Automated Event Detection for Active Monitoring
Tony McGregor, Hans-Werner Braun
The National Laboratory for Applied Network Research (NLANR)
San Diego Supercomputer Center, San Diego, California, USA
Pieter van Dijk
The University of Waikato, Hamilton, New Zealand
It is easy to collect large volumes of data through active
measurement. The AMP system [1][2][3], for example, consists of
around 120 monitors operating in a full mesh. Each monitor measures the
round trip time to each other monitor every minute and the route to
each other monitor every 10 minutes. The data collected is collected
a central site where it is made available to users as raw data and as
web based performance graphs.
A set of graphs is available for each of the approximately 14,000
paths that are measured. While these graphs have proved useful when a
change in the network has been detected they do not often aid in the
_detection_ of interesting network events. There are two reasons for
this. First, there are just too many of them. No human could examine
all the graphs often enough to detect network events in a timely way.
Second, even though most sites have a single AMP machine and only need
to examine the data for their site, the shortage of networking
personnel means that most network administrators are event driven;
they do poll for work.
The over arching goal for the AMP project, and most other active
monitor projects, is to increase the performance of the network for
its users. Supporting the diagnosis of networks problems is a useful
function in this respect. However, the data AMP collects contains
enough information to aid in the discovery of network problems before
the users of the network report them.
Making use of the latent capacity of the data to discover network
problems, in a practical way. is not trivial. Some of the
difficulties that must be overcome are:
- There are large volumes of data involved. AMP collects over 1Gb
of data a day and has an incoming data stream of about 1.5Mbps.
Any processing must be able to match this data rate,
and fit in the memory available on inexpensive equipment. Meeting
this need is compounded by the large number of paths. In some
cases all these paths must be monitored for events of interest.
- There is a lot of natural variation in the data. Paths differ in
their normal RTT and the amount of variability in this data. On one
path a 10% increase in RTT might be quite normal, while on another
it would indicate an unusual event.
- When a network even occurs often many elements are often affected
at the same time. Naive notification of every event could swamp
the system and the support staff with error notifications at the
very time they are likely to be under stress.
This paper describes a system for atomic detection of events in data
collected by active monitoring. The system is self-adjusting to the fit
the data being monitored but simple enough to be implemented within
practical constraints. A post-processor sends notifications to users who
have requested them. This notification processor limits the number of
notifications using an exponential back-off to group successive
notifications into a single message.
The system is divided into algorithms; each algorithm is responsible
for detecting one class of network event. Currently two algorithms are
implemented. The first of these, the "Plateau Detector" detects a change
in the base RTT between a pair of monitors, even in the presence of (or
absence of) variance. (See, for example,
http://amp.nlanr.net/active/cgi-bin/daily.cgi?amp-startap/HPC/data/amp-vt/100.9.4.gz).
The second algorithm detects a change in the RTT variance over a path.
The algorithms are based around three "windows" that advance in time
as each new data item is received. The first of these (which is the
oldest in time) is used to characterise the normal state of the path. The
second window indicates the current state of the path. Simple statistics
are calculated on these windows and compared to give the basis of the
event detection. The third window is used as a filter; it removes
transient outliers from the data so that they do not cause false triggers.
The paper describes these algorithms in detail and gives examples of
their operation in practical situations, including performance and
memory use. We conclude with suggestions for refinements and some new
algorithms to detect other important data, including loss and
outages.
References:
- McGregor A., Braun H-W. and Brown J. "The NLANR NAI
Network Analysis Infrastructure" IEEE Communication Magazine
special issue on network measurement, pp122-128, May 2000.
- McGregor A.J. and Braun H-W. "Balancing Cost and Utility
in Active Monitoring: The AMP example" The Global Internet Summit
-Inet2000 18 - 21 July 2000.
- Brown J., McGregor A. and Braun H-W.
"Network Performance Visualisation: Insight Through Animation"
PAM2000 Passive and Active Measurement Workshop, pp33-41, April 2000.
|