Automated Event Detection for Active Monitoring

Tony McGregor, Hans-Werner Braun
The National Laboratory for Applied Network Research (NLANR) San Diego Supercomputer Center, San Diego, California, USA

Pieter van Dijk
The University of Waikato, Hamilton, New Zealand

It is easy to collect large volumes of data through active measurement. The AMP system [1][2][3], for example, consists of around 120 monitors operating in a full mesh. Each monitor measures the round trip time to each other monitor every minute and the route to each other monitor every 10 minutes. The data collected is collected a central site where it is made available to users as raw data and as web based performance graphs.

A set of graphs is available for each of the approximately 14,000 paths that are measured. While these graphs have proved useful when a change in the network has been detected they do not often aid in the _detection_ of interesting network events. There are two reasons for this. First, there are just too many of them. No human could examine all the graphs often enough to detect network events in a timely way. Second, even though most sites have a single AMP machine and only need to examine the data for their site, the shortage of networking personnel means that most network administrators are event driven; they do poll for work.

The over arching goal for the AMP project, and most other active monitor projects, is to increase the performance of the network for its users. Supporting the diagnosis of networks problems is a useful function in this respect. However, the data AMP collects contains enough information to aid in the discovery of network problems before the users of the network report them.

Making use of the latent capacity of the data to discover network problems, in a practical way. is not trivial. Some of the difficulties that must be overcome are:

There are large volumes of data involved. AMP collects over 1Gb of data a day and has an incoming data stream of about 1.5Mbps. Any processing must be able to match this data rate, and fit in the memory available on inexpensive equipment. Meeting this need is compounded by the large number of paths. In some cases all these paths must be monitored for events of interest.
There is a lot of natural variation in the data. Paths differ in their normal RTT and the amount of variability in this data. On one path a 10% increase in RTT might be quite normal, while on another it would indicate an unusual event.
When a network even occurs often many elements are often affected at the same time. Naive notification of every event could swamp the system and the support staff with error notifications at the very time they are likely to be under stress.

This paper describes a system for atomic detection of events in data collected by active monitoring. The system is self-adjusting to the fit the data being monitored but simple enough to be implemented within practical constraints. A post-processor sends notifications to users who have requested them. This notification processor limits the number of notifications using an exponential back-off to group successive notifications into a single message.

The system is divided into algorithms; each algorithm is responsible for detecting one class of network event. Currently two algorithms are implemented. The first of these, the "Plateau Detector" detects a change in the base RTT between a pair of monitors, even in the presence of (or absence of) variance. (See, for example, http://amp.nlanr.net/active/cgi-bin/daily.cgi?amp-startap/HPC/data/amp-vt/100.9.4.gz). The second algorithm detects a change in the RTT variance over a path.

The algorithms are based around three "windows" that advance in time as each new data item is received. The first of these (which is the oldest in time) is used to characterise the normal state of the path. The second window indicates the current state of the path. Simple statistics are calculated on these windows and compared to give the basis of the event detection. The third window is used as a filter; it removes transient outliers from the data so that they do not cause false triggers.

The paper describes these algorithms in detail and gives examples of their operation in practical situations, including performance and memory use. We conclude with suggestions for refinements and some new algorithms to detect other important data, including loss and outages.

References:

McGregor A., Braun H-W. and Brown J. "The NLANR NAI Network Analysis Infrastructure" IEEE Communication Magazine special issue on network measurement, pp122-128, May 2000.
McGregor A.J. and Braun H-W. "Balancing Cost and Utility in Active Monitoring: The AMP example" The Global Internet Summit -Inet2000 18 - 21 July 2000.
Brown J., McGregor A. and Braun H-W. "Network Performance Visualisation: Insight Through Animation" PAM2000 Passive and Active Measurement Workshop, pp33-41, April 2000.