About Test Box Alarms
- How does the alarm work.
- Alarm codes
- How to interpret the emails from the alarm program?
- Who gets emails from the alarm program?
- More information about this program.
- Information for test-box operators at the NCC.
What the alarm system tries to do.
The alarm system maintains median and percentile delays for data for each test-box that sends packets to the box where the system runs. It compares the values for the most recent measurements against the results taken over a longer period. If all recent results are consistent with the longer term medians, nothing happens, if the recent results are outside the expected range, the program generates an alarm.
Note: The parameters are still being tuned. The numbers reflect their value at the time this page was created. They may have changed since then. However, this does not affect the principles behind the alarm.
The alarm system consists of 2 programs: LTA and STA, with a few support scripts.
The Long Term Average (LTA) program divides the day into 4 equal periods of 6 hours and for each test-box that is sending data to this box, the LTA program maintains a distribution of the one-way-delays measured during this period. The 4 periods are: 0:00-6:00 GMT, 6:00-12:00 GMT, 12:00-18:00 GMT and 18:00-24:00 GMT. These intervals roughly correspond to night, morning, afternoon and evening in most of the RIPE-NCC service area (and to evening, night, morning and afternoon in the US). By selecting them this way, day-night effects are excluded as much as possible. The LTA program updates the distributions once a day using the data from the previous day. Data older than 30 days is removed from the distributions.
The Short Term Average (STA) program maintains the same distributions but for a much shorter period of only the last 30 minutes.
Every 15 minutes, both the short and long term distributions are parameterized by 3 percentiles: 5%, 50% and 95%. The value of these percentiles shows the fraction of points with a delay than 5%, 50% or 95% respectively, so if the 5%-percentile is 10.0 ms, then 5% of the delays measured between two test-boxes is less than 10 ms, and 95% is more than 10 ms. The code then compares short term results against the long term results.
If the short term results are above what is expected from the long term average, then an alarm message will be sent to the host of the two (sending and receiving) test-box. The format of the message is explained here. The time constants of 30 and 15 minutes mean that an unusual condition has to exist for 15 to 30 minutes before an alarm is raised. A typical situation is shown in this figure:
The figure shows the delay between two test-boxes for a 24 hour period. Around -15 (or 9am) something caused the delays to go up from 30ms to about 100ms. This created an alarm message similar to the one shown here.
The programs maintain state, so when the alarm condition disappears, the hosts of the two test-boxes will receive another message. In the figure above, this happened at 11am. Other conditions where the short term average differs from the long term average are recorded but no error message will be sent.
The alarm conditions are shown in the figure above. There are 9 cases:
- 0: No Alarm. No alarm condition.
- 1: Alarm. Lower percentile of the short term average above the higher percentile of the long term average.
- 2: No Long Data. Not enough valid measurements for the long term average, currently set to 100 measurements.
- 4: No Short Data. Not enough valid measurements for the short term average, currently set to 10 measurements.
- 8: Larger Spread. Lower percentile below the long term average, higher percentile above the long term average.
- 16: Improving. Higher percentile of the short term average below the lower percentile of the long term average.
- 32: Go Up. Delays slowly increase, both lower and higher percentile above the long term average.
- 64: Go Down. Delays slowly decrease, both lower and higher percentile below the long term average.
- 127: Undefined. Any other case, should never happen.
When the alarm on the test-box is triggered (that is, the STA program returned "Alarm" for a pair of test-boxes), a message like this will be sent to the hosts of the pair of test-boxes that generated the alarm:
Date: Mon, 27 Sep 1999 00:05:16 GMT
The Subject: line tells if an alarm has been set or reset. The line "The testbox alarm..." shows which test-box generated the alarm. Note that each host will receive message for both data sent to his box and data originating from his box. In the latter case, the alarm will be sent by the receiving test-box. The following lines tell what happened:
- TB xx: Sending test-box.
- at 938390715: Unix time when the alarm was triggered. Note that 938390715 corresponds to (approximately) the time in the "Date:" field.
- ALARM SET old: XX new: XX: What happened, followed by the old: and new Alarm Codes.
- Long: A B/C/D: A is the number of points used in the long term average, B, C and D are the 5%, 50% and 95% percentiles of this distribution. All values are in ms.
- Short: A B/C/D: A is the number of points used in the short term average, B, C and D are the 5%, 50% and 95% percentiles of this distribution. All values are in ms.
If 2 (or more) pairs of test-boxes generate an alarm condition at the same time, then this pair of lines will be repeated for the next test-box.
In this particular case, data sent from tt23 (test-box #23) to tt01 (test-box #1) generated an alarm. The delay distribution over the last 30 days, had a lower percentile of 59.0 ms, a median of 66.5 and a 95% of 225.0. In the last 30 minutes, these numbers climbed to 238.0, 277.5 and 299.0 respectively. Since this above what was expected, the alarm was set. To find out where tt01 tt23 are located, click here.
The line Satellite conditions shows the satellite conditions for the receiving test-box. This information is intended to verify the alarm. The next line (Mon Sep 27...) shows:
- Mon Sep 27 00:02:20 1999: A time stamp.
- Satellites seen from X to Y: The interval during which the samples described in the next point were taken.
- a b c d e f g h i j : Satellite conditions are sampled every 64 secondsm and summarized once an hour. The first number shows the number of samples where the receiver saw no satellites, the second one the number of samples where it saw 1 satellite, the third one the number of samples where it saw 2 satellites and so on.
So, in the example above (0 0 0 0 21 29 7 0 0 0), there were 21 samples where the receiver saw 4 satellites, 29 where it saw 5 and 7 samples where it saw 6 satellites.
This information can be used to verify the alarm. If ALL entries are in first bin (60 0 0 0 0 0 0 0 0) then the alarm might be a false alarm caused by a drifting clock rather than a network problem. Note that a FEW, up to about 1 out of 3, entries in the lowest bin is no reason for concern.
If you suspect that the alarms are caused by a drifting clock, then you may want to look at the raw satellite data.
The URL will only appear for test-boxes equipped with a web-server, so series C and D. Click on the link will point you to the web site on the box, with a plot showing the delay and loss during the last hours. The URL will only appear if the box is equipped with the web-server.
When the alarm condition disappears, the host will receive a second email. This looks like:
Date: Mon, 27 Sep 1999 00:20:18 GMT
The format is almost the same, except that the subject says that the alarm is reset.
Who gets email from the alarm program?
By default, the alarm messages are sent to the contact person(s) for the test-box. If you want this to be changed, please contact the test-box operators.