|
About the testbox alarm.
This page describes the test-box alarm system.
What the alarm system tries to do.
The alarm system maintains median and percentile delays for data for each
test-box that sends packets to the box where the system runs. It compares
the values for the most recent measurements against the results taken over
a longer period. If all recent results are consistent with the longer term
medians, nothing happens, if the recent results are outside the expected
range, the program generates an alarm.
Note: The parameters are still being tuned. The numbers reflect their
value at the time this page was created. They may have changed since then.
However, this does not affect the principles behind the alarm.
The alarm system consists of 2 programs: LTA and STA, with a
few support scripts.
The Long Term Average (LTA) program divides the day into 4 equal
periods of 6 hours and for each test-box that is sending data to this box,
the LTA program maintains a distribution of the one-way-delays measured
during this period. The 4 periods are: 0:00-6:00 GMT, 6:00-12:00 GMT,
12:00-18:00 GMT and 18:00-24:00 GMT. These intervals roughly correspond to
night, morning, afternoon and evening in most of the RIPE-NCC service area
(and to evening, night, morning and afternoon in the US). By selecting
them this way, day-night effects are excluded as much as possible. The LTA
program updates the distributions once a day using the data from the
previous day. Data older than 30 days is removed from the distributions.
The Short Term Average (STA) program maintains the same
distributions but for a much shorter period of only the last 30 minutes.
Every 15 minutes, both the short and long term distributions are
parameterized by 3 percentiles: 5%, 50% and 95%. The value of these
percentiles shows the fraction of points with a delay than 5%, 50% or 95%
respectively, so if the 5%-percentile is 10.0 ms, then 5% of the delays
measured between two test-boxes is less than 10 ms, and 95% is more than 10
ms. The code then compares short term results against the long term
results.
If the short term results are above what is expected from the long term
average, then an alarm message will be sent to the host of the two (sending
and receiving) test-box. The format of the message is explained here. The time constants of 30 and 15 minutes mean that an
unusual condition has to exist for 15 to 30 minutes before an alarm is
raised. A typical situation is shown in this figure:
The figure shows the delay between two test-boxes for a 24 hour period.
Around -15 (or 9am) something caused the delays to go up from 30ms to about
100ms. This created an alarm message similar to the one shown here.
The programs maintain state, so when the alarm condition disappears, the
hosts of the two test-boxes will receive another message. In the figure
above, this happened at 11am. Other conditions where the short term average
differs from the long term average are recorded but no error message will
be sent.
The alarm conditions are shown in the figure above. There are 9 cases:
- 0: No Alarm.
No alarm condition.
- 1: Alarm.
Lower percentile of the short term average above the higher percentile of
the long term average.
- 2: No Long Data.
Not enough valid measurements for the long term average, currently set
to 100 measurements.
- 4: No Short Data.
Not enough valid measurements for the short term average, currently set
to 10 measurements.
- 8: Larger Spread.
Lower percentile below the long term average, higher percentile above
the long term average.
- 16: Improving.
Higher percentile of the short term average below the lower percentile of
the long term average.
- 32: Go Up.
Delays slowly increase, both lower and higher percentile above the long
term average.
- 64: Go Down.
Delays slowly decrease, both lower and higher percentile below the long
term average.
- 127: Undefined.
Any other case, should never happen.
When the alarm on the test-box is triggered (that is, the STA program
returned "Alarm" for a pair of test-boxes), a message like this will be sent to
the hosts of the pair of test-boxes that generated the alarm:
Date: Mon, 27 Sep 1999 00:05:16 GMT
From: ttraffic@ripe.net
To: tt-ops@ripe.net
Subject: Testbox ALARM SET
The testbox alarm program on tt01.ripe.net found:
TB 23 at 938390715 ALARM SET old: Go Up, new: Alarm
TB 23 at 938390715 Long: 518 59.0/ 66.5/225.0 Short: 36 238.0/277.5/299.0
Satellite conditions on tt01.ripe.net:
Mon Sep 27 00:02:20 1999: Satellites seen from 19990926 225900 to 235900: 0 0 0 0 21 29 7 0 0 0
This message has been sent to: tt23 tt01
Satellite conditions on tt01.ripe.net:
Wed Apr 10 12:00:00 2002:Satellites seen from 20020410 110000 to 120000: 0 0 0
0 0 4 39 14 0 0
To see how the delays developed in the last days, open this URL:
http://tt01.ripe.net:10259/cgi-bin/multiple.cgi?&tt23=209.211.237.18&delay=delay&loss=loss&RRD_START=now-2days&RRD_END=now"
Access to this page is limited to the owner of tt01.ripe.net,
please contact tt-ops@ripe.net with any access errors
For an explanation of this email please see
http://www.ripe.net/test-traffic/General/alarm.html
For an explanation of this email please see http://www.ripe.net/test-traffic/Host_testbox/alarm.html
|
The Subject: line tells if an alarm has been set or reset. The line
"The testbox alarm..." shows which test-box generated the alarm. Note
that each host will receive message for both data sent to his box and data
originating from his box. In the latter case, the alarm will be sent by the
receiving test-box. The following lines tell what happened:
- TB xx: Sending test-box.
- at 938390715: Unix time when the alarm was triggered. Note that
938390715 corresponds to (approximately) the time in the "Date:"
field.
- ALARM SET old: XX new: XX: What happened, followed by the old:
and new Alarm Codes.
- Long: A B/C/D: A is the number of points used in the long term
average, B, C and D are the 5%, 50% and 95% percentiles of this
distribution. All values are in ms.
- Short: A B/C/D: A is the number of points used in the short term
average, B, C and D are the 5%, 50% and 95% percentiles of this
distribution. All values are in ms.
If 2 (or more) pairs of test-boxes generate an alarm condition at the same
time, then this pair of lines will be repeated for the next test-box.
In this particular case, data sent from tt23 (test-box #23) to tt01
(test-box #1) generated an alarm. The delay distribution over the last 30
days, had a lower percentile of 59.0 ms, a median of 66.5 and a 95% of
225.0. In the last 30 minutes, these numbers climbed to 238.0, 277.5 and
299.0 respectively. Since this above what was expected, the alarm was set.
To find out where tt01 tt23 are located, click
here.
The line Satellite conditions shows the satellite conditions for
the receiving test-box. This information is intended to verify the alarm.
The next line (Mon Sep 27...) shows:
- Mon Sep 27 00:02:20 1999: A time stamp.
- Satellites seen from X to Y: The interval during which
the samples described in the next point were taken.
- a b c d e f g h i j :
Satellite conditions are sampled every 64 secondsm and summarized once
an hour. The first
number shows the number of samples where the receiver saw no satellites,
the second one the number of samples where it saw 1 satellite, the third
one the number of samples where it saw 2 satellites and so on.
So, in the example above (0 0 0 0 21 29 7 0 0 0), there were
21 samples where the receiver saw 4 satellites, 29 where it saw 5 and 7 samples
where it saw 6 satellites.
This information can be used to verify the alarm. If ALL entries are in first
bin (60 0 0 0 0 0 0 0 0) then the alarm might be a false alarm caused by a
drifting clock rather than a network problem. Note that a FEW, up to about 1 out
of 3, entries in the lowest bin is no reason for concern.
If you suspect that the alarms are caused by a drifting clock, then you may want
to look at the raw satellite data.
The URL will only appear for test-boxes equipped with a web-server,
so series C and D. Click on the link will point you to the web site on the
box, with a plot showing the delay and loss during the last hours. The
URL will only appear if the box is equipped with the web-server.
When the alarm condition disappears, the host will receive a second email. This looks like:
The format is almost the same, except that the subject says that the alarm
is reset.
Who gets email from the alarm program?
By default, the alarm messages are sent to the contact person(s) for the
test-box. If you want this to be changed, please contact the test-box operators.
More information.
For more details about this program, please contact the test-box operators.
|