You're viewing an archived page. It is no longer being updated.
The Test Traffic Measurement Service (TTM) was shut down on 1 July 2014. This information is available for historical reference.
The Test Traffic Measurement Service (TTM) was shut down on 1 July 2014. This information is available for historical reference.
The alarm system maintains median and percentile delays for data for each test-box that sends packets to the box where the system runs. It compares the values for the most recent measurements against the results taken over a longer period. If all recent results are consistent with the longer term medians, nothing happens, if the recent results are outside the expected range, the program generates an alarm.
Note: The parameters are still being tuned. The numbers reflect their value at the time this page was created. They may have changed since then. However, this does not affect the principles behind the alarm.
The alarm system consists of 2 programs: LTA and STA, with a few support scripts.
The Long Term Average (LTA) program divides the day into 4 equal periods of 6 hours and for each test-box that is sending data to this box, the LTA program maintains a distribution of the one-way-delays measured during this period. The 4 periods are: 0:00-6:00 GMT, 6:00-12:00 GMT, 12:00-18:00 GMT and 18:00-24:00 GMT. These intervals roughly correspond to night, morning, afternoon and evening in most of the RIPE-NCC service area (and to evening, night, morning and afternoon in the US). By selecting them this way, day-night effects are excluded as much as possible. The LTA program updates the distributions once a day using the data from the previous day. Data older than 30 days is removed from the distributions.
The Short Term Average (STA) program maintains the same distributions but for a much shorter period of only the last 30 minutes.
Every 15 minutes, both the short and long term distributions are parameterized by 3 percentiles: 5%, 50% and 95%. The value of these percentiles shows the fraction of points with a delay than 5%, 50% or 95% respectively, so if the 5%-percentile is 10.0 ms, then 5% of the delays measured between two test-boxes is less than 10 ms, and 95% is more than 10 ms. The code then compares short term results against the long term results.
If the short term results are above what is expected from the long term average, then an alarm message will be sent to the host of the two (sending and receiving) test-box. The format of the message is explained here. The time constants of 30 and 15 minutes mean that an unusual condition has to exist for 15 to 30 minutes before an alarm is raised. A typical situation is shown in this figure:
The figure shows the delay between two test-boxes for a 24 hour period. Around -15 (or 9am) something caused the delays to go up from 30ms to about 100ms. This created an alarm message similar to the one shown here.
The programs maintain state, so when the alarm condition disappears, the hosts of the two test-boxes will receive another message. In the figure above, this happened at 11am. Other conditions where the short term average differs from the long term average are recorded but no error message will be sent.
The alarm conditions are shown in the figure above. There are 9 cases:
When the alarm on the test-box is triggered (that is, the STA program returned "Alarm" for a pair of test-boxes), a message like this will be sent to the hosts of the pair of test-boxes that generated the alarm:
Date: Mon, 27 Sep 1999 00:05:16 GMT |
The Subject: line tells if an alarm has been set or reset. The line "The testbox alarm..." shows which test-box generated the alarm. Note that each host will receive message for both data sent to his box and data originating from his box. In the latter case, the alarm will be sent by the receiving test-box. The following lines tell what happened:
If 2 (or more) pairs of test-boxes generate an alarm condition at the same time, then this pair of lines will be repeated for the next test-box.
In this particular case, data sent from tt23 (test-box #23) to tt01 (test-box #1) generated an alarm. The delay distribution over the last 30 days, had a lower percentile of 59.0 ms, a median of 66.5 and a 95% of 225.0. In the last 30 minutes, these numbers climbed to 238.0, 277.5 and 299.0 respectively. Since this above what was expected, the alarm was set. To find out where tt01 tt23 are located, click here.
The line Satellite conditions shows the satellite conditions for the receiving test-box. This information is intended to verify the alarm. The next line (Mon Sep 27...) shows:
So, in the example above (0 0 0 0 21 29 7 0 0 0), there were 21 samples where the receiver saw 4 satellites, 29 where it saw 5 and 7 samples where it saw 6 satellites.
This information can be used to verify the alarm. If ALL entries are in first bin (60 0 0 0 0 0 0 0 0) then the alarm might be a false alarm caused by a drifting clock rather than a network problem. Note that a FEW, up to about 1 out of 3, entries in the lowest bin is no reason for concern.
If you suspect that the alarms are caused by a drifting clock, then you may want to look at the raw satellite data.
The URL will only appear for test-boxes equipped with a web-server, so series C and D. Click on the link will point you to the web site on the box, with a plot showing the delay and loss during the last hours. The URL will only appear if the box is equipped with the web-server.
When the alarm condition disappears, the host will receive a second email. This looks like:
Date: Mon, 27 Sep 1999 00:20:18 GMT |
The format is almost the same, except that the subject says that the alarm is reset.
By default, the alarm messages are sent to the contact person(s) for the test-box. If you want this to be changed, please contact the test-box operators.
For more details about this program, please contact the test-box operators