[atlas] some thoughts and question regrding probe "stability"
Robert Kisteleki robert at ripe.net
Tue Jul 22 15:53:13 CEST 2014
Hello, To provide a bit of background information about $subject: In order to receive reports from the probes, and to deliver (measurement) commands to it, we maintain a bidirectional channel from the probe to the infrastructure. At the moment this is using SSH. We consider the probe to be "connected" as long as this channel is open, and "disconnected" when it's not. Note that this is only an indicator for the probe's stability, not a precise quality metric. Said connections can break for a number of reasons -- administrative actions, probe power loss, power cycle, path problems between probe and infrastructure (including the NAT box, if applicable), and infrastructure availability. For example, every now and then we disconnect the probes to make them upgrade, or have to reboot the server the probe is connected to. All these event show up as disconnects. The disconnection time mostly depends on the reason of the disconnect -- for example a probe reboot can be done in seconds, firmware upgrade takes something like 5-15 minutes, a controller reboot can cause up to 2 hours of non-connectedness. Finally: as Philip mentioned, we don't optimise for high connection times. The probes execute the pre-scheduled measurements even if they are not connected to the infrastructure. Regards, Robert