[atlas] some thoughts and question regrding probe "stability"
Wilfried Woeber Woeber at CC.UniVie.ac.at
Thu Jul 17 14:48:40 CEST 2014
Hi Folks, triggered by the discussion related to DNSMON, and an issue (power, resolved) with one of my V1 probes, I'd like to get some input or start a disussion or an investigation. To start with, I am not very clear what the term "stability" w/should mean in this context, as the probes are supposed to buffer measurment data locally, at least for a while (true?). So, here goes... Obviously, looking at some Atlas Stat pages, there are probes with a 100% uptime. Now, looking a the 3 under my supervision (2x V1, 1 Anchor), ref "Connected" and "Disconnected", there's no chance to get near that value, as all of them tend to topple over on a regular basis, mostly for a *short* period of time in the range of 0m(!) to some 30+m. With respct to the bahaviour of the Anchor, which is mounted in the same rack as the backbone router it connects to, in a Data Center, we tried to correlate the (reported) disconnection events with the router and interface logs for the probe. No luck there, also, no maint works or the like, so I presume the Anchor didn't reboot or that there were "real" network problems. Let's compare the most recent dis/connection logs for my 3 pets: ID 6009 2014-07-14 03:58:03 3d 8h 16m Still Connected 2014-05-27 03:03:54 48d 0h 46m 2014-07-14 03:50:47 0h 7m 2014-05-20 15:19:02 6d 11h 37m 2014-05-27 02:57:00 0h 6m 2014-05-14 21:16:56 5d 17h 59m 2014-05-20 15:16:22 0h 2m 2014-04-08 16:03:21 36d 5h 1m 2014-05-14 21:05:17 0h 11m ID 0466 2014-07-13 23:31:05 3d 12h 45m Still Connected 2014-07-09 23:05:40 3d 23h 54m 2014-07-13 22:59:49 0h 31m 2014-06-16 10:53:21 23d 11h 55m 2014-07-09 22:49:04 0h 16m 2014-05-25 09:03:06 22d 1h 38m 2014-06-16 10:42:00 0h 11m 2014-05-24 20:34:50 11h 54m 2014-05-25 08:29:12 0h 33m ID 0414 2014-07-07 23:41:23 9d 12h 35m Still Connected 2014-07-02 03:58:45 5d 19h 31m 2014-07-07 23:29:54 0h 11m 2014-06-13 09:37:50 18d 18h 7m 2014-07-02 03:45:08 0h 13m 2014-06-08 13:22:14 4d 20h 7m 2014-06-13 09:29:38 0h 8m 2014-05-21 08:29:23 18d 4h 45m 2014-06-08 13:15:11 0h 7m Again, I fail to see some obvious correlation, what am I missing? Does anyone else see a similar pattern? How to start debugging, if there's anythig that needs debugging? Thanks for your ideas and help! Wilfried