[atlas] some thoughts and question regrding probe "stability"

Thu Jul 17 18:44:15 CEST 2014

Just a thought,   Are you connected to a "green" switch that might be
dropping the power when idle and the probe can't handle that situation and
disconnecting from the network and the process starts over?

Bryan Socha
Network Engineer
DigitalOcean

On Thu, Jul 17, 2014 at 12:03 PM, Philip Homburg <philip.homburg at ripe.net>
wrote:

> Hi Wilfried,
>
> > Let's compare the most recent dis/connection logs for my 3 pets:
>
> Here is what I found in our logs:
>
> > ID 6009
> > 2014-07-14 03:58:03   3d 8h 16m        Still Connected
>
> Upgrade to firmware 4650
>
> > 2014-05-27 03:03:54   48d 0h 46m       2014-07-14 03:50:47    0h 7m
>
> Hard to say, some network glitch
>
> > 2014-05-20 15:19:02   6d 11h 37m       2014-05-27 02:57:00    0h 6m
>
> Anchor was rebooted
>
> > 2014-05-14 21:16:56   5d 17h 59m       2014-05-20 15:16:22    0h 2m
>
> Network glitch
>
>
> https://atlas.ripe.net/atlas/udm.html?1026358.increase_type=rel&1026358.current_shift=150&1026358.current_clip=250&1026358.group_by=cc&1026358.show_me_filter=max,pls&msm_id=1026358&1026358.start_timestamp=1400098401&1026358.end_timestamp=1400102942&1026358.selected_probes=6001,6002,6003,6019,6022,6031,6040,6052#tab-seismograph1026358
>
> > 2014-04-08 16:03:21   36d 5h 1m        2014-05-14 21:05:17    0h 11m
>
> Anchor was rebooted
>
> > ID 0466
> > 2014-07-13 23:31:05   3d 12h 45m       Still Connected
>
> Some network glitch, unclear what
>
> > 2014-07-09 23:05:40   3d 23h 54m       2014-07-13 22:59:49    0h 31m
>
> Probe upgraded firmware, reason for disconnect got lost
>
> > 2014-06-16 10:53:21   23d 11h 55m      2014-07-09 22:49:04    0h 16m
>
> Network problem
>
> > 2014-05-25 09:03:06   22d 1h 38m       2014-06-16 10:42:00    0h 11m
>
> Some network problem.
>
> > 2014-05-24 20:34:50   11h 54m          2014-05-25 08:29:12    0h 33m
>
> Unclear
>
> > ID 0414
> > 2014-07-07 23:41:23   9d 12h 35m       Still Connected
>
> Some network problem
>
> > 2014-07-02 03:58:45   5d 19h 31m       2014-07-07 23:29:54    0h 11m
>
> Power cycled?
>
> > 2014-06-13 09:37:50   18d 18h 7m       2014-07-02 03:45:08    0h 13m
>
> Some network problem. High RTTs
>
> > 2014-06-08 13:22:14   4d 20h 7m        2014-06-13 09:29:38    0h 8m
>
> Power cycled?
>
> > 2014-05-21 08:29:23   18d 4h 45m       2014-06-08 13:15:11    0h 7m
>
> Same.
>
> > Again, I fail to see some obvious correlation, what am I missing?
> >
> > Does anyone else see a similar pattern?
> >
> > How to start debugging, if there's anythig that needs debugging?
>
> A couple of points:
> 1) The connection between a probe (or anchor) and its controller doesn't
> have to be perfectly stable. It has to be good enough that probes will
> report results in timely fashion and can get commands. But nothing
> beyond that.
> 2) For single probe to see a network failure (with measurements using
> the default parameters) the failure has to last for at least 10 minutes.
> That way a couple of measurements will have a chance to report on the
> failure. In contrast, the connection between a probe and the controller
> is already terminated if the network is down for one minute.
> 3) When a target is measured by many probes then it is likely that at
> least some probes will pick up an event. But one probe on its own, it is
> hard to say anything about that.
> 4) Version 1 probes tend to reboot after losing the connection to the
> controller due to memory fragmentation issues. That is unfortunate, but
> we can't really do anything about it. Version 3 probes and anchors just
> report their results a little later.
>
> Philip
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ripe.net/ripe/mail/archives/ripe-atlas/attachments/20140717/77421690/attachment.html>