[atlas] some thoughts and question regrding probe "stability"
Philip Homburg philip.homburg at ripe.net
Fri Jul 18 14:23:43 CEST 2014
On 2014/07/18 12:12 , Wilfried Woeber wrote: > Hu Philip + Team, > > Philip Homburg wrote: > > first of all thanks for investigating! No problem. I was also curious myself why 'normal' probes would disconnect. Most time is spend looking at the exceptions. > [...] >> More like, the controller 'pings' the probe every 20 seconds and after 3 >> missed responses the connection is terminated. >> >> And for the Atlas system as a whole, that works. But the goal of the >> Atlas system is not to have a probe connected as long as possible. > > That's fully understood. > > I'm still having a couple of questions :-) > > 1) if I do understand correctly, the decision to label a probe "disconnected" > is made by the associateed collector, based on pings? (btw. - "real" pings > on ICMP or internal over the channel?) Connected/disconnected is based on whether a probe has a ssh connection to a controller. There is a keepalive mechanism within the ssh protocol to see if there other end is still there. That ssh mechanism is used abort the connection. Nothing to do with real (ICMP) pings. > 2) if that's the case, is there an easy way to find out to which collector a > probe is "assigned"? (is this static or dynamic?) I don't know why, but that information is not shown to normal users. Of course, if you can capture traffic, you can easily find out :-) The assignment is dynamic. > 3) if a probe, in particular an anchor, gets updated with a new firmware, is > it possible that the ethernet IF does *not* go down? (Note, the 6009 is an > old, big, beta box! Is there a difference with the new soekris probes?) On regular probes a firmware upgrade always involves a reboot. On anchors the Atlas 'firmware' is an rpm. There is no reason to reboot the box or bring its interface down to upgrade the Atlas rpm. > Just to be very clear, I just want to understand how to interpret things, > 'cause I already had an issue with one of my v1 probes, and in the end it > turned out that the USB power feed was just boarderline, problem gone after > replacement. Yes it is good to keep an eye on those things. We can only look at probes statistically or in response to tickets, mail, etc. > And as an ISP and backbone operator, seeing stuff as "down" or "disconnected", > without a good explanation, starts to itch after a while :-) I think the best page to look at is the 'Result from Built-in Measurements'. If those graphs look fine, then there is no real reason to worry. Unless the probe keeps connecting and disconnecting multiple time a day or something like that.