[Atlas-anchors-pilot] Fwd: Dual uplink failure on OSL2 core switch

Tore Anderson tore.anderson at redpill-linpro.com
Mon Mar 4 14:07:35 CET 2013

Next message: [Atlas-anchors-pilot] Services on Atlas anchors as measurement targets
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

FYI, we had an incident impacting the network connectivty of
no-osl-as39029 today, which is directly attached to the impacted switch.
Apologies for any trouble this may have caused. See below for a copy of
the incident report.

----------------

Dual uplink failure on OSL2 core switch

This message is sent to you in order to inform about an event that
affects the operation of one or more of your systems hosted at Redpill
Linpro.

The following significant events have been noted:

2013-03-04 - 12:04 (SLA begins) - Last remaining uplink fails
2013-03-04 - 12:42 - Connectivity fully restored

Total SLA time: 38 minutes.

Description:

After inserting two SFP+ optical tranceivers in the core switch
in our OSL2 data centre, both ports connecting the core switch to our
backbone went down. The exact reason why this happened is unknown, but
must be a bug in the PHY handling of the switch, as inserting
tranceivers in one port *should* not have any impact on any other ports.
No other ports had similar problems. However, the uplink ports are
directly adjacent to the ports where the new tranceivers were inserted,
so our current theory is that the ports are managed
in pairs by the hardware, and that some property relating to the new
tranceivers confused the chips/software handling the ports.

Consequences:

The OSL2 data centre completely lost connectivity to the internet and
the rest of our backbone.

The backbone connection to our KSD1 data centre also transits this
switch, so the connection between Karlstad and Oslo was lost.
This led to blackholing of traffic entering our network in Karlstad
addressed to destinations in Oslo (including OSL3) and vice versa.

Finally, peering and transit connections to Telenor, TDC (Oslo), and
Level3 were lost. Our connections to NIX, Global Crossing, Availo, and
TDC (Karlstad) were not impacted. Our backbone therefore retained full
internet connectivity, although some routes may have been less optimal
compared to normal.

Response:

After determining that the problem was due to failed uplinks, the SFP+
optical tranceivers for the uplinks was re-seated. This immediately
restored all connectivity.

Severity: 1. Critical.

Regards,
Tore Anderson
Managed Services
Redpill Linpro

Next message: [Atlas-anchors-pilot] Services on Atlas anchors as measurement targets
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]