rssLink RSS for all categories
 
icon_red
icon_green
icon_red
icon_red
icon_blue
icon_green
icon_green
icon_red
icon_red
icon_red
icon_orange
icon_green
icon_green
icon_green
icon_red
icon_blue
icon_red
icon_orange
icon_red
icon_red
icon_red
icon_red
icon_red
icon_red
icon_red
icon_red
icon_red
icon_orange
icon_green
 

FS#10307 — FS#14176 — GSW Paris

Attached to Project— Network
Incident
GSW
CLOSED
100%
The Globalswitch PoP is down. We're investigating...
Date:  Wednesday, 29 July 2015, 18:02PM
Reason for closing:  Done
Comment by OVH - Wednesday, 29 July 2015, 16:09PM

The outage was caused by human error. The OSPF configuration cut off the GSW router.

The TH2 in Paris has taken over the traffic.


Comment by OVH - Wednesday, 29 July 2015, 17:02PM

The th2-1-a9 and some other links are unstable. The other GSW routers are still communicating.

We're investigating...

Apparently one of the "reflector" routers (rf-3-a1) didn't communicate with the other routers that GSW was down, so the GSW routers remained installed.

We're cutting off the BGP session towards rf-3-a1 and th2-1-a9 to see if that works.

That fixed it. So that's where the issue is.

We're cutting off all BGP sessions.

rf-3-a1#clear ip bgp *


Comment by OVH - Wednesday, 29 July 2015, 17:03PM

rf-1-a1 is down with GSW.

We reset rf-3-a1, which apparently has a bug.
We were therefore only connected to a RR rf-2-a1 for a few minutes.


Comment by OVH - Wednesday, 29 July 2015, 17:05PM

Resetting rf-3-a1 fixed the communication problem which should have been fixed by isolating the gsw-1-a9 router.

Traffic is back to normal. Mainly the connections managed by gsw-1-a9 were impacted:
- 50% Free
- 50% Orange
- 30% Telefonica (Backup)
- 50% Google Eurupe

Transit:
- 20G Cogent
- 40G Tata
- 20G Level3
- 10G Telia

The reste of the backbone continued to function as normal.


Comment by OVH - Wednesday, 29 July 2015, 17:06PM

Summary of the gsw-1-a9 configuration:
We cut off the BGP sessions with PNI and Transit.
We reinstated the OSPF configuration.
It's up.
We're remounting the BGP sessions with the peers.


Comment by OVH - Wednesday, 29 July 2015, 17:07PM

Everything is up.


Comment by OVH - Wednesday, 29 July 2015, 18:01PM

Hello,

An incident has just occurred on the routing of the 2 routers in Paris: gsw-1-a9. The incident was caused by a human error. One of the engineers in the network team (that’s my team) accidentally deleted the OSPF configuration on the router, despite double checking the configuration. He confirmed that he had done it without realising (in pilot mode) .. and as a result the gsw-1-a9 router went down.

Everything should still have continued to function, however. But a bug on the 3rd reflector router rf-3-a1 stopped this router alerting the rest of the backbone that gsw-1-a9 was down. rf-2-a1 did it, but rf-2-a1 was down during the outage. The backbone therefore continued to behave as if the gsw-1-a9 router was up. We saw that there was a looping problem in the traceroutes.

We’ve rebooted all BGP sessions on rf-3-a1, but given that only rf-2-a1 was synchronising the BGP between all the routers in Europe because rf-1-a1 and gsw-1-a9 were down, the European network was intermittent: each router was pinging or not pinging for 60-120 seconds.

Everything then came back up; then we reconfigured the gsw-1-a9 router.
The backbone is up.

Please accept our sincere apologies for this outage. Human errors can happen and the backbone is meant to prevent this type of issue. We will investigate to find the bug on our RR (ASR1002). Then we will give the team a good hiding…

Best,
Octave