The outage was caused by human error. The OSPF configuration cut off the GSW router.
The TH2 in Paris has taken over the traffic.
Comment by OVH - Wednesday, 29 July 2015, 17:02PM
The th2-1-a9 and some other links are unstable. The other GSW routers are still communicating.
We're investigating...
Apparently one of the "reflector" routers (rf-3-a1) didn't communicate with the other routers that GSW was down, so the GSW routers remained installed.
We're cutting off the BGP session towards rf-3-a1 and th2-1-a9 to see if that works.
That fixed it. So that's where the issue is.
We're cutting off all BGP sessions.
rf-3-a1#clear ip bgp *
Comment by OVH - Wednesday, 29 July 2015, 17:03PM
rf-1-a1 is down with GSW.
We reset rf-3-a1, which apparently has a bug.
We were therefore only connected to a RR rf-2-a1 for a few minutes.
Comment by OVH - Wednesday, 29 July 2015, 17:05PM
Resetting rf-3-a1 fixed the communication problem which should have been fixed by isolating the gsw-1-a9 router.
Traffic is back to normal. Mainly the connections managed by gsw-1-a9 were impacted:
- 50% Free
- 50% Orange
- 30% Telefonica (Backup)
- 50% Google Eurupe
The reste of the backbone continued to function as normal.
Comment by OVH - Wednesday, 29 July 2015, 17:06PM
Summary of the gsw-1-a9 configuration:
We cut off the BGP sessions with PNI and Transit.
We reinstated the OSPF configuration.
It's up.
We're remounting the BGP sessions with the peers.
Comment by OVH - Wednesday, 29 July 2015, 17:07PM
Everything is up.
Comment by OVH - Wednesday, 29 July 2015, 18:01PM
Hello,
An incident has just occurred on the routing of the 2 routers in Paris: gsw-1-a9. The incident was caused by a human error. One of the engineers in the network team (that’s my team) accidentally deleted the OSPF configuration on the router, despite double checking the configuration. He confirmed that he had done it without realising (in pilot mode) .. and as a result the gsw-1-a9 router went down.
Everything should still have continued to function, however. But a bug on the 3rd reflector router rf-3-a1 stopped this router alerting the rest of the backbone that gsw-1-a9 was down. rf-2-a1 did it, but rf-2-a1 was down during the outage. The backbone therefore continued to behave as if the gsw-1-a9 router was up. We saw that there was a looping problem in the traceroutes.
We’ve rebooted all BGP sessions on rf-3-a1, but given that only rf-2-a1 was synchronising the BGP between all the routers in Europe because rf-1-a1 and gsw-1-a9 were down, the European network was intermittent: each router was pinging or not pinging for 60-120 seconds.
Everything then came back up; then we reconfigured the gsw-1-a9 router.
The backbone is up.
Please accept our sincere apologies for this outage. Human errors can happen and the backbone is meant to prevent this type of issue. We will investigate to find the bug on our RR (ASR1002). Then we will give the team a good hiding…
The outage was caused by human error. The OSPF configuration cut off the GSW router.
The TH2 in Paris has taken over the traffic.
The th2-1-a9 and some other links are unstable. The other GSW routers are still communicating.
We're investigating...
Apparently one of the "reflector" routers (rf-3-a1) didn't communicate with the other routers that GSW was down, so the GSW routers remained installed.
We're cutting off the BGP session towards rf-3-a1 and th2-1-a9 to see if that works.
That fixed it. So that's where the issue is.
We're cutting off all BGP sessions.
rf-3-a1#clear ip bgp *
rf-1-a1 is down with GSW.
We reset rf-3-a1, which apparently has a bug.
We were therefore only connected to a RR rf-2-a1 for a few minutes.
Resetting rf-3-a1 fixed the communication problem which should have been fixed by isolating the gsw-1-a9 router.
Traffic is back to normal. Mainly the connections managed by gsw-1-a9 were impacted:
- 50% Free
- 50% Orange
- 30% Telefonica (Backup)
- 50% Google Eurupe
Transit:
- 20G Cogent
- 40G Tata
- 20G Level3
- 10G Telia
The reste of the backbone continued to function as normal.
Summary of the gsw-1-a9 configuration:
We cut off the BGP sessions with PNI and Transit.
We reinstated the OSPF configuration.
It's up.
We're remounting the BGP sessions with the peers.
Everything is up.
Hello,
An incident has just occurred on the routing of the 2 routers in Paris: gsw-1-a9. The incident was caused by a human error. One of the engineers in the network team (that’s my team) accidentally deleted the OSPF configuration on the router, despite double checking the configuration. He confirmed that he had done it without realising (in pilot mode) .. and as a result the gsw-1-a9 router went down.
Everything should still have continued to function, however. But a bug on the 3rd reflector router rf-3-a1 stopped this router alerting the rest of the backbone that gsw-1-a9 was down. rf-2-a1 did it, but rf-2-a1 was down during the outage. The backbone therefore continued to behave as if the gsw-1-a9 router was up. We saw that there was a looping problem in the traceroutes.
We’ve rebooted all BGP sessions on rf-3-a1, but given that only rf-2-a1 was synchronising the BGP between all the routers in Europe because rf-1-a1 and gsw-1-a9 were down, the European network was intermittent: each router was pinging or not pinging for 60-120 seconds.
Everything then came back up; then we reconfigured the gsw-1-a9 router.
The backbone is up.
Please accept our sincere apologies for this outage. Human errors can happen and the backbone is meant to prevent this type of issue. We will investigate to find the bug on our RR (ASR1002). Then we will give the team a good hiding…
Best,
Octave