rssLink RSS for all categories
 
icon_red
icon_green
icon_red
icon_red
icon_blue
icon_green
icon_green
icon_red
icon_red
icon_red
icon_orange
icon_green
icon_green
icon_green
icon_red
icon_blue
icon_red
icon_orange
icon_red
icon_red
icon_red
icon_red
icon_red
icon_red
icon_red
icon_red
icon_red
icon_orange
icon_green
 

FS#21 — FS#4145 — gsw-4

Attached to Project— Network
Incident
global switch
CLOSED
100%
We have a routing problem on gsw-4.


Date:  Tuesday, 11 May 2010, 15:27PM
Reason for closing:  Done
Comment by OVH - Thursday, 29 April 2010, 11:12AM

OVH commentary-Thursday, April 29, 2010, 10:34

We have some strange logs on a few routers regarding the IPs which are used
for the routers of Global Switch.

Apr 29 10:29:38 20g.vss-3-6k.routers.chtix.eu 22890: Apr 29 09:29:23 GMT: %COMMON_FIB-6-FIB_RECURSION_VIA_SELF: 213.251.190.48/28 is found to resolve via itself during setting up switching info
Apr 29 10:29:38 20g.vss-3-6k.routers.chtix.eu 22891: Apr 29 09:29:23 GMT: %COMMON_FIB-SW1_DFC8-6-FIB_RECURSION_VIA_SELF: 213.251.190.48/28 is found to resolve via itself during setting up switching info
Apr 29 10:29:38 20g.vss-3-6k.routers.chtix.eu 22892: Apr 29 09:29:23 GMT: %COMMON_FIB-SW2_DFC9-6-FIB_RECURSION_VIA_SELF: 213.251.190.48/28 is found to resolve via itself during setting up switching info
Apr 29 10:29:38 20g.vss-3-6k.routers.chtix.eu 22893: Apr 29 09:29:23 GMT: %COMMON_FIB-SW1_DFC9-6-FIB_RECURSION_VIA_SELF: 213.251.190.48/28 is found to resolve via itself during setting up switching info
Apr 29 10:29:38 20g.vss-3-6k.routers.chtix.eu 22894: Apr 29 09:29:23 GMT: %COMMON_FIB-SW2_DFC8-6-FIB_RECURSION_VIA_SELF: 213.251.190.48/28 is found to resolve via itself during setting up switching info
Apr 29 10:29:38 20g.vss-3-6k.routers.chtix.eu 22895: Apr 29 09:29:23 GMT: %COMMON_FIB-SW2_SPSTBY-6-FIB_RECURSION_VIA_SELF: 213.251.190.48/28 is found to resolve via itself during setting up switching info
Apr 29 10:29:38 20g.vss-3-6k.routers.chtix.eu 22896: Apr 29 09:29:23 GMT: %COMMON_FIB-SW1_SP-6-FIB_RECURSION_VIA_SELF: 213.251.190.48/28 is found to resolve via itself during setting up switching info
Apr 29 10:29:38 20g.vss-3-6k.routers.chtix.eu 22897: Apr 29 09:29:23 GMT: %COMMON_FIB-SW1_DFC1-6-FIB_RECURSION_VIA_SELF: 213.251.190.48/28 is found to resolve via itself during setting up switching info

It seems that, this morning, routers do not like the announcement of 213.251.190.48/28
in OSPF and BGP.

We have just removed the BGP announcement. We keep only the OSPF.
A nice bug again


Comment by OVH - Thursday, 29 April 2010, 11:27AM

The origin of the problem is probably with the maintenance task ,carried out in emergency, and
which we perform this morning on Frankfurt on Decix.
http://travaux.ovh.com/?do=details&id=4131

Thus, probably the shutdown/no shutdown of DECIX caused
a small overload on the VSS at the level of the recalculation of
BGP tables. This recurrent problem of VSS overload at the level of BGP will be
resolved soon with the introduction of 2 ASR 1000 for
collector routes of the whole network. This is a router which is specifically
designed for large BGP tables and many BGP operations.


Comment by OVH - Tuesday, 11 May 2010, 14:14PM

We have just been attacked. The attack is blocked now, without having necessarily anything to do with this morning.


Comment by OVH - Tuesday, 11 May 2010, 14:15PM

we are searching.


Comment by OVH - Tuesday, 11 May 2010, 15:12PM

Well.
Since 21h approximately, we had a problem on gsw-4-c1 which impacts 50% of our
clients bays in the Global Switch. And sometimes, this affects the gsw-3.
We have moved the routing of our secondary dns servers on a new router. Still down.
We have:
- looked for the attack which we are subjected to and we cannot find it
- looked for an attack which comes from one of the clients, same thing as well.
- we have restarted one of the 2 routing cards and some ports
were put in default. This has caused a reboot of the second card
and a few ports were put in default as well.
-we have rebooted all of the router, 95% of these are up

Therefore, we bet on the following scenario: following an attack of this
morning, something was pushed to the limit at the level of the hardware and
it broke in the after noon.

We are looking for 2 routing cards of spare and we will proceed to the change of
cards one after another. If we are lucky, it will be started again. We think that
the probability that it could be the chassis which is in default is not null.

In the first case (only the cards): all restart around midnight
In the second case (the chassis): around 1h30/2h00 a.m


Comment by OVH - Tuesday, 11 May 2010, 15:18PM

We have cut the remaining "up" port of the second card. It seems better.
We have cut all of the routing via the card 2. All of the clients
are up.

Thus, this could be probably the card #2 in the router which has a hardware
problem and therefore, we change it in 1 hour approximately.


Comment by OVH - Tuesday, 11 May 2010, 15:20PM

We change the card.


Comment by OVH - Tuesday, 11 May 2010, 15:24PM

Changed. The conf is synchronized. The BGP has been set off.

Everything is up again.

We are sorry for the length of the breakdown.
The hardware breakdowns are "no net".