rssLink RSS for all categories
 
icon_red
icon_green
icon_red
icon_red
icon_red
icon_green
icon_green
icon_red
icon_red
icon_red
icon_orange
icon_green
icon_green
icon_green
icon_red
icon_blue
icon_red
icon_orange
icon_red
icon_red
icon_red
icon_red
icon_red
icon_red
icon_red
icon_green
icon_red
icon_green
icon_green
 

FS#326 — FS#4408 — ldn-1-6k

Attached to Project— Network
Incident
the whole network
CLOSED
100%
The router is down.
Date:  Monday, 26 July 2010, 00:59AM
Reason for closing:  Done
Comment by OVH - Saturday, 24 July 2010, 20:29PM

fra-5 et th1-1 are defected. Not enough CPU.
We disabled the MPLS on all the backbone.


Comment by OVH - Saturday, 24 July 2010, 20:52PM

Jul 24 20:28:13 40g.fra-5-6k.routers.chtix.eu 623150: Jul 24 19:27:53 GMT: %FIB-2-FIBDOWN: CEF has been disabled due to a low memory condition.
Jul 24 20:28:13 40g.fra-5-6k.routers.chtix.eu 623151: It can be re-enabled by configuring "ip cef [distributed]"


Comment by OVH - Saturday, 24 July 2010, 20:54PM

We isolated fra-5.


Comment by OVH - Saturday, 24 July 2010, 20:57PM

ams-1 is down. The router is just back.


Comment by OVH - Saturday, 24 July 2010, 21:01PM

proxy arp disabled on the vss-2.


Comment by OVH - Saturday, 24 July 2010, 21:39PM

fra-5-6k is back. Cards are not yet properly back.
ams-1-6k is back, the same, it has yet rebooted a card.
ldn-1-6k it is a crash, we are fixing it through series cable, boot in progress
vss-2-6k the arp proxy is returned.

This is the worst backbone crash we've ever had in OVH ...
The domino effect on routers which has not rebooted a long time ago and that have a RAM split.

Jul 24 20:21:29 40g.fra-5-6k.routers.chtix.eu 622981: Pool: Processor Free: 30087848 Cause: Memory fragmentation
Jul 24 20:21:29 40g.fra-5-6k.routers.chtix.eu 622982: Alternate Pool: None Free: 0 Cause: No Alternate pool
Jul 24 20:21:29 40g.fra-5-6k.routers.chtix.eu 622983: -Process= "IP RIB Update", ipl= 0, pid= 164
Jul 24 20:21:29 40g.fra-5-6k.routers.chtix.eu 622984: -Traceback= 4102AD28 41030958 410433E0 413C2D10 42289224 406417AC 42305768 409D2680 40983230 40983350
Jul 24 20:21:29 40g.fra-5-6k.routers.chtix.eu 622985: Jul 24 19:21:07 GMT: %FIB-3-NORPXDRQELEMS: Exhausted XDR queuing elements while preparing message for slot/cpu 1/0
Jul 24 20:21:29 40g.fra-5-6k.routers.chtix.eu 622986: -Process= "IP RIB Update", ipl= 0, pid= 164
Jul 24 20:21:29 40g.fra-5-6k.routers.chtix.eu 622987: -Traceback= 413C2DE0 42289224 406417AC 42305768 409D2680 40983230 40983350
Jul 24 20:21:46 40g.fra-5-6k.routers.chtix.eu 623015: Jul 24 19:21:11 GMT: %FIB-3-NOMEM: Malloc Failure, disabling DCEF
Jul 24 20:27:34 40g.fra-5-6k.routers.chtix.eu 623147: Jul 24 19:27:15 GMT: %C6KFIB-4-DISABLED: Hardware FIB forwarding disabled, reverting to only software forwarding.

It is time, to establish new routers generation.
It was expected but only in September (it has to be available)


Comment by OVH - Saturday, 24 July 2010, 21:44PM

We have removed a queue modification on the 10G in order to return the old values. We modified it this week to increase the buffers on the ports.
Apparently the router did not support correctly the option.


Comment by OVH - Saturday, 24 July 2010, 21:47PM

fra-5: some problems yet:
Jul 24 20:30:53 GMT: %TFIB-SP-7-SCANSABORTED: TFIB scan not completing. MAC string updated.
-Traceback= 40E40578 40E40904 40F1664C 40E18AD8 40E19078 40DFF760 40DFFB7C 40DFFE58 40E00AD8
Jul 24 20:31:11 GMT: %TFIB-DFC4-7-SCANSABORTED: TFIB scan not completing. MAC string updated.
-Traceback= 20F6AE38 20F6B1C4 2103E87C 20F43398 20F43938 20F2A020 20F2A43C 20F2A718 20F2B398
Jul 24 20:31:14 GMT: %TFIB-DFC1-7-SCANSABORTED: TFIB scan not completing. MAC string updated.
-Traceback= 20F6AE38 20F6B1C4 2103E87C 20F43398 20F43938 20F2A020 20F2A43C 20F2A718 20F2B398
Jul 24 20:31:15 GMT: %TFIB-DFC5-7-SCANSABORTED: TFIB scan not completing. MAC string updated.


Comment by OVH - Saturday, 24 July 2010, 21:47PM

Jul 24 21:32:47 40g.fra-5-6k.routers.chtix.eu 418: Jul 24 20:32:27 GMT: %C6KPWR-SP-4-DISABLED: power to module in slot 8 set off (Module Failed SCP dnld)


Comment by OVH - Saturday, 24 July 2010, 21:47PM

Jul 24 21:33:07 160G.rbx-1-6k.routers.ovh.net 48924: Jul 24 20:32:47 GMT: %DIAG-SP-3-TEST_FAIL: Module 9: TestMacNotification{ID=13} has failed. Error code = 0x1


Comment by OVH - Saturday, 24 July 2010, 21:48PM

We are booting card by card
fra-5-6k(config)#no power en module 2
fra-5-6k(config)#no power en module 7
fra-5-6k(config)#no power en module 8
fra-5-6k(config)#no power en module 9


Comment by OVH - Saturday, 24 July 2010, 21:48PM

Jul 24 21:37:20 40g.fra-5-6k.routers.chtix.eu 707: Jul 24 20:36:57 GMT: %IPACCESS-2-NOMEMORY: Alloc fail for acl-config buffer. Disabling distributed mode on lc
Jul 24 21:37:20 40g.fra-5-6k.routers.chtix.eu 708: Jul 24 20:36:57 GMT: %IPACCESS-2-NOMEMORY: Alloc fail for acl-config buffer. Disabling distributed mode on lc
Jul 24 21:37:20 40g.fra-5-6k.routers.chtix.eu 709: Jul 24 20:36:58 GMT: %FIB-3-NOMEM: Malloc Failure, disabling DCEF


Comment by OVH - Saturday, 24 July 2010, 21:48PM

Jul 24 21:37:55 40g.fra-5-6k.routers.chtix.eu 718: Jul 24 20:37:26 GMT: %SYS-2-MALLOCFAIL: Memory allocation of 64 bytes failed from 0x420B35A8, alignment 8
Jul 24 21:37:55 40g.fra-5-6k.routers.chtix.eu 719: Pool: Processor Free: 0 Cause: Not enough free memory
Jul 24 21:37:55 40g.fra-5-6k.routers.chtix.eu 720: Alternate Pool: None Free: 0 Cause: No Alternate pool
Jul 24 21:37:55 40g.fra-5-6k.routers.chtix.eu 721: -Process= "Tag Control", ipl= 0, pid= 278
Jul 24 21:37:55 40g.fra-5-6k.routers.chtix.eu 722: -Traceback= 4102AD28 410315F0 420B35B0 420B4960 420BBF90 421EFA60 420BD978 420B7760 420BB770


Comment by OVH - Saturday, 24 July 2010, 22:26PM

fra-5 is down again. it's a memory problem. we are rebooting it in hard.


Comment by OVH - Saturday, 24 July 2010, 22:28PM

We isolated all the sessions on fra-5 and disconnect all.
we are saving the configuration then rebooting.


Comment by OVH - Saturday, 24 July 2010, 23:04PM

on ldn-1-6k in the crashinfo:
Jul 24 19:05:24 GMT: %C6K_PLATFORM-SP-2-PEER_RESET: SP is being reset by the RP


Comment by OVH - Saturday, 24 July 2010, 23:52PM

We returned all sessions on fra-5. it is stable.

We believe it is memory problem and memory split since we established the security via "london/amsterdam" and "paris/frankfurt".
ldn routers, ams and fra have consumed memory because of new information and visibly we are arriving at high limits. It remains 73Mo/1Go on ldn for example, but only 53Mo non fragmented.


Comment by OVH - Monday, 26 July 2010, 00:59AM

It will be fixed with the BGP collector router which been ordered and have to arrive in 5 weeks. We will have less BGP sessions by router and only simple BGP.