Incident Details
Show Details For An Incident
Resolved [21/12/2022 10:47]
In summary, here are details of the issue observed on Friday afternoon / evening:
- It was observed that there was a significant and unexpected memory leak on core equipment in our Telehouse West (THW) core.
- It was determined that the best course of action was to carry out a controlled reload out of hours.
- We began slowly culling broadband sessions terminating at THW and steering them to other PoPs in preparation.
- A short time later the memory exhausted on the THW core, the BGP process terminated and resulted in all broadband sessions on LNSs at the PoP disconnecting.
- All broadband circuits that were operating via THW were automatically steered to other PoPs in our network.
- At this point we had no choice but to carry out an emergency reload of the core.
- Leased lines operating from THW were impacted throughout.
- Reload of the core took 30 minutes to complete, however a secondary issue was identified with the hardware of one of the switches.
- Half of the leased lines were restored, whilst on-site hands moved the affected NNIs from the failed switch to the other. This involved configuration changes.
- Circuits were impacted between 1 hour and 4 hours at worst. The majority of circuits were up around the 1 to 2 hour point.
- We are not set to move the NNIs again, to ensure that there is no further disruption.
- Owing to fulfilment issues the replacement hardware is now expected to arrive today, but to avoid any further risks, installation will be postponed until the New Year.
- We have raised the memory leak issue with Cisco TAC.
We apologise for the disruption this would have caused.
Update [19/12/2022 08:12]
This morning engineers will be at Telehouse West to fit replacement hardware. This is expected to be non-service affecting. The impacted leased line NNI we moved from the failed switch will remain where they are with no further disruption planned.
Update [16/12/2022 21:25]
All affected circuits are now up, and we expect them to remain stable.
We apologise or the disruption caused this afternoon / evening, it most certainly isn't what anyone wanted to see. The issue was twofold; initially starting with a requirement to complete an emergency reload of the core at Telehouse West, followed by a hardware failure with one of the switches.
We are working with the hardware vendor to establish the root cause of the initial disruption, and to source replacement hardware. We will put plans in place regarding scheduling these changes, as well as providing a more detailed RFO via the status feed early next week.
Update [16/12/2022 20:12]
The majority of NNI are now restored and we can see circuits back up. There remain 2 NNI which are causing issues which we are working to address.
Update [16/12/2022 19:10]
We are handling a hardware failure situation. We are in the process of actively moving affected NNI across to an alternative switch. We will provide further updates as soon as we can.
Circuits not affected by the hardware failure are now online and should remain stable.
Apologies for the continued disruption this evening.
Update [16/12/2022 18:32]
One of our core devices reached a state of memory exhaustion, as a result we had no choice but to complete an emergency reload. This has now taken place, however upon return, the second switch in our chassis entered a stage of looping. We're sending engineers to Telehouse West to investigate.
Investigating [16/12/2022 17:11]
We are having to complete an emergency reload of our Telehouse West core. This will impact leased line circuits connected to this POP. We would not take this decision lightly, and it has to be actioned immediately. Further updates to follow.