According to M1, the issue originated with two hardware boxes, called mobile site switches, which caused intermittent connection problems between M1's switches and customer databases. The cause of this is still being investigated by their vendor.
The connection problems then affected major network processes, leading to more and more errors that tied up resources to the point where customers began experiencing problems with calls, data and messaging as reported.
As an analogy, think of a helpful neighbour coming to tell you that your office signboard has cracked, then another neighbour dropping by to tell you the same thing, and then a third, to the point that your office is so crowded with neighbours that you can't even leave it to attend any of the client meetings for the day.
According to M1, the switch problem was actually identified and stabilised quickly, but the resulting congestion took time to clear because of the complexity of their Advanced Telecommunications Computing Architecture, which was implemented to offer a better customer experience.
To take the analogy further, this is like the only way to get out of your crowded office, even though you can't see the door or tell how many neighbours are in your office, is to take time to figure out who is nearest the exit, and how best to get all the neighbours to leave. You could call a friend to come to your office, get to the entrance to ask the neighbour nearest the door to leave, creating some space so that the people who are fairly near the door can leave, and so on. Depending on the number of people in your office and the size of your office, it could take some time before you have cleared a pathway to the door.
“We are continually upgrading our network to deliver better customer experience and improve network resiliency. With a more sophisticated network, it has inevitably increased the level of complexity in troubleshooting," explained Karen Kooi, Chief Executive Officer, M1, who also apologised to customers and outlined three measures in the aftermath of the breakdown.
“While the root cause for the mobile site switch instability is being investigated, we are reconfiguring our site switches to eliminate congestion that may arise due to unstable intermittent connections. We will also be deploying a software enhancement that will enable us to better manage sudden and unexpected extreme traffic conditions. In addition, an independent expert will be appointed to review our network architecture and connectivity to further enhance network performance.”
Toh Soon Seah, CEO and Founder, NetGain Systems, which provides real-time IT infrastructure management solutions, said that prevention is better than cure in cases such as these.
"IT monitoring solutions would have alerted the IT team of these intermittent connections between the switches and customer databases as the performance threshold limits would have been reached and triggered off the alerts. Hundred percent uptime* can be achieved if IT teams take pro-active and preventive steps to react to the possible causes of IT failure before it becomes a major outage issue," he said.
![]() |
Toh Soon Seah, CEO, NetGain Systems. Source: NetGain Systems. |
"IT outages can cause businesses to lose revenue, lose reputation among customers and even in result in compliance penalties. For example, M1 was down in 2012 for three days and was subsequently fined S$1.5 million by IDA Singapore.
"In today’s world of social media, when customers experience outages, they ..go to social networks to show their displeasure and this could impact the reputation of the business almost at the speed of hitting a share button," Toh warned.
*Mobile customer base as reported in financial statements for the year 2013.
*Downtime or outage refers to the period when there has been a breakdown in service delivery. Uptime in contrast refers to the period when service delivery is normal.
No comments:
Post a Comment