After the extinction of a great service Monday that issued all its services, Facebook has published a blog post that details what happened yesterday. According to Santosh Janardhan, vice president of the company infrastructure, blackouts begin with what should be routine maintenance. At some point yesterday, a command was issued which should assess the availability of backbone networks that connect all different Facebook computing facilities. Instead, the order accidentally took the connection. Janardhan said the bug in the company’s internal audit system did not correctly prevent orders from implementation.
The problem caused secondary problems which finally made yesterday’s blackout into an international incident that it became. When the Facebook DNS server cannot connect to the company’s primary data center, they stop advertising Routing information on the Gateway Protocol (BGP) border which needs to be connected to the server.
“The end result is that our DNS server is not affordable even though they are still operational,” said Janardhan. “This makes it impossible for the entire internet to find our server.”
As we learned the partway yesterday, what made a situation that was difficult to happen was that the blackout made Facebook engineer connect to the server they needed to fix. In addition, the loss of DNS function means they cannot use many internal tools they rely on to investigate and resolve network problems under normal circumstances. That means the company must physically send personnel to the center of the data, the tasks are made by physical protection in that location.
“They are difficult to enter, and once you are inside, hardware and routers are designed to be difficult to modify even when you have physical access to them,” according to Janardhan. After returning the backbone network, Facebook is careful not to rotate everything at once because it requires power and computational demands may have caused more crashes.
“Every failure like this is an opportunity to learn and get better, and there is a lot for us to learn from this one,” said Janardhan. “After every problem, small and big, we do a broad review process to understand how we can make our system more resilient. The process has already taken place.”