Facebook Outage
As shown in this CNBC article and as many people have experienced Facebook had major outages today. More information is here: https://www.cnbc.com/2019/03/13/facebook-suffers-outage-related-to-core-whatsapp-and-instagram.html I'm going to post some thoughts, things I've noticed, etc. Many of these thoughts are possibly wrong but it's interesting to speculate:
A few interesting things I've noticed as things came back. The messenger web portal was not working but it seemed to be that the logon system was having problems. Also I noticed that certain wall posts were not working. This happened from Chicago which at the time was shown in https://downdetector.com/status/facebook/map/ as a major red dot. That has changed now. One thing that is interesting with wall posts not working is there likely was an issue reconfiguring write database servers and my guess is facebook was reverting to read-only whenever possible. Also messenger seemed to be unaffected as time went on. My guess is messenger is less complicated as its private conversations per user and not something a user posts that can be read by any user so that infrastructure is likely simpler. However, I did hear whatsapp was down along with instagram.
The article says no DDoS attack and I completely believe them. All these issues seem like misconfigurations, accidentally shutting down servers, network outages, or combinations of those problems. The whatsapp and instagram outages appear to also indicate this is a core infrastructure problem--possibly someone accidentally shut down some servers as what happened with AWS (https://www.theregister.co.uk/2017/03/01/aws_s3_outage/) or someone pushed a flawed change to their network infrastructure disrupting routes between datacenters and/or individual servers. With giant computing infrastructure each component needs to interact with the others precisely and pushing changes can be difficult. One interesting observation though is the quick reversion to read only. One interesting aspect to reviewing this outage is such a large system was able to be restored fairly quickly. I still remember in 2007 when facebook would periodically do hour long planned downtimes...even with the outage they have gained a lot by implementing infrastructure automation.