Why Does The Internet Keep Going Down Every Once In A While?
Dhir Acharya
In recent months, Facebook, as well as other Internet platforms, has gone down repeatedly. And there's a reason behind these annoying incidents.
- Receive Buffers And Other Tips To Improve Your Internet Connection
- Internet Shutdowns In India By The Government Caused A $2.7 Billion Loss In 2020
- You May Not Know These Things From Before The Internet
In recent months, we have witnessed a number of outages on the Internet, including giant social networks like Facebook, WhatsApp, and Instagram. During those incidents, these platforms were not offline, but rather unable to load anything in their interfaces.
There is a number of reasons behind these outages, from various sources like web-hosting providers, CDN systems, or even the Internet’s backbone. Now, let’s dig a little into why Internet services and websites go down. The most common cause is CDNs and Cloudflare is one of those.
What is Cloudflare?
This is a US service provider for websites, including distributed domain name server and online security services, DDoS protection, content delivery networks, etc. Essentially, it stands between hosting providers of firms like Facebook and Internet users.
Cloudflare reported that as per 2017, it had around 12 million customer websites, with around 20,000 additional customers each day. So if they go down, pretty much everyone will go down too.
What happened that caused Facebook outage earlier this month?
Well, Cloudflare went down. It was not even due to some kind of attack, the company just messed up. It decided to update its WAF (Web Application Firewall) to protect websites. While the company often does this within a test mode rather than an isolated test environment, so it shouldn’t cause any problems.
However, among the new rules they added, one rule had a glitch causing the CPU to spike to 100%. Secondly, the company deployed the test to the entire world rather than a handful of users as usual. As a result, the usage on every machine across the globe spiked, hence the “502 Bad Gateway” error that users spotted on websites.
It took Cloudflare 20 minutes just to figure out what they did wrong, followed by another 30 minutes or more to roll back the update. And even house later, people still couldn’t launch websites.
But there’s a bigger issue
This outage was not the first of its kind, and surely will not be the last because of BGP – the Border Gateway Protocol.
Since the Internet isn’t a centralized database, there must be a way for computers to connect to websites as well as services across the world. Simply speaking, data needs to flow through without being controlled by any entities and we have BGP for that to happen. So when BGP messes up, it affects literally everyone.
Earlier this month, Verizon accidentally messed the protocol and took down a large part of the Internet. By accident, they made a small company a preferred path for a lot of Internet routes. To make it easier to understand, imagine Uber tells all of its drivers that the best route to all places is through a particular market gully, the result is that no one can get anywhere.
In another similar incident dated back November 2018, Google suffered from a major outage related to BGP reroute. And while this incident was not an official malicious hijack, China and Russia were suspected as Transtelecom in Russia and China Telecom first accepted the wrong reroute, sending a huge amount of Google-related traffic through them.
And in 2014, BGP was truly hijacked when a hacker rerouted the traffic on 52 networks of 19 ISPs. Essentially, he redirected cryptocurrency miners to his own mining pool under his control to collect the profits they were supposed to obtain.
Now, what is BGP?
IBM’s Yakov Rekhter and Cisco’s Kirk Lougheed initially conceived BGP in 1989 during lunch at an Internet engineering conference. And since 1994 when it was first implemented, BGP has mostly remained the same.
The protocol works like a map which lets computers transfer data around the Internet. Each network on the Internet is run by many industrial nodes at ISP, each of which controls a set of routes and IP address. And for the traffic to flow, they have to inform the world about these routes.
Why does BGP suck?
One problem with this three-decade-old protocol is that it relies on trust. Rekhter and Lougheed did not design BGP to independently verify routes that individual networks claim and it does not have co-mingled pass-phrases as in encryption. Therefore, the protocol cannot tell if these systems announced bad routes, regardless of by accident or due to being hijacked.
What else can knock off Internet access?
Companies can mess up the Internet too if they screw up individually. In June, Google Cloud’s us-east1 region went offline just because a maintenance event physically damaged some fiber bundles that linked their cloud servers. To bypass this, the tech giant rerouted some of its cloud traffic to which formed a gridlock with increased latency for not only users but also all the sites using this platform including Snapchat, Shopify, and YouTube.
The problem is concerning as Google is among the world’s largest hosting providers. When such a major cloud hosting service fails, it drags down all clients with it.
We need to try harder
Each year, thousands of BGP routing accidents happen, most of which are accidental and have minor effects. However, some of them are malicious and disruptive too. And while governments have been aware of this for years, little progress has been made to address such a major national security issue.
After all, BGP makes it easy to reroute the traffic of an entire country through another one, or even completely knock off their Internet access.
But there’s hope, with concerns over hackers targeting at BGP, starting 2014, a group of network operators is working with the Internet Society to codify as well as promote “BGP best practices.” In addition, an international committee consisting of UK and US government officials and Internet experts have been working on a defense framework to protect BGP from hijacks, their research was published in 2018.
The remaining problem is that no matter how effective the framework is, it’s challenging to have every ISP implement it. And it only takes one weak link in the chain to break down the entire Internet.