BGP Explained: the protocol that may be behind Facebook’s disappearance

On Monday, Facebook was completely knocked offline, taking Instagram and WhatsApp (not to mention a few other websites) down with it. Many have been quick to say that the incident had to do with BGP, or Border Gateway Protocol, citing sources from inside Facebook, traffic analysis, and the gut instinct that “it’s always DNS or BGP.” Facebook is on its way back up, but this all begs the question:

What is BGP?

At a very basic level, BGP is one of the systems that the internet uses to get your traffic to where it needs to go as quickly as possible. Because there are tons of different internet service providers, backbone routers, and servers responsible for your data making it to, say, Facebook, there’s a ton of different routes your packets could end up taking. BGP’s job is to show them the way and make sure it’s the best route.

I’ve heard BGP described as a system of post offices, an air traffic controller, and more, but I think my favorite explanation was one that likened it to a map. Imagine BGP as a bunch of people making and updating maps that show you how to get to YouTube or Facebook.

When it comes to BGP, the internet is broken up into big networks, known as autonomous systems. You can sort of imagine them as island nations — they’re networks that are controlled by a single entity, which could be an ISP, like Comcast, a company, like Facebook, or some other big organization like a government or major university. It would be extremely difficult to build bridges connecting every island to all the others, so BGP is what’s responsible for telling you which islands (or autonomous systems) you have to go through to get to your destination.

Since the internet is always changing, the maps need to be updated — you don’t want your ISP to lead you down an old road that no longer goes to Google. Because it’d be a massive undertaking to map the entire internet all the time, autonomous systems share their maps. They’ll occasionally talk to their island neighbors to see and copy any updates they’ve made to their maps.

Using maps as a framework, it’s easy to imagine how things can go wrong. Back when consumers first got access to GPS, there were always jokes about it having you drive off a cliff or into the middle of the desert. The same thing can happen with BGP — if someone makes a mistake, it can end up leading traffic somewhere it’s not supposed to go, which will cause problems. If it isn’t caught, that mistake will end up on everyone’s map. There are other ways this can go wrong, but we’ll get to those in a bit.

Yeah, yeah, maps. Give me an example.

Of course! This is massively simplified, but imagine you want to connect to an imaginary tech news website called Convergence. Convergence uses the ISP NetSend, and you use DecadeConnect. In this example, DecadeConnect and NetSend can’t talk directly to each other, but your ISP can talk to Border Communications, which can talk to Form, which can talk to NetSend. If that’s the only route, then BGP would make sure that you and Convergence could communicate through it. But if alternatively, both DecadeConnect and NetSend were connected to ThirdLevel, BGP would likely choose to route your traffic through it, as it’s a shorter hop.

Okay, so BGP is like maps that detail all the fastest ways from you to a website?

Right! Unfortunately, it can get even more complicated because the shortest doesn’t always equal best. There are plenty of reasons why a routing algorithm would choose one path over another — cost can be a factor as well, with some networks charging others if they want to include them in their routes.

Also, maps are super tricky! I discovered this just recently trying to plan a trip where roads existed on one map and not another or were different between maps. One road even had three different names across three maps. If it’s that hard to pin down for a “town” that has all of five roads, imagine what it’s like trying to connect the entire internet together. Real roads don’t change that often, but websites can move from one country to another or change, add, or subtract service providers, and the internet just has to deal with it.

I remember something like this from my algorithms and data structures class — trying to build algos to find the shortest route.

I’ll take your word on that. I dropped out as soon as I heard about graphs.

But Facebook didn’t! In fact, it’s built its own BGP system, which lets it do “fast incremental updates,” according to a paper presented earlier this year. That said, the system the company describes there is meant for communication within data centers — at this point, it’s hard to say what caused Facebook’s problems on Monday, and it’d take someone smarter than me to say whether Facebook’s datacenter communications could cause this kind of issue. Cybersecurity reporter Bryan Krebs claims that the outage was caused by a “routine BGP update.”

InFacebook’s engineering update, it said that the issue was caused by “configuration changes on the backbone routers that coordinate network traffic between our data centers.” That then led to a “cascading effect on the way [Facebook’s] data centers communicate, bringing [its] services to a halt.” At least to my eye, it reads like the problem was Facebook communicating within itself, not to the outside world (though that can obviously cause a worldwide outage, given how much of its own network stack Facebook controls).

What does DNS have to do with all this?

To borrow an explanation from Cloudflare: DNS tells you where you’re going, and BGP tells you how to get there. DNS is how computers know what IP address a website or other resource can be found at, but that knowledge itself isn’t helpful — if you ask your friend where their house is, you’re still probably going to need GPS to get you there.

Cloudflare also has a great technical rundown of how BGP errors can also mess up DNS requests — the article is specifically about Monday’s Facebook incident, so it’s worth a read if you’re looking for an explanation of what it looked like from an autonomous system’s perspective.

What can go wrong with BGP?

Many things. According to Cloudflare, two notable incidents include a Turkish ISP accidentally telling the entire internet to route its traffic to its service in 2004 and a Pakistani ISP accidentally banning YouTube worldwide after trying to do so only for its users. Because of BGP’s ability to spread from autonomous system to autonomous system (which, as a reminder, is one of the things that makes it so darn useful), one group making a mistake can cascade.

One group getting owned can also cause problems — in 2018, hackers were able to hijack requests to Amazon’s DNS and steal thousands of dollars in Ethereum by compromising a separate ISP’s BGP servers. Amazon wasn’t the one hacked, but traffic meant for it ended up somewhere else.

Or, you can mess it up and delete your entire service off the internet with a bad BGP update. BGP is lovingly called the duct tape of the internet, but no adhesive is perfect.

So what happened to Facebook?

It seems like Facebook’s servers, for some reason, told everyone to take them off their maps. Facebook has issued an initial report, but it’s light on details — it’s possible Facebook plans on releasing a more in-depth explanation later, saying why the changes were made, but this may also be the last we hear about it (at least officially).

However, Cloudflare’s CTO reports that the service saw a ton of BGP updates from Facebook (most of which were route withdrawals, or erasing lines on the map leading to Facebook) right before it went dark. One of Fastly’s tech leads tweeted that Facebook stopped providing routes to Fastly when it went offline, and KrebsOnSecurity backs up the idea that it was some update to Facebook’s BGP that knocked out its services.

I’d recommend Cloudflare’s explanation if you want nitty-gritty technical details.

If BGP was the problem, how does Facebook fix it?

Given that the outage went on for hours, the answer seems to be “not easily.” Facebook needed to make sure that it was advertising the correct records and that those records were picked up by the internet at large. In other words, it needed to make sure its maps were right and that everyone could see them.

That’s easier said than done, though. There were reports of Facebook employees being locked out from badge-protected doors and of employees struggling to communicate. In situations like these, you not only have to figure out who has the knowledge to solve the problem, and who has the permissions to solve the problem, but how to connect those people. And when your entire company is dead in the water, that’s no easy task — The Verge received reports of engineers being physically sent to a Facebook data center in California to try to fix the problem.

Would Web3 solve this problem?

Stop it. I will cry.

But to quickly answer the question, probably not — even if Facebook hopped on the decentralized train, there’d still have to be some protocol telling you where to find its resources. We’ve seen that it’s possible to misconfigure or mess up blockchain contracts before, so I’d be a bit suspicious of anyone who said that a contract and blockchain-based internet would be immune to this kind of issue.

Sure was fishy timing on that outage given all the bad Facebook news, huh?

Right, so obviously, the fact that this all happened while a whistleblower was going on TV and airing out Facebook’s dirty laundry makes it really easy to come up with alternative explanations. But it’s just as possible that this is an innocent mistake that some (very, very unfortunate) person on Facebook’s IT staff made.

For what it’s worth, that’s Facebook’s explanation. It lays the blame on a “faulty configuration change” that it made, not any devious hacks.

Update October 4th, 10:44PM ET: Updated with information from Facebook’s official engineering post.


Credit: Source link

Comments are closed.