Let BGP Convoy Your Data Home.
Clued-in businesses these days wouldn't think of running their critical computer systems without an uninterruptible power supply (UPS). Yet, until recently, many of those same businesses blithely connected to the Internet through a single link to a single provider, leaving them subject to sudden disconnection from the global network and their customers, partners, and remote employees. And even when that link was working, they may not have been getting the best possible connection to the backbone and the rest of the Internet beyond, depending on the routing expertise and connectedness of their ISP.
The good news is that an increasing number of sites have discovered multi-homing (connecting to more than one ISP) and "intelligent" routing services (ISPs or backbones that promise the best route for your traffic) that make their Internet connection more reliable and give them assurance that their data is taking the best possible route (or at least better than most) through the network to its destination, whether next door or around the world. The bad news is that the proliferation of multi-homed sites and the increasing stress of multiplying routing policies threatens to bring the Internet to its knees.
Forget the much-ballyhooed Ipv4 address-exhaustion problem of a few years back; Classless Inter-Domain Routing (CIDR) and Network Address Translation (NAT) took care of that. Recent estimates by Geoff Huston of Telstra put the exhaustion of the Ipv4 address space some 19 years in the future at current rates of growth. According to him, it's now a shortage of routing resources like AS numbers and the ability of the core routers to handle increasingly large routing tables that will eventually crash the Internet if something isn't done.
But multi-homing and intelligent routing are too useful to abandon. The Internet infrastructure will have to adapt; several IETF efforts are already under way. The main goal of this article is to explain at a very high level how Internet routing works (Part One), and the tools available to integrators for dealing with and profiting from Internet routing deficiencies (Part Two).
BGP And Internet Routing
The fundamental unit of Internet routing is not the router but the Autonomous System (AS), a group of routers controlled by the same routing authority--this may be a backbone, an ISP, or a business.
The border routers of each Autonomous System (AS) use the Border Gateway Protocol (BGP) to notify their peers in other ASs to which they connect about the route they use to reach a given CIDR prefix; that is, an address with a network mask that indicates a group of sequential IP addresses. For instance, in the figure, which illustrates how an Internet route announcement works, the prefix 18.104.22.168/24 indicates the addresses from 22.214.171.124 to 126.96.36.199.
Network providers try to announce the largest prefixes possible to avoid bloating the router tables with too-specific announcements. Multi-homing by its nature forces the announcement of smaller (more specific) routes, hence the growing problem with Internet routing.
Routing within an AS is handled by an Interior Gateway Protocol (an IGP such as OSPF or IS-IS, or in many cases, especially backbones, Interior BGP, or IBGP), but this routing information is not visible outside the AS. It does, however, have an effect, and is thought by many to be part of the reason Internet routing stability is increasingly problematic.
Each border router in an AS must establish a BGP session with its peers, or neighbors, in every other AS to which it is connected: a permanent TCP connection over which route announcements (also called route advertisements) are transmitted. When a BGP session first comes up, the two routers exchange their entire routing, or forwarding, tables, minus routes that are filtered out to prevent various horrible things from happening. (It's possible, for instance, to route the traffic from one major ISP to another through your own site by not filtering properly, which basically shuts down your router in a self-inflicted denial of service attack!)
After that, only route changes are transmitted. If a BGP session goes down for any reason, all the routes announced via that session are withdrawn. This effect is virtually immediate in next-hop routers, and then propagates through the Internet as the affected routers announce the changes to their peers. BGP takes from three minutes to 15 minutes or more to stabilize globally; in pathological cases, it may not stabilize at all.
A router has four tables in which information obtained via BGP are stored:
* Adj-RIB-In, which stores route announcements received from each peer.
* The main BGP table, which is the sum of all the Adj-RIB-In tables, with whatever parameters have been added by the action of route maps (discussed below).
* A forwarding or routing table, which stores the ONE route selected, according to routing policy, to reach a given prefix. This is what the router uses to make its real-time forwarding decision. It is this table that's getting too big for even the huge core routers to handle--sure, they've got the most processing power, but they also have to have a route for every announced prefix in the world!
* Adj-RIB-Out, which stores routes to be announced to each peer. A filter is applied to remove routes which should not, for various reasons, be propagated.
A route announcement (see the figure) consists of four basic parts:
* The destination prefix (an IP address plus subnet mask; for instance 188.8.131.52/24, which represents 256 IP addresses).
* A list of the ASs that a packet will traverse to reach that destination (AS-PATH).
* The next hop for that route (usually the IP address of the router making the announcement), and;
* A set of BGP attributes.
In effect, a route announcement is a promise that any packets addressed to the destination prefix that are delivered to the given next hop will be moved one hop closer to that destination.
BGP attributes describe various metrics about a route and are used by a router to apply the site's or AS's routing policy to routes received from other ASs or sent to them. There are several types:
* Some are passed along from AS to AS all the way from beginning (the router closest to the destination) to end (the router closest to the source). An example of this is AS-PATH, which is basically an ordered list of all the ASs that a packet will pass through on its way to the destination prefix.
* Some are set locally, passed to the next AS, but not further. An example of this is MULTI-EXIT-DISCRIMINATOR (MED), which is used by an AS to tell the ASs to which it is connected which is the best router to use if multiple ones are available.
* Some are set locally and not passed to the next AS. An example of this is LOCAL-PREF, a very powerful attribute that overrides almost every other one and expresses a preference for a given route.
Choosing A Route
When a router receives a route announcement from a peer, it stores it in an Adj-RIB-In table. A router has one Adj-RIB-In for every peer it is connected to via a BGP session. The router then applies the routing policy of the AS (i.e. decides which route to use for a given prefix) by applying a route map to all the announcements in all the Adj-RIB-In tables. This route map is a set of paired match and set statements that are basically if-then rules applied to the four parts of the routing announcements to decide how to set associated BGP attributes.
After the routing announcement has been passed through the route map, all of the information on it has been acted on, and local BGP parameters have been set. The resulting list of routes is stored in the router's main BGP table. The router's control plane then acts on all the routing announcements for a given prefix according to a decision tree specified in the BGP standard that ranks the various attributes. In this decision tree LOCAL-PREF outweighs everything except a dead route (unless the proprietary Cisco attribute WEIGHT is used). This means that failover from a dead router is automatic with BGP.
The application of this decision tree results in one and only one route being chosen for a given prefix. This route is then stored in the routing or forwarding table. Then a set of outgoing filters are applied, again using route maps, to decide which routes should be passed along (announced) to the router's peers. The resulting set of routes are stored in the Adj-RIB-Out table and sent out via the appropriate BGP session(s).
In practice, what this procedure means is that routers use and announce a single route for each prefix to their peers, whether via BGP for those in other ASs or via IBGP for those in the same AS.
This article is the first in a two-part series. The second part will appear in the July issue of CTR.