Dynamic Routing on Linux: A 6-Month FRRouting Production Post-Mortem

Networking tutorial - IT technology blog
Networking tutorial - IT technology blog

The Breaking Point of Static Route Management

Six months ago, our network reached a level of complexity that manual configuration could no longer support. We had scaled from three simple gateways to a sprawling infrastructure of twelve subnets across data centers in Ashburn and Frankfurt. Our routing table was a fragile collection of static entries. Every time we provisioned a new VLAN or a site-to-site VPN tunnel jittered, a sysadmin had to manually update half a dozen servers.

The collapse happened on a Tuesday night. A simple typo in a ip route add command caused a 40-minute outage for our entire staging environment. A single gateway began black-holing packets because it lacked the return path for a newly created subnet. This wasn’t just a human error problem. It was a scalability wall. We needed a network that could heal itself without a human in the loop.

The Core Problem: Static Routes Are “Dumb”

Static routing fails because it lacks state awareness. To the Linux kernel, a static route is a blind instruction. If the target interface is “UP,” the kernel will keep pushing packets into that interface. It doesn’t matter if the next-hop router has crashed or if an upstream provider is experiencing a massive routing leak. Packets simply disappear into the void.

To run a resilient stack, we needed three capabilities that static routes can’t provide:

  • Sub-Second Failover: Traffic must automatically reroute if a primary link drops.
  • Automated Discovery: New subnets should announce their presence to the entire fabric instantly.
  • Health Monitoring: Routes must be withdrawn the moment a destination becomes unreachable.

Choosing the Right Stack: Quagga vs. BIRD vs. FRR

I spent a week testing the three main contenders for Linux routing. Each has a specific niche, but only one fit our production needs.

1. Quagga

Quagga is the grandfather of Linux routing, but it shows its age. Development has stalled, and it struggles with modern multi-threaded workloads. During testing, it felt sluggish and lacked the robust API support we wanted for future automation.

2. BIRD

BIRD is a powerhouse. It is the industry standard for Internet Exchange Points (IXPs) handling millions of routes. However, its configuration syntax is a custom programming language. Unless you have a dedicated network engineer to manage BGP policies, the learning curve is prohibitively steep for a standard DevOps team.

3. FRRouting (FRR)

FRR is the modern fork of Quagga, backed by heavyweights like Nvidia and Broadcom. It uses vtysh, a shell that mimics the Cisco IOS/Arista EOS workflow. For anyone who has touched a hardware switch, it feels familiar. It handles OSPF, BGP, and EVPN with ease, making it the most versatile choice for our hybrid environment.

The Implementation: OSPF and BGP in Production

After 180 days in production, our setup has proven remarkably stable. We use OSPF for internal (East-West) traffic and BGP for external (North-South) connectivity. This dual-protocol approach balances speed with granular control.

Step 1: Installation

On Ubuntu 22.04 or 24.04, skip the default OS repositories. They often lag behind the latest stable releases. Instead, use the official FRR repository to ensure you have the latest security patches.

# Add the official repository
curl -s https://deb.frrouting.org/frr/keys.asc | sudo apt-key add -
FRRVER="frr-stable"
echo deb https://deb.frrouting.org/frr/ $(lsb_release -s -c) $FRRVER | sudo tee -a /etc/apt/sources.list.d/frr.list

sudo apt update && sudo apt install frr frr-pythontools

Next, enable the specific protocols you need by editing /etc/frr/daemons. For our setup, we set bgpd=yes and ospfd=yes, then restarted the service.

Step 2: Internal Routing with OSPF

OSPF is our “set it and forget it” tool for internal subnets. It ensures that every gateway knows about every other gateway. Use vtysh to configure it rather than editing raw text files.

sudo vtysh
conf t
router ospf
  network 10.0.0.0/24 area 0
  network 192.168.1.0/24 area 0
exit
wr memory

This configuration eliminated our manual tracking. When we add a new interface to Area 0, the route propagates to the rest of the cluster in under 200ms.

Step 3: External Peering via BGP

BGP is essential for connecting to cloud providers like AWS or Azure. Here is a simplified version of our peer config for an AWS Direct Connect gateway.

router bgp 65001
  neighbor 169.254.0.1 remote-as 65002
  neighbor 169.254.0.1 description AWS-Primary
  !
  address-family ipv4 unicast
    network 10.50.0.0/16
  exit-address-family
exit

Hard-Won Lessons from the Field

Transitioning to dynamic routing changed our entire operational philosophy. Here are three critical takeaways from the last six months.

1. Visibility is Everything

Dynamic routing is great until a link starts “flapping” (rapidly going up and down). This can trigger a route recalculation storm. We now export FRR metrics to Prometheus. We need to know if a BGP session drops within seconds, long before users report latency spikes.

2. Respect the vtysh Workflow

Avoid the temptation to manually hack /etc/frr/frr.conf. Using vtysh provides real-time syntax checking. It allows you to apply changes live without tearing down existing traffic flows. Get comfortable with show ip route and show ip bgp summary; they are your best diagnostic tools.

3. Never Trust a Neighbor

Always implement prefix lists. If you don’t filter your BGP neighbors, a misconfigured peer could accidentally send you a default route (0.0.0.0/0). This would effectively hijack all your outbound traffic. We use strict filters to only allow specific, expected subnets.

ip prefix-list ONLY-OUR-SUBNETS permit 10.0.0.0/8 ge 24
!
route-map IMPORT-FILTER permit 10
  match ip address prefix-list ONLY-OUR-SUBNETS
!

The Bottom Line

Our network is now significantly more resilient. During a recent hardware failure on an edge gateway, FRR rerouted traffic to a backup path in roughly 2.4 seconds. No one on the engineering team had to wake up. If you manage more than a handful of Linux nodes, stop using static routes. The setup time for FRRouting pays for itself the moment your first link fails and your users notice nothing.

Share: