Cloudflare

Cloudflare has shared details about a recent Border Gateway Protocol (BGP) route leak incident that affected IPv6 traffic for 25 minutes. The incident caused significant congestion, packet loss, and the drop of approximately 12 Gbps of traffic, impacting external networks beyond Cloudflare’s direct customers.

What is a BGP route leak?

The BGP system is essential for routing data across different interconnected networks known as autonomous systems (AS). A BGP route leak occurs when an Autonomous System violates routing policies by incorrectly advertising routes learned from one peer or provider to another peer or provider.

According to Cloudflare, this incident was classified as a mix of Type 3 and Type 4 route leaks, according to the definitions in RFC7908. These leaks violate “valley-free routing” rules that dictate how routes should be propagated based on business relationships between networks. Violating these rules can cause traffic to be drawn down unstable or unintended paths, often resulting in congestion, suboptimal performance, or, as in this case, completely dropping traffic when it is rejected by firewall filters.

Cause of the Incident and Consequences

The root of the problem was a “misconfiguration” (accidental misconfiguration) on a router. The specific cause was a policy change that sought to prevent Cloudflare from advertising Bogotá IPv6 prefixes. However, by eliminating specific prefix lists, the export policy became overly permissive. This allowed an internal route type match to accept all internal IPv6 (iBGP) routes and export them externally to all of Cloudflare’s BGP neighbors in Miami.

Although the main impact of such incidents is loss of reliability, they also have a security dimension, as they can facilitate BGP hijacking attacks, allowing unauthorized parties to intercept and analyze traffic.

Detection and Mitigation

Cloudflare detected the issue shortly after it appeared. The company’s engineers manually reverted the settings and paused the automation, stopping the impact within 25 minutes. The code change that triggered the incident was later reverted.

Lessons Learned and Prevention Measures

Cloudflare noted that the incident is similar to another that occurred in July 2020 and has proposed measures to prevent future occurrences:

  • Stricter export safeguards: Implementation of community-based safeguards to control route propagation.
  • Controls in CI/CD: Integrate automated checks to detect routing policy errors in the Continuous Integration/Continuous Deployment (CI/CD) pipeline.
  • Improved early detection: Optimize monitoring systems to quickly identify routing anomalies.
  • RPKI ASPA Validation: Promote and validate the adoption of Route Origin Authorization (RPKI) and Autonomous System Path Authorization (ASPA) as best practices for BGP routing.

References

Conclusion

The Cloudflare BGP route leak incident highlights the complexity and importance of network configuration management in large-scale infrastructures. Rapid detection and response mitigated the impact in just 25 minutes, but the root cause highlights the need to implement more robust security controls in CI/CD and configuration automation processes to prevent human errors that can have far-reaching consequences on the global network.