The theme of cyber resilience engineering has been reverberating through the last few years, and the resilience gong is sounded again with the Federal Aviation Administration (FAA) network outage which impacted all airline service and caused thousands of flight delays. Between concepts of DevOps, Site Reliability Engineering (SRE), and Cyber Resiliency Engineering, there are a lot of methodologies which could have assisted the FAA in reducing their risk.
It is worthwhile to mention that the FAA is most likely drowning in technical debt in the form of outdated technology, fragile infrastructure, complexity, and lack of documentation. Geoff Freeman, president and CEO of the U.S. Travel Association said “Today’s FAA catastrophic system failure is a clear sign that America’s transportation network desperately needs significant upgrades,” which amplifies the presence of technical debt at the FAA which desperately needed attention (Muntean & Wallace, 2023).
The MITRE Cyber Resiliency Engineering Framework (MITRE CREF) focuses on these primary goals: Anticipate, Withstand, Recover, and Adapt. When considering the FAA outage, they did not anticipate the outage and did not have resilience mechanisms (redundancy, automation, etc.) in place to withstand the outage. Recovery has been slow and painful, indicating that there was probably a lack of documentation, damage assessment, monitoring, root cause analysis capabilities, simplicity, and protected backup and restore capabilities. The question is: How will the FAA adapt and improve after this outage? Will the FAA designate significant budget toward reducing technical debt and improving resilience with the goals of preventing infrastructure failures in the future?
The FAA is not alone in this predicament. Incident response has taught us over the last several years that most companies, regardless of size (from large enterprise down to small business), have incredible technical debt, lack of documentation, lack of redundancy, and lack of recoverability. Often, this is due to budget restrictions and the pressure to drive down costs and limit investments. Sometimes, this technical debt and lack of resiliency is due to lack of attention and prioritization by the business. IT is now at the backbone of most businesses, so it requires more priority and more investment at most organizations. How should a business start to prioritize resilience? Consider a resilience framework like the MITRE Cyber Resiliency Engineering Framework (MITRE CREF) as a starting point.
Muntean, P. and Wallace, G. (2023, January 11). FAA system outage causes thousands of flight delays and cancellations across the US. CNN. Retrieved from https://www.cnn.com/travel/article/faa-computer-outage-flights-grounded/index.html