Learning from the Past: Mars Orbiter Failure Due to Quality

            The NASA Mars Climate Orbiter (MCO) mission suffered a series of defects leading up to its mission in 1999 which led to catastrophic failure of the orbiter spacecraft.  The root cause of the failure was the use of the incorrect units between Lockheed Martin and NASA in system interfaces.  Some places used metric and some places used imperial units, which meant that spacecraft navigation and tracking was incorrect, leading to incorrect course adjustments and transfer burns (Dodd, 2020).  All of these added up to its crash, followed by inquiries and NASA’s follow up investigation (Isbell & Savage, 1999).  There are eight factors in the post-mortem that can be mitigated in future missions through Failure Mode & Effects Analysis (FMEA), Six Sigma, Value Stream Mapping (VSM), and resilience engineering.

Failure Mode & Effects Analysis (FMEA), Six Sigma, Value Stream Mapping (VSM)

            In design, engineering, and manufacturing, and testing at NASA, there are two tools and methods which would reduce defects and use predictive analysis to avoid mission failures:  Failure Mode & Effects Analysis (FMEA) and Six Sigma.  Failure Mode & Effects Analysis (FMEA) is a qualitative tool used to systematically analyze possible things that could go wrong (Forrest, 2022).  Six Sigma is a quality management process and method built around reducing defects and waste.  It is primarily quantitative analysis but is also able to quantify qualitative factors into its analysis in order to provide data-driven insights.

            Value stream mapping (VSM) is a lean tool that flowcharts and documents every step of the process and can be used for process improvement.  Value stream mapping illustrates how information is passed between process steps (What is Value Stream Mapping (VSM), n.d.).  This would have been useful in the NASA project in order to detail the process as well as inputs, outputs, and information communications at each process step.

            Resilience engineering, popularized by Erik Hollnagel in 1973, is designing and engineering systems that can adjust their functions before, during, or after disturbances or changes so that they can continue operations under unexpected conditions (Lay, 2019).  In other words, resilience engineering makes systems which are designed to bounce back from adversity and error.  Resilience engineering designs systems for the failure of one or more systems, so it would have been ideal in the Mars orbiter.

8 Contributing Factors to Orbiter Failure and Suggestions for Mitigation

            “Errors went undetected within ground-based computer models of how small thruster firings on the spacecraft were predicted and then carried out on the spacecraft during its interplanetary trip to Mars” (Isbell & Savage, 1999).  For this factor, Failure Mode & Effects Analysis (FMEA) would have explored the negative side effects of thruster firing errors.  Six Sigma implementation would have identified outliers in computer models, potentially identifying anomalies and errors.  In addition, more prototyping and testing would have improved likelihood of catching this defect.

            “The operational navigation team was not fully informed on the details of the way that Mars Climate Orbiter was pointed in space, as compared to the earlier Mars Global Surveyor mission” (Isbell & Savage, 1999).  This failure of communication and lack of information sharing could have been remedied in a value stream map (VSM) detailing the required information to be passed to and from each process step (What is Value Stream Mapping (VSM), n.d.).

            “A final, optional engine firing to raise the spacecraft’s path relative to Mars before its arrival was considered but not performed for several interdependent reasons” (Isbell & Savage, 1999).  FMEA would have been ideal to identify the needs for the additional thruster based on a possible failure state.  Resilience engineering would have determined that the optional engine would have provided the needed resiliency to adapt to necessary path corrections.

            “The systems engineering function within the project that is supposed to track and double-check all interconnected aspects of the mission was not robust enough, exacerbated by the first-time handover of a Mars-bound spacecraft from a group that constructed it and launched it to a new, multi-mission operations team” (Isbell & Savage, 1999).  In addition to a system engineering problem, which could have been aided with FMEA and VSM, this sounds like a lack of total system testing.  FMEA could have recognized failure points from a systemic point of view.  VSM, at a total system layer, would have illustrated and documented inter-process dependencies which failed when assembled into an entire system.

            “Some communications channels among project engineering groups were too informal” (Isbell & Savage, 1999).  Neither FMEA, nor resilience engineering, nor Six Sigma could have improved communication processes during the design process between engineering teams.  Better communication methods and channels should have been specified by the project manager so that better verbal, non-verbal, written, and visual communication was present (5 Effective Communication Methods, n.d.).  Defining a communications plan to formalize communication methods and frequency is basic project management.

            “The small mission navigation team was oversubscribed, and its work did not receive peer review by independent experts” (Isbell & Savage, 1999). This factor illustrates two problems.  First, the team was oversubscribed.  This means that they were doing too many tasks at once or team members were members of several teams with competing responsibilities.  Ideally, this never happens, and each engineer can focus on one part of the project.  Almost every project team is oversubscribed.  There is almost never excess team capacity.  The fix for this would have been to prioritize mission-critical systems with a risk assessment and staff dedicated engineers to those systems without oversubscribing them.  The second problem is that there was no peer review by independent experts.  This failure is a process problem in product release.  There should have been a quality assurance and testing phase for each system which included outside audit and review.

            “Personnel were not trained sufficiently in areas such as the relationship between the operation of the mission and its detailed navigational characteristics, or the process of filing formal anomaly reports” (Isbell & Savage, 1999).  Reporting on anomalies is a part of data collection and analysis for Six Sigma (Sherman, n.d.).  In addition, collecting this anomaly data, in Six Sigma, is a part of the data collection plan (Hessing, n.d.).  The training portion of this defect should have been included as a part of the project plan.

            “The process to verify and validate certain engineering requirements and technical interfaces between some project groups, and between the project and its prime mission contractor, was inadequate” (Isbell & Savage, 1999).  Similar to the failure of the systems engineering function, this failure is a matter of system-wide specifications and project management, including designing, engineering, and testing the interconnections, interfacing, and dependencies between components in the system.  This requires more thought and planning, as well as more testing.

            In summary, quality management through tools and methodologies like Six Sigma, Value Stream Mapping (VSM), Resilience Engineering, and Failure Mode & Effects Analysis (FMEA) may have been useful in identifying defects and problems in the design, engineering, production, and testing portions of the NASA project.

Considerations Before Implementing Suggested Tools and Methodologies

            Before implementing tools and methodologies like Failure Mode & Effects analysis (FMEA), Six Sigma, Value Stream Mapping (VSM), and Resilience Engineering, it is important to identify the quality management processes in place within the project already at NASA.  For instance, if the project team is already implementing Scrum for project management with Six Sigma and an ISO 9001 approach to process management, then it would be worthwhile to examine the existing processes, then add Value Stream Mapping (VSM) to the existing process documentation, if it does not already exist as a part of their ISO 9001 approach.  The biggest improvement to this imaginary project and process management stack (Scrum, Six Sigma, ISO 9001 with VSM) would be Failure Mode & Effects Analysis (FMEA).  Therefore, it is important to evaluate what NASA already has in place.

            In addition to identifying what quality management is already in place, it is important to evaluate the effectiveness of the project management and engineering teams in implementing the quality management already in place.  Undoubtedly NASA already has some quality management methodologies implemented on the project.  The failures may only be in implementation of those methodologies.

Conclusion

            NASA’s Mars Orbiter failure incident was a failure in quality management around design, engineering, production, and testing processes between NASA and Lockheed Martin (Dodd, 2020).  At some point, if proper engineering, review, and testing were in place, someone would have identified the root cause:  imperial units and metric units used between systems.  In order to improve on NASA’s quality management, Failure Mode & Effects Analysis (FMEA), Six Sigma, Value Stream Mapping (VSM), and resilience engineering should be used in conjunction with existing management practices.

References

5 Effective Communication Methods In Project Management. (n.d.). Your Software PM Mentor. Retrieved from https://pmbasics101.com/communication-methods-project-management/

Dodd, T. (2020, May 14). Metric vs. Imperial Units:  How NASA lost a 327 Million Dollar Mission to Mars. Everyday Astronaut. Retrieved from https://everydayastronaut.com/mars-climate-orbiter/

Forrest, G. (n.d.). FMEA (Failure Mode and Effects Analysis) Quick Analysis. iSixSigma. Retrieved from https://www.isixsigma.com/tools-templates/fmea/fmea-quick-guide/

Hessing, T. (n.d.). Data Collection Plan. Six Sigma Study Guide. Retrieved from https://sixsigmastudyguide.com/data-collection-plan/

Isbell, D. & Savage, D. (1999, November 10). Mars Climate Orbiter Failure Board Releases Report, Numerous NASA Actions Underway in Response. NASA. Retrieved from https://www.nasa.gov/home/hqnews/1999/99-134.txt

Kumar, P. (2022, December 8). What is Six Sigma:  Everything You Need to Know About It. Simplilearn. Retrieved from https://www.simplilearn.com/what-is-six-sigma-a-complete-overview-article

Lay, E. (2019, November 9). Hollnagel: What is Resilience Engineering. Resilience Engineering Association. Retrieved from https://www.resilience-engineering-association.org/blog/2019/11/09/what-is-resilience-engineering/

Sherman, P. (n.d.). Interpreting Anomalies Correctly Can Help Avoid Waste. iSixSigma. Retrieved from https://www.isixsigma.com/tools-templates/graphical-analysis-charts/interpreting-anomalies-correctly-can-help-avoid-waste/

What is Value Stream Mapping (VSM)? (n.d.). ASQ. Retrieved from https://asq.org/quality-resources/lean/value-stream-mapping

Published by Art Ocain

I am a DevOps advocate, not because I am a developer (I’m not), but because of the cultural shift it represents and the agility it gains. I am also a fan of the theory of constraints and applying constraint management to all areas of business: sales, finance, planning, billing, and all areas of operations. My speaking: I have done a lot of public speaking in my various roles over the years, including presentations at SBDC (Small Business Development Center) and Central PA Chamber of Commerce events as well as events that I have organized at MePush. My writing: I write a lot. Blog articles on the MePush site, press-releases for upcoming events to media contracts, posts on LinkedIn (https://www.linkedin.com/in/artocain/), presentations on Slideshare (https://www.slideshare.net/ArtOcain), posts on the Microsoft Tech Community, articles on Medium (https://medium.com/@artocain/), and posts on Quora (https://www.quora.com/profile/Art-Ocain-1). I am always looking for new places to write, as well. My certifications: ISACA Certified Information Security Manager (CISM), Certified Web Application Security Professional (CWASP), Certified Data Privacy Practitioner (CDPP), Cisco Certified Network Associate (CCNA), VMware Certified Professional (VCP-DCV), Microsoft Certified System Engineer (MCSE), Veeam Certified Engineer (VMCE), Microsoft 365 Security Administrator, Microsoft 365 Enterprise Administrator, Azure Administrator, Azure Security Administrator, Azure Architect, CompTIA Network+, CompTIA Security+, ITIL v4 Foundations, Certified ScrumMaster, Certified Scrum Product Owner, AWS Certified Cloud Practitioner See certification badges on Acclaim here: https://www.youracclaim.com/users/art-ocain/badges My experience: I have a lot of experience from developing a great company with great people and culture to spinning up an impressive DevOps practice and designing impressive solutions. I have been a project manager, a President, a COO, a CTO, and an incident response coordinator. From architecting cloud solutions down to the nitty-gritty of replacing hardware, I have done it all. When it comes to technical leadership, I am the go-to for many companies. I have grown businesses and built brands. I have been a coach and a mentor, developing the skills and careers of those in my company. I have formed and managed teams, and developed strong leaders and replaced myself within the company time and again as I evolved. See my experience on LinkedIn here: https://www.linkedin.com/in/artocain/

Leave a Reply

%d bloggers like this: