12 Lessons Learned From Software Meltdowns
Hackers broke into the administrative console of Code Spaces’ Amazon Web Services account in June 2014 and deleted all the company’s files, including backup. Code Spaces never recovered and went out of business. LESSON: Store disaster recovery materials in a different cloud, on-premise or at another secure location.
Mt. Gox, the largest bitcoin exchange, calculated account balances by tracking user activity. It lacked a place to keep a permanent record of each transaction, so hackers stole money from user accounts. Because there was no transaction history, there was no evidence to prove that money was in those accounts. The now-defunct exchange lost $500 million of the Internet currency. LESSON: Use version control to know exactly what changes were made to source code—and when.
A bug in General Electric’s energy management system stalled a local utility’s control room. Result: a series of server failures that triggered alarms, and 50 million Americans and Canadians were without power for two days. LESSON: Alarms and other fail-safes can fail. Since each system is a potential single point of failure, additional layers are necessary to guarantee alerts for catastrophic system failures.
A technician failed to replace code in one of Knight Capital’s eight servers dedicated to high-frequency trading. The company unknowingly executed orders in 45 minutes instead of during several days. Result: $640 million lost. LESSON: Automate as much as possible, but require peer review for major code changes and put procedures in place to ensure that steps are completed for manual work.
The now-defunct exchange programmed its system to truncate the index at three decimal points, instead of the standard practice of rounding to two. This averaged a loss of 0.0005 points on each of the 2,800 daily trades. Two years later, the exchange corrected the error and the index shot up over 500 points overnight. LESSON: Check and double-check design implementations. Simulate whenever possible to ensure that they abide by common sense and produce the results you want.
In 2013, the European Space Agency’s rocket Ariane 5 exploded after takeoff because of the software in the inertial reference system. A 64-bit floating-point number had to be converted to a 16-bit signed integer, but its resulting value was too large for the destination format. The conversion failed, and the $7-billion rocket veered off its flight path and blew up. LESSON: Know your systems, their limits, input and output ranges, and make sure your data types support every possible value at both ends of each calculation.
In 2007, a coding error was introduced into the quantitative fund system of AXA Rosenberg, which replaced human judgment with statistical algorithms for stock investment decisions. When the error was discovered two years later, the company took six months to fix the problem, while trying to hide the truth from investors. Result: a record $242-million fine. LESSON: Invest in code review and use it. Properly performing code can mean the difference between success and failure. And honesty really is the best policy.
The robotic space probe burned up when it overshot its target due to a coding error. The ground-based team used British imperial units instead of metric units, causing a discrepancy between the planned and actual altitude. LESSON: Although the contract between NASA and Lockheed called for the use of metric units, that critical detail fell through the cracks. Consistent collaboration between teams can catch such mistakes because it fosters better communication on critical project requirements and components.
The $2 billion-plus cloud computing system for military communication in Iraq and Afghanistan suffered from reliability problems. Server failures resulted in users rebooting every 5.5 hours, and workstation operators experienced failure every 10 hours. LESSON: Designing for high availability with reliable fail-over should always be a key goal for critical systems. Stress, load and other forms of testing are crucial to provide guaranteed uptime in high-risk operations.
NASDAQ’s computer systems had been tested for volumes up to 40,000 orders, so on the morning of Facebook’s historic IPO, 496,000 orders overwhelmed the system. The result: an error that prevented traders from learning the true opening price. LESSON: Overestimate load factors. Your testing regimen should at least meet, or preferably exceed, expected real-world use. When in doubt, add another zero or two to load factors.
Early Theracs machines had hardware interlocks to prevent activating a high-power electron beam, rather than the intended low-power beam. The later version did not have the hardware and relied instead on software. When the software failed, three people died of excess radiation. LESSON: High-risk operations should always have many layers of protection systems. Theracs’ hardware and software were not tested jointly until the machine was resembled at the hospital—far too late in the development process.
Due to deficiencies in its electronic throttle control system, Toyota recalled millions of vehicles with hybrid anti-lock brake software. The problem: a single bit could be flipped in memory by an intense ray of sunlight, causing unintended acceleration. Also, sometimes the real-time OS overwrote critical data because it could not store all the data the vehicle produced. LESSON: Install multiple layers of redundancy and protection in decision-making systems. Errors could have been avoided by knowing the system component limitations, choosing other components and enforcing stack limits.