12 Lessons Learned From Software Meltdowns

 
 
By Karen A. Frenkel  |  Posted 12-12-2014 Email
 
 
 
 
 
 
 
 
 
  • Previous
    Code Spaces: A Killing in the Cloud
    Next

    Code Spaces: A Killing in the Cloud

    Hackers broke into the administrative console of Code Spaces' Amazon Web Services account in June 2014 and deleted all the company's files, including backup. Code Spaces never recovered and went out of business. LESSON: Store disaster recovery materials in a different cloud, on-premise or at another secure location.
  • Previous
    Bitcoin Exchange Collapse
    Next

    Bitcoin Exchange Collapse

    Mt. Gox, the largest bitcoin exchange, calculated account balances by tracking user activity. It lacked a place to keep a permanent record of each transaction, so hackers stole money from user accounts. Because there was no transaction history, there was no evidence to prove that money was in those accounts. The now-defunct exchange lost $500 million of the Internet currency. LESSON: Use version control to know exactly what changes were made to source code—and when.
  • Previous
    Northeast Blackout of 2003
    Next

    Northeast Blackout of 2003

    A bug in General Electric's energy management system stalled a local utility's control room. Result: a series of server failures that triggered alarms, and 50 million Americans and Canadians were without power for two days. LESSON: Alarms and other fail-safes can fail. Since each system is a potential single point of failure, additional layers are necessary to guarantee alerts for catastrophic system failures.
  • Previous
    Accelerated High-Frequency Trading
    Next

    Accelerated High-Frequency Trading

    A technician failed to replace code in one of Knight Capital's eight servers dedicated to high-frequency trading. The company unknowingly executed orders in 45 minutes instead of during several days. Result: $640 million lost. LESSON: Automate as much as possible, but require peer review for major code changes and put procedures in place to ensure that steps are completed for manual work.
  • Previous
    Vancouver Stock Exchange Index
    Next

    Vancouver Stock Exchange Index

    The now-defunct exchange programmed its system to truncate the index at three decimal points, instead of the standard practice of rounding to two. This averaged a loss of 0.0005 points on each of the 2,800 daily trades. Two years later, the exchange corrected the error and the index shot up over 500 points overnight. LESSON: Check and double-check design implementations. Simulate whenever possible to ensure that they abide by common sense and produce the results you want.
  • Previous
    European Space Agency Rocket Explosion
    Next

    European Space Agency Rocket Explosion

    In 2013, the European Space Agency's rocket Ariane 5 exploded after takeoff because of the software in the inertial reference system. A 64-bit floating-point number had to be converted to a 16-bit signed integer, but its resulting value was too large for the destination format. The conversion failed, and the $7-billion rocket veered off its flight path and blew up. LESSON: Know your systems, their limits, input and output ranges, and make sure your data types support every possible value at both ends of each calculation.
  • Previous
    Fund System Error
    Next

    Fund System Error

    In 2007, a coding error was introduced into the quantitative fund system of AXA Rosenberg, which replaced human judgment with statistical algorithms for stock investment decisions. When the error was discovered two years later, the company took six months to fix the problem, while trying to hide the truth from investors. Result: a record $242-million fine. LESSON: Invest in code review and use it. Properly performing code can mean the difference between success and failure. And honesty really is the best policy.
  • Previous
    Mars Robotic Orbiter
    Next

    Mars Robotic Orbiter

    The robotic space probe burned up when it overshot its target due to a coding error. The ground-based team used British imperial units instead of metric units, causing a discrepancy between the planned and actual altitude. LESSON: Although the contract between NASA and Lockheed called for the use of metric units, that critical detail fell through the cracks. Consistent collaboration between teams can catch such mistakes because it fosters better communication on critical project requirements and components.
  • Previous
    U.S. Army's Distributed Common Ground System
    Next

    U.S. Army's Distributed Common Ground System

    The $2 billion-plus cloud computing system for military communication in Iraq and Afghanistan suffered from reliability problems. Server failures resulted in users rebooting every 5.5 hours, and workstation operators experienced failure every 10 hours. LESSON: Designing for high availability with reliable fail-over should always be a key goal for critical systems. Stress, load and other forms of testing are crucial to provide guaranteed uptime in high-risk operations.
  • Previous
    Facebook's IPO on NASDAQ
    Next

    Facebook's IPO on NASDAQ

    NASDAQ's computer systems had been tested for volumes up to 40,000 orders, so on the morning of Facebook's historic IPO, 496,000 orders overwhelmed the system. The result: an error that prevented traders from learning the true opening price. LESSON: Overestimate load factors. Your testing regimen should at least meet, or preferably exceed, expected real-world use. When in doubt, add another zero or two to load factors.
  • Previous
    Therac-25 Radiation Therapy Disaster
    Next

    Therac-25 Radiation Therapy Disaster

    Early Theracs machines had hardware interlocks to prevent activating a high-power electron beam, rather than the intended low-power beam. The later version did not have the hardware and relied instead on software. When the software failed, three people died of excess radiation. LESSON: High-risk operations should always have many layers of protection systems. Theracs' hardware and software were not tested jointly until the machine was resembled at the hospital—far too late in the development process.
  • Previous
    Toyota Vehicle Recall
    Next

    Toyota Vehicle Recall

    Due to deficiencies in its electronic throttle control system, Toyota recalled millions of vehicles with hybrid anti-lock brake software. The problem: a single bit could be flipped in memory by an intense ray of sunlight, causing unintended acceleration. Also, sometimes the real-time OS overwrote critical data because it could not store all the data the vehicle produced. LESSON: Install multiple layers of redundancy and protection in decision-making systems. Errors could have been avoided by knowing the system component limitations, choosing other components and enforcing stack limits.
 

Software for high-risk operations and mission-critical services requires a level of discipline and care above and beyond normal product development. Below are 12 software malfunctions (some are also coupled with hardware faults) and lessons learned, according to Perforce Software, which offers a management and collaboration platform. "This awful menagerie of failures tells us that the human beings who write code aren't getting any better at avoiding mistakes," says John Williston, product marketing manager at Perforce. "Rather than testing only for the expected and hope for the best, we should expect failure, plan for it, force our systems into failure and harden them against it." But is that overkill for your average application? Williston thinks that depends on how loyal you think your customers are, and whether they'll return to your app if it malfunctions. When significant resources or human lives are on the line, he points out, such caution is the kind of uncommon sense that the software industry desperately needs.

 
 
 
 
 
Karen A. Frenkel writes about technology and innovation and lives in New York City.

 
 
 
 
 
 

Submit a Comment

Loading Comments...