For most of the past few weeks I have been moderating an intense and evolving debate on software quality for one of my clients. The software engineering organization is struggling to balance speed of development (which is in many ways their proxy for market responsiveness), quality of deliverables, and lifecycle cost (their deliverables are potentially in use for many years and have a significant ongoing engineering cost). The software’s requirements are often fluid, during and after development, and it’s critical that it be both highly reliable and scalable to millions of users. With a few thousand engineers involved, it’s also critical that some common understanding of “quality” is in place and agreed to.
A lot of the debate has been focused on testing. Total Quality, however, would suggest that although testing is necessary, it’s not sufficient. Testing focuses on inspection, not on prevention. To over-simplify, you test in hopes of demonstrating that the software has no defects (because you have a good high-quality development process), not to detect the defects that are present, but should not be there (because you don’t have a good high quality development process).
After several years of significant effort, the code my client was developing (and testing) still isn’t of the high quality they are looking for. So we decided to go back, start from some basics, and look again at the issue of software quality.
First we developed a series of “quality gates” that the software development process had to pass before the code could be delivered to production. Most of these quality characteristics were already being looked at, but in a haphazard fashion and with different degrees of emphasis. Our work has standardized these characteristics and how they are established. The gates go something like this:
Does the software work? This is the code quality gate. The code has to compile cleanly and pass some specific (and highly automated) code structure and execution tests to ensure that the organization’s standards are being followed. The code also has to integrate into the automatic build process that the teams use. With modern software platform technologies, the teams also have to adhere to object naming conventions and namespace management rules, so these get checked as well. Passing this gate will not entirely eliminate all possible coding defects (some of the tests are heuristic, not deterministic), but we should get all of the most critical coding mistakes and bad practices. The biggest issue here was teaching the developers adequate testing skills and getting them to actually run the tests. That eventually led them to adopt a Test-Driven Development (TDD) approach and to combine development and test organizations into a single project organization, a change that is still in process.
Does the software do what it’s supposed to? This is the functional quality gate. The Product Owners (the business managers who specified what the code was supposed to do in the first place) have to provide test definitions and test cases, which they do with the help of the project team. Here we also ensure that the code is state independent (the same initial conditions always produce the same results when the code is executed, and changes in initial conditions that are not supposed to affect the code don’t have an effect). This is quite hard to do well, and we have had some major silo-busting to do within the product management organization to make sure that everyone who should be involved gets (and stays) involved.
Can everyone use it effectively? This is the usability characteristic of software quality. Even if the code passes the first two gates, it’s not worth putting into production if it’s so hard to use that no one will actually use it. Usability testing involves both automated structure analysis and actual observed use by a panel of user testers. It’s time-consuming at present, and we are looking at ways to make this aspect of quality assessment less onerous.
Did we handle all the non-functional requirements? This is mostly about security, privacy, regulatory compliance, contractual compliance and similar concerns. There are quite a lot of mandatory characteristics that all code has to pass before moving to production, and we developed another set of automated tests that support this gate.
Will the new code run adequately on the production platform? This is the deployability and production quality gate. We check to make sure the economic cost of the production environment is within a reasonable range (the company’s operations does not have an infinite budget for hardware and bandwidth) and ensure that we can afford to put the code into production. And when we do, we test whether it will scale, be manageable and not break anything else that shares platform resources.
So if we pass all of these gates we should have “quality” production code. Now we apply one last test:
Can we maintain it cost effectively? We know from experience that we will make regular changes to the production code over its lifetime and we want to be able to do this quickly, efficiently and without introducing defects.
This is the gate that has generated the most debate. The software engineering organization requires that the development team provide support for production code for the first 90 days after release to production and support the sustaining engineering group for a further 90 days as responsibility for support transitions. This is a new model; previously the developers moved on immediately to a new project and responsibility went straight to sustaining engineering, who never felt that the code was well enough documented (and structured and specified) to make their jobs easy. The new process is already teaching developers to pay attention to support issues, something they never needed to understand before.
This probably seems like a lot of additional work for the development organization. And it is. But the costs of defective code in the production environment (both direct, in terms of sustaining engineering, and indirect, in terms of lost revenue and reputation) was becoming significant. Management felt it had no choice but to focus on improved quality, and turn to the productivity issues later. So far, despite the added burden of all these quality–related activities (many of which already took place, but simply weren’t very effective) we have not seen a slowdown in software availability. Some teams are actually moving faster than before, despite the new work they have to perform, because they spend much less time and effort on remediation and last-minute adjustments to code to fix issues that only show up as the code is transitioned to production.
Note that at this stage we are not attempting to quantify software quality. Gates are “Pass/Fail” (although the release managers can provisionally accept “failed” code under a few circumstances). We are tracking defects that are detected after the release to production, and tracking them back to continuously improve the development (and over time, the sustaining engineering) processes. Eventually, we hope to have enough information on the overall performance of production code to calculate and monitor a “quality” index. For now, however, pass/fail is good enough, and it lets us focus on immediate process improvements (mostly better defect avoidance practices in design and coding), better test automation, and improved productivity (which with the fixed-scope model we are using translates to reduced development time).
I’m going to be tracking this for the next few months, and I’ll let you know how things develop.