Book notes: Perfect Software and Other Illusions About Testing

Perfect Software and Other Illusions About Testing (2008) is one of the books I’ve read when I first got interested in software testing. It’s written by Gerald M. Weinberg from whom I’ve also read The Psychology of Computer Programming.

The book focuses on the psychological aspects of why software testing is often done poorly and explains what testing is really about. Weinberg illustrates this with many stories, relatable to anyone with some years in the software industry, to show the flawed thinking around testing.

While the book does not cover specific tools or techniques, the information given is still applicable to any software project today. A useful read for anyone who is interested in software quality.

Here are my notes from a recent reread.

Notes

Chapter 1: Why Do We Bother Testing?

No matter how hard we try to do a perfect job, we will sometimes make mistakes.
Recognize your own fallibility and that others are no different.
Testing is fundamentally about gathering information to help make better decisions.
While we can make decisions without testing, having additional information can improve our chances of a satisfactory outcome.
Information helps to reduce risk.
Testing is one of the ways to obtain that risk-reducing information.
Good testing involves balancing the need to mitigate risk against the risk of trying to gather too much information.
Before you start testing, think about the risks of the product and whether testing can help address them.
Assessment of risk is subjective, because it’s about predicting the future.
Risk is also subjective because the same risk feels different to different people.
Questions that testing can help answer:
- Does the software do what we want it to do?
- If the software doesn’t do what we want it to do, how much work will be involved in fixing it?
- Does the software not do what we don’t want it to do?
- Does the software do what we intended?
- Will the software do what our customers want?
- Will the software satisfy other business needs?
- What are the likelihood and consequences of failure?
How much testing we do depends on both how likely failure is and how serious the consequences would be:

Likelihood of Failure	Consequences of Failure	Amount of Testing
Low	Low	Least
Low	High	Moderate
High	Low	Moderate
High	High	Most

If humans were perfect thinkers, we wouldn’t need to test our work.
But we are imperfect and irrational human beings. Therefore, we test, and we test our testing.
The tester’s job is to inform. The manager’s job is to decide.
Identify all the information needed for decision-making.
Prioritize risks correctly.
Testing itself does not improve a product, the improving is done by fixing the bugs that testing has uncovered.
Testing happens throughout development, not just at the end.

Chapter 2: What Testing Cannot Do

Information doesn’t reduce risk if it’s ignored.
Testing costs time and money; consider this when deciding what to test.
Information can also be a burden: knowing about a problem requires you to act on it, which can be worse than not knowing at all.
Testing is not fixing.
More data doesn’t mean better information.
Question what’s missing in test reports as it can reveal information about the testers.
Our decisions are emotional, not rational.
People are emotionally invested in not finding out that they’ve made mistakes.
Some developers respond to bugs by working more carefully, while others get defensive or try to hide them.
Poor testing can give you the impression that a product is better or worse than it actually is.
Don’t bother testing if you are not going to evaluate or act on the results.
Build trust with your testers, their findings are only useful if the team accepts them.
Don’t manipulate testers to get the answers you want.
Try to stay calm and rational when making decisions.
Evaluate the quality of test data, ask how numbers were obtained and what they actually mean.
Prepare adequately before testing by understanding the purpose of the application you’re testing.
Coordinate testing with the rest of the project by allocating time and resources to fix what’s found.
Give testers the time they need: rushing produces dangerously misleading results.
When others make decisions you disagree with, they seem irrational, but are rational based on different values.
Test findings are not only useful to decide whether to ship, but also to customer service and support staff.

Chapter 3: Why Not Just Test Everything?

“Testing may convincingly demonstrate the presence of bugs, but can never demonstrate their absence.” — Edsger W. Dijkstra
It’s impossible to test everything because the number of possible tests for any program is infinite.
If you can’t see inside the program, you don’t know the exact conditions to test to cover all scenarios.
If a program can run on different configurations, it increases the number of possible tests by orders of magnitude.
The same tests performed in different order might produce different results (due to state), requiring even more tests.
Timing and true randomness are additional variables that make exhaustive testing infeasible.
So every test suite is a sample, a subset of tests that ideally is representative of all possible tests.
Sampling is subjective; there is no objective answer to “how many tests are enough?”
A sample can end up under- or over-reporting the problems in your product.
The goals of sampling are to cover all interesting conditions and to reduce the set of tests to a manageable, affordable level.

Chapter 4: What’s the Difference Between Testing and Debugging?

Testing is commonly confused with debugging, but these are distinct activities.
Testing for discovery: to perform actions on the software under test to discover new information.
Pinpointing: isolating the conditions under which the bug occurs.
Locating: finding the location of the bug in the code to fix it.
Determining significance: balancing the risks of repairing versus not repairing this bug in the code.
Repairing: changing code to remove the problem.
Troubleshooting: diagnosing a problem that’s reported by someone else.
Testing to learn: testing software not to find bugs, but to build a better understanding of the software.
In small organizations, team members easily shift between these activities.
In larger organizations, if responsibilities are not well-defined, it can lead to conflict and inefficiency.
You cannot know in advance how much time finding bugs will take; if you knew them you wouldn’t need to find them.
Task switching has a cost; if you switch too often you’ll become ineffective.
Testing is not a low-priority task that can be interrupted for just about any reason: it requires concentration to do reliably.
Repairs done in a hurry are highly likely to make things worse. Always retest after repairing.
Code that is designed and built to be testable can greatly reduce the time and effort associated with all aspects of testing.
The fact that a bug cannot be consistently reproduced is not a reason to ignore it.
If you want to improve the testing culture in your organization, it’s better to change things gradually.

Chapter 5: Meta-Testing

Testing isn’t the only way to learn about product quality.
The people closest to the product are often blind to its flaws.
The ability to find key documents is an indicator of how well an organization manages its development process.
It’s a warning sign when people are not alarmed by a huge amount of bugs.
Metrics that seem meaningful can create false confidence when measured poorly.
When bug severity is routinely downplayed, the product looks better than it is.
When people only take responsibility for quality within their own part, it becomes acceptable to discover a bug and not report it.
A tester spending hours on the wrong thing is an organizational failure, not an individual one.
Code that is delivered late needs more testing, not less: if it was hard to write, it is likely hard to get right.
Finding more bugs does not mean fewer bugs remain.
Tests that pass are not evidence that the program is “correct”; it only shows that something did or did not fail under specific conditions.
Testing under low load does not predict performance at scale.
Testers need to be kept in the loop to confirm that what they reported was accurately understood and correctly acted upon.
When bug reports are unwelcome, they stop being filed and remain unresolved.
Having good developers doesn’t mean you can skip testing; even good developers make mistakes.
Meta-information gives you hints about the quality of information; but requires evidence to support it.
Morale suffers by not keeping up with bugs.
Bugs that are not recorded are a waste of time and money.
Recording every bug is also wasteful, find the right level of recording for each bug type.
Build systems that enforce a testing process, rather than relying on people to follow processes.
All information is interpreted and therefore subjective. Be aware of the forces that influence the information you’re using.

Chapter 6: Information Immunity

Whenever we feel threatened, we instinctively defend ourselves. It protects us from information we don’t want to hear.
People have different ways of defending themselves.
People get more defensive when self-esteem is low, or they’re under pressure.
Testing especially can trigger various defensive reactions.
Repression is denying or overlooking what makes us uncomfortable, consciously or not.
The most common reaction testers hear from developers is “I didn’t change anything.” Translation: “I didn’t change anything that should matter, that I thought was important, except for what I did change.”
Rationalization is constructing a justification that appears reasonable to defend a position driven by emotion.
Logical arguments usually don’t change a position that wasn’t arrived at by logic.
Projection is criticizing others for having the same flaws we dislike in ourselves.
Displacement is redirecting blame onto someone or something else, thereby avoiding responsibility: “the user must be doing something wrong.”
Overcompensation is overdoing something to make up for a perceived flaw.
Compulsiveness is the inability to stop a counterproductive behavior pattern.
Information is neutral, but people’s reactions to information are rarely neutral.
To properly assess testing information, you must take into account people’s emotional defenses.
Be aware of your own defenses and how they distort your view of information.
Learn to listen to objections and disagreements, as there’s usually something of value in them.

Chapter 7: How to Deal With Defensive Reactions

Accusing people of being defensive almost always makes it worse.
Fear drives defensive reactions.
Identify what a person fears, try to reduce that fear and see what changes.
If an argument is not based on logic, you cannot use logic to argue against it.
Common defensive reactions like “it works the way it’s designed” or “no one will care” are hard to identify as rationalizations because there’s some truth in them.
Tailor your communication to each individual based on what drives their defensive reactions.
If people react defensively, then don’t insult them by saying they don’t care about quality. Instead, educate them.
Being under pressure is not an excuse to stop thinking. In fact, it’s a reason to think more carefully.

Chapter 8: What Makes a Good Test?

You cannot find every bug. How would you know you’ve found them all?
You cannot judge whether a test is good or bad without considering the context it’s used in.
You can only assess whether your testing was good after the product has been in use and real bugs have surfaced.
By looking back and analyzing how many and what type of bugs were missed, you can improve your testing process.
Unfortunately, some bugs turn up many years later.
Software can have bugs that were not bugs when the system shipped, due to external changes such as new hardware or operating systems.
To find out how effective your testers are, you could insert bugs intentionally to see how many would be detected.
Test quality is always an estimate, as you can never know the total number of bugs.
Managers shouldn’t judge testers by how many bugs they find, otherwise only poor systems make testers look good.
Managers mistakenly see “no bugs found” as “no value” from testers.
Questions to assess bad testing:
- Does testing give me the information I’m after?
- Is testing documented?
- Is it honest?
- Is it understandable? If not, how can you know whether it’s good or bad?
- Does it cover what matters most?
- Is it actually performed?
- Are there inconsistencies in the results?
Think hard about the goals you are trying to achieve, otherwise testing becomes very difficult.
Good testing does not compensate for bad code.

Chapter 9: Major Fallacies About Testing

The Blaming Fallacy: the more time spent looking for someone to blame, the less chance the actual problem gets solved.
The Exhaustive Testing Fallacy: the belief that it’s possible to test everything.
The Testing-Produces-Quality Fallacy: testing reveals quality, but does not produce it.
The Decomposition Fallacy: testing each part separately doesn’t tell you what happens when they work together.
The Composition Fallacy: if you only test the system as a whole, the parts that make up the system are not fully tested.
The All-Testing-Is-Testing Fallacy: the belief that all tests are equal, causing gaps in your testing (e.g., relying only on unit tests).
The Everything-Is-Equally-Easy-to-Test Fallacy: assuming all tests take similar effort, leading to optimistically wrong time estimates.
The Any-Idiot-Can-Test Fallacy: the belief that testing is just “banging keys”. Good testing is exploratory and investigative.

Chapter 10: Testing is More Than Banging Keys

The real work of testing is the thinking that happens before the execution.
Testing is not limited to software.
Any action that gathers information to inform a decision is a test.
People gather information and act on it all the time without labeling it as testing.
Dogfooding is a form of testing because it helps developers make decisions about the product.
A demonstration is not a test: its purpose is to avoid surprises, whereas a test is designed to find them.
Coverage metrics do not prove something is meaningfully tested.

Chapter 11: Information Intake

The Satir Interaction Model breaks down communication into four parts: intake, meaning, significance and response.
Intake involves not only observing the world, but a selection process as well.
Meaning is given to intake data, and interacts with the intake process. For example, it can lead us to take in more information.
Significance is determined to prioritize intake and its meaning, otherwise the world would be overwhelming.
Response is the act of formulating an action to take.
Miscommunication happens because people listen selectively.
People respond differently based on the source of the information they receive.
Timing makes a difference: people miss information when they’re paying attention somewhere else.
Too much information can overwhelm the receiver’s ability to process it.
Sometimes finding an answer requires actively shifting our attention, because we naturally filter out information outside our focus.
People tend to jump to conclusions because the human mind attaches meaning automatically.
When presented with an interpretation, ask for the data before making meaning of it.

Chapter 12: Making Meaning

Always think of at least three possible interpretations of a test result before coming to a conclusion.
If you don’t know what a program is supposed to do, you cannot say with certainty that it’s wrong.
Test-first philosophy is useful because it forces developers to write the expected results before writing a line of code.
When requirements are unclear, clarify them first before making meaning of test results.
Before reaching for more information, use what you already have.
Capture metadata (e.g. software version, OS, browser) in bug and test reports and automate this where possible, for full context.
When a result doesn’t make sense, look beyond the data.
Words carry different meanings to different people (e.g. “it failed”); choose your words carefully.
Don’t assume two things are the same just because they appear to be.
Sometimes it’s better to use imprecise language to prevent triggering an emotional response.
Involve the whole team in making meaning to avoid overlooking important information.
The same meaning can have different significance to different people.

Chapter 13: Determining Significance

Significance is the importance attached to the bug by whoever gets to decide what to do about it.
The same bug can have different significance to different people.
People can attribute higher or lower significance because of hidden personal agendas, which may not reflect what they actually think.
You cannot determine the significance of a single bug without considering all others.
Monetary value is a useful input for significance, but not always the only or most important one.
Don’t make significance ratings for bugs too complicated: four levels of significance should be enough.
Address significant problems first.
Don’t dismiss emotions entirely; sometimes they can point to something of significance.
Bug severity and repair difficulty are unrelated: a typo that’s easy to fix could be disastrous.
There is no objective way to assess significance: our emotional reaction to the data drives it.
Significance can change when the system changes.
Be mindful that deciding not to fix a bug can be disappointing for testers, so communicate this carefully.

Chapter 14: Making a Response

In an IBM study, a dozen failed projects were attributed to “bad luck”. The investigators found that successful projects had an equal amount of “bad luck”, but they succeeded because of how they were managed.
Well-managed software projects are not in a rush; bad management produces the crunch.
You aren’t going to get lucky: anticipate bugs from the start of the project and plan accordingly.
Time estimates for testing are often wrong because they don’t account for delays such as fixing and retesting bugs.
Fault-feedback-ratio (FFR): the percentage of bug fixes that introduce new bugs.
The baseline FFR is around 25%: one new bug introduced per four fixes.
Fixing under pressure raises the FFR, which means rushing makes things worse.
Testing is sometimes an after-thought, so it gets whatever time is left when everything else is done.
Postponing testing doesn’t prevent but guarantees a crunch: bugs accumulate silently and surface all at once.
When time runs out, stop finding new bugs and focus on fixing the most significant ones you already know about.
Continuously adjust a project’s schedule as testing reveals new information.
Don’t waste time testing something that’s likely to be significantly rebuilt.
Cutting testing to meet a deadline doesn’t save the project.

Chapter 15: Preventing Testing from Growing More Difficult

Testing keeps getting harder because we keep building more ambitious software.
Requirements creep causes products to grow out of control, and it’s a management failure.
While adding a feature individually can be relatively easy, it has nonlinear effects on testing time that are rarely considered.
Your application always runs within a larger system, so account for the complexity of everything it depends on.
Keep testing under control by building incrementally: build and test a small piece before moving on to the next.
Write acceptance tests before building each component, so you know when it’s done before integrating it.
Fixing a bug as early as possible helps to keep costs down, before it causes other issues that also require fixing.

Chapter 16: Testing Without Machinery

Technical reviewing is a powerful testing technique complementary to automated testing.
Technical reviews find different bugs than automated testing, since they include other data sources (e.g. design specs) besides code.
Every common objection to doing a review (e.g. “no one understands it”) reveals a problem in the product or process.
Start reviewing the most severe problems first before spending time on minor issues.
Even when a review tells the truth, it may require effort to convince others.
Testers should participate in technical reviews, as they can find logic bugs and design flaws without needing to understand code.
Testers learn from technical reviews how developers think and become more effective at testing.
Review code for testability: if it’s easier to test, it will also cost less time.

Chapter 17: Testing Scams

Tools amplify effectiveness: if your testing process is broken, adding tools will make it worse.
Don’t evaluate testing tools based on a demonstration only, but evaluate using your own benchmarks.
Testimonials in sales presentations are not evidence of a good tool.
The more expensive a tool, the more likely people rationalize to keep using it even when it’s not effective.
Beware of tools that claim to require minimum human involvement; only humans can operate them meaningfully.
Outsourcing testing doesn’t mean you can outsource thinking.
Managing testing by numbers alone incentivizes people to game the system.

Chapter 18: Oblivious Scams

Record bugs immediately after they’re found, to avoid losing valuable information that can’t be recovered from memory later.
Even when testers make things look better than they are with good intentions, it ultimately hinders improvement.
Strained relationships between testers and developers can skew test results maliciously.
Beta testing usually draws attention to the commonly used features of a product, leaving bugs in less-used areas undetected.
Don’t treat bugs as isolated problems, but try to find root causes that can reveal related bugs.
Don’t mistake partial testing for complete testing.
Real data is messy, so question test reports that look too perfect.