In a previous blog post titled, Let’s say “No” to groupthink and stop quoting the Chaos Report, I wrote that:
“We need to be able to examine the underlying data and measurement methods used as the basis for any report or study on IT project failures. Without examining the data, to continue quoting such reports is simply engaging in groupthink”
While we will never be able to examine the actual data on which the Chaos Report is based, we now have research that refutes its findings. In summary, this research found the Chaos Report to be misleading and one-sided. It perverts the estimation practice and results in meaningless figures.
Laurenz Eveleens and Chris Verhoef, of Vrije Universiteit Amsterdam, recently published the research in the article “The Rise and Fall of the Chaos Report Figures” in the January/February 2010 issue of IEEE Software magazine.
I had the opportunity recently to interview Mr. Verhoef about this research. Here is the full text of the interview:
What was the motivation for doing this research?
This particular research paper is part of a larger project called EQUITY, which is short for Exploring Quantifiable IT Yields. Let me tell you a bit more about that project.
The invisible motor of our western economy is software, an emerging production factor comparable to natural resources, labor, and capital. Current paradigms indicate that software is just a cost center, and these costs must be lower. This is like saying that from less iron ore, more steel must be produced. The EQUITY project intends to explore potential connections between value creation and information technology, to enable competition with software in a calculated manner.
The bottom line is that we wish to trace the actual impact of IT on the value creation or destruction, e.g., in the form of stock value, also known as the equity of a firm. It is our ambition to develop a quantitative approach that is both accurate and usable within software-intensive organizations to facilitate rational decision-making about software investments. Achieving this would be a break-through since no-one has successfully explored the territory of information technology yields before by purely quantitative means.
Within the EQUITY project we work on developing the competencies to understand the possible connections between investing in software and the ensuing value creation or destruction via quantitative methods. Using such methods enables the development of predictive models so that competing with software becomes feasible through maximizing value creation and minimizing value destruction.
In the EQUITY project we work with six people: four Ph.D. students and a former top executive. Let me introduce them briefly:
- Erald Kulk just received his Ph.D. and worked on requirements creep. With real-world data he figured out when volatile requirements are healthy and when they start to become dangerous. Without requirements change you get the system you asked for, and with some healthy modifications you get the system that you meant. But when you do not know what you want, creep turns into a failure factor. We came up with (complex) mathematical methods that warn you at an early stage that you have reached the danger zone of failure. Dr. Kulk also worked on predicting IT project risks like budget overrun and how you can quantify this risk in terms of easily measured aspects of IT projects. Erald Kulk was recruited by our national government where he assists our federal CIO, Mr. Hillenaar, with the installation of nationwide IT portfolio management to improve the IT performance by the Dutch government.
- Peter Kampstra is another Ph.D. student working on the EQUITY project. He is a very talented young man with a great intuition for mathematics and statistics. You could call him Mr. Beanplot, since he invented a new statistical tool he dubbed a beanplot. We used his intuitive statistical visualization technique (see paper and spreadsheet) to benchmark the risk of failure of large Dutch governmental projects against 6,000 IT projects in the private sector. He also works on the reliability of function points counts. When investing in custom IT systems, it is important to know “how much” IT you are going to make. The function point measure is one of the possible candidates. We investigated many tens of thousands of function point totals from many projects. It turned out that the function point totals were a good measure on which to base predictions. The totals gave plausible numbers and were accurate. Peter is still working on the EQUITY project.
- Then we have Lukasz Kwiatkowski. While Erald and Peter work with management data, Lukasz also works with source code. The idea is that IT decision-making is ruled by existing applications, whether you like it or not. We call that the bit-to-board approach. We extract bit-level data from large source portfolios and aggregate that up to the executive level. No information gets lost by management filters. A good example is operational cost. This is often a significant factor but what can you do about it? The answer is to dive into the source code and look for the low hanging fruit. Lukasz worked on a nice example where he waded through a source portfolio of 20 million lines of code (250 apps) of a large multinational company, seeking to reduce MIPS. We could identify just a very small part of the giant portfolio as code that could be optimized so that the operational cost had a potential of decreasing MIPS usage by 5-10%.
- Laurenz Eveleens is working on quantifying the quality of IT forecasts. By now you have seen that an important aspect of IT decision-making is that executives use only prior experience and forecasts as bases for their decisions. Obviously, you have to know the quality of those forecasts. But it turns out that not many researchers work on that. Again with large amounts of data from various industrial parties we worked on methods to assess forecasting quality. Also, complex math is involved, and we went to great lengths to get it all right. Laurenz is recruited by PricewaterhouseCoopers where he works in the Software Assessment and Control group. One day a week he works to finalize his Ph.D. thesis.
- Finally Dr. Rob Peters is also working on the EQUITY project. Rob is a veteran academic and has worked for many years at a university. He has a Ph.D. in econometrics. He worked for many years at ING Group, a large financial service provider based in the Netherlands. He initiated quantitative thinking at ING and that is where we met years ago when I was invited by ING to work with them on IT portfolio management. Rob and I are working with the Ph.D. students and the industrial parties on the important themes of the EQUITY project. We also collaborate on IT portfolio management. For instance, we recently proposed a method to quantify the yield of risk-bearing IT portfolios.
You can imagine that this type of research is only possible with substantial amounts of code and data. We have access to this type of data because of our decades-long connections with many industrial parties, and the added value our research brings to them. Of course this data is not meant for publication or sharing with others; it is crucial data that the competition is not allowed to have.
Of course that is a problem within our field; data is scarce and almost never publically available.
The Chaos Report data and methods of measurement are not available for verification. You say in your report that:
Nicholas Zvegintsov has placed low reliability on information where researchers keep the actual data and data sources hidden. He argued that because the Standish Group hasn’t explained, for instance, how it chose the organizations it surveyed, what survey questions it asked, or how many good responses it received, there’s little to believe.
Yes we fully agree. Now the problem is that you often cannot publish actual data. Instead we publish statistical aggregates of the data. That is not as good as the data itself but it is a start.
Isn’t it expected that research studies, especially those with enormous impact, such as the Chaos Report, disclose their data and analysis methods to the research community for verification and validation?
This question has been asked more than once of Standish but they would not disclose their data.
Why do you think the Chaos Report is so widely quoted without any basis to validate its findings?
I think because the numbers are astounding, at least that is why I quoted these reports. In 1994 they came up with a 16% success rate. In retrospect I can predict that kind of percentage by a small Gedanken experiment. Suppose we are to predict cost, time, and the amount of functionality. Success means we are below cost and time predictions and above the amount of functionality. Now assume we have a 50% chance of getting each number right (so this is random!). If the three numbers are not correlated, their combined change is a change. So the 16% success rate is in fact high. Now the snag is that not many quoting this report really read these definitions out loud and absorbed their true meaning.
Others have previously challenged the Chaos Report findings. In your report you have cited Nicholas Zvegintsov, Robert Glass, and Magne Jørgensen. How is your approach to challenging the Chaos Report different from previous ones?
Laurenz Eveleens and I were working on assessing the quality of IT forecasts using large amounts of data from various sources. The Standish Group definitions are about some form of forecasting quality, and not about what success constitutes in general terms. We carried out the exact same calculations as Standish reported on in their chaos chronicles. It turned out that these results were not at all in accordance with reality. Therefore the research is not reproducible. In medical science this is a normal procedure: when someone publishes a result other groups reproduce it.
Zveginstov’s argument was about the Standish Group’s practice of non-disclosure. Glass argued that if so many projects fail how can we claim to live in an information age? Jørgensen’s argument was twofold: the definitions did not cover all cases, and other research findings were wildly different. In fact other research in this area suffers from the same problem as the Standish figures. Also, that research does not take institutional bias into account, which leads to meaningless rates. So for us it is no surprise that Jørgensen found these large discrepancies.
Our argument is fundamentally different; we have actual data, we know the quality of it, and we apply it to their definitions. The outcomes simply do not at all coincide with reality.
You applied the Standish definitions to extensive data when you collected 5,457 forecasts of 1,211 real-world projects totaling hundreds of millions of Euros. What is the process you went through to get this data and how long did the research take from start to finish?
It takes decades to build industrial relations so that important and confidential data comes your way. Once relations are firm and added value is returned, plenty of data becomes available.
How did you make sure that your research uses the same underlying assumptions or measurements as those used in the Chaos Report?
If you read the public versions of their reports closely this information is there.
Since you released your findings, what has been the reaction from other researchers and the media?
In 2009 we published a mathematically dense and substantial paper, Quantifying IT Forecast Quality. This paper contained the findings that we separately published in early 2010 in IEEE Software. On the Internet the IEEE Software paper is now attracting attention. There is a lot of discussion going on about the Standish reports. Our findings seem to be trickling into those discussions.
Scientific articles and media reports widely cite the Chaos Report. The report found its way to the President of the United States to support the claim that processes and U.S. software products are inadequate. What impact do the findings of the Chaos Report have on software projects and project management in general?
If quoting and citation is a measure for impact then the impact in general is still substantial.
What impact do you hope your report findings will make?
We hope that others will also make an effort to assess the forecasting quality of their own data so that fact-based decision-making in our field becomes the norm.
The Chaos Report defines a project as successful based on how well it did with respect to its original estimates of cost, time, and functionality. Can you give us a brief summary of the definitions used by the Chaos Report for successful, challenged, and failed projects?
Laurenz and I translated their definitions into more mathematical terms, but they are equivalent:
- Resolution Type 1, or project success. The project is completed, the forecast to actual ratios (f/a) of cost and time are ≥1, and the f/a ratio of the amount of functionality is ≤1.
- Resolution Type 2, or project challenged. The project is completed and operational, but f/a < 1 for cost and time and f/a > 1 for the amount of functionality.
Let’s talk about the four findings from your research. Your first finding is that the definitions are misleading. Can you explain to us the basis for this conclusion?
They’re misleading because they’re solely based on an estimation of accuracy for cost, time, and functionality. But Standish labels projects as successful or challenged, suggesting much more than deviations from their original estimates.
So basically the definitions of successful and challenged projects are based on estimation deviation only. Readers of the report who associate words like “challenged” and “success” with something other than their definitions will interpret the figures differently.
Your second finding is that the report contains unrealistic rates. I know you go to great lengths in the report on how you arrived at this conclusion but can you give us a summary of your findings?
The Standish Group’s measures are one-sided because they neglect underruns for cost and time and overruns for the amount of functionality. We took a best-in-class forecasting organization and used projects for which we had cost and amount of functionality estimates. The quality of those forecasts was high; half the projects have a time-weighted average deviation of 11% for cost and 20% deviation for functionality. Combined, half the projects have an average time-weighted deviation of only 15% from both actuals. In IT this is known as best-in-class.
Yet, even though this organization’s cost and functionality forecasts are accurate, when we apply the Standish definitions to the initial forecasts, we find only a 35% success rate. This is unrealistic.
The 3rd finding is that basing estimates on the Chaos definitions leads to perverting accuracy. You say:
The organization adopted the Standish definitions to establish when projects were successful. This caused project managers to overstate budget requests to increase the safety margin for success. However, this practice perverted forecast quality.
What led you to this conclusion?
If you optimize on a high Standish success rate, the strategy is to not exceed the duration and budget that was initially stated and to not deliver less functionality than initially promised. In practice, what you do is ask for a lot of time and money and promise nothing. This is exactly what we found in one company. Indeed, this company had high Standish ratings but 50% of the projects had a time-weighted average deviation of 233% or more from the actual. Hence, these definitions hinder rather than help increasing estimation practice.
The 4th and final conclusion is that the Chaos Report provides meaningless figures. You say:
Comparing all case studies together, we show that without taking forecasting biases into account, it is almost impossible to make any general statement about estimation accuracy across institutional boundaries.
Give an overview of some the work you did to arrive at this conclusion
We found institutional biases in forecasting. For instance, we found a salami tactic: this is systematically underestimating the actual. Or we found sand bagging: overestimating systematically. When you average numbers with an unknown bias the average does not mean anything. And that is what Standish did.
What was your reaction to the Standish Group’s response to your findings that:
All data and information in the Chaos reports and all Standish reports should be considered Standish opinion and the reader bears all risk in the use of this opinion.
Laurenz and I fully support this disclaimer, which to our knowledge was never stated in the Chaos reports.
What is your advice to those who continue to use the Chaos Report project failure statistics, without really understanding the basis of its conclusions?
So what is next for Mr. Verhoef?
Helping IT governors to make IT decision making more fact-based and transparent.
How can our readers contact you and find out more about your research?
There’s plenty of information on the Web and one can reach me via email:
If you like this interview, you will also like: Advanced Project Thinking – A conversation with Dr. Harvey Maylor