Estimation with no historical data: a Monte Carlo approach
By Sanzio Castor, MSc, CSM
Better estimates can significantly change the course of a software development project and its budget results. This article proposes to use a basic Monte Carlo simulation to fulfill a lack of accuracy in estimates not driven by historical data.
Often, the goal is to predict the schedule needed to deliver a specific amount of functionality and frequently estimators are forced to provide a single-point estimate. The first suggestion is to re-estimate each feature's best and worst cases (Table 1). A Fibonacci sequence is used because it reflects the greater uncertainty associated with estimates for larger units of work.
Why not insert a column with the ‘most likely case’ and then calculate the ‘expected case’ using the Program Evaluation and Review Technique (PERT) formula? It would be a reasonable short-term solution to the problem. The long-term solution would be work with estimators to make their ‘most likely case’ estimates more accurate; however, the analogy between this new project to a similar past one is necessary to compute an ‘expected case’ is wishful thinking in this case study. In this example, the estimation values for each story are made by an individual expert judgment.
A list of user stories based on Mike Cohn’s case study, “Bomb Shelter Studios” with estimates and effort values would look like this1:
As Jon Wittwer states, “the Monte Carlo method is just one of many methods for analyzing uncertainty propagation, where the goal is to determine how random variation, lack of knowledge, or error affects the sensitivity, performance, or reliability of the system that is being modeled.”2 The formula to calculate the MC value for each story in pseudocode is:
if random < 0.5
apply first estimate;
apply second estimate;
How do we map the MC value with best and worst cases? A second integer random number between the best and worst cases range is generated. For example, for the second user story, 3 was the MC value. This value in the table 1 points to the third row. Then the number 6 was generated—an integer random number between 5 and 8. In this case, we made use of MS Excel's
RAND formula. Every time the worksheet is recalculated, a new random number is generated. Remember that “the key to Monte Carlo simulation is generating the set of random inputs.”
The next step is to total the effort column. Subsequently, 5000 sets of random inputs are generated and the effort sum is evaluated for all 5000 sets (MS Excel can handle all iterations with simple macro codification). A sample of results is here:
Using the data in table 3, the final table (Table 4) presents the cumulative probability corresponding to the possible total effort to complete the software development project.
Instead of working with a simplistic single-point estimate, the model was embedded with probability and ignored the use of historical data. According to Steve MacConnell, “the key point is that all estimates include a probability, whether the probability is stated or implied. An explicitly stated probability is one sign of a good estimate.”3
Mike Cohn, Agile Estimating and Planning (Prentice Hall PTR, 2005)
Jon Wittwer, Monte Carlo Simulation Basics (Vertex42.com, 2004), http://vertex42.com/ExcelArticles/mc/MonteCarloSimulation.html
Steve McConnell, Software Estimation: Demystifying the Black Art (Microsoft Press, 2006)
Juanjuan Zang, Agile Estimation with Monte Carlo Simulation (Agile Processes in Software Engineering and Extreme Programming: 9th International Conference, XP 2008)