ZeePedia

Project Management and Software Engineering:Software Progress and Problem Tracking

<< Software Team Organization and Specialization:Quantifying Organizational Results
Requirements, Business Analysis, Architecture, Enterprise Architecture, and Design:Software Requirements >>
img
Chapter
6
Project Management and
Software Engineering
Introduction
Project management is a weak link in the software engineering chain. It
is also a weak link in the academic education curricula of many univer-
sities. More software problems and problems such as cost and schedule
overruns can be attributed to poor project management than to poor
programming or to poor software engineering practices. Because poor
project management is so harmful to good software engineering, it is
relevant to a book on best practices.
Working as an expert witness in a number of cases involving cancelled
projects, quality problems, and other significant failures, the author
observed bad project management practices on almost every case. Not
only were project management problems common, but in some lawsuits,
the project managers and higher executives actually interfered with
good software engineering practices by canceling inspections and trun-
cating testing in the mistaken belief that such actions would shorten
development schedules.
For example, in a majority of breach of contract lawsuits, project man-
agement issues such as inadequate estimating, inadequate quality con-
trol, inadequate change control, and misleading or even false status
tracking occur repeatedly.
As the recession continues, it is becoming critical to analyze every
aspect of software engineering in order to lower costs without degrading
operational efficiency. Improving project management is on the critical
path to successful cost reduction.
351
img
352
Chapter Six
Project management needs to be defined in a software context. The
term project management has been artificially narrowed by tool vendors
so that it has become restricted to the activities of critical path analysis
and the production of various scheduling aids such as PERT and Gantt
charts. For successful software project management, many other activi-
ties must be supported.
Table 6-1 illustrates 20 key project management functions, and how
well they are performed circa 2009, based on observations within about
150 companies. The scoring range is from ­10 for very poor performance
to +10 for excellent performance.
Using this scoring method that runs from +10 to ­10, the midpoint or
average is 0. Observations made over the past few years indicate that proj-
ect management is far below average in far too many critical activities.
The top item in Table 6-1, reporting "red flag" items, refers to notify-
ing clients and higher managers that a project is in trouble. In almost
every software breach of contract lawsuit, problems tend to be concealed
or ignored, which delays trying to solve problems until they grow too
serious to be cured.
TABLE 6-1
Software Project Management Performance Circa 2009
Project Management Functions
Score
Definition
1.
Reporting "red flag" problems
­9.5
Very poor
2.
Defect removal efficiency measurements
­9.0
Very poor
3.
Benchmarks at project completion
­8.5
Very poor
4.
Requirements change estimating
­8.0
Very poor
5.
Postmortems at project completion
­8.0
Very poor
6.
Quality estimating
­7.0
Very poor
7.
Productivity measurements
­6.0
Poor
8.
Risk estimating
­3.0
Poor
9.
Process improvement tracking
­2.0
Poor
10.
Schedule estimating
1.0
Marginal
11.
Initial application sizing
2.0
Marginal
12.
Status and progress tracking
2.0
Marginal
13.
Cost estimating
3.0
Fair
14.
Value estimating
4.0
Fair
15.
Quality measurements
4.0
Fair
16.
Process improvement planning
4.0
Fair
17.
Quality and defect tracking
5.0
Good
18.
Software assessments
6.0
Good
19.
Cost tracking
7.0
Very good
20.
Earned-value tracking
8.0
Very good
Average
­0.8
Poor
img
Project Management and Software Engineering
353
The main reason for such mediocre performance by software project
managers is probably lack of effective curricula at the university and
graduate school level. Few software engineers and even fewer MBA stu-
dents are taught anything about the economic value of software quality
or how to measure defect removal efficiency levels, which is actually the
most important single measurement in software engineering.
If you examine the tools and methods for effective software project
management that are available in 2009, a much different profile would
occur if software managers were trained at state-of-the-art levels.
Table 6-2 makes the assumption that a much improved curricula for
project managers could be made available within ten years, coupled with
the assumption that project managers would then be equipped with
modern sizing, cost estimating, quality estimating, and measurement
tools. Table 6-2 shows what software project managers could do if they
were well trained and well equipped.
Instead of jumping blindly into projects with poor estimates and
inadequate quality plans, Table 6-2 shows that it is at least theoreti-
cally possible for software project managers to plan and estimate with
TABLE 6-2
Potential Software Project Management Performance by 2019
Project Management Functions
Score
Definition
1.
Reporting "red flag" problems
10.0
Excellent
2.
Benchmarks at project completion
10.0
Excellent
3.
Postmortems at project completion
10.0
Excellent
4.
Status and progress tracking
10.0
Excellent
5.
Quality measurements
10.0
Excellent
6.
Quality and defect tracking
10.0
Excellent
7.
Cost tracking
10.0
Excellent
8.
Defect removal efficiency measurements
9.0
Excellent
9.
Productivity measurements
9.0
Very good
10.
Software assessments
9.0
Very good
11.
Earned-value tracking
9.0
Very good
12.
Quality estimating
8.0
Very good
13.
Initial application sizing
8.0
Very good
14.
Cost estimating
8.0
Very good
15.
Risk estimating
7.0
Good
16.
Schedule estimating
7.0
Good
17.
Process improvement tracking
6.0
Good
18.
Value estimating
6.0
Good
19.
Process improvement planning
6.0
Good
20.
Requirements change estimating
5.0
Good
Average
8.4
Very good
354
Chapter Six
high precision, measure with even higher precision, and create bench-
marks for every major application when it is finished. Unfortunately, the
technology of software project management is much better than actual
day-to-day performance.
As of 2009, the author estimates that only about 5 percent of U.S.
software projects create benchmarks of productivity and quality data at
completion. Less than one-half of 1 percent submit benchmark data to a
formal benchmark repository such as that maintained by the International
Software Benchmarking Standards Group (ISBSG), Software Productivity
Research (SPR), the David Consulting Group, Quality and Productivity
Management Group (QPMG), or similar organizations.
Every significant software project should prepare formal benchmarks
at the completion of the project. There should also be a postmortem
review of development methods to ascertain whether improvements
might be useful for future projects.
As of 2009, many independent software project management tools are
available, but each only supports a portion of overall software project
management responsibilities. A new generation of integrated software
project management tools is approaching, which has the promise of
eliminating the gaps in current project management tools and improv-
ing the ability to share information from tool to tool. New classes of
project management tools such as methodology management tools have
also joined the set available to the software management community.
Software project management is one of the most demanding jobs of
the 21st century. Software project managers are responsible for the con-
struction of some of the most expensive assets that corporations have
ever attempted to build. For example, large software systems cost far
more to build and take much longer to construct than the office build-
ings occupied by the companies that have commissioned the software.
Really large software systems in the 100,000­function point range can
cost more than building a domed football stadium, a 50-story skyscraper,
or a 70,000-ton cruise ship.
Not only are large software systems expensive, but they also have
one of the highest failure rates of any manufactured object in human
history. The term failure refers to projects that are cancelled without
completion due to cost or schedule overruns, or which run later than
planned by more than 25 percent.
For software failures and disasters, the great majority of blame can
be assigned to the management community rather than to the tech-
nical community. Table 6-3 is derived from one of the author's older
books, Patterns of Software System Failure and Success, published by
the International Thomson Press. Note the performance of software
managers on successful projects as opposed to their performance associ-
ated with cancellations and severe overruns.
img
Project Management and Software Engineering
355
TABLE 6-3
Software Management Performance on Successful and Unsuccessful
Projects
Activity
Successful Projects
Unsuccessful Projects
Sizing
Good
Poor
Planning
Very good
Fair
Estimating
Very good
Very poor
Tracking
Good
Poor
Measurement
Good
Very Poor
Quality control
Excellent
Poor
Change control
Excellent
Poor
Problem resolutions
Good
Poor
Risk analysis
Good
Very poor
Personnel management
Good
Poor
Supplier management
Good
Poor
Overall Performance
Very good
Poor
As mentioned in Chapter 5 of this book, the author's study of proj-
ect failures and analysis of software lawsuits for breach of contract
reached the conclusion that project failures correlate more closely to
the number of managers involved with software projects than they do
with the number of software engineers.
Software projects with more than about six first-line managers tend
to run late and over budget. Software projects with more than about
12 first-line managers tend to run very late and are often cancelled.
As can easily be seen, deficiencies of the software project management
function are a fundamental root cause of software disasters. Conversely,
excellence in project management can do more to raise the probability of
success than almost any other factor, such as buying better tools, or chang-
ing programming languages. (This is true for larger applications above
1000 function points. For small applications in the range of 100 function
points, software engineering skills still dominate results.)
On the whole, improving software project management performance
can do more to optimize software success probabilities and to minimize
failure probabilities than any other known activity. However, improving
software project management performance is also one of the more difficult
improvement strategies. If it were easy to do, the software industry would
have many more successes and far fewer failures than in fact occur.
A majority of the failures of software projects can be attributed to fail-
ures of project management rather than to failures of the technical staff.
For example, underestimating schedules and resource requirements is
associated with more than 70 percent of all projects that are cancelled
due to overruns. Another common problem of project management is
img
356
Chapter Six
ignoring or underestimating the work associated with quality control
and defect removal. Yet another management problem is failure to deal
with requirements changes in an effective manner.
Given the high costs and significant difficulty associated with software
system construction, you might think that software project managers
would be highly trained and well equipped with state-of-the-art planning
and estimating tools, with substantial analyses of historical software cost
structures, and with very thorough risk analysis methodologies. These
are natural assumptions to make, but they are false. Table 6-4 illustrates
patterns of project management tool usage of leading, average, and lag-
ging software projects.
Table 6-4 shows that managers on leading projects not only use a
wider variety of project management tools, but they also use more of
the features of those tools.
In part due to the lack of academic preparation for software project
managers, most software project managers are either totally untrained
or at best partly trained for the work at hand. Even worse, software
project managers are often severely under-equipped with state-of-the-
art tools.
TABLE 6-4
Numbers and Size Ranges of Project Management Tools
(Size data expressed in terms of function point metrics)
Project Management
Lagging
Average
Leading
Project planning
1,000
1,250
3,000
Project cost estimating
3,000
Statistical analysis
3,000
Methodology management
750
3,000
Benchmarks
2,000
Quality estimation
2,000
Assessment support
500
2,000
Project measurement
1,750
Portfolio analysis
1,500
Risk analysis
1,500
Resource tracking
300
750
1,500
Value analysis
350
1,250
Cost variance reporting
500
1,000
Personnel support
500
500
750
Milestone tracking
250
750
Budget support
250
750
Function point analysis
250
750
Backfiring: LOC to FP
750
Function point subtotal
1,800
5,350
30,250
Number of tools
3
10
18
Project Management and Software Engineering
357
From data collected from consulting studies performed by the author,
less than 25 percent of U.S. software project managers have received any
formal training in software cost estimating, planning, or risk analysis;
less than 20 percent of U.S. software project managers have access to
modern software cost-estimating tools; and less than 10 percent have
access to any significant quantity of validated historical data from proj-
ects similar to the ones they are responsible for.
The comparatively poor training and equipment of project managers
is troubling. There are at least a dozen commonly used software cost-
estimating tools such as COCOMO, KnowledgePlan, Price-S, SEER,
SLIM, and the like. Of a number of sources of benchmark data, the
International Software Benchmarking Standards Group (ISBSG) has
the most accessible data collection.
By comparison, the software technical personnel who design and build
software are often fairly well trained in the activities of analysis, design,
and software development, although there are certainly gaps in topics
such as software quality control and software security.
The phrase "project management" has unfortunately been narrowed
and misdefined in recent years by vendors of automated tools for sup-
porting real project managers. The original broad concept of project
management included all of the activities needed to control the outcome
of a project: sizing deliverables, estimating costs, planning schedules
and milestones, risk analysis, tracking, technology selection, assessment
of alternatives, and measurement of results.
The more narrow concept used today by project management tool
vendors is restricted to a fairly limited set of functions associated with
the mechanics of critical path analysis, work breakdown structuring,
and the creation of PERT charts, Gantt charts, and other visual sched-
uling aids. These functions are of course part of the work that project
managers perform, but they are neither the only activities nor even the
most important ones for software projects.
The gaps and narrow focus of conventional project management tools
are particularly troublesome when the projects in question are software
related. Consider a very common project management question associ-
ated with software projects: What will be the results to software sched-
ules and costs from the adoption of a new development method such as
Agile development or the Team Software Process (TSP)?
Several commercial software estimating tools can predict the results
of both Agile and TSP development methods, but not even one standard
project management tool such as Microsoft Project has any built-in
capabilities for automatically adjusting its assumptions when dealing
with alternative software development approaches.
The same is also true for other software-related technologies such
as the project management considerations of dealing with the formal
358
Chapter Six
inspections in addition to testing, static analysis, the ISO 9000-9004
standards, the SEI maturity model, reusable components, ITIL, and
so forth.
The focus of this chapter is primarily on the activities and tasks asso-
ciated with software project management. Project managers also spend
quite a bit of time dealing with personnel issues such as hiring, apprais-
als, pay raises, and staff specialization. Due to the recession, project
managers will probably also face tough decisions involving layoffs and
downsizing.
Most software project managers are also involved with departmental
and corporate issues such as creating budgets, handling travel requests,
education planning, and office space planning. These are important
activities, but are outside the scope of what managers do when they are
involved specifically with project management.
The primary focus of this chapter is on the tools and methods that
are the day-to-day concerns of software project managers, that is, sizing,
estimating, planning, measurement and metrics, quality control, process
assessments, technology selection, and process improvement.
There are 15 basic topics that project managers need to know about,
and each topic is a theme of some importance to professional software
project managers:
1. Software sizing
2. Software project estimating
3. Software project planning
4. Software methodology selection
5. Software technology and tool selection
6. Software quality control
7. Software security control
8. Software supplier management
9. Software progress and problem tracking
10. Software measurements and metrics
11. Software benchmarking
12. Software risk analysis
13. Software value analysis
14. Software process assessments
15. Software process improvements
These 15 activities are not the only topics of concern to software proj-
ect managers, but they are critical topics in terms of the ability to control
Project Management and Software Engineering
359
major software projects. Unless at least 10 of these 13 are performed in
a capable and competent manner, the probability of the project running
out of control or being cancelled will be alarmingly high.
Because the author's previous books on Estimating Software Costs
(McGraw-Hill, 2007) and Applied Software Measurement (McGraw-Hill,
2008) dealt with many managerial topics, this book will cover only 3 of
the 15 management topics:
1. Software sizing
2. Software progress and problem tracking
3. Software benchmarking
Sizing is the precursor to estimating. Sizing has many different
approaches, but several new approaches have been developed within
the past year.
Software progress tracking is among the most critical of all software
project management activities. Unfortunately, based on depositions
and documents discovered during litigation, software progress track-
ing is seldom performed competently. Even worse, when projects are
in trouble, tracking tends to conceal problems until it is too late to
solve them.
Software benchmarking is underreported in the literature. As this
book is in production, the ISO standards organization is preparing a
new ISO standard on benchmarking. It seems appropriate to discuss
how to collect benchmark data and what kinds of reports constitute
effective benchmarks.
Software Sizing
The term sizing refers to methods for predicting the volume of various
deliverable items such as source code, specifications, and user manu-
als. Software bugs or defects should also be included in sizing, because
they cost more money and take more time than any other software
"deliverable." Bugs are an accidental deliverable, but they are always
delivered, like it or not, so they need to be included in sizing. Because
requirements are unstable and grow during development, changes and
growth in application requirements should be sized, too.
Sizing is the precursor to cost estimating and is one of the most criti-
cal software project management tasks. Sizing is concerned with pre-
dicting the volumes of major kinds of software deliverables, including
but not limited to those shown in Table 6-5.
As can be seen from the list of deliverables, the term sizing includes
quite a few deliverables. Many more things than source code need to be
predicted to have complete size and cost estimates.
img
360
Chapter Six
TABLE 6-5
Software Deliverables Whose Sizes Should Be Quantified
Paper documents
Requirements
Text requirements
Function requirements (features of
the application)
Nonfunctional requirements (quality
and constraints)
Use-cases
User stories
Requirements change (new features)
Requirements churn (changes that
don't affect size)
Architecture
External architecture (SOA, client-
server, etc.)
Internal architecture (data structure,
platforms, etc.)
Specifications and design
External
Internal
Planning documents
Development plans
Quality plans
Test plans
Security plans
Marketing plans
Maintenance and support plans
User manuals
Reference manuals
Maintenance manuals
Translations into foreign languages
Tutorial materials
Translations of tutorial materials
Online HELP screens
Translations of HELP screens
Source code
New source code
Reusable source code from
certified sources
Reusable source code from
uncertified sources
Inherited or legacy source code
Code added to support
requirements change and churn
img
Project Management and Software Engineering
361
TABLE 6-5
Software Deliverables Whose Sizes Should Be Quantified (continued)
Test cases
New test cases
Reusable test cases
Bugs or defects
Requirements defects (original)
Requirements defects (in changed
requirements)
Architectural defects
Design defects
Code defects
User documentation defects
"Bad fixes" or secondary defects
Test case defects
Note that while bugs or defects are accidental deliverables, there are
always latent bugs in large software applications and they have serious
consequences. Therefore, estimating defect potentials and defect removal
efficiency levels are critical tasks of software application sizing.
This section discusses several methods of sizing software applications,
which include but are not limited to:
1. Traditional sizing by analogy with similar projects
2. Traditional sizing using "lines of code" metrics
3. Sizing using story point metrics
4. Sizing using use-case metrics
5. Sizing using IFPUG function point metrics
6. Sizing using other varieties of function point metrics
7. High-speed sizing using function point approximations
8. High-speed sizing legacy applications using backfiring
9. High-speed sizing using pattern matching
10. Sizing application requirements changes
Accurate estimation and accurate schedule planning depend on
having accurate size information, so sizing is a critical topic for success-
ful software projects. Size and size changes are so important that a new
management position called "scope manager" has come into existence
over the past few years.
362
Chapter Six
New methods for formal size or scope control have been created.
Interestingly, the two most common methods were developed in very
distant locations from each other. A method called Southern Scope was
developed in Australia, while a method called Northern Scope was devel-
oped in Finland. Both of these scope-control methods focus on change
controls and include formal sizing, reviews of changes, and other tech-
niques for quantifying the impact of growth and change. While other
size control methods exist, the Southern Scope and Northern Scope
methods both appear to be more effective than leaving changes to ordi-
nary practices.
Because thousands of software applications exist circa 2009, care-
ful forensic analysis of existing software should be a good approach
for predicting the sizes of future applications. As of 2009, many "new"
applications are replacements of existing legacy applications. Therefore,
historical data would be useful, if it were reliable and accurate.
Size is a useful precursor for estimating staffing, schedules, effort,
costs, and quality. However, size is not the only factor that needs to be
known. Consider an analogy with home construction. You need to know
the number of square feet or square meters in a house to perform a cost
estimate. But you also need to know the specifics of the site, the con-
struction materials to be used, and any local building codes that might
require costly additions such as hurricane-proof windows or special
septic systems.
For example, a 3000-square-foot home to be constructed on a flat
suburban lot with ordinary building materials might be constructed for
$100 per square foot, or $300,000. But a luxury 3000-square-foot home
built on a steep mountain slope that requires special support and uses
exotic hardwoods might cost $250 per square foot or $750,000.
Similar logic applies to software. An embedded application in a medi-
cal device may cost twice as much as the same size application that
handles business data. This is because the liabilities associated with
software in medical devices require extensive verification and validation
compared with ordinary business applications.
(Author's note: Prior to the recession, one luxury home was built on a
remote lake so far from civilization that it needed a private airport and
its own electric plant. The same home featured handcrafted windows
and wall panels created on site by artists and craftspeople. The bud-
geted cost was about $40 million, or more than $6,000 per square foot.
Needless to say, this home was built before the Wall Street crash since
the owner was a financier.)
Three serious problems have long been associated with software sizing:
(1) Most of the facts needed to create accurate sizing of software deliver-
ables are not known until after the first cost estimates are required; (2)
Some sizing methods such as function point analysis are time-consuming
Project Management and Software Engineering
363
and expensive, which limits their utility for large applications; and (3)
Software deliverables are not static in size and tend to grow during
development. Estimating growth and change is often omitted from sizing
techniques. Let us now consider a number of current software sizing
approaches.
Traditional Sizing by Analogy
The traditional method of sizing software projects has been that of anal-
ogy with older projects that are already completed, so that the sizes of
their deliverables are known. However, newer methods are available
circa 2009 and will be discussed later in this chapter.
The traditional sizing-by-analogy method has not been very success-
ful for a variety of reasons. It can only be used for common kinds of
software projects where similar projects exist. For example, sizing by
analogy works fairly well for compilers, since there are hundreds of
compilers to choose from. The analogy method can also work for other
familiar kinds of applications such as accounting systems, payroll sys-
tems, and other common application types. However, if an application is
unique, and no similar applications have been constructed, then sizing
by analogy is not useful.
Because older legacy applications predate the use of story points or
use-case points, or sometimes even function points, not every legacy
application is helpful in terms of providing size guidance for new
applications. For more than 90 percent of legacy applications, their
size is not known with precision, and even code volumes are not
known, due to "dead code" and calls to external routines. Also, many
of their deliverables (i.e., requirements, specifications, plans, etc.)
have long since disappeared or were not updated, so their sizes may
not be available.
Since legacy applications tend to grow at an annual rate of about 8
percent, their current size is not representative of their initial size at
their first release. Very seldom is data recorded about requirements
growth, so this can throw off sizing by analogy.
Even worse, a lot of what is called "historical data" for legacy applications
is very inaccurate and can't be relied upon to predict future applications.
Even if legacy size is known, legacy effort and costs are usually incom-
plete. The gaps and missing elements in historical data include unpaid
overtime (which is almost never measured), project management effort,
and the work of part-time specialists who are not regular members of the
development team (database administration, technical writers, quality
assurance, etc.). The missing data on legacy application effort, staffing,
and costs is called leakage in the author's books. For small applications
with one or two developers, this leakage from historical data is minor.
364
Chapter Six
But for large applications with dozens of team members, leakage of miss-
ing effort and cost data can top 50 percent of total effort.
Leakage of effort and cost data is worse for internal applications
developed by organizations that operate as cost centers and that there-
fore have no overwhelming business need for precision in recording
effort and cost data. Outsource applications and software built under
contract is more accurate in accumulating effort and cost data, but even
here unpaid overtime is often omitted.
It is an interesting point to think about, but one of the reasons why IT
projects seem to have higher productivity rates than systems or embed-
ded software is that IT project historical data "leaks" a great deal more
than systems and embedded software. This leakage is enough by itself
to make IT projects look at least 15 percent more productive than sys-
tems or embedded applications of the same size in terms of function
points. The reason is that most IT projects are created in a cost-center
environment, while systems and embedded applications are created in
a profit-center environment.
The emergence of the International Software Benchmarking Standards
Group (ISBSG) has improved the situation somewhat, since ISBSG now
has about 5000 applications of various kinds that are available to the
software engineering community. All readers who are involved with
software are urged to consider collecting and providing benchmark data.
Even if the data cannot be submitted to ISBSG for proprietary or busi-
ness reasons, keeping such data internally will be valuable.
The ISBSG questionnaires assist by collecting the same kinds of infor-
mation for hundreds of applications, which facilitates using the data for
estimating purposes. Also, companies that submit data to the ISBSG
organization usually have better-than-average effort and cost tracking
methods, so their data is probably more accurate than average.
Other benchmark organizations such as Software Productivity
Research (SPR), the Quality and Productivity Management Group
(QPMG), the David Consulting Group, and a number of others have
perhaps 60,000 projects, but this data has limited distribution to specific
clients. This private data is also more expensive than ISBSG data. A
privately commissioned set of benchmarks with a comparison to similar
relevant projects may cost between $25,000 and $100,000, based on the
number of projects examined. Of course, the on-site private benchmarks
are fairly detailed and also correct common errors and omissions, so the
data is fairly reliable.
What would be useful for the industry is a major expansion in soft-
ware productivity and quality benchmark data collection. Ideally, all
development projects and all major maintenance and enhancement
projects would collect enough data so that benchmarks would become
standard practices, rather than exceptional activities.
Project Management and Software Engineering
365
For the immediate project under development, the benchmark data
is valuable for showing defects discovered to date, effort expended to
date, and ensuring that schedules are on track. In fact, similar but
less formal data is necessary just for status meetings, so a case can be
made that formal benchmark data collection is close to being free since
the information is needed whether or not it will be kept for benchmark
purposes after completion of the project.
Unfortunately, while sizing by analogy should be useful, flaws and
gaps with software measurement practices have made both sizing by
analogy and also historical data of questionable value in many cases.
If there are benchmarks or historical size
Timing of sizing by analogy
data from similar projects, this form of sizing can be done early, even
before the requirements for the new application are fully known. This
is one of the earliest methods of sizing. However, if historical data is
missing, then sizing by analogy can't be done at all.
There are at least 3 million existing soft-
Usage of sizing by analogy
ware applications that might, in theory, be utilized for sizing by analogy.
However, from visits to many large companies and government agencies,
the author hypothesizes that fewer than 100,000 existing legacy appli-
cations have enough historical data for sizing by analogy to be useful
and accurate. About another 100,000 have partial data but so many
errors that sizing by analogy would be hazardous. About 2.8 million
legacy applications either have little or no historical data, or the data is
so inaccurate that it should not be used. For many legacy applications,
no reliable size data is available in any metric.
This form of sizing is quick and inexpensive,
Schedules and costs
assuming that benchmarks or historical data are available. If neither
size nor historical data is available, the method of sizing by analogy
cannot be used. In general, benchmark data from an external source
such as ISBSG, the David Consulting Group, QPMG, or SPR will be
more accurate than internal data. The reason for this is that the exter-
nal benchmark organizations attempt to correct common errors, such
as omitting unpaid overtime.
The main counter indication is that
Cautions and counter indications
sizing by analogy does not work at all if there is neither historical data
nor accurate benchmarks. A caution about this method is that historical
data is usually incomplete and leaves out critical information such as
unpaid overtime. Formal benchmarks collected for ISBSG or one of the
other benchmark companies will usually be more accurate than most
internal historical data, which is of very poor reliability.
366
Chapter Six
Traditional Sizing Based on Lines
of Code (LOC) Metrics
When the "lines of code" or LOC metric originated in the early 1960s,
software applications were small and coding composed about 90 percent
of the effort. Today in 2009, applications are large, and coding composes
less than 40 percent of the effort. Between the 1960s and today, the useful-
ness of LOC metrics degraded until that metric became actually harmful.
Today in 2009, using LOC metrics for sizing is close to professional mal-
practice. Following are the reasons why LOC metrics are now harmful.
The first reason that LOC metrics are harmful is that after more
than 60 years of usage, there are no standard counting rules for source
code! LOC metrics can be counted using either physical lines or logical
statements. There can be more than a 500 percent difference in appar-
ent size of the same code segment when the counting method switches
between physical lines and logical statements.
In the first edition of the author's book Applied Software Measurement
in 1991, formal rules for counting source code based on logical statements
were included. These rules were used by Software Productivity Research
(SPR) for backfiring when collecting benchmark data. But in 1992, the
Software Engineering Institute (SEI) issued their rules for counting source
code, and the SEI rules were based on counts of physical lines. Since both
the SPR counting rules and the SEI counting rules are widely used, but
totally different, the effect is essentially that of having no counting rules.
(The author did a study of the code-counting methods used in major
software journals such as IEEE Software, IBM Systems Journal,
CrossTalk, the Cutter Journal, and so on. About one-third of the articles
used physical lines, one-third used logical statements, and the remain-
ing third used LOC metrics, but failed to mention whether physical
lines or logical statements (or both) were used in the article. This is a
serious lapse on the part of both the authors and the referees of soft-
ware engineering journals. You would hardly expect a journal such as
Science or Scientific American to publish quantified data without care-
fully explaining the metrics used to collect and analyze the results.
However, for software engineering journals, poor measurements are the
norm rather than the exception.)
The second reason that LOC metrics are hazardous is because they
penalize high-level programming languages in direct proportion to the
power of the language. In other words, productivity and quality data
expressed using LOC metrics looks better for assembly language than
for Java or C++.
The penalty is due to a well-known law of manufacturing economics,
which is not well understood by the software community: When a manu-
facturing process has a large number of fixed costs and there is a decline
in the number of units manufactured, the cost per unit must go up.
Project Management and Software Engineering
367
A third reason is that LOC metrics can't be used to size or measure
noncoding activities such as requirements, architecture, design, and
user documentation. An application written in the C programming lan-
guage might have twice as much source code as the same application
written in C++. But the requirements and specifications would be the
same size.
It is not possible to size paper documents from source code without
adjusting for the level of the programming language. For languages
such as Visual Basic that do not even have source code counting rules
available, it is barely possible to predict source code size, much less the
sizes of any other deliverables.
The fourth reason the LOC metrics are harmful is that circa 2009,
more than 700 programming languages exist, and they vary from very
low-level languages such as assembly to very high-level languages such
as ASP NET. More than 50 of these languages have no known counting
rules.
The fifth reason is that most modern applications use more than a
single programming language, and some applications use as many as
15 different languages, each of which may have unique code counting rules.
Even a simple mix of Java and HTML makes code counting difficult.
Historically, the development of Visual Basic and its many competi-
tors and descendants changed the way many modern programs are
developed. Although "visual" languages do have a procedural source
code portion, much of the more complex programming uses button con-
trols, pull-down menus, visual worksheets, and reusable components.
In other words, programming is being done without anything that can
be identified as a "line of code" for sizing, measurement, or estimation
purposes. By today in 2009, perhaps 60 percent of new software appli-
cations are developed using either object-oriented languages or visual
languages (or both). Indeed, sometimes as many as 12 to 15 different
languages are used in the same applications.
For large systems, programming itself is only the fourth most expen-
sive activity. The three higher-cost activities cannot be measured or esti-
mated effectively using the lines of code metric. Also, the fifth major cost
element, project management, cannot easily be estimated or measured
using the LOC metric either. Table 6-6 shows the ranking in descending
order of software cost elements for large applications.
The usefulness of a metric such as lines of code, which can only mea-
sure and estimate one out of the five major software cost elements of
software projects, is a significant barrier to economic understanding.
Following is an excerpt from the 3rd edition of the author's book
Applied Software Measurement (McGraw-Hill, 2008), which illustrates
the economic fallacy of KLOC metrics. Here are two case studies showing
both the LOC results and function point results for the same application
img
368
Chapter Six
TABLE 6-6
Rank Order of Large System Software Cost Elements
1.
Defect removal (inspections, static analysis, testing, finding, and fixing bugs)
2.
Producing paper documents (plans, architecture, specifications, user manuals)
3.
Meetings and communication (clients, team members, managers)
4.
Programming
5.
Project management
in two languages: basic assembly and C++. In Case 1, we will assume
that an application is written in assembly. In Case 2, we will assume
that the same application is written in C++.
Case 1: Application written in the assembly language Assume that the
assembly language program required 10,000 lines of code, and the vari-
ous paper documents (specifications, user documents, etc.) totaled to 100
pages. Assume that coding and testing required ten months of effort,
and writing the paper documents took five months of effort. The entire
project totaled 15 months of effort, and so has a productivity rate of 666
LOC per month. At a cost of $10,000 per staff month, the application
cost $150,000. Expressed in terms of cost per source line, the cost is $15
per line of source code.
Case 2: The same application written in the C++ language Assume that the
C++ version of the same application required only 1000 lines of code.
The design documents probably were smaller as a result of using an
object-oriented (OO) language, but the user documents are the same size
as the previous case: assume a total of 75 pages were produced. Assume
that coding and testing required one month, and document production
took four months. Now we have a project where the total effort was only
five months, but productivity expressed using LOC has dropped to only
200 LOC per month. At a cost of $10,000 per staff month, the applica-
tion cost $50,000 or only one-third as much as the assembly language
version. The C++ version is a full $100,000 cheaper than the assembly
version, so clearly the C++ version has much better economics. But the
cost per source line for this version has jumped to $50.
Even if we measure only coding, we still can't see the value of high-
level languages by means of the LOC metric: the coding rates for both
the assembly language and C++ versions were both identical at 1000
LOC per month, even though the C++ version took only one month as
opposed to ten months for the assembly version.
Since both the assembly and C++ versions were identical in terms of
features and functions, let us assume that both versions were 50 func-
tion points in size. When we express productivity in terms of function
points per staff month, the assembly version had a productivity rate of
Project Management and Software Engineering
369
3.33 function points per staff month. The C++ version had a productivity
rate of 10 function points per staff month. When we turn to costs, the
assembly version had a cost of $3000 per function point, while the C++
version had a cost of $1000 per function point. Thus, function point met-
rics clearly match the assumptions of standard economics, which define
productivity as goods or services produced per unit of labor or expense.
Lines of code metrics, on the other hand, do not match the assump-
tions of standard economics and in fact show a reversal. Lines of code
metrics distort the true economic picture by so much that their use for
economic studies involving more than one programming language might
be classified as professional malpractice.
Timing of sizing by lines of code Unless the application being sized is
going to replace an existing legacy application, this method is pure
guesswork until the code is written. If code benchmarks or historical
code size data from similar projects exist, this form of sizing can be done
early, assuming the new language is the same as the former language.
However, if there is no history, sizing using lines of code, or the old lan-
guage is not the same as the new, this can't be done with accuracy, and
it can't be done until the code is written, which is far too late. When
either the new application or the old application (or both) use multiple
languages, code counting becomes very complicated and difficult.
Usage of lines of code sizing As of 2009, at least 3 million legacy appli-
cations still are in use, and another 1.5 million are under development.
However, of this total of about 4.5 million applications, the author esti-
mates that more than 4 million use multiple programming languages
or use languages for which no effective counting rules exist. Of the
approximate total of 500,000 applications that use primarily a single
language where counting rules do exist, no fewer than 500 program-
ming languages have been utilized. Essentially, code sizing is inaccurate
and hazardous, except for applications that use a single language such
as assembler, C, dialects of C, COBOL, Fortran, Java, and about 100
others.
In today's world circa 2009, sizing using LOC metrics still occurs
in spite of the flaws and problems with this metric. The Department
of Defense and military software are the most frequent users of LOC
metrics. The LOC metric is still widely used by systems and embedded
applications. The older waterfall method often employed LOC sizing, as
does the modern Team Software Process (TSP) development method.
Schedules and costs This form of sizing is quick and inexpensive, assum-
ing that automated code counting tools are available. However, if the
application has more than two programming languages, automated code
img
370
Chapter Six
counting may not be possible. If the application uses some modern lan-
guage, code counting is impossible because there are no counting rules for
the buttons and pull-down menus used to "program" in some languages.
The main counter indication is that
Cautions and counter indications
lines of code metrics penalize high-level languages. Another indication
is that this method is hazardous for sizing requirements, specifications,
and paper documents. Also, counts of physical lines of code may differ
from counts of logical statements by more than 500 percent. Since the
software literature and published productivity data is ambiguous as
to whether logical or physical lines are used, this method has a huge
margin of error.
Sizing Using Story Point Metrics
The Agile development method was created in part because of a reaction
against the traditional software cost drivers shown in Table 6-6. The
Agile pioneers felt that software had become burdened by excessive vol-
umes of paper requirements and specifications, many of which seemed
to have little value in actually creating a working application.
The Agile approach tries to simplify and minimize the production of
paper documents and to accelerate the ability to create working code.
The Agile philosophy is that the goal of software engineering is the cre-
ation of working applications in a cost-effective fashion. In fact, the goal
of the Agile method is to transform the traditional software cost drivers
into a more cost-effective sequence, as shown in Table 6-7.
As part of simplifying the paper deliverables of software applications,
a method for gathering the requirements for Agile projects is that of user
stories. These are very concise statements of specific requirements that
consist of only one or two sentences, which are written on 3"×5" cards
to ensure compactness.
An example of a basic user story for a software cost-estimating tool
might be, The estimating tool should include currency conversion between
dollars, euros, and yen.
Once created, user stories are assigned relative weights called story
points, which reflect their approximate difficulty and complexity compared
TABLE 6-7
Rank Order of Agile Software Cost Elements
1. Programming
2. Meetings and communication (clients, team members, managers)
3. Defect removal (inspections, static analysis, testing, finding and fixing bugs)
4. Project management
5. Producing paper documents (plans, architecture, specifications, user manuals)
Project Management and Software Engineering
371
with other stories for the same application. The currency conversion exam-
ple just shown is quite simple and straightforward (except for the fact that
currencies fluctuate on a daily basis), so it might be assigned a weight
of 1 story point. Currency conversion is a straightforward mathematical
calculation and also is readily available from online sources, so this is not
a difficult story or feature to implement.
The same cost-estimating application will of course perform other
functions that are much harder and more complex than currency con-
version. An example of a more difficult user story might be, The esti-
mating tool will show the effects of CMMI levels on software quality and
productivity.
This story is much harder to implement than currency conversion,
because the effects of CMMI levels vary with the size and nature of the
application being developed. For small and simple applications, CMMI
levels have hardly any impact, but for large and complex applications,
the higher CMMI levels have a significant impact. Obviously, this story
would have a larger number of story points than currency conversion,
and might be assigned a weight of 5, meaning that it is at least five
times as difficult as the previous example.
The assignment of story point weights for a specific application is
jointly worked out between the developers and the user representative.
Thus, for specific applications, there is probably a high degree of math-
ematical consistency between story point levels; that is, levels 1, 2, 3,
and so on, probably come close to capturing similar levels of difficulty.
The Agile literature tends to emphasize that story points are units of
size, not units of time or effort. However, that being said, story points are
in fact often used for estimating team velocity and even for estimating
the overall schedules for both sprints and even entire applications.
However, user stories and therefore story points are very flexible, and
there is no guarantee that Agile teams on two different applications will
use exactly the same basis for assigning story point weights.
It may be that as the Agile approach gains more and more adherence
and wider usage, general rules for determining story point weights will
be created and utilized, but this is not the case circa 2009.
It would be theoretically possible to develop mathematical conversion
rules between story points and other metrics such as IFPUG function
points, COSMIC function points, use-case points, lines of code, and so
forth. However, for this to work, story points would need to develop
guidelines for consistency between applications. In other words, quanti-
ties such as 1 story point, 2 story points, and so on, would have to have
the same values wherever they were applied.
From looking at samples of story points, there does not seem to be a
strict linear relation between user stories and story points in terms of
effort. What might be a useful approximation is to assume that for each
img
372
Chapter Six
increase of 1 in terms of story points, the IFPUG function points needed
for the story would double. For example:
Story Points
IFPUG Function Points
1
2
2
4
3
8
4
16
5
32
This method is of course hypothetical, but it would be interesting to
carry out trials and experiments and create a reliable conversion table
between story points and function points.
It would be useful if the Agile community collected valid historical
data on effort, schedules, defects, and other deliverables and submitted
them to benchmarking organizations such as ISBSG. Larger volumes
of historical data would facilitate the use of story points for estimating
purposes and would also speed up the inclusion of story points in com-
mercial estimating tools such as COCOMO, KnowledgePlan, Price-S,
SEER, SLIM, and the like.
A few Agile projects have used function point metrics in addition to
story points. But as this book is written in 2009, no Agile projects have
submitted formal benchmarks to ISBSG or to other public benchmark
sources. Some Agile projects have been analyzed by private benchmark
organizations, but the results are proprietary and confidential.
As a result, there is no reliable quantitative data circa 2009 that
shows either Agile productivity or Agile quality levels. This is not a sign
of professional engineering, but it is a sign of how backwards "software
engineering" is compared with more mature engineering fields.
While Agile projects attempt an over-
Timing of sizing with story points
view of an entire application at the start, user stories occur continu-
ously with every sprint throughout development. Therefore, user stories
are intended primarily for the current sprint and don't have much to
say about future sprints that will occur downstream. As a result, story
points are hard to use for early sizing of entire applications, although
useful for the current sprint.
Agile is a very popular method, but it is
Usage of story point metrics
far from being the only software development method. The author esti-
mates that circa 2009, about 1.5 million new applications are being
developed. Of these perhaps 200,000 use the Agile method and also
use story points. Story points are used primarily for small to mid-sized
IT software applications between about 250 and 5000 function points.
Project Management and Software Engineering
373
Story points are not used often for large applications greater than
10,000 function points, nor are they often used for embedded, systems,
and military software.
Schedules and costs Since story points are assigned informally by team
consensus, this form of sizing is quick and inexpensive. It is possible to
use collections of story cards and function points, too. User stories could
be used as a basis for function point analysis. But Agile projects tend to
stay away from function points. It would also be possible to use some of
the high-speed function point methods with Agile projects, but as this
book is written, there is no data that shows this being done.
The main counter indication for story
Cautions and counter indications
points is that they tend to be unique for specific applications. Thus, it
is not easy to compare benchmarks between two or more different Agile
applications using story points, because there is no guarantee that the
applications used the same weights for their story points.
Another counter indication is that story points are useless for com-
parisons with applications that were sized using function points, use-
case points, or any other software metric. Story points can only be used
for benchmark comparisons against other story points, and even here
the results are ambiguous.
A third counter indication is that there are no large-scale collections
of benchmark data that are based on story points. For some reason, the
Agile community has been lax on benchmarks and collecting historical
data. This is why it is so hard to ascertain if Agile has better or worse
productivity and quality levels than methods such as TSP, iterative
development, or even waterfall development. The shortage of quantita-
tive data about Agile productivity and quality is a visible weakness of
the Agile approach.
Sizing Using Use-Case Metrics
Use-cases have been in existence since the 1980s. They were originally
discussed by Ivar Jacobsen and then became part of the unified model-
ing language (UML). Use-cases are also an integral part of the Rational
Unified Process (RUP), and Rational itself was acquired by IBM. Use-
cases have both textual and several forms of graphical representation.
Outside of RUP, use-cases are often used for object-oriented (OO) appli-
cations. They are sometimes used for non-OO applications as well.
Use-cases describe software application functions from the point of view
of a user or actor. Use-cases can occur in several levels of detail, including
"brief," "casual," and "fully dressed," which is the most detailed. The fully
dressed use-cases are of sufficient detail that they can be used for function
point analysis and also can be used to create use-case points.
374
Chapter Six
Use-cases include other topics besides actors, such as preconditions,
postconditions, and several others. However, these are well defined and
fairly consistent from application to application.
Use-cases and user stories have similar viewpoints, but use-cases
are more formal and often much larger than user stories. Because of
the age and extensive literature about use-cases, they tend to be more
consistent from application to application than user stories do.
Some criticisms are aimed at use-cases for not dealing with nonfunc-
tional requirements such as security and quality. But this same criti-
cism could be aimed with equal validity at any design method. In any
case, it is not difficult to append quality, security, and other nonfunc-
tional design issues to use-cases.
Use-case points are based on calculations and logic somewhat simi-
lar to function point metrics in concept but not in specific details. The
factors that go into use-case points include technical and environmen-
tal complexity factors. Once calculated, use-case points can be used to
predict effort and costs for software development. About 20 hours of
development work per use-case has been reported, but the activities
that go into this work can vary.
Use-case diagrams and supporting text can be used to calculate func-
tion point metrics as well as use-case metrics. In fact, the rigor and con-
sistency of use-cases should allow automatic derivation of both use-case
points and function points.
The use-case community tends to be resistant to function points and
asserts that use-cases and function points look at different aspects,
which is only partly true. However, since both can yield information on
work hours per point, it is obvious that there are more similarities than
the use-case community wants to admit to.
If you assume that a work month consists of 22 days at 8 hours per
day, there are about 176 hours in a work month. Function point produc-
tivity averages about 10 function points per staff month, or 17.6 work
hours per function point.
Assuming that use-case productivity averages about 8.8 use-cases
per month, which is equivalent to 20 hours per use-case, it can be seen
that use-case points and IFPUG function points yield results that are
fairly close together.
Other authors and benchmark organizations such as the David
Consulting Group and ISBSG have published data on conversion ratios
between IFPUG function point metrics and use-case points. While the
other conversion ratios are not exactly the same as the ones in this
chapter, they are quite close, and the differences are probably due to
using different samples.
There may be conversion ratios between use-case points and COSMIC
function points, Finnish function points, or other function point variants,
Project Management and Software Engineering
375
but the author does not use any of the variants and has not searched
their literature.
Of course, productivity rates using both IFPUG function points and
use-case points have wide ranges, but overall they are not far apart.
Timing of sizing with use-case points Use-cases are used to define require-
ments and specifications, so use-case points can be calculated when use-
cases are fairly complete; that is, toward the end of the requirements
phase. Unfortunately, formal estimates are often needed before this
time.
RUP is a very popular method, but it is far
Usage of use-case points
from being the only software development method. The author esti-
mates that circa 2009, about 1.5 million new applications are being
developed. Of these, perhaps 75,000 use the RUP method and also use-
case points. Perhaps another 90,000 projects are object-oriented and
utilize use-cases, but not RUP. Use-case points are used for both small
and large software projects. However, the sheer volume of use-cases
becomes cumbersome for large applications.
Since use-case points have simpler calculations
Schedules and costs
than function points, this form of sizing is somewhat quicker than func-
tion point analysis. Use-case points can be calculated at a range of
perhaps 750 per day, as opposed to about 400 per day for function point
analysis. Even so, the cost for calculating use-case points can top $3
per point if manual sizing is used. Obviously, automatic sizing would
be a great deal cheaper and also faster. In theory, automatic sizing of
use-case points could occur at rates in excess of 5000 use-case points
per day.
The main counter indication for use-
Cautions and counter indications
case points is that there are no large collections of benchmark data
that use them. In other words, use-case points cannot yet be used for
comparisons with industry databases such as ISBSG, because function
point metrics are the primary metric for benchmark analysis.
Another counter indication is that use-case points are useless for com-
parisons with applications that were sized using function points, story
points, lines of code, or any other software metric. Use-case points can
only be used for benchmark comparisons against other use-case points,
and even here the results are sparse and difficult to find.
A third counter indication is that supplemental data such as pro-
ductivity and quality is not widely collected for projects that utilize
use-cases. For some reason, both the OO and RUP communities have
been lax on benchmarks and collecting historical data. This is why it
376
Chapter Six
is so hard to ascertain if RUP or OO applications have better or worse
productivity and quality levels than other methods. The shortage of
quantitative data about RUP productivity and quality compared with
other methods such as Agile and TSP productivity and quality is a vis-
ible weakness of the use-case point approach.
Sizing Based on IFPUG Function
Point Analysis
Function point metrics were developed by A.J. Albrecht and his col-
leagues at IBM in response to a directive by IBM executives to find a
metric that did not distort economic productivity, as did the older lines
of code metric. After research and experimentation, Albrecht and his
colleagues developed a metric called "function point" that was indepen-
dent of code volumes.
Function point metrics were announced at a conference in 1978 and
put into the public domain. In 1984, responsibility for the counting rules
of function point metrics was transferred from IBM to a nonprofit organi-
zation called the International Function Point Users Group (IFPUG).
Sizing technologies based on function point metrics have been pos-
sible since this metric was introduced in 1978. Function point sizing is
more reliable than sizing based on lines of code because function point
metrics support all deliverable items: paper documents, source code,
test cases, and even bugs or defects. Thus, function point sizing has
transformed the task of sizing from a very difficult kind of work with a
high error rate to one of acceptable accuracy.
Although the counting rules for function points are complex today,
the essence of function point analysis is derived by a weighted formula
that includes five elements:
1. Inputs
2. Outputs
3. Logical files
4. Inquiries
5. Interfaces
There are also adjustments for complexity. The actual rules for count-
ing function points are published by the International Function Point
Users Group (IFPUG) and are outside the scope of this section.
The function point counting items can be quantified by reviewing
software requirements and specifications. Note that conventional paper
specifications, use-cases, and user stories can all be used for function
point analysis. The counting rules also include complexity adjustments.
img
Project Management and Software Engineering
377
The exact rules for counting function points are outside the scope of this
book and are not discussed.
Now that function points are the most widely used software size
metric in the world, thousands of projects have been measured well
enough to extract useful sizing data for all major deliverables: paper
documents such as plans and manuals, source code, and test cases. Here
are a few examples from all three sizing domains. Table 6-8 illustrates
typical document volumes created for various kinds of software.
Table 6-8 illustrates only a small sample of the paperwork and docu-
ment sizing capabilities that are starting to become commercially avail-
able. In fact, as of 2009, more than 90 kinds of document can be sized
using function points, including translations into other national lan-
guages such as Japanese, Russian, Chinese, and so on.
Not only can function points be used to size paper deliverables, but
they can also be used to size source code, test cases, and even software
bugs or defects. In fact, function point metrics can size the widest range
of software deliverables of any known metric.
For sizing source code volumes, data now is available on roughly 700
languages and dialects. There is also embedded logic in several com-
mercial software estimating tools for dealing with multiple languages
in the same application.
Since the function point total of an application is known at least
roughly by the end of requirements, and in some detail by the middle of
the specification phase, it is now possible to produce fairly accurate size
estimates for any application where function points are utilized. This
form of sizing is now a standard function for many commercial software
estimating tools such as COCOMO II, KnowledgePlan, Price-S, SEER,
SLIM, and others.
The usefulness of IFPUG function point metrics has made them the
metric of choice for software benchmarks. As of 2009, benchmarks based
on function points outnumber all other metrics combined. The ISBSG
TABLE 6-8
Number of Pages Created Per Function Point for Software Projects
Systems
Military
Commercial
Software
MIS Software
Software
Software
User requirements
0.45
0.50
0.85
0.30
Functional specifications
0.80
0.55
1.75
0.60
Logic specifications
0.85
0.50
1.65
0.55
Test plans
0.25
0.10
0.55
0.25
User tutorial documents
0.30
0.15
0.50
0.85
User reference documents
0.45
0.20
0.85
0.90
Total document set
3.10
2.00
6.15
3.45
378
Chapter Six
benchmark data currently has about 5000 projects and is growing at a
rate of perhaps 500 projects per year.
The proprietary benchmarks by companies such as QPMG, the David
Consulting Group, Software Productivity Research, Galorath Associates,
and several others total perhaps 60,000 software projects using function
points and grow at a collective rate of perhaps 1000 projects per year.
There are no other known metrics that even top 1000 projects.
Over the past few years, concerns have been raised that software
applications also contain "nonfunctional requirements" such as per-
formance, quality, and so on. This is true, but the significance of these
tends to be exaggerated.
Consider the example of home construction. A major factor in the
cost of home construction is the size of the home, measured in terms of
square feet or square meters. The square footage, the amenities, and
the grade of construction materials are user requirements. But in the
author's state (Rhode Island), local building codes add significant costs
due to nonfunctional requirements. Homes built near a lake, river, or
aquifer require special hi-tech septic systems, which cost about $30,000
more than standard septic systems. Homes built within a mile of the
Atlantic Ocean require hurricane-proof windows, which cost about three
times more than standard windows.
These government mandates are not user requirements. But they
would not occur without a home being constructed, so they can be dealt
with as subordinate cost elements. Therefore, estimates and measures
such as "cost per square foot" are derived from the combination of func-
tional user requirements and government building codes that force
mandated nonfunctional requirements on homeowners.
IFPUG function points are derived
Timing of IFPUG function point sizing
from requirements and specifications, and can be quantified by the time
initial requirements are complete. However, the first formal cost esti-
mates usually are needed before requirements are complete.
While the IFPUG method is the most
Usage of IFPUG function points
widely used form of function point analysis, none of the function point
methods are used widely. Out of an approximate total of perhaps 1.5
million new software applications under development circa 2009, the
author estimates that IFPUG function point metrics are currently
being used on about 5000 applications. Function point variants, back-
firing, and function point approximation methods are probably in use
on another 2500 applications. Due to limitations in the function point
method itself, IFPUG function points are seldom used for applications
greater than 10,000 function points and can't be used at all for small
updates less than 15 function points in size.
Project Management and Software Engineering
379
Schedules and costs This form of sizing is neither quick nor inexpen-
sive. Function point analysis is so slow and expensive that applications
larger than about 10,000 function points are almost never analyzed.
Normal function point analysis requires a certified function point analy-
sis to be performed with accuracy (uncertified counts are highly inaccu-
rate). Normal function point analysis proceeds at a rate of between 400
and 600 function points per day. At a daily average consulting fee of $3000,
the cost is between $5.00 and $7.50 for every function point counted.
Assuming an average cost of $6.00 per function point counted,
counting a 10,000­function point application would cost $60,000. This
explains why normal function point analysis is usually only performed
for applications in the 1000-function point size range.
Later in this section, various forms of high-speed function point
approximation are discussed. It should be noted that automatic func-
tion point counting is possible when formal specification methods such
as use-cases are utilized.
Cautions and counter indications The main counter indication with func-
tion point analysis is that it is expensive and fairly time-consuming.
While small applications less than 1000 function points can be sized
in a few days, large systems greater than 10,000 function points would
require weeks. No really large systems greater than 100,000 function
points have ever been sized with function points due to the high costs
and the fact that the schedule for the analysis would take months.
Another counter indication is that from time to time, the counting
rules change. When this occurs, historical data based on older versions
of the counting rules may change or become incompatible with newer
data. This situation requires conversion rules from older to newer count-
ing rules. If nonfunctional requirements are indeed counted separately
from functional requirements, such a change in rules would cause sig-
nificant discontinuities in historical benchmark data.
Another counter indication is that there is a lower limit for function point
analysis. Small changes less than 15 function points can't be sized due to
the lower limits of the adjustment factors. Individually, these changes are
trivial, but within large companies, there may be thousands of them every
year, so their total cost can exceed several million dollars.
A caution is that accurate function point analysis requires certified
function point counters who have successfully passed the certification
examination offered by IFPUG. Uncertified counters should not be used
because the counting rules are too complex. As with tax regulations, the
rules change fairly often.
Function point analysis is accurate and useful, but slow and expen-
sive. As a result, a number of high-speed function point methods have
been developed and will be discussed later in this section.
380
Chapter Six
Sizing Using Function Point Variations
The success of IFPUG function point metrics led to a curious situation.
The inventor of function point metrics, A.J. Albrecht, was an electri-
cal engineer by training and envisioned function points as a general-
purpose metric that could be used for information technology projects,
embedded software, systems software, and military software, and even
games and entertainment software. However, the first published results
that used function point metrics happened to be information technology
applications such as accounting and financial software.
The historical accident that function point metrics were first used for
IT applications led some researchers to conclude that function points
only worked for IT applications. As a result, a number of function point
variations have come into being, with many of them being aimed at sys-
tems and embedded software. These function point variations include
but are not limited to:
1. COSMIC function points
2. Engineering function points
3. 3-D function points
4. Full function points
5. Feature points
6. Finnish function points
7. Mark II function points
8. Netherlands function points
9. Object-oriented function points
10. Web-object function points
When IFPUG function points were initially used for systems and
embedded software, it was noted that productivity rates were lower
for these applications. This is because systems and embedded software
tend to be somewhat more complex than IT applications and really are
harder to build, so productivity will be about 15 percent lower than for
IT applications of the same size.
However, rather than accepting the fact that some embedded and
systems applications are tougher than IT applications and will there-
fore have lower productivity rates, many function point variants were
developed that increased the apparent size of embedded and systems
applications so that they appear to be about 15 percent larger than
when measured with IFPUG function points.
As mentioned earlier, it is an interesting point to think about, but one
of the reasons why IT projects seem to have higher productivity rates
Project Management and Software Engineering
381
than systems or embedded software is that IT project historical data
leaks a great deal more than historical data systems and embedded
software. This is because IT applications are usually developed by a
cost center, but systems and embedded software are usually developed
by a profit center. This leakage is enough by itself to make IT projects
look at least 15 percent more productive than systems or embedded
applications of the same size in terms of function points. It is perhaps a
coincidence that the size increases for systems and embedded software
predicted by function point variants such as COSMIC are almost exactly
the same as the leakage rates from IT application historical data.
Not all of the function point variants are due to a desire to puff up the
sizes of certain kinds of software, but many had that origin. As a result
now, in 2009, the term function point is extremely ambiguous and includes
many variations. It is not possible to mix these variants and have a single
unified set of benchmarks. Although some of the results may be similar,
mixing the variants into the same benchmark data collection would be like
mixing yards and meters or statute miles and nautical miles.
The function point variations all claim greater accuracy for certain
kinds of software than IFPUG function points, but what this means is
that the variations produce larger counts than IFPUG for systems and
embedded software and for some other types of software. This is not the
same thing as "accuracy" in an objective sense.
In fact, there is no totally objective way of ascertaining the accuracy of
either IFPUG function points or the variations. It is possible to ascertain
the differences in results between certified and uncertified counters, and
between groups of counters who calculate function points for the same
test case. But this is not true accuracy: it's only the spread of human
variation.
With so many variations, it is now very difficult to use any of them for
serious estimating and planning work. If you happen to use one of the vari-
ant forms of function points, then it is necessary to seek guidance from the
association or group that controls the specific counting rules used.
As a matter of policy, inventors of function point variants should be
responsible for creating conversion rules between these variants and
IFPUG function points, which are the oldest and original form of func-
tional measurement. However, with few exceptions, there are no really
effective conversion rules. There are some conversion rules between
IFPUG and COSMIC and also between several other variations such
as the Finnish and Netherlands functional metrics.
The older feature point metric was jointly developed by A.J. Albrecht
and the author, so it was calibrated to produce results that matched
IFPUG function points in over 90 percent of cases; for the other
10 percent, the counting rules created more feature points than function
points, but the two could be converted by mathematical means.
382
Chapter Six
There are other metrics with multiple variations such as statute miles
and nautical miles, Imperial gallons and U.S. gallons, or temperature
measured using Fahrenheit or Celsius. Unfortunately, the software
industry has managed to create more metric variations than any other
form of "engineering." This is yet another sign that software engineering
is not yet a true engineering discipline, since it does not yet know how
to measure results with high precision.
Timing of function point variant sizing Both IFPUG function points and
the variations such as COSMIC are derived from requirements and
specifications, and can be quantified by the time initial requirements are
complete. However, the first formal cost estimates usually are needed
before requirements are complete.
The four function point variations
Usage of function point variations
that are certified by the ISO standards organization include the IFPUG,
COSMIC, Netherlands, and Finnish methods. Because IFPUG is much
older, it has more users. The COSMIC, Netherlands, and Finnish meth-
ods probably have between 200 and 1000 applications currently using
them. The older Mark II method probably had about 2000 projects
mainly in the United Kingdom. The other function point variations
have perhaps 50 applications each.
IFPUG, COSMIC, and most variations require
Schedules and costs
about the same amount of time. These forms of sizing are neither quick
nor inexpensive. Function point analysis of any flavor is so slow and
expensive that applications larger than about 10,000 function points
are almost never analyzed.
Normal function point analysis for all of the variations requires a cer-
tified function point analysis to be performed with accuracy (uncertified
counts are highly inaccurate). Normal function point analysis proceeds
at a rate of between 400 and 600 function points per day. At a daily
average consulting fee of $3000, the cost is between $5.00 and $7.50 for
every function point counted.
Assuming an average cost of $6 per function point counted for the
major variants, counting a 10,000­function point application would cost
$60,000. This explains why normal function point analysis is usually
only performed for applications in the 1000-function point size range.
The main counter indication with
Cautions and counter indications
function point analysis for all variations is that it is expensive and
fairly time-consuming. While small applications less than 1000 function
points can be sized in a few days, large systems greater than 10,000
function points would require weeks. No really large systems greater
Project Management and Software Engineering
383
than 100,000 function points have ever been sized using either IFPUG
or the variations such as COSMIC due to the high costs and the fact
that the schedule for the analysis would take months.
Another counter indication is that there is a lower limit for function
point analysis. Small changes less than 15 function points can't be sized
due to the lower limits of the adjustment factors. This is true for all of
the variations such as COSMIC, Finnish, and so on. Individually, these
changes are trivial, but large companies could have thousands of them
every year at a total cost exceeding several million dollars.
A caution is that accurate function point analysis requires a certified
function point counter who has successfully passed the certification exam-
ination offered by the function point association that controls the metric.
Uncertified counters should not be used, because the counting rules are
too complex. As with tax regulations, the rules change fairly often.
Function point analysis is accurate and useful, but slow and expen-
sive. As a result, a number of high-speed function point methods have
been developed and will be discussed later in this section.
High-Speed Sizing Using Function
Point Approximations
The slow speed and high costs of normal function point analysis were
noted within a few years of the initial development of function point
metrics. Indeed, the very first commercial software cost-estimating tool
that supported function point metrics, SPQR/20 in 1985, supported a
method of high-speed function point analysis based on approximation
rather than actual counting.
The term approximation refers to developing a count of function points
without having access to, or knowledge of, every factor that determines
function point size when using normal function point analysis.
The business goal of the approximation methods is to achieve func-
tion point totals that would come within about 15 percent of an actual
count by a certified counter, but achieve that result in less than one
day of effort. Indeed, some of the approximation methods operate in
only a minute or two. The approximation methods are not intended
as a full substitute for function point analysis, but rather to provide
quick estimates early in development. This is because the initial cost
estimate for most projects is demanded even before requirements are
complete, so there is no way to carry out formal function point analysis
at that time.
There are a number of function point approximation methods circa
2009, but the ones that are most often used include
1. Unadjusted function points
2. Function points derived from simplified complexity adjustments
384
Chapter Six
3. Function points "light"
4. Function points derived from data mining of legacy applications
5. Function points derived from questions about the application
6. Function points derived from pattern matching (discussed later in
this section)
The goal of these methods is to improve on the average counting speed
of about 400 function points per day found with normal function point
analysis. That being said, the "unadjusted" function point method seems
to achieve rates of about 700 function points per day. The method using
simplified complexity factors achieves rates of about 600 function points
per day. The function point "light" method achieves rates of perhaps 800
function points per day.
The function point light method was developed by David Herron of
the David Consulting Group, who is a certified function point counter.
His light method is based on simplifying the standard counting rules
and especially the complexity adjustments.
The method based on data mining of legacy applications is technically
interesting. It was developed by a company called Relativity Technologies
(now part of Micro Focus). For COBOL and other selected languages, the
Relativity function point tool extracts hidden business rules from source
code and uses them as the basis for function point analysis.
The technique was developed in conjunction with certified function
point analysts, and the results come within a few percentage points of
matching standard function point analysis. The nominal speed of this
approach is perhaps 1000 function points per minute (as opposed to 400
per day for normal counts). For legacy applications, this method can be
very valuable for retrofitting function points and using them to quantify
maintenance and enhancement work.
There are several methods of approximation based on questions
about the application. Software Productivity Research (SPR) and Total
Metrics both have such tools available. The SPR approximation methods
are embedded in the KnowledgePlan estimation tool. The Total Metrics
approximation method is called Function Point Outline and deals with
some interesting external attributes of software applications, such as
the size of the requirements or functional specifications.
As noted earlier in this chapter, function points have long been used
to measure and predict the size of requirements and specifications. The
FP Outlook approach merely reversed the mathematics and uses known
document sizes to predict function points, which is essentially another
form of backfiring. Of course, document size is only one of the questions
asked, but the idea is to create function point approximations based on
easily available information.
Project Management and Software Engineering
385
The speed of the FP Outlook tool and the other question-based func-
tion point approximation methods seems to be in the range of perhaps
4000 function points per day, as opposed to the 400 function points per
day of normal function point analysis.
The methods based on ques-
Timing of function point approximation sizing
tions about applications can be used earlier than standard function
points. Function points "light" can be used at the same time as stan-
dard function points; that is, when the requirements are known. The
data mining approach requires existing source code and hence is used
primarily for legacy applications. However, the approximation methods
that use questions about software applications can be used very early
in requirements: several months prior to when standard function point
analysis might be carried out.
The function point approxima-
Usage of function point approximations
tion methods vary in usage. The Relativity method and the Total Metrics
method were only introduced in 2008, so usage is still growing: perhaps
250 projects each. The older approximation methods may have as many
as 750 projects each.
The main purpose of the approximation methods
Schedules and costs
is to achieve faster function point counts and lower costs than IFPUG,
COSMIC, or any other standard method of function point analysis. Their
speed of operation ranges between about twice that of standard function
points up to perhaps 20 times standard function point analysis. The cost
per function point counted runs from less than 1 cent up to perhaps $3,
but all are cheaper than standard function point analysis.
The main counter indication with func-
Cautions and counter indications
tion point approximation is accuracy. The Relativity method matches
standard IFPUG function points almost exactly. The other approximation
methods only come within about 15 percent of manual counts by certified
counters. Of course, coming within 15 percent three months earlier than
normal function points might be counted, with a cost of perhaps one-tenth
normal function point analysis, are both significant business advantages.
Sizing Legacy Applications Based
on "Backfiring" or LOC to Function
Point Conversion
The concept of backfiring is nothing more than reversing the direction
of the equations used when predicting source code size from function
points. The technology of backfiring or direct conversion of LOC data
386
Chapter Six
into the equivalent number of function points was pioneered by Allan
Albrecht, the inventor of the function point metric. The first backfire
data was collected within IBM circa 1975 as a part of the original devel-
opment of function point metrics.
The first commercial software estimating tool to support backfiring
was SPQR/20, which came out in 1985 and supported bi-directional
sizing for 30 languages. Today, backfiring is a standard function for
many commercial software estimating tools such as the ones already
mentioned earlier in this section.
From 30 languages in 1985, the number of languages that can be
sized or backfired has now grown to more than 450 circa 2009, when
all dialects are counted. Of course, for the languages where no counting
rules exist, backfiring is not possible. Software Productivity Research
publishes an annual table of conversion ratios between logical lines of
code and function points, and the current edition circa 2009 contains
almost 700 programming languages and dialects. Similar tables are
published by other consulting organizations such as Gartner Group and
the David Consulting Group.
There are far too many programming languages to show more than a
few examples in this short subsection. Note also that the margin of error
when backfiring is rather large. Even so, the results are interesting and
now widely utilized. Following are examples taken from the author's
Table of Programming Languages and Levels, which is updated sev-
eral times a year by Software Productivity Research (Jones, 1996). This
data indicates the ranges and median values in the number of source
code statements required to encode one function point for selected lan-
guages. The counting rules for source code are based on logical state-
ments and are defined in an appendix of the author's book Applied
Software Measurement (McGraw-Hill, 2008). Table 6-9 shows samples
of the ratios of logical source code statements to function points. A full
table for all 2,500 or so programming languages would not fit within
the book.
Although backfiring is usually not as accurate as actually counting
function points, there is one special case where backfiring is more accu-
rate: very small modifications to software applications that have fewer
than 15 function points. For changes less than 1 function point, backfir-
ing is one of only two current approaches for deriving function points.
(The second approach is pattern matching, which will be discussed later
in this section.)
While backfiring is widely used and also supported by many com-
mercial software cost-estimating tools, the method is something of an
"orphan," because none of the function point user groups such as IFPUG,
COMIC, and the like have ever established committees to evaluate back-
firing or produced definitive tables of backfire data.
img
Project Management and Software Engineering
387
TABLE 6-9
Ratios of Logical Source Code Statements to Function Points for
Selected Programming Languages
Source Statements Per Function Point
Language
Nominal Level
Low
Mean
High
1st Generation
1.00
220
320
500
Basic assembly
1.00
200
320
450
Macro assembly
1.50
130
213
300
C
2.50
60
128
170
BASIC (interpreted)
2.50
70
128
165
2nd Generation
3.00
55
107
165
FORTRAN
3.00
75
107
160
ALGOL
3.00
68
107
165
COBOL
3.00
65
107
150
CMS2
3.00
70
107
135
JOVIAL
3.00
70
107
165
PASCAL
3.50
50
91
125
3rd Generation
4.00
45
80
125
PL/I
4.00
65
80
95
MODULA 2
4.00
70
80
90
ADA 83
4.50
60
71
80
LISP
5.00
25
64
80
FORTH
5.00
27
64
85
QUICK BASIC
5.50
38
58
90
C++
6.00
30
53
125
Ada 9X
6.50
28
49
110
Data base
8.00
25
40
75
Visual Basic (Windows)
10.00
20
32
37
APL (default value)
10.00
10
32
45
SMALLTALK
15.00
15
21
40
Generators
20.00
10
16
20
Screen painters
20.00
8
16
30
SQL
27.00
7
12
15
Spreadsheets
50.00
3
6
9
One potential use of backfiring would be to convert historical data
for applications that used story points or use-case points into function
point form. This would only require deriving logical code size and then
using published backfire ratios.
It would also be fairly trivial for various kinds of code analyzers such
as complexity analysis tools or static analysis tools to include backfire
algorithms, as could compilers for that matter.
Even though the function point associations ignore backfiring, many
benchmark organizations such as Software Productivity Research (SPR),
388
Chapter Six
the David Consulting Group, QPMG, Gartner Group, and so on, do pub-
lish tables of backfire conversion ratios.
While many languages in these various tables have the same level
from company to company, other languages vary widely in the apparent
number of source code statements per function point based on which
company's table is used. This is an awkward problem, and coopera-
tion among metrics consulting groups would be useful to the industry,
although it will probably not occur.
Somewhat surprisingly, as of 2009, all of the published data on back-
firing relates to standard IFPUG function point metrics. It would be
readily possible to generate backfiring rules for COSMIC function
points, story point, use-case points, or any other metric, but this does
not seem to have happened, for unknown reasons.
Timing of backfire function point sizing Since backfiring is based on source
code, its primary usage is for sizing legacy applications so that historical
maintenance data can be expressed in terms of function points. A sec-
ondary usage for backfiring is to convert historical data based on lines
of code metrics into function point data so it can be compared against
industry benchmarks such as those maintained by ISBSG.
Usage of backfire function points The backfire method was created in
part by A.J. Albrecht as a byproduct of creating function point met-
rics. Therefore, backfiring has been in continuous use since about 1975.
Because of the speed and ease of backfiring, more applications have
been sized with this method than almost any other. Perhaps as many
as 100,000 software applications have been sized via backfiring.
Schedules and costs If source code size is known, the backfiring form of
sizing is both quick and inexpensive. Assuming automated code count-
ing, rates of more than 10,000 LOC per minute can be converted into
function point form. This brings the cost down to less than 1 cent per
function point, as opposed to about $6 per function point for normal
manual function point analysis. Backfiring does not require a certified
counter. Of course, the accuracy is not very high.
Cautions and counter indications The main counter indication for back-
firing is that it is not very accurate. Due to variations in program-
ming styles, individual programmers can vary by as much as 6-to-1 in
the number of lines of code used to implement the same functionality.
Therefore, backfiring also varies widely. When backfiring is used for
hundreds of applications in the same language, such as COBOL, the
average value of about 106 code statements in the procedure and data
Project Management and Software Engineering
389
division yield reasonably accurate function point totals. But for lan-
guages with few samples, the ranges are very wide.
A second caution is that there are no standard methods for counting
lines of code. The backfire approach was originally developed based on
counts of logical statements. If backfiring is used on counts of physical
lines, the results might vary by more than 500 percent from backfiring
the same samples using logical statements.
Another counter indication is that backfiring becomes very compli-
cated for applications coded in two or more languages. There are auto-
mated tools that can handle backfire conversions for any number of
languages, but it is necessary to know the proportions of code in each
language for the tools to work.
A final caution is that the published rules that show conversion ratios
between lines of code and function points vary based on the source. The
published rules by the David Consulting Group, Gartner Group, the
Quality and Productivity Management Group (QPMG), and Software
Productivity Research (SPR) do not show the same ratios for many
languages. Since none of the function point associations such as IFPUG
have ever studied backfiring, nor have any universities, there is no over-
all authoritative source for validating backfire assumptions.
Backfiring remains popular and widely used, even though of question-
able accuracy. The reason for its popularity is because of the high costs
and long schedules associated with normal function point analysis.
Sizing Based on Pattern Matching
The other sizing methods in this section are in the public domain and are
available for use as needed. But sizing based on pattern matching has had
a patent application filed, so the method is not yet generally available.
The pattern-matching method was not originally created as a sizing
method. It was first developed to provide an unambiguous way of identify-
ing applications for benchmark purposes. After several hundred applica-
tions had been measured using the taxonomy, it was noted that applications
with the same patterns on the taxonomy were of the same size.
Pattern matching is based on the fact that thousands of legacy applica-
tions have been created, and for a significant number, size data already
exists. By means of a taxonomy that captures the nature, scope, class,
and type of existing software applications, a pattern is created that can
be used to size new software applications.
What makes pattern-matching work is a taxonomy that captures
key elements of software applications. The taxonomy consists of seven
topics: (1) nature, (2) scope, (3) class, (4) type, (5) problem complexity,
(6) code complexity, and (7) data complexity. Each topic uses numeric
values for identification.
img
390
Chapter Six
In comparing one software project against another, it is important to
know exactly what kinds of software applications are being compared. This
is not as easy as it sounds. The industry lacks a standard taxonomy of soft-
ware projects that can be used to identify projects in a clear and unambigu-
ous fashion other than the taxonomy that is used with this invention.
The author has developed a multipart taxonomy for classifying proj-
ects in an unambiguous fashion. The taxonomy is copyrighted and
explained in several of the author's previous books including Estimating
Software Costs (McGraw-Hill, 2007) and Applied Software Measurement
(McGraw-Hill, 2008). Following is the taxonomy:
When the taxonomy is used for benchmarks, four additional factors
from public sources are part of the taxonomy:
Country code
=
1
(United States)
Region code
=
06
(California)
City code
=
408
(San Jose)
NAIC industry code
=
1569
(Telecommunications)
These codes are from telephone area codes, ISO codes, and the North
American Industry Classification (NAIC) codes of the Department of
Commerce. These four codes do not affect the size of applications, but
provide valuable information for benchmarks and international eco-
nomic studies. This is because software costs vary widely by country,
geographic region, and industry. For historical data to be meaningful, it
is desirable to record all of the factors that influence costs.
The portions of the taxonomy that are used for estimating application
size include the following factors:
PROJECT NATURE: __
1. New program development
2. Enhancement (new functions added to existing software)
3. Maintenance (defect repair to existing software)
4. Conversion or adaptation (migration to new platform)
5. Reengineering (re-implementing a legacy application)
6. Package modification (revising purchased software)
PROJECT SCOPE: __
1. Algorithm
2. Subroutine
3. Module
4. Reusable module
Project Management and Software Engineering
391
5. Disposable prototype
6. Evolutionary prototype
7. Subprogram
8. Stand-alone program
9. Component of a system
10. Release of a system (other than the initial release)
11. New departmental system (initial release)
12. New corporate system (initial release)
13. New enterprise system (initial release)
14. New national system (initial release)
15. New global system (initial release)
PROJECT CLASS: __
1. Personal program, for private use
2. Personal program, to be used by others
3. Academic program, developed in an academic environment
4. Internal program, for use at a single location
5. Internal program, for use at a multiple locations
6. Internal program, for use on an intranet
7. Internal program, developed by external contractor
8. Internal program, with functions used via time sharing
9. Internal program, using military specifications
10. External program, to be put in public domain
11. External program to be placed on the Internet
12. External program, leased to users
13. External program, bundled with hardware
14. External program, unbundled and marketed commercially
15. External program, developed under commercial contract
16. External program, developed under government contract
17. External program, developed under military contract
PROJECT TYPE: __
1. Nonprocedural (generated, query, spreadsheet)
2. Batch application
392
Chapter Six
3. Web application
4. Interactive application
5. Interactive GUI applications program
6. Batch database applications program
7. Interactive database applications program
8. Client/server applications program
9. Computer game
10. Scientific or mathematical program
11. Expert system
12. Systems or support program including "middleware"
13. Service-oriented architecture (SOA)
14. Communications or telecommunications program
15. Process-control program
16. Trusted system
17. Embedded or real-time program
18. Graphics, animation, or image-processing program
19. Multimedia program
20. Robotics, or mechanical automation program
21. Artificial intelligence program
22. Neural net program
23. Hybrid project (multiple types)
PROBLEM COMPLEXITY: ________
1. No calculations or only simple algorithms
2. Majority of simple algorithms and simple calculations
3. Majority of simple algorithms plus a few of average complexity
4. Algorithms and calculations of both simple and average complexity
5. Algorithms and calculations of average complexity
6. A few difficult algorithms mixed with average and simple
7. More difficult algorithms than average or simple
8. A large majority of difficult and complex algorithms
9. Difficult algorithms and some that are extremely complex
10. All algorithms and calculations are extremely complex
img
Project Management and Software Engineering
393
CODE COMPLEXITY: _________
1. Most "programming" done with buttons or pull-down controls
2. Simple nonprocedural code (generated, database, spreadsheet)
3. Simple plus average nonprocedural code
4. Built with program skeletons and reusable modules
5. Average structure with small modules and simple paths
6. Well structured, but some complex paths or modules
7. Some complex modules, paths, and links between segments
8. Above average complexity, paths, and links between segments
9. Majority of paths and modules are large and complex
10. Extremely complex structure with difficult links and large modules
DATA COMPLEXITY: _________
1. No permanent data or files required by application
2. Only one simple file required, with few data interactions
3. One or two files, simple data, and little complexity
4. Several data elements, but simple data relationships
5. Multiple files and data interactions of normal complexity
6. Multiple files with some complex data elements and interactions
7. Multiple files, complex data elements and data interactions
8. Multiple files, majority of complex data elements and interactions
9. Multiple files, complex data elements, many data interactions
10. Numerous complex files, data elements, and complex interactions
As most commonly used for either measurement or sizing, users will
provide a series of integer values to the factors of the taxonomy, as
follows:
PROJECT NATURE
1
PROJECT SCOPE
8
PROJECT CLASS
11
PROJECT TYPE
15
PROBLEM COMPLEXITY
5
DATA COMPLEXITY
6
CODE COMPLEXITY
2
img
394
Chapter Six
Although integer values are used for nature, scope, class, and type, up
to two decimal places can be used for the three complexity factors. The
algorithms will interpolate between integer values. Thus, permissible
values might also be
PROJECT NATURE
1
PROJECT SCOPE
8
PROJECT CLASS
11
PROJECT TYPE
15
PROBLEM COMPLEXITY
5.25
DATA COMPLEXITY
6.50
CODE COMPLEXITY
2.45
The combination of numeric responses to the taxonomy provides a
unique "pattern" that facilitates both measurement and sizing. The funda-
mental basis for sizing based on pattern matching rests on two points:
1. Observations have demonstrated that software applications that
have identical patterns in terms of the taxonomy are also close to
being identical in size expressed in function points.
2. The seven topics of the taxonomy are not equal in their impacts.
The second key to pattern matching is the derivation of the relative
weights that each factor provides in determining application size.
To use the pattern-matching approach, mathematical weights are
applied to each parameter. The specific weights are defined in the
patent application for the method and are therefore proprietary and
not included here. However, the starting point for the pattern-matching
approach is the average sizes of the software applications covered by the
"scope" parameter. Table 6-10 illustrates the unadjusted average values
prior to applying mathematical adjustments.
As shown in Table 6-10, an initial starting size for a software applica-
tion is based on user responses to the scope parameter. Each answer is
assigned an initial starting size value in terms of IFPUG function points.
These size values have been determined by examination of applications
already sized using standard IFPUG function point analysis. The initial
size values represent the mode of applications or subcomponents that
have been measured using function points.
The scope parameter by itself only provides an approximate initial
value. It is then necessary to adjust this value based on the other param-
eters of class, type, problem complexity, code complexity, and data com-
plexity. These adjustments are part of the patent application for sizing
based on pattern matching.
From time to time, new forms of software will be developed. When this
occurs, the taxonomy can be expanded to include the new forms.
img
Project Management and Software Engineering
395
TABLE 6-10
Initial Starting Values for Sizing by Pattern Matching
APPLICATION SCOPE PARAMETER
Value
Definition
Size in Function Points
1.
Algorithm
1
2.
Subroutine
5
3.
Module
10
4.
Reusable module
20
5.
Disposable prototype
50
6.
Evolutionary prototype
100
7.
Subprogram
500
8.
Stand-alone program
1,000
9.
Component of a system
2,500
10.
Release of a system
5,000
11.
New Departmental system
10,000
12.
New Corporate system
50,000
13.
New Enterprise system
100,000
14.
New National system
250,000
15.
New Global system
500,000
The taxonomy can be used well before an application has started its
requirements. Since the taxonomy contains information that should be
among the very first topics known about a future application, it is pos-
sible to use the taxonomy months before requirements are finished and
even some time before they begin.
It is also possible to use the taxonomy on legacy applications that
have been in existence for many years. It is often useful to know the
function point totals of such applications, but normal counting of func-
tion points may not be feasible since the requirements and specifications
are seldom updated and may not be available.
The taxonomy can also be used with commercial software, and indeed
with any form of software including classified military applications
where there is sufficient public or private knowledge of the application
to assign values to the taxonomy tables.
The taxonomy was originally developed to produce size in terms of
IFPUG function points and also logical source code statements. However,
the taxonomy could also be used to produce size in terms of COSMIC
function points, use-case points, or story points. To use the taxonomy
with other metrics, historical data would need to be analyzed.
The sizing method based on pattern matching can be used for any
size application ranging from small updates that are only a fraction
of a function point up to massive defense applications that might top
300,000 function points. Table 6-11 illustrates the pattern-matching
img
396
Chapter Six
TABLE 6-11
Sample of 150 Applications Sized Using Pattern Matching
Note 1: IFPUG rules version 4.2 are assumed.
Note 2: Code counts are based on logical statements; not physical lines
Lines per
Size in
Function
Function Points Language
Total
Point
(IFPUG 4.2)
Level  Source Code
Application
1. Star Wars missile defense
352,330
3.50
32,212,992
91
2. Oracle
310,346
4.00
24,827,712
80
3. WWMCCS
307,328
3.50
28,098,560
91
4. U.S. Air Traffic control
306,324
1.50
65,349,222
213
5. Israeli air defense system
300,655
4.00
24,052,367
80
6. SAP
296,764
4.00
23,741,088
80
7. NSA Echelon
293,388
4.50
20,863,147
71
8. North Korean border
defenses
273,961
3.50
25,047,859
91
9. Iran's air defense system
260,100
3.50
23,780,557
91
10. Aegis destroyer C&C
253,088
4.00
20,247,020
80
11. Microsoft VISTA
157,658
5.00
10,090,080
64
12. Microsoft XP
126,788
5.00
8,114,400
64
13. IBM MVS
104,738
3.00
11,172,000
107
14. Microsoft Office
Professional
93,498
5.00
5,983,891
64
15. Airline reservation system
38,392
2.00
6,142,689
160
16. NSA code decryption
35,897
3.00
3,829,056
107
17. FBI Carnivore
31,111
3.00
3,318,515
107
18. Brain/Computer interface
25,327
6.00
1,350,757
53
19. FBI fingerprint analysis
25,075
3.00
2,674,637
107
20. NASA space shuttle
23,153
3.50
2,116,878
91
21. VA patient monitoring
23,109
1.50
4,929,910
213
22. F115 avionics package
22,481
3.50
2,055,438
91
23. Lexis-Nexis legal analysis
22,434
3.50
2,051,113
91
24. Russian weather satellite
22,278
3.50
2,036,869
91
25. Data warehouse
21,895
6.50
1,077,896
49
26. Animated film graphics
21,813
8.00
872,533
40
27. NASA Hubble controls
21,632
3.50
1,977,754
91
28. Skype
21,202
6.00
1,130,759
53
29. Shipboard gun controls
21,199
3.50
1,938,227
91
30. Natural language
translation
20,350
4.50
1,447,135
71
31. American Express billing
20,141
4.50
1,432,238
71
32. M1 Abrams battle tank
19,569
3.50
1,789,133
91
33. Boeing 747 avionics
package
19,446
3.50
1,777,951
91
34. NASA Mars rover
19,394
3.50
1,773,158
91
img
Project Management and Software Engineering
397
TABLE 6-11
Sample of 150 Applications Sized Using Pattern Matching (continued)
Note 1: IFPUG rules version 4.2 are assumed.
Note 2: Code counts are based on logical statements; not physical lines
Lines per
Size in
Function
Function Points Language
Total
Point
(IFPUG 4.2)
Level  Source Code
Application
35. Travelocity
19,383
8.00
775,306
40
36. Apple iPhone
19,366
12.00
516,432
27
37. Nuclear reactor controls
19,084
2.50
2,442,747
128
38. IRS income tax analysis
19,013
4.50
1,352,068
71
39. Cruise ship navigation
18,896
4.50
1,343,713
71
40. MRI medical imaging
18,785
4.50
1,335,837
71
41. Google search engine
18,640
5.00
1,192,958
64
42. Amazon web site
18,080
12.00
482,126
27
43. Order entry system
18,052
3.50
1,650,505
91
44. Apple Leopard
17,884
12.00
476,898
27
45. Linux
17,505
8.00
700,205
40
46. Oil refinery process control
17,471
3.50
1,597,378
91
47. Corporate cost accounting
17,378
3.50
1,588,804
91
48. FedEx shipping controls
17,378
6.00
926,802
53
49. Tomahawk cruise missile
17,311
3.50
1,582,694
91
50. Oil refinery process control
17,203
3.00
1,834,936
107
51. ITT System 12 telecom
17,002
3.50
1,554,497
91
52. Ask search engine
16,895
6.00
901,060
53
53. Denver Airport luggage
16,661
4.00
1,332,869
80
54. ADP payroll application
16,390
3.50
1,498,554
91
55. Inventory management
16,239
3.50
1,484,683
91
56. eBay transaction controls
16,233
7.00
742,072
46
57. Patriot missile controls
15,392
3.50
1,407,279
91
58. Second Life web site
14,956
12.00
398,828
27
59. IBM IMS database
14,912
1.50
3,181,283
213
60. America Online (AOL)
14,761
5.00
944,713
64
61. Toyota robotic mfg.
14,019
6.50
690,152
49
62. Statewide child support
13,823
6.00
737,226
53
63. Vonage VOIP
13,811
6.50
679,939
49
64. Quicken 2006
11,339
6.00
604,761
53
65. ITMPI web site
11,033
14.00
252,191
23
66. Motor vehicle
registrations
10,927
3.50
999,065
91
67. Insurance claims handling
10,491
4.50
745,995
71
68. SAS statistical package
10,380
6.50
511,017
49
69. Oracle CRM features
6,386
4.00
510,878
80
(Continued)
img
398
Chapter Six
TABLE 6-11
Sample of 150 Applications Sized Using Pattern Matching (continued)
Note 1: IFPUG rules version 4.2 are assumed.
Note 2: Code counts are based on logical statements; not physical lines
Lines per
Size in
Function
Function Points Language
Total
Point
(IFPUG 4.2)
Level  Source Code
Application
70. DNA analysis
6,213
9.00
220,918
36
71. Enterprise JavaBeans
5,877
6.00
313,434
53
72. Software renovation
tool suite
5,170
6.00
275,750
53
73. Patent data mining
4,751
6.00
253,400
53
74. EZ Pass vehicle controls
4,571
4.50
325,065
71
75. U.S. patent applications
4,429
3.50
404,914
91
76. Chinese submarine sonar
4,017
3.50
367,224
91
77. Microsoft Excel 2007
3,969
5.00
254,006
64
78. Citizens bank online
3,917
6.00
208,927
53
79. MapQuest
3,793
8.00
151,709
40
80. Bank ATM controls
3,625
6.50
178,484
49
81. NVIDIA graphics card
3,573
2.00
571,637
160
82. Lasik surgery (wave guide)
3,505
3.00
373,832
107
83. Sun D-Trace utility
3,309
6.00
176,501
53
84. Microsoft Outlook
3,200
5.00
204,792
64
85. Microsoft Word 2007
2,987
5.00
191,152
64
86. Artemis Views
2,507
4.50
178,250
71
87. ChessMaster 2007 game
2,227
6.50
109,647
49
88. Adobe Illustrator
2,151
4.50
152,942
71
89. SpySweeper antispyware
2,108
3.50
192,757
91
90. Norton antivirus software
2,068
6.00
110,300
53
91. Microsoft Project 2007
1,963
5.00
125,631
64
92. Microsoft Visual Basic
1,900
5.00
121,631
64
93. Windows Mobile
1,858
5.00
118,900
64
94. SPR KnowledgePlan
1,785
4.50
126,963
71
95. All-in-one printer
1,780
2.50
227,893
128
96. AutoCAD
1,768
4.00
141,405
80
97. Software code
restructuring
1,658
4.00
132,670
80
98. Intel Math function library
1,627
9.00
57,842
36
99. Sony PlayStation game
controls
1,622
6.00
86,502
53
100. PBX switching system
1,592
3.50
145,517
91
101. SPR Checkpoint
1,579
3.50
144,403
91
102. Microsoft Links golf game
1,564
6.00
83,393
53
103. GPS navigation system
1,518
8.00
60,730
40
img
Project Management and Software Engineering
399
TABLE 6-11
Sample of 150 Applications Sized Using Pattern Matching (continued)
Note 1: IFPUG rules version 4.2 are assumed.
Note 2: Code counts are based on logical statements; not physical lines
Lines per
Size in
Function
Function Points Language
Total
Point
(IFPUG 4.2)
Level  Source Code
Application
104. Motorola cell phone
1,507
6.00
80,347
53
105. Seismic analysis
1,492
3.50
136,438
91
106. PRICE-S
1,486
4.50
105,642
71
107. Sidewinder missile controls
1,450
3.50
132,564
91
108. Apple iPod
1,408
10.00
45,054
32
109. Property tax assessments
1,379
4.50
98,037
71
110. SLIM
1,355
4.50
96,342
71
111. Microsoft DOS
1,344
1.50
286,709
213
112. Mozilla Firefox
1,340
6.00
71,463
53
113. CAI APO (original
estimate)
1,332
8.00
53,288
40
114. Palm OS
1,310
3.50
119,772
91
115. Google Gmail
1,306
8.00
52,232
40
116. Digital camera controls
1,285
5.00
82,243
64
117. IRA account management
1,281
4.50
91,096
71
118. Consumer credit report
1,267
6.00
67,595
53
119. Laser printer driver
1,248
2.50
159,695
128
120. Software complexity
analyzer
1,202
4.50
85,505
71
121. JAVA compiler
1,185
6.00
63,186
53
122. COCOMO II
1,178
4.50
83,776
71
123. Smart bomb targeting
1,154
5.00
73,864
64
124. Wikipedia
1,142
12.00
30,448
27
125. Music synthesizer
1,134
4.00
90,736
80
126. Configuration control
1,093
4.50
77,705
71
127. Toyota Prius engine
1,092
3.50
99,867
91
128. Cochlear implant (internal)
1,041
3.50
95,146
91
129. Nintendo Game Boy DS
1,002
6.00
53,455
53
130. Casio atomic watch
993
5.00
63,551
64
131. Football bowl selection
992
6.00
52,904
53
132. COCOMO I
883
4.50
62,794
71
133. APAR analysis and routing
866
3.50
79,197
91
134. Computer BIOS
857
1.00
274,243
320
135. Automobile fuel injection
842
2.00
134,661
160
136. Antilock brake controls
826
2.00
132,144
160
137. Quick Sizer Commercial
794
6.00
42,326
53
(Continued)
img
400
Chapter Six
TABLE 6-11
Sample of 150 Applications Sized Using Pattern Matching (continued)
Note 1: IFPUG rules version 4.2 are assumed.
Note 2: Code counts are based on logical statements; not physical lines
Lines per
Size in
Function
Function Points Language
Total
Point
(IFPUG 4.2)
Level  Source Code
Application
138. CAI APO (revised
estimate)
761
8.00
30,450
40
139. LogiTech cordless mouse
736
6.00
39,267
53
140. Function point workbench
714
4.50
50,800
71
141. SPR SPQR/20
699
4.50
49,735
71
142. Instant messaging
687
5.00
43,944
64
143. Golf handicap analyzer
662
8.00
26,470
40
144. Denial of service virus
138
2.50
17,612
128
145. Quick Sizer prototype
30
20.00
480
16
146. ILOVEYOU computer
worm
22
2.50
2,838
128
147. Keystroke logger virus
15
2.50
1,886
128
148. MYDOOM computer virus
8
2.50
1,045
128
149. APAR bug report
3.85
3.50
352
91
150. Screen format change
0.87
4.50
62
71
AVERAGE
33,269
4.95
2,152,766
65
sizing method for a sample of 150 software applications. Each applica-
tion was sized in less than one minute.
Because the pattern-matching approach is experimental and being cali-
brated, the information shown in Table 6-11 is provisional and subject to
change. The data should not be used for any serious business purpose.
Note that the column labeled "language level" refers to a mathemati-
cal rule that was developed in the 1970s in IBM. The original definition
of "level" was the number of statements in a basic assembly language
that would be needed to provide the same function as one statement in
a higher-level language. Using this rule, COBOL is a "level 3" language
because three assembly statements would be needed to provide the func-
tions of 1 COBOL statement. Using the same rule, Smalltalk would be
a level 18 language, while Java would be a level 6 language.
When function point metrics were developed in IBM circa 1975, the
existing rules for language level were extended to include the number
of logical source code statements per function point.
For both backfiring and predicting source code size using pattern
matching, language levels are a required parameter. However, there is
Project Management and Software Engineering
401
published data with language levels for about 700 programming lan-
guages and dialects.
Timing of pattern-matching sizing Because the taxonomy used for pat-
tern matching is generic, it can be used even before requirements are
fully known. In fact, pattern matching is the sizing method that can be
applied the earliest in software development: long before normal func-
tion point analysis, story points, use-case points, or any other known
metric. It is the only method that can be used before requirements
analysis begins, and hence provide a useful size approximation before
any money is committed to a software project.
Because the pattern matching approach is
Usage of pattern matching
covered by a patent application and still experimental, usage as of 2009
has been limited to about 250 trial software applications.
It should be noted that because pattern matching is based on an exter-
nal taxonomy rather than on specific requirements, the pattern-match-
ing approach can be used to size applications that are impossible to size
using any other method. For example, it is possible to size classified mili-
tary software being developed by other countries such as Iran and North
Korea, neither of whom would provide such information knowingly.
The pattern-matching approach is embodied in a
Schedules and costs
prototype sizing tool that can predict application size at rates in excess
of 300,000 function points per minute. This makes pattern matching
the fastest and cheapest sizing method yet developed. The method is so
fast and so easy to perform that several size estimates can easily be per-
formed using best-case, expected-case, and worst-case assumptions.
Even without the automated prototype, the pattern-matching calcu-
lations can be performed using a pocket calculator or even by hand in
perhaps 2 minutes per application.
Cautions and counter indications The main counter indication for pattern
matching is that it is still experimental and being calibrated. Therefore,
results may change unexpectedly.
Another caution is that the accuracy of pattern matching needs to be
examined with a large sample of historical projects that have standard
function point counts.
Sizing Software Requirements Changes
Thus far, all of the sizing methods discussed have produced size esti-
mates that are valid only for a single moment. Observations of software
projects indicate that requirements grow and change at rates of between
402
Chapter Six
1 percent and more than 2 percent every calendar month during the
design and coding phases.
Therefore, if the initial size estimate at the end of the requirements
phase is 1000 function points, then this total might grow by 6 percent or
60 function points during the design phase and by 8 percent or 80 func-
tion points during the coding phase. When finally released, the original
1000 function points will have grown to 1140.
Because growth in requirements is related to calendar schedules,
really large applications in the 10,000-function point range or higher can
top 35 percent or even 50 percent in total growth. Obviously, this much
growth will have a significant impact on both schedules and costs.
Some software cost-estimating tools such as KnowledgePlan include
algorithms that predict growth rates in requirements and allow users
to either accept or reject the predictions. Users can also include their
own growth predictions.
There are two flavors of requirements change:
Requirements creep These are changes to requirements that cause func-
tion point totals to increase and that also cause more source code to be
written. Such changes should be sized and of course if they are signifi-
cant, they should be included in revised cost and schedule estimates.
These are changes that do not add to the function
Requirements churn
point size total of the application, but which may cause code to be writ-
ten. An example of churn might be changing the format or appearance
of an input screen, but not adding any new queries or data elements. An
analogy from home construction might be replacing existing windows
with hurricane-proof windows that fit the same openings. There is no
increase in the square footage or size of the house, but there will be
effort and costs.
Software application size is never stable and continues to change during
development and also after release. Therefore, sizing methods need to be
able to deal with changes and growth in requirements, and these require-
ments changes will also cause growth in source code volumes.
Requirements creep has a more significant impact than just growth
itself. As it happens, because changing requirements tend to be rushed,
they have higher defect potentials than the original requirements. They
also tend be harder to find and eliminate bugs, because if the changes
are late, inspections may be skipped and testing will be less thorough.
As a result, creeping requirements on large software projects tend to
be the source of many more defects that get delivered than the original
requirements. For large systems in the 10,000-function point range,
almost 50 percent of the delivered defects can be attributed to require-
ments changes during development.
Project Management and Software Engineering
403
Software Progress and
Problem Tracking
From working as an expert witness in a number of software lawsuits,
the author noted a chronic software project management problem. Many
projects that failed or had serious delays in schedules or quality prob-
lems did not identify any problems during development by means of
normal progress reports.
From depositions and discovery, both software engineers and first-line
project managers knew about the problems, but the information was
not included in status reports to clients and senior management when
the problems were first noticed. Not until very late, usually too late
to recover, did higher management or clients become aware of serious
delays, quality problems, or other significant issues.
When asked why the information was concealed, the primary reason
was that the lower managers did not want to look bad to executives. Of
course, when the problems finally surfaced, the lower managers looked
very bad, indeed.
By contrast, projects that are successful always deal with problems
in a more rational fashion. They identify the problems early, assemble
task groups to solve them, and usually bring them under control before
they become so serious that they cannot be fixed. One of the interesting
features of the Agile method is that problems are discussed on a daily
basis. The same is true for the Team Software Process (TSP).
Software problems are somewhat like serious medical problems. They
usually don't go away by themselves, and many require treatment by
professionals in order to eliminate them.
Once a software project is under way, there are no fixed and reli-
able guidelines for judging its rate of progress. The civilian software
industry has long utilized ad hoc milestones such as completion of
design or completion of coding. However, these milestones are notori-
ously unreliable.
Tracking software projects requires dealing with two separate
issues: (1) achieving specific and tangible milestones, and (2) expend-
ing resources and funds within specific budgeted amounts.
Because software milestones and costs are affected by requirements
changes and "scope creep," it is important to measure the increase in
size of requirements changes, when they affect function point totals.
However, as mentioned in a previous section in this chapter, some
requirements changes do not affect function point totals, which are
termed requirements churn. Both creep and churn occur at random
intervals. Churn is harder to measure than creep and is often measured
via "backfiring" or mathematical conversion between source code state-
ments and function point metrics.
img
404
Chapter Six
As of 2009, automated tools are available that can assist project man-
agers in recording the kinds of vital information needed for milestone
reports. These tools can record schedules, resources, size changes, and
also issues or problems.
For an industry now more than 60 years of age, it is somewhat sur-
prising that there is no general or universal set of project milestones for
indicating tangible progress. From the author's assessment and bench-
mark studies, following are some representative milestones that have
shown practical value.
Note that these milestones assume an explicit and formal review
connected with the construction of every major software deliverable.
Table 6-12 shows representative tracking milestones for large soft-
ware projects. Formal reviews and inspections have the highest defect
removal efficiency levels of any known kind of quality control activity,
and are characteristic of "best in class" organizations.
The most important aspect of Table 6-12 is that every milestone is
based on completing a review, inspection, or test. Just finishing up a
document or writing code should not be considered a milestone unless
the deliverables have been reviewed, inspected, or tested.
TABLE 6-12
Representative Tracking Milestones for Large Software Projects
1.
Requirements document completed
2.
Requirements document review completed
3.
Initial cost estimate completed
4.
Initial cost estimate review completed
5.
Development plan completed
6.
Development plan review completed
7.
Cost tracking system initialized
8.
Defect tracking system initialized
9.
Prototype completed
10.
Prototype review completed
11.
Complexity analysis of base system (for enhancement projects)
12.
Code restructuring of base system (for enhancement projects)
13.
Functional specification completed
14.
Functional specification review completed
15.
Data specification completed
16.
Data specification review completed
17.
Logic specification completed
18.
Logic specification review completed
19.
Quality control plan completed
img
Project Management and Software Engineering
405
TABLE 6-12
Representative Tracking Milestones for Large Software Projects
(continued)
20.
Quality control plan review completed
21.
Change control plan completed
22.
Change control plan review completed
23.
Security plan completed
24.
Security plan review completed
25.
User information plan completed
26.
User information plan review completed
27.
Code for specific modules completed
28.
Code inspection for specific modules completed
29.
Code for specific modules unit tested
30.
Test plan completed
31.
Test plan review completed
32.
Test cases for specific test stage completed
33.
Test case inspection for specific test stage completed
34.
Test stage completed
35.
Test stage review completed
36.
Integration for specific build completed
37.
Integration review for specific build completed
38.
User information completed
39.
User information review completed
40.
Quality assurance sign off completed
41.
Delivery to beta test clients completed
42.
Delivery to clients completed
In the litigation where the author worked as an expert witness, these
criteria were not met. Milestones were very informal and consisted
primarily of calendar dates, without any validation of the materials
themselves.
Also, the format and structure of the milestone reports were inad-
equate. At the top of every milestone report, problems and issues or "red
flag" items should be highlighted and discussed first.
During depositions and reviews of court documents, it was noted that
software engineering personnel and many managers were aware of the
problems that later triggered the delays, cost overruns, quality prob-
lems, and litigation. At the lowest levels, these problems were often
included in weekly status reports or discussed at daily team meetings.
But for the higher-level milestone and tracking reports that reached
clients and executives, the hazardous issues were either omitted or
glossed over.
406
Chapter Six
A suggested format for monthly progress tracking reports delivered
to clients and higher management would include these sections:
Suggested Format for Monthly Status Reports for Software Projects
1. New "red flag" problems noted this month
2. Status of last month's "red flag" problems
3. Discussion of "red flag" items more than one month in duration
4. Change requests processed this month versus change requests
predicted
5. Change requests predicted for next month
6. Size in function points for this month's change requests
7. Size in function points predicted for next month's change
requests
8. Change requests that do not affect size in function points
9. Schedule impacts of this month's change requests
10. Cost impacts of this month's change requests
11. Quality impacts of this month's change requests
12. Defects found this month versus defects predicted
13. Defect severity levels of defects found this month
14. Defect origins (requirements, design, code, etc.)
15. Defects predicted for next month
16. Costs expended this month versus costs predicted
17. Costs predicted for next month
18. Earned value for this month's deliverable (if earned value is used)
19. Deliverables completed this month versus deliverables predicted
20. Deliverables predicted for next month
Although the suggested format somewhat resembles the items calcu-
lated using the earned value method, this format deals explicitly with
the impact of change requests and also uses function point metrics for
expressing costs and quality data.
An interesting question is the frequency with which milestone prog-
ress should be reported. The most common reporting frequency is
monthly, although an exception report can be filed at any time it is
suspected that something has occurred that can cause perturbations.
For example, serious illness of key project personnel or resignation of
key personnel might very well affect project milestone completions, and
this kind of situation cannot be anticipated.
Project Management and Software Engineering
407
It might be thought that monthly reports are too far apart for small
projects that only last six or fewer months in total. For small projects,
weekly reports might be preferred. However, small projects usually do
not get into serious trouble with cost and schedule overruns, whereas
large projects almost always get in trouble with cost and schedule
overruns. This article concentrates on the issues associated with large
projects. In the litigation where the author has been an expert wit-
ness, every project under litigation except one was larger than 10,000
function points.
The simultaneous deployment of software sizing tools, estimating
tools, planning tools, and methodology management tools can pro-
vide fairly unambiguous points in the development cycle that allow
progress to be judged more or less effectively. For example, software
sizing technology can now predict the sizes of both specifications and
the volume of source code needed. Defect estimating tools can predict
the numbers of bugs or errors that might be encountered and discov-
ered. Although such milestones are not perfect, they are better than the
former approaches.
Project management is responsible for establishing milestones, moni-
toring their completion, and reporting truthfully on whether the mile-
stones were successfully completed or encountered problems. When
serious problems are encountered, it is necessary to correct the problems
before reporting that the milestone has been completed.
Failing or delayed projects usually lack serious milestone tracking.
Activities are often reported as finished while work was still ongoing.
Milestones on failing projects are usually dates on a calendar rather
than completion and review of actual deliverables.
Delivering documents or code segments that are incomplete, contain
errors, and cannot support downstream development work is not the
way milestones are used by industry leaders.
Another aspect of milestone tracking among industry leaders is what
happens when problems are reported or delays occur. The reaction
is strong and immediate: corrective actions are planned, task forces
assigned, and correction begins. Among laggards, on the other hand,
problem reports may be ignored, and very seldom do corrective actions
occur.
In more than a dozen legal cases involving projects that failed or were
never able to operate successfully, project tracking was inadequate in
every case. Problems were either ignored or brushed aside, rather than
being addressed and solved.
Because milestone tracking occurs throughout software development,
it is the last line of defense against project failures and delays. Milestones
should be established formally and should be based on reviews, inspec-
tions, and tests of deliverables. Milestones should not be the dates that
408
Chapter Six
deliverables more or less were finished. Milestones should reflect the
dates that finished deliverables were validated by means of inspections,
testing, and quality assurance review.
An interesting form of project tracking has been developed by the
Shoulders Corp for keeping track of object-oriented projects. This method
uses a 3-D model of software objects and classes using Styrofoam balls
of various sizes that are connected by dowels to create a kind of mobile.
The overall structure is kept in a location viewable by as many team
members as possible. The mobile makes the status instantly visible to
all viewers. Color-coded ribbons indicate status of each component, with
different colors indicated design complete, code complete, documenta-
tion complete, and testing complete (gold). There are also ribbons for
possible problems or delays. This method provides almost instantaneous
visibility of overall project status. The same method has been automated
using a 3-D modeling package, but the physical structures are easier
to see and have proven more useful on actual projects. The Shoulders
Corporation method condenses a great deal of important information
into a single visual representation that nontechnical staff can readily
understand.
A combination of daily status meetings that center on problems and
possible delays are very useful. When formal written reports are submit-
ted to higher managers or clients, the data should be quantified. In addi-
tion, possible problems that might cause delays or quality issues should
be the very first topics in the report because they are more important
than any other topics that are included.
Software Benchmarking
As this book is being written in early 2009, a new draft standard on per-
formance benchmarks is being circulated for review by the International
Standards Organization (ISO). The current draft is not yet approved.
The current draft deals with concepts and definitions, and will be fol-
lowed by additional standards later. Readers should check with the ISO
organization for additional information.
One of the main business uses of software measurement and metric
data is that of benchmarking, or comparing the performance of a com-
pany against similar companies within the same industry, or related
industries. (The same kind of data can also be used as a "baseline" for
measuring process improvements.)
The term benchmark is far older than the computing and software
professions. It seemed to have its origin in carpentry as a mark of stan-
dard length on workbenches. The term soon spread to other domains.
Another early definition of benchmark was in surveying, where it indi-
cated a metal plate inscribed with the exact longitude, latitude, and
Project Management and Software Engineering
409
altitude of a particular point. Also from the surveying domain comes
the term baseline which originally defined a horizontal line measured
with high precision to allow it to be used for triangulation of heights
and distances.
When the computing industry began, the term benchmark was origi-
nally used to define various performance criteria for processor speeds,
disk and tape drive speeds, printing speeds, and the like. This definition
is still in use, and indeed a host of new and specialized benchmarks has
been created in recent years for new kinds of devices such as CD-ROM
drives, multisynch monitors, graphics accelerators, solid-state flash
disks, high-speed modems, and the like.
As a term for measuring the relative performance of organizations
in the computing and software domains, the term benchmark was first
applied to data centers in the 1960s. This was a time when computers
were entering the mainstream of business operations, and data centers
were proliferating in number and growing in size and complexity. This
usage is still common for judging the relative efficiencies of data center
operations.
Benchmark data has a number of uses and a number of ways of being
gathered and analyzed. The most common and significant ways of gath-
ering benchmark data are these five:
1. Internal collection for internal benchmarks This form is data
gathered for internal use within a company or government unit
by its own employees. In the United States, the author estimates
that about 15,000 software projects have been gathered using this
method, primarily by large and sophisticated corporations such as
AT&T, IBM, EDS, Microsoft, and the like. This internal benchmark
data is proprietary and is seldom made available to other organiza-
tions. The accuracy of internal benchmark data varies widely. For
some sophisticated companies such as IBM, internal data is very
accurate. For other companies, the accuracy may be marginal.
2. Consultant collection for internal benchmarks The second
form is that of data gathered for internal use within a company
or government unit by outside benchmark consultants. The author
estimates that about 20,000 software projects have been gathered
using this method, since benchmark consultants are fairly numer-
ous. This data is proprietary, with the exception that results may
be included in statistical studies without identifying the sources
of the data. Outside consultants are used because benchmarks are
technically complicated to do well, and specialists generally outper-
form untrained managers and software engineers. Also, the extensive
experience of benchmark consultants helps in eliminating leakage
and in finding other problems.
410
Chapter Six
3. Internal collection for public or ISBSG benchmarks This
form is data gathered for submission to an external nonprofit bench-
mark organization such as the International Software Benchmarking
Standards Group (ISBSG) by a company's own employees. The author
estimates that in the United States perhaps 3000 such projects have
been submitted to the ISBSG. This data is readily available and
can be commercially purchased by companies and individuals. The
data submitted to ISBSG is also made available via monographs
and reports on topics such as estimating, the effectiveness of vari-
ous development methods, and similar topics. The questionnaires
for such benchmarks are provided to clients by the ISBSG, together
with instructions on how to collect the data. This method of gather-
ing data is inexpensive, but may have variability from company to
company since answers may not be consistent from one company to
another.
4. Consultant collection for proprietary benchmarks This
form consists of data gathered for submission to an external for-
profit benchmark organization such as Gartner Group, the David
Consulting Group, Galorath Associates, the Quality and Productivity
Management Group, Software Productivity Research (SPR), and
others by consultants who work for the benchmark organizations.
Such benchmark data is gathered via on-site interviews. The author
estimates that perhaps 60,000 projects have been gathered by the
for-profit consulting organizations. This data is proprietary, with the
exception of statistical studies that don't identify data sources. For
example, this book and the author's previous book, Applied Software
Measurement, utilize corporate benchmarks gathered by the author
and his colleagues under contract. However, the names of the clients
and projects are not mentioned due to nondisclosure agreements.
5. Academic benchmarks This form is data gathered for academic
purposes by students or faculty of a university. The author estimates
that perhaps 2000 projects have been gathered using this method.
Academic data may be used in PhD or other theses, or it may be used
for various university research projects. Some of the academic data
will probably be published in journals or book form. Occasionally,
such data may be made available commercially. Academic data is
usually gathered via questionnaires distributed by e-mail, together
with instructions for filling them out.
When all of these benchmark sources are summed, the total is about
100,000 projects. Considering that at least 3 million legacy applications
exist and another 1.5 million new projects are probably in development,
the sum total of all software benchmarks is only about 2 percent of
software projects.
Project Management and Software Engineering
411
When the focus narrows to benchmark data that is available to the
general public through nonprofit or commercial sources, the U.S. total is
only about 3000 projects, which is only about 0.07 percent. This is far too
small a sample to be statistically valid for the huge variety of software
classes, types, and sizes created in the United States. The author sug-
gests that public benchmarks from nonprofit sources such as the ISBSG
should expand up to at least 2 percent or about 30,000 new projects out
of 1.5 million or so in development. It would also be useful to have at
least a 1 percent sample of legacy applications available to the public,
or another 30,000 projects.
A significant issue with current benchmark data to date is the unequal
distribution of project sizes. The bulk of all software benchmarks are for
projects between about 250 and 2500 function points. There is very little
benchmark data for applications larger than 10,000 function points,
even though these are the most expensive and troublesome kinds of
applications. There is almost no benchmark data available for small
maintenance projects below 15 function points in size, even though such
projects outnumber all other sizes put together.
Another issue with benchmark data is the unequal distribution by
project types. Benchmarks for IT projects comprise about 65 percent
of all benchmarks to date. Systems and embedded software comprise
about 15 percent, commercial software about 10 percent, and military
software comprises about 5 percent. (Since the Department of Defense
and the military services own more software than any other organiza-
tions on the planet, the lack of military benchmarks is probably due to
the fact that many military projects are classified.) The remaining 5
percent includes games, entertainment, iPhone and iPod applications,
and miscellaneous applications such as tools.
Categories of Software Benchmarks
There are a surprisingly large number of kinds of software benchmarks,
and they use different metrics, different methods, and are aimed at dif-
ferent aspects of software as a business endeavor.
Benchmarks are primarily collections of quantitative data that show
application, phase, or activity productivity rates. Some benchmarks
also include application quality data in the form of defects and defect
removal efficiency. In addition, benchmarks should also gather informa-
tion about the programming languages, tools, and methods used for the
application.
Over and above benchmarks, the software industry also performs soft-
ware process assessments. Software process assessments gather detailed
data on software best practices and on specific topics such as project
management methods, quality control methods, development methods,
412
Chapter Six
maintenance methods, and the like. The process assessment method
developed by the Software Engineering Institute (SEI) that evaluates
an organization's "capability maturity level" is probably the best-known
form of assessment, but there are several others as well.
Since it is obvious that assessment data and benchmark data are
synergistic, there are also hybrid methods that collect assessment and
benchmark data simultaneously. These hybrid methods tend to use
large and complicated questionnaires and are usually performed via
on-site consultants and face-to-face interviews. However, it is possible
to use e-mail or web-based questionnaires and communicate with soft-
ware engineers and managers via Skype or some other method rather
than actual travel.
The major forms of software benchmarks included in this book circa
2009 are
1. International software benchmarks
2. Industry software benchmarks
3. Overall software cost and resource benchmarks
4. Corporate software portfolio benchmarks
5. Project-level software productivity and quality benchmarks
6. Phase-level software productivity and quality benchmarks
7. Activity-level software productivity and quality benchmarks
8. Software outsource versus internal performance benchmarks
9. Software maintenance and customer support benchmarks
10. Methodology benchmarks
11. Assessment benchmarks
12. Hybrid assessment and benchmark studies
13. Earned-value benchmarks
14. Quality and test coverage benchmarks
15. Cost of quality (COQ) benchmarks
16. Six Sigma benchmarks
17. ISO quality standard benchmarks
18. Security benchmarks
19. Software personnel and skill benchmarks
20. Software compensation benchmarks
21. Software turnover or attrition benchmarks
22. Software performance benchmarks
Project Management and Software Engineering
413
23. Software data center benchmarks
24. Software customer satisfaction benchmarks
25. Software usage benchmarks
26. Software litigation and failure benchmarks
27. Award benchmarks
As can be seen from this rather long list of software-related bench-
marks, the topic is much more complicated than might be thought.
Between the recession and global
International software benchmarks
software competition, it is becoming very important to be able to com-
pare software development practices around the world. International
software benchmarking is a fairly new domain, but has already begun
to establish a substantial literature, with useful books by Michael
Cusumano, Watts Humphries, Howard Rubin, and Edward Yourdon as
well as by the author of this book. One weakness with the ISBSG data
is that country of origin is deliberately concealed. This policy should be
reconsidered in light of the continuing recession.
When performing international benchmarks, many local factors need
to be recorded. For example, Japan has at least 12 hours of unpaid over-
time per week, while other countries such as Canada and Germany have
hardly any. In Japan the workweek is about 44 hours, while in Canada
it is only 36 hours. Vacation days also vary from country to country,
as do the number of public holidays. France and the EU countries, for
example, have more than twice as many vacation days as the United
States.
Of course, the most important international topics for the purposes of
outsourcing are compensation levels and inflation rates. International
benchmarks are a great deal more complex than domestic benchmarks.
Industry benchmarks As the recession continues, more and more atten-
tion is paid to severe imbalances among industries in terms of costs
and salaries. For example, the large salaries and larger bonuses paid to
bankers and financial executives have shocked the world business com-
munity. Although not as well-known because the amounts are smaller,
financial software executives and financial software engineering per-
sonnel earn more than similar personnel in other industries, too. As
the recession continues, many companies are facing the difficult ques-
tion of whether to invest significant amounts of money and effort into
improving their own software development practices, or to turn over all
software operations to an outsourcing vendor who may already be quite
sophisticated. Benchmarks of industry schedules, effort, and costs will
become increasingly important.
414
Chapter Six
As of 2009, enough industry data exists to show interesting variations
between finance, insurance, health care, several forms of manufactur-
ing, defense, medicine, and commercial software vendors.
Overall software cost and resource benchmarks Cost and resources at
the corporate level are essentially similar to the classic data center
benchmarking studies, only transferred to a software development
organization. These studies collect data on the annual expenditures
for personnel and equipment, number of software personnel employed,
number of clients served, sizes of software portfolios, and other tangible
aspects associated with software development and maintenance. The
results are then compared against norms or averages from companies
of similar sizes, companies within the same industry, or companies that
have enough in common to make the comparisons interesting. These
high-level benchmarks are often produced by "strategic" consulting
organization such as McKinsey, Gartner Group, and the like. This form
of benchmark does not deal with individual projects, but rather with
corporate or business-group expense patterns.
In very large enterprises with multiple locations, similar benchmarks
are sometimes used for internal comparisons between sites or divisions.
The large accounting companies and a number of management consult-
ing companies can perform general cost and resource benchmarks.
A corporate portfolio can be as
Corporate software portfolio benchmarks
large as 10 million function points and contain more than 5000 applica-
tions. The applications can include IT projects, systems software, embed-
ded software, commercial software, tools, outsourced applications, and
open-source applications. Very few companies know how much software
is in their portfolios. Considering that the total portfolio is perhaps the
most valuable asset that the company owns, the lack of portfolio-level
benchmarks is troubling.
There are so few portfolio benchmarks because of the huge size of
portfolios and the high costs of collecting data on the entire mass of
software owned by large corporations.
A portfolio benchmark study in which the author participated for
a large manufacturing conglomerate took about 12 calendar months
and involved 10 consultants who visited at least 24 countries and 60
companies owned by the conglomerate. Just collecting data for this one
portfolio benchmark cost more than $2 million. However, the value of
the portfolio itself was about $15 billion. That is a very significant asset
and therefore deserves to be studied and understood.
Of course, for a smaller company whose portfolio was concentrated
in a single data center, such a study might have been completed in a
month by only a few consultants. But unfortunately, large corporations
Project Management and Software Engineering
415
are usually geographically dispersed, and their portfolios are highly
fragmented across many cities and countries.
Project-level productivity and quality benchmarks Project-level produc-
tivity and quality benchmarks drop down below the level of entire
organizations and gather data on specific projects. These project-level
benchmark studies accumulate effort, schedule, staffing, cost, and qual-
ity data from a sample of software projects developed and/or maintained
by the organization that commissioned the benchmark. Sometimes the
sample is as large as 100 percent, but more often the sample is more
limited. For example, some companies don't bother with projects below
a certain minimum size, such as 50 function points, or exclude projects
that are being developed for internal use as opposed to projects that are
going to be released to external clients.
Project-level productivity and quality benchmarks are sometimes per-
formed using questionnaires or survey instruments that are e-mailed or
distributed to participants. This appears to be the level discussed in the
new ISO draft benchmark standard. Data at the project level includes
schedules, effort in hours or months, and costs. Supplemental data on
programming languages and methodologies may be included. Quality
data should be included, but seldom is.
To avoid "apples to oranges" comparisons, companies that perform
project-level benchmark studies normally segment the data so that sys-
tems software, information systems, military software, scientific soft-
ware, and other kinds of software are compared against projects of the
same type. Data is also segmented by application size, to ensure that
very small projects are not compared against huge systems. New proj-
ects and enhancement and maintenance projects are also segmented.
Although collecting data at the project level is fairly easy to do, there
is no convenient way to validate the data or to ensure that "leakage"
has not omitted a significant quantity of work and therefore costs. The
accuracy of project level data is always suspect.
Unfortunately, project-
Phase-level productivity and quality benchmarks
level data is essentially impossible to validate and therefore tends to
be unreliable. Dropping down to the level of phases provides increased
granularity and therefore increased value. There are no standard defi-
nitions of phases that are universally agreed to circa 2009. However,
a common phase pattern includes requirements, design, development,
and testing.
When a benchmark study is carried out as a prelude to software process
improvement activities, the similar term baseline is often used. In this
context, the baseline reflects the productivity, schedule, staffing, and/or
quality levels that exist when the study takes place. These results can
416
Chapter Six
then be used to measure progress or improvements at future intervals.
Benchmarks and baselines collect identical information and are essen-
tially the same. Project-level data is not useful for baselines, so phase-
level data is the minimum level of granularity that can show process
improvement results.
Phase-level benchmarks are used by the ISBSG and also frequently
used in academic studies. In fact, the bulk of the literature on software
benchmarks tends to deal with phase-level data. Enough phase-level
data is now available to have established fairly accurate averages and
ranges for the United States, and preliminary averages for many other
countries.
Activity-level productivity and quality benchmarks Unfortunately, mea-
surement that collects only project data is impossible to validate. Phase-
level data is hard to validate because many activities such as technical
documentation and project management cross phase boundaries.
Activity-based benchmarks are even more detailed than the project-
level benchmarks already discussed. Activity-based benchmarks drop
down to the level of the specific kinds of work that must be performed in
order to build a software application. For example, the 25 activities used
by the author since the 1980s include specific sub-benchmarks for require-
ments, prototyping, architecture, planning, initial design, detail design,
design reviews, coding, reusable code acquisition, package acquisition,
code inspections, independent verification and validation, configuration
control, integration, user documentation, unit testing, function testing,
integration testing, system testing, field testing, acceptance testing, inde-
pendent testing, quality assurance, installation, and management.
Activity-based benchmarks are more difficult to perform than other
kinds of benchmark studies, but the results are far more useful for
process improvement, cost reduction, quality improvement, schedule
improvement, or other kinds of improvement programs. The great
advantage of activity-based benchmarks is that they reveal very impor-
tant kinds of information that the less granular studies can't provide.
For example, for many kinds of software projects, the major cost drivers
are associated with the production of paper documents (plans, speci-
fications, user manuals) and with quality control (inspections, static
analysis, testing). Both paperwork costs and defect removal costs are
often more expensive than coding. Findings such as this are helpful in
planning improvement programs and calculating returns on invest-
ments. But to know the major cost drivers within a specific company
or enterprise, it is necessary to get down to the level of activity-based
benchmark studies.
Activity-based benchmarks are normally collected via on-site interviews,
although today Skype or a conference call might be used. The benchmark
Project Management and Software Engineering
417
interview typically takes about two hours and involves the project man-
ager and perhaps three team members. Therefore the hours are about
eight staff hours plus consulting time for collecting the benchmark itself.
If function points are counted by the consultant, they would take addi-
tional time.
Software outsource versus internal performance benchmarks One of the
most frequent reasons that the author has been commissioned to carry
out productivity and quality benchmark studies is that a company
is considering outsourcing some or all of their software development
work.
Usually the outsource decision is being carried out high in the com-
pany at the CEO or CIO levels. The lower managers are alarmed that
they might lose their jobs, and so they commission productivity and
quality studies to compare in-house performance against both industry
data and also data from major outsource vendors in the United States
and abroad.
Until recently, U.S. performance measured in terms of function points
per month was quite good compared with the outsource countries of
China, Russia, India, and others. However, when costs were measured,
the lower labor costs overseas gave offshore outsourcers a competitive
edge. Within the past few years, inflation rates have risen faster over-
seas than in the United States, so the cost differential has narrowed.
IBM, for example, recently decided to build a large outsource center in
Iowa due to the low cost-of-living compared with other locations.
The continuing recession has resulted in a surplus of U.S. software
professionals and also lowered U.S. compensation levels. As a result, cost
data is beginning to average out across a large number of countries. The
recession is affecting other countries too, but since travel costs continue
to go up, it is becoming harder or at least less convenient to do business
overseas.
As of 2009,
Software maintenance and customer support benchmarks
there are more maintenance and enhancement software engineers than
development software engineers. Yet benchmarks for maintenance and
enhancement work are not often performed. There are several reasons
for this. One reason is that maintenance work has no fewer than 23
different kinds of update to legacy applications, ranging from minor
changes through complete renovation. Another reason is that a great
deal of maintenance work involves changes less than 15 function points
in size, which is below the boundary level of normal function point
analysis. Although individually these small changes may be fast and
inexpensive, there are thousands of them, and their cumulative costs
in large companies total to millions of dollars per year.
418
Chapter Six
One of the key maintenance metrics that has value is that of main-
tenance assignment scope or the amount of software one person can
keep up and running. Other maintenance metrics include number of
users supported, rates at which bugs are fixed, and normal productivity
rates expressed in terms of function points per month or work hours
per function point. Defect potentials and defect removal efficiency level
are also important.
One strong caution for maintenance benchmarks: the traditional "cost
per defect" metric is seriously flawed and tends to penalize quality. Cost
per defect achieves the lowest costs for the buggiest software. It also
seems to be cheaper early rather than late, but this is really a false
conclusion based on overhead rather than actual time and motion.
The new requirements for service and customer support included in
the Information Technology Infrastructure Library (ITIL) are giving
a new impetus to maintenance and support benchmarks. In fact, ITIL
benchmarks should become a major subfield of software benchmarks.
Methodology benchmarks There are many different forms of software
development methodology such as Agile development, extreme program-
ming (XP), Crystal development, waterfall development, the Rational
Unified Process (RUP), iterative development, object-oriented develop-
ment (OO), rapid application development (RAD), the Team Software
Process (TSP), and dozens more. There are also scores of hybrid develop-
ment methods and probably hundreds of customized or local methods
used only by a single company.
In addition to development methods, a number of other approaches
can have an impact on software productivity, quality, or both. Some
of these include Six Sigma, quality function deployment (QFD), joint
application design (JAD), and software reuse.
Benchmark data should be granular and complete enough to dem-
onstrate the productivity and quality levels associated with various
development methods. The ISBSG benchmark data is complete enough
to do this. Also, the data gathered by for-profit benchmark organizations
such as QPMG and SPR can do this, but there are logistical problems.
The logistical problems include the following: Some of the popular
development methods such as Agile and TSP use nonstandard metrics
such as story points, use-case points, ideal time, and task hours. The
data gathered using such metrics is incompatible with major industry
benchmarks, all of which are based on function point metrics and stan-
dard work periods.
Another logistical problem is that very few organizations that use some
of these newer methods have commissioned benchmarks by outside con-
sultants or used the ISBSG data questionnaires. Therefore, the effective-
ness of many software development methods is ambiguous and uncertain.
img
Project Management and Software Engineering
419
Conversion of data to function points and standard work periods is techni-
cally possible, but has not yet been performed by the Agile community or
most of the other methods that use nonstandard metrics.
Software assessment has been available in
Assessment benchmarks
large companies such as IBM since the 1970s. IBM-style assessments
became popular when Watts Humphrey left IBM and created the assess-
ment method for the Software Engineering Institute (SEI) circa 1986.
By coincidence, the author also left IBM and created the Software
Productivity Research (SPR) assessment method circa 1984.
Software process assessments received a burst of publicity from the
publication of two books. One of these was Watts Humphrey's book
Managing the Software Process (Addison Wesley, 1989), which describes
the assessment method used by the Software Engineering Institute (SEI).
A second book on software assessments was the author's Assessment
and Control of Software Risks (Prentice Hall, 1994), which describes
the results of the assessment method used by Software Productivity
Research (SPR). Because both authors had been involved with software
assessments at IBM, the SEI and SPR assessments had some attributes
in common, such as a heavy emphasis on software quality.
Both the SEI and SPR assessments are similar in concept to medical
examinations. That is, both assessment approaches try to find every-
thing that is right and everything that may be wrong with the way
companies build and maintain software. Hopefully, not too much will be
wrong, but it is necessary to know what is wrong before truly effective
therapy programs can be developed.
By coincidence, both SPR and SEI utilize 5-point scales in evaluating
software performance. Unfortunately, the two scales run in opposite
directions. The SPR scale is based on a Richter scale, with the larger
numbers indicating progressively more significant hazards. The SEI
scale uses "1" as the most primitive score, and moves toward "5" as
processes become more rigorous. Following is the SEI scoring system,
and the approximate percentages of enterprises that have been noted
at each of the five levels.
SEI Scoring System for the Capability Maturity Model (CMM)
Definition
Frequency
1 = Initial
75.0%
2 = Repeatable
15.0%
3 = Defined
7.0%
4 = Managed
2.5%
5 = Optimizing
0.5%
img
420
Chapter Six
As can be seen, about 75 percent of all enterprises assessed using the
SEI approach are at the bottom level, or "initial." Note also that the SEI
scoring system lacks a midpoint or average.
A complete discussion of the SEI scoring system is outside the scope
of this book. The SEI scoring is based on patterns of responses to a set
of about 150 binary questions. The higher SEI maturity levels require
"Yes" answers to specific patterns of questions.
Following is the SPR scoring system, and the approximate percent-
ages of results noted within three industry groups: military software,
systems software, and management information systems software.
SPR Assessment Scoring System
Frequency
Military
Systems
MIS
Definition
(Overall)
Frequency
Frequency
Frequency
1 = Excellent
2.0%
1.0%
3.0%
1.0%
2 = Good
18.0%
13.0%
26.0%
12.0%
3 = Average
56.0%
57.0%
50.0%
65.0%
4 = Poor
20.0%
24.0%
20.0%
19.0%
5 = Very Poor
4.0%
5.0%
2.0%
3.0%
The SPR scoring system is easier to describe and understand. It is
based on the average responses to the 300 or so SPR questions on the
complete set of SPR assessment questionnaires.
By inversion and mathematical compression of the SPR scores, it is
possible to establish a rough equivalence between the SPR and SEI
scales, as follows:
SPR Scoring Range
Equivalent SEI Score
Approximate Frequency
5.99 to 3.00
1 = Initial
80.0%
2.99 to 2.51
2 = Repeatable
10.0%
2.01 to 2.50
3 = Defined
5.0%
1.01 to 2.00
4 = Managed
3.0%
0.01 to 1.00
5 = Optimizing
2.0%
The conversion between SPR and SEI assessment results is not per-
fect, of course, but it does allow users of either assessment methodology
to have an approximate indication of how they might have appeared
using the other assessment technique.
There are other forms of assessment too. For example, ISO quality
certification uses a form of software assessment, as do the SPICE and
TickIT approaches in Europe.
In general, software assessments are performed by outside consul-
tants, although a few organizations do have internal assessment experts.
Project Management and Software Engineering
421
For SEI-style assessments, a number of consulting groups are licensed
to carry out the assessment studies and gather data.
Benchmark data shows pro-
Hybrid assessment and benchmark studies
ductivity and quality levels, but does not explain what caused them.
Assessment data shows the sophistication of software development
practices, or the lack of same. But assessments usually collect no quan-
titative data.
Obviously, assessment data and benchmark data are synergistic, and
both need to be gathered. The author recommends that a merger of
assessment and benchmark data would be very useful to the industry.
In fact the author's own benchmarks are always hybrid and gather
assessment and benchmark data concurrently.
One of the key advantages of hybrid benchmarks is that the quantita-
tive data can demonstrate the economic value of the higher CMM and
CMMI levels. Without empirical benchmark data, the value of ascending
the CMMI from level 1 to level 5 is uncertain. But benchmarks do dem-
onstrate substantial productivity and quality levels for CMMI levels 3,
4, and 5 compared with levels 1 and 2.
The software industry would benefit from a wider consolidation of
assessment and benchmark data collection methods. The advantage of
the hybrid approach is that it minimizes the number of times managers
and technical personnel are interviewed or asked to provide informa-
tion. This keeps the assessment and benchmark data collection activi-
ties from being intrusive or interfering with actual day-to-day work.
Some of the kinds of data that need to be consolidated to get an over-
all picture of software within a large company or government group
include
1. Demographic data on team sizes
2. Demographic data on specialists
3. Demographic data on colocation or geographic dispersion of teams
4. Application size using several metrics (function points, story points,
LOC, etc.)
5. Volumes of reusable code and other deliverables
6. Rates of requirements change during development
7. Data on project management methods
8. Data on software development methods
9. Data on software maintenance methods
10. Data on specific programming languages
11. Data on specific tool suites used
422
Chapter Six
12. Data on quality-control and testing methods
13. Data on defect potentials and defect removal efficiency levels
14. Data on security-control methods
15. Activity-level schedule, effort, and cost data
Hybrid assessment and benchmark data collection could gather all
of this kind of information in a fairly cost-effective and nonintrusive
fashion.
The earned-value method of comparing accu-
Earned-value benchmarks
mulated effort and costs against predicted milestones and deliverables
is widely used on military software applications; indeed, it is a require-
ment for military contracts. However, outside of the defense community,
earned-value calculations are also used by some outsource contracts and
occasionally on internal applications.
Earned-value calculations are performed at frequent intervals, usu-
ally monthly, and show progress versus expense levels. The method is
somewhat specialized and the calculations are complicated, although
dozens of tools are available that can carry them out.
The earned-value approach by itself is not a true benchmark because
it has a narrow focus and does not deal with topics such as quality,
requirements changes, and other issues. However, the data that is col-
lected for the earned-value approach is quite useful for benchmark stud-
ies, and could also show correlations with assessment results such as
the levels of the capability maturity model integration (CMMI).
Quality and test coverage benchmarks Software quality is poorly rep-
resented in the public benchmark data offered by nonprofit organiza-
tions such as ISBSG (International Software Benchmarking Standards
Group). In fact, software quality is not very well done by the entire
software industry, including some major players such as Microsoft.
Companies such as IBM that do take quality seriously measure all
defects from requirements through development and out into the field.
The data is used to create benchmarks of two very important metrics:
defect potentials and defect removal efficiency. The term defect poten-
tials refers to the sum total of defects that are likely to be found in
software. The term defect removal efficiency refers to the percentage
of defects found and removed by every single review, inspection, static
analysis run, and test stage.
In addition, quality benchmarks may also include topics such as
complexity measured using cyclomatic and essential complexity; test
coverage (percentage of code actually touched by test cases); and defect
severity levels. There is a shortage of industry data on many quality
topics, such as bugs or errors in test cases themselves.
Project Management and Software Engineering
423
In general, the software industry needs more and better quality and
test coverage benchmarks. The test literature is very sparse with infor-
mation such as numbers of test cases, numbers of test runs, and defect
removal efficiency levels.
A strong caution about quality benchmarks is that "cost per defect"
is not a safe metric to use because it penalizes quality. The author
regards this metric as approaching professional malpractice. A better
metric for quality economics is that of defect removal cost per func-
tion point.
Cost of quality (COQ) benchmarks It is unfortunate that such an impor-
tant idea as the "cost of quality" has such an inappropriate name.
Quality is not only "free" as pointed out by Phil Crosby of ITT, but it
also has economic value. The COQ measure should have been named
something like the "cost of defects." In any case, the COQ approach is
older than the software and computing industry and derives from a
number of pioneers such as Joseph Juran, W. Edwards Deming, Kaoru
Ishikawa, Genichi Taguchi, and others.
The traditional cost elements of COQ include prevention, appraisal,
and failure costs. While these are workable for software, software COQ
often uses cost buckets such as defect prevention, inspection, static
analysis, testing, and delivered defect repairs. The ideas are the same,
but the nomenclature varies to match software operations.
Many companies perform COQ benchmark studies of both software
applications and engineered products. There is a substantial literature
on this topic and dozens of reference books.
"Six Sigma" is a mathematical expression that
Six Sigma benchmarks
deals with limiting defects to no more than 3.4 per 1 million opportuni-
ties. While this quantitative result appears to be impossible for software,
the philosophy of Six Sigma is readily applied to software.
The Six Sigma approach uses a fairly sophisticated and complex suite
of metrics to examine software defect origins, defect discovery methods,
defects delivered to customers, and other relevant topics. However, the
Six Sigma approach is also about using such data to improve both defect
prevention and defect detection.
A number of flavors of Six Sigma exist, but the most important flavor
circa 2009 is that of "Lean Six Sigma," which attempts a minimalist
approach to the mathematics of defects and quality analysis.
The Six Sigma approach is not an actual benchmark in the tradi-
tional sense of the word. As commonly used, a benchmark is a discrete
collection of data points gathered in a finite period, such as collecting
data on 50 applications developed in 2009 by a telecommunications
company.
424
Chapter Six
The Six Sigma approach is not fixed in time or limited in number
of applications. It is a continuous loop of data collection, analysis, and
improvement that continues without interruption once it is initiated.
Although the ideas of Six Sigma are powerful and often effective,
there is a notable gap in the literature and data when Six Sigma is
applied to software. As of 2009, there is not a great deal of empirical data
that shows the application of Six Sigma raises defect removal efficiency
levels or lowers defect potentials.
The overall U.S. average for defect potentials circa 2009 is about 5.00
bugs per function point, while defect removal efficiency averages about
85 percent. This combination leaves a residue of 0.75 bug per function
point when software is delivered to users.
Given the statistical nature of Six Sigma metrics, it would be inter-
esting to compare all companies that use Lean Six Sigma or Six Sigma
for software against U.S. averages. If so, one might hope that defect
potentials would be much lower (say about 3.00 bugs per function point),
while removal efficiency was much higher (say greater than 95 percent).
Unfortunately, this kind of data is sparse and not yet available in suf-
ficient quantity for a convincing statistical study.
As it happens, one way of achieving Six Sigma for software would
be to achieve a defect removal efficiency rate of 99.999 percent, which
has actually never occurred. However, it would seem useful to compare
actual levels of defect removal efficiency against this Six Sigma theo-
retical target.
From a historical standpoint, defect removal efficiency calculations
did not originate in the Six Sigma domain, but rather seemed to origi-
nate in IBM, when software inspections were being compared with other
forms of defect removal activities in the early 1970s.
Organizations that needed certification for
ISO quality benchmarks
the ISO 9000-9004 quality standards or for other newer relevant ISO
standards undergo an on-site examination of their quality methods
and procedures, and especially the documentation for quality control
approaches. This certification is a form of benchmark and actually is
fairly expensive to carry out. However, there is little or no empirical data
that ISO certification improves software quality in the slightest.
In other words, neither defect potentials nor defect removal efficiency
levels of ISO certified organizations seem to be better than similar
uncertified organizations. Indeed there is anecdotal evidence that aver-
age software quality for uncertified companies may be slightly higher
than for certified companies.
With the exception of studies by Homeland
Security benchmarks
Security, the FBI, and more recently, the U.S. Congress, there is almost
Project Management and Software Engineering
425
a total absence of security benchmarks at the corporate level. As the
recession lengthens and security attacks increase, there is an urgent
need for security benchmarks that can measure topics such as the resis-
tance of software to attack; numbers of attacks per company and per
application; costs of security flaw prevention; costs of recovery from
security attacks and denial of service attacks; and evaluations of the
most effective forms of security protection.
The software journals do include benchmarks for antivirus and anti-
spyware applications and firewalls that show ease of use and viruses
detected or viruses let slip through. However, these benchmarks are
somewhat ambiguous and casual.
So far as can be determined, there are no known benchmarks on topics
such as the number of security attacks against Microsoft Vista, Oracle,
SAP, Linux, Firefox, Internet Explorer, and the like. It would be useful to
have monthly benchmarks on these topics. The lack of effective security
benchmarks is a sign that the software industry is not yet fully up to
speed on security issues.
Software personnel and skill benchmarks Software personnel and skills
inventory benchmarks in the context of software are a fairly new arrival
on the scene. Software has become one of the major factors in global
business. Some large corporations have more than 50,000 software per-
sonnel of various kinds, and quite a few companies have more than 2500.
Over and above the large numbers of workers, the total complement of
specific skills and occupation groups associated with software is now
approaching 90.
As discussed in earlier chapters, large enterprises have many dif-
ferent categories of specialists in addition to their general software
engineering populations: For example, quality assurance specialists,
integration and test specialists, human factors specialists, performance
specialists, customer support specialists, network specialists, database
administration specialists, technical communication specialists, main-
tenance specialists, estimating specialists, measurement specialists,
function point counting specialists, and many others.
There are important questions in the areas of how many specialists
of various kinds are needed, how they should be recruited, trained, and
perhaps certified in their area of specialization. There are also questions
dealing with the best way of placing specialists within the overall software
organization structures. Benchmarking in this domain involves collecting
information on how companies of various sizes in various industries deal
with the increasing need for specialization in an era of downsizing and
business process reengineering due to the continuing recession.
A new topic of increasing importance due to the recession is the
distribution of foreign software workers who are working in the
426
Chapter Six
United States on temporary work-related visas. This topic has recently
been in the press when it was noted that Microsoft and Intel were laying off
U.S. workers at a faster rate than they were laying off foreign workers.
Compensation benchmarks have
Software compensation benchmarks
been used for more than 25 years for nonsoftware studies, and they soon
added software compensation to these partly open or blind benchmarks.
The way compensation benchmarks work is that many companies
provide data on the compensation levels that they pay to various workers
using standard job descriptions. A neutral consulting company analyzes
the data and reports back to each company. Each report shows how spe-
cific companies compare with group averages. In the partly open form,
the names of the other companies are identified but of course their actual
data is concealed. In the blind form, the number of participating compa-
nies is known, but none of the companies are identified. There are legal
reasons for having these studies carried out in blind or partly open forms,
which involve possible antitrust regulations or conspiracy charges.
Software turnover and attrition benchmarks This form of benchmark was
widely used outside of software before software became a major business
function. The software organizations merely joined in when they became
large enough for attrition to become an important issue.
Attrition and turnover benchmarks are normally carried out by
human resource organizations rather than software organizations.
They are classic benchmarks that are usually either blind or partly
open. Dozens or even hundreds of companies report their attrition
and turnover rates to a neutral outside consulting group, which then
returns statistical results to each company. Each company's rate is
compared with the group, but the specific rates for the other partici-
pants are concealed.
There are also internal attrition studies within large corporations such
as IBM, Google, Microsoft, EDS, and the like. The author has had access
to some very significant data from internal studies. The most important
points were that software engineers with the highest appraisal scores
leave in the greatest numbers. The most common reason cited for leav-
ing in exit interviews is that good technical workers don't like working
for bad managers.
Software performance benchmarks Software execution speed or perfor-
mance is one of the older forms of benchmark, and has been carried out
since the 1970s. These are highly technical benchmarks that consider
application throughput or execution speed for various kinds of situa-
tions. Almost every personal computer magazine has benchmarks for
Project Management and Software Engineering
427
topics such as graphics processing, operating system load times, and
other performance issues.
This form of benchmark is probably
Software data center benchmarks
the oldest form for the computing and software industry and has been
carried out continuously since the 1960s. Data center benchmarks are
performed to gather information on topics such as availability of hard-
ware and software, mean time to failure of software applications, and
defect repair intervals. The new Information Technology Infrastructure
Library (ITIL) includes a host of topics that need to be examined so they
can be included in service agreements.
While data center benchmarks are somewhat separate from software
benchmarks, the two overlap because poor data center performance
tends to correlate with poor quality levels of installed software.
Customer satisfaction benchmarks Formal customer satisfaction surveys
have long been carried out by computer and software vendors such as
IBM, Hewlett-Packard, Unisys, Google, and some smaller companies,
too. These benchmark studies are usually carried out by the market-
ing organization and are used to suggest improvements in commercial
software packages.
There are some in-house benchmarks of customer satisfaction within
individual companies such as insurance companies that have thousands
of computer users. These studies may also be correlated to data center
benchmarks.
Software usage benchmarks As software becomes an important business
and operational tool, it is obvious that software usage tends to improve
the performance of various kinds of knowledge work and clerical work.
In fact, prior to the advent of computers, the employment patterns of
insurance companies included hundreds of clerical workers who han-
dled applications, claims, and other clerical tasks. Most of these were
displaced by computer software, and as a result the demographics of
insurance companies changed significantly.
Function point metrics can be used to measure consumption of soft-
ware just as well as they can measure production of software. Although
usage benchmarks are rare in 2009, they are likely to grow in impor-
tance as the recession continues.
Usage benchmarks of software project managers, for example, indi-
cate that managers who are equipped with about 3000 function points
of cost estimating tools and 3000 function points of project management
tools have fewer failures and shorter schedules for their projects than
managers who attempt estimating and planning by hand.
428
Chapter Six
Usage studies also indicate that many knowledge workers who are
well equipped with software outperform colleagues who are not so well
equipped. This is true for knowledge work such as law, medicine, and
engineering, and also for work where data plays a significant role such
as marketing, customer support, and maintenance.
Software consumption benchmark studies are just getting started
circa 2009, but are likely to become major forms of benchmarks within
ten years, especially if the recession continues.
Software litigation and failure benchmarks In lawsuits for breach of con-
tract, poor quality, fraud, cost overruns, or project failure, benchmarks
play a major role. Usually in such cases software expert witnesses are
hired to prepare reports and testify about industry norms for topics
such as quality control, schedules costs, and the like. Industry experts
are also brought in for tax cases if the litigation involves the value or
replacement costs of software assets.
The expert reports produced for lawsuits attempt to compare the
specifics of the case against industry background data for topics such
as defect removal efficiency levels, schedules, productivity, costs, and
the like.
The one key topic where litigation is almost unique in gathering
data is that of the causes of software failure. Most companies that have
internal failures don't go to court. But failures where the software was
developed under contract go to court with high frequency. These law-
suits have extensive and thorough discovery and deposition phases, so
the expert witnesses who work on such cases have access to unique data
that is not available from any other source.
Benchmarks based on litigation are perhaps the most complete source
of data on why projects are terminated, run late, exceed their budgets,
or have excessive defect volumes after release.
There are a number of organizations that offer
Award benchmarks
awards for outstanding performance. For example, the Baldrige Award
is well known for quality and customer service. The Forbes Annual issue
on the 100 best companies to work for is another kind of award. J.D.
Power and Associates issues awards for various kinds of service and
support excellence. For companies that aspire to "best in class" status,
a special kind of benchmark can be carried out dealing with the criteria
of the Baldrige Awards.
If a company is a candidate for some kind of award, quite a bit of work
is involved in collecting the necessary benchmark information. However,
only fairly sophisticated companies that are actually doing a good job
are likely to have such expenses.
Project Management and Software Engineering
429
As of 2009, probably at least a dozen awards are offered by vari-
ous corporations, government groups, and software journals. There are
awards for customer service, for high quality, for innovative applica-
tions, and for many other topics as well.
Types of Software Benchmark
Studies Performed
There are a number of methodologies used to gather the data for bench-
mark studies. These include questionnaires that are administered by
mail or electronic mail, on-site interviews, or some combination of
mailed questionnaires augmented by interviews.
Benchmarking studies can also be "open" or "blind" in terms of
whether the participants know who else has provided data and infor-
mation during the benchmark study.
Open benchmarks In a fully open study, the names of all participating
organizations are known, and the data they provide is also known. This
kind of study is difficult to do between competitors, and is normally
performed only for internal benchmark studies of the divisions and
locations within large corporations.
Because of corporate politics, the individual business units within a
corporation will resist open benchmarks. When IBM first started software
benchmarks, there were 26 software development labs, and each lab man-
ager claimed that "our work is so complex that we might be penalized."
However, IBM decided to pursue open benchmarks, and that was a good
decision because it encouraged the business unit to improve.
One of the common variations of an open study
Partly open benchmarks
is a limited benchmark, often between only two companies. In a two-com-
pany benchmark, both participants sign fairly detailed nondisclosure
agreements, and then provide one another with very detailed informa-
tion on methods, tools, quality levels, productivity levels, schedules, and
the like. This kind of study is seldom possible for direct competitors, but
is often used for companies that do similar kinds of software but operate
in different industries, such as a telecommunications company sharing
data with a computer manufacturing company.
In partly open benchmark studies, the names of the participating
organizations are known, even though which company provided specific
points of data is concealed. Partly open studies are often performed
within specific industries such as insurance, banking, telecommuni-
cations, and the like. In fact, studies of this kind are performed for a
variety of purposes besides software topics. Some of the other uses of
partly open studies include exploring salary and benefit plans, office
430
Chapter Six
space arrangements, and various aspects of human relations and
employee morale.
An example of a partly open benchmark is a study of the productivity
and quality levels of insurance companies in the Hartford, Connecticut,
area where half a dozen are located. All of these companies are com-
petitors, and all are interested in how they compare with the others.
Therefore, a study gathered data from each and reported back on how
each company compared with the averages derived from all of the com-
panies. But information on how a company such as Hartford Insurance
compared with Aetna or Travelers would not be provided.
In blind benchmark studies, none of the participants
Blind benchmarks
know the names of the other companies that participate. In extreme
cases, the participants may not even know the industries from which
the other companies were drawn. This level of precaution would only
be needed if there were very few companies in an industry, or if the
nature of the study demanded extraordinary security measures, or if
the participants are fairly direct competitors.
When large corporations first start collecting benchmark data, it is
obvious that the top executives of various business units will be con-
cerned. They all have political rivals, and no executive want his or her
business unit to look worse than a rival business unit. Therefore, every
executive will want blind benchmarks that conceal the results of spe-
cific units. This is a bad mistake, because nobody will take the data
seriously.
For internal benchmark and assessment studies within a company,
it is best to show every unit by name and let corporate politics serve as
an incentive to improve. This brings up the important point that bench-
marks have a political aspect as well as a technical aspect.
Since executives and project managers have rivals, and corporate
politics are often severe, nobody wants to be measured unless they are
fairly sure the results will indicate that they are better than average,
or at least better than their major political opponents.
Benchmark Organizations Circa 2009
A fairly large number of consulting companies collect benchmark data of
various kinds. However, these consulting groups tend to be competitors,
and therefore it is difficult to have any kind of coordination or consolida-
tion of benchmark information.
As it happens, three of the more prominent benchmark organizations do
collect activity-level data in similar fashions: The David Consulting Group,
Quality and Productivity Management Group (QPMG), and Software
Productivity Research (SPR). This is due to the fact the principals for all
img
Project Management and Software Engineering
431
TABLE 6-13
Examples of Software Benchmark Organizations
1.
Business Applications Performance Corporation (BAPco)
2.
Construx
3.
David Consulting Group
4.
Forrester Research
5.
Galorath Associates
6.
Gartner Group
7.
Information Technology Metrics and Productivity Institute (ITMPI)
8.
International Software Benchmarking Standards Group (ISBSG)
9.
ITABHI Corporation
10.
Open Standards Benchmarking Collaborative (OSBC)
11.
Process Fusion
12.
Quality and Productivity Management Group (QPMG)
13.
Quality Assurance Institute (QAI)
14.
Quality Plus
15.
Quantitative Software Management (QSM)
16.
Software Engineering Institute (SEI)
17.
Software Productivity Research (SPR)
18.
Standard Performance Evaluation Corporation (SPEC)
19.
Standish Group
20.
Total Metrics
three organizations have worked together in the past. However, although
the data collection methods are similar, there are still some differences.
But the total volume of data among these three is probably the largest
collection of benchmark data in the industry. Table 6-13 shows examples
of software benchmark organizations.
For all of these 20 examples of benchmark organizations, IFPUG
function points are the dominant metric, followed by COSMIC function
points as a distant second.
Reporting Methods for Benchmark
and Assessment Data
Once assessment and benchmark data has been collected, two interest-
ing questions are who gets to see the data, and what is it good for?
Normally, assessment and benchmarks are commissioned by an exec-
utive who wants to improve software performance. For example, bench-
marks and assessments are sometimes commissioned by the CEO of a
corporation, but more frequently by the CIO or CTO.
The immediate use of benchmarks and assessments is to show the
executive who commissioned the study how the organization compares
432
Chapter Six
against industry data. The topics of interest at the executive level
include
Benchmark Contents (standard benchmarks)
Number of projects in benchmark sample
Country and industry identification codes
Application sizes
Methods and tool used
Growth rate of changing requirements
Productivity rates by activity
Net productivity for entire project
Schedules by activity
Net schedule for entire project
Staffing levels by activity
Specialists utilized
Average staff for entire project
Effort by activity
Total effort for entire project
Costs by activity
Total costs for entire project
Comparison to industry data
Suggestions for improvements based on data
Once an organization starts collecting assessment and benchmark
data, they usually want to improve. This implies that data collection
will be an annual event, and that the data will be used as baselines to
show progress over multiple years.
When improvement occurs, companies will want to assemble an
annual baseline report that shows progress for the past year and the
plans for the next year. These annual reports are produced on the same
schedule as corporate annual reports for shareholders; that is, they are
created in the first quarter of the next fiscal year.
The contents for such an annual report would include
Annual Software Report for Corporate Executives and Senior Management
CMMI levels by business group
Completed software projects by type
IT applications
Project Management and Software Engineering
433
Systems software
Embedded applications
Commercial packages
Other (if any)
Cancelled software projects (if any)
Total costs of software in current year
Unbudgeted costs in current year
Litigation
Denial of service attacks
Malware attacks and recovery
Costs by type of software
Costs of development versus maintenance
Customer satisfaction levels
Employee morale levels
Average productivity
Ranges of productivity
Average quality
Discovered defects during development
Delivered defects reported by clients in 90 days
Cost of quality (COQ) for current year
Comparison of local results to ISBSG and other external benchmarks
Most of the data in the annual report would be derived from assess-
ment and benchmark studies. However, a few topics such as those deal-
ing with security problems such as denial of service attacks are not part
of either standard benchmarks or standard assessments. They require
special studies.
Summary and Conclusions
Between about 1969 and today in 2009, software applications have
increased enormously in size and complexity. In 1969, the largest appli-
cations were fewer than 1000 function points, while in 2009, they top
100,000 function points in size.
In 1969, programming or coding was the major activity for software
applications and constituted about 90 percent of the total effort. Most
applications used only a single programming language. The world total
434
Chapter Six
of programming languages was fewer than 25. Almost the only spe-
cialists in 1969 were technical writers and perhaps quality assurance
workers.
Today in 2009, coding or programming is less than 40 percent of the
effort for large applications, and the software industry now has more
than 90 specialists. More than 700 programming languages exist, and
almost every modern application uses at least two programming lan-
guages; some use over a dozen.
As the software industry increased in numbers of personnel, size of
applications, and complexity of development, project management fell
behind. Today in 2009, project managers are still receiving training that
might have been effective in 1969, but it falls short of what is needed in
today's more complicated world.
Even worse, as the recession increases in severity, there is an urgent
need to lower software costs. Project managers and software engineers
need to have enough solid empirical data to evaluate and understand
every single cost factor associated with software. Unfortunately, poor mea-
surement practices and a shortage of solid data on quality, security, and
costs have put the software industry in a very bad economic position.
Software costs more than almost any other manufactured product;
it is highly susceptible to security attacks; and it is filled with bugs or
defects. Yet due to the lack of reliable benchmark and quality data, it is
difficult for either software engineers or project managers to deal with
these serious problems effectively.
The software industry needs better quality, better security, lower
costs, and shorter schedules. But until solid empirical data is gathered
on all important projects, both software engineers and project manag-
ers will not be able to plan effective solutions to industrywide problems.
Many process improvement programs are based on nothing more than
adopting the methodology du jour, such as Agile in 2009, without any
empirical data on whether it will be effective. Better measurements and
better benchmarks are the keys to software success.
Readings and References
Abran, Alain and Reiner R. Dumke. Innovations in Software Measurement. Aachen,
Germany: Shaker-Verlag, 2005.
Abran, Alain, Manfred Bundschuh, Reiner Dumke, Christof Ebert, and Horst Zuse.
Software Measurement News, Vol. 13, No. 2, Oct. 2008.
Boehm, Dr. Barry. Software Engineering Economics. Englewood Cliffs, NJ: Prentice
Hall, 1981.
Booch, Grady. Object Solutions: Managing the Object-Oriented Project. Reading, MA:
Addison Wesley, 1995.
Brooks, Fred. The Mythical Man-Month. Reading, MA: Addison Wesley, 1974, rev. 1995.
Bundschuh, Manfred and Carol Dekkers. The IT Measurement Compendium. Berlin:
Springer-Verlag, 2008.
Project Management and Software Engineering
435
Capability Maturity Model Integration. Version 1.1. Software Engineering Institute,
Carnegie-Mellon Univ., Pittsburgh, PA. March 2003. www.sei.cmu.edu/cmmi/
Charette, Bob. Application Strategies for Risk Management. New York: McGraw-Hill,
1990.
Charette, Bob. Software Engineering Risk Analysis and Management. New York:
McGraw-Hill, 1989.
Cohn, Mike. Agile Estimating and Planning. Englewood Cliffs, NJ: Prentice Hall PTR,
2005.
DeMarco, Tom. Controlling Software Projects. New York: Yourdon Press, 1982.
Ebert, Christof and Reiner Dumke. Software Measurement: Establish, Extract,
Evaluate, Execute. Berlin: Springer-Verlag, 2007.
Ewusi-Mensah, Kweku. Software Development Failures. Cambridge, MA: MIT Press,
2003.
Galorath, Dan. Software Sizing, Estimating, and Risk Management: When Performance
is Measured Performance Improves. Philadelphia: Auerbach Publishing, 2006.
Garmus, David and David Herron. Function Point Analysis--Measurement Practices for
Successful Software Projects. Boston: Addison Wesley Longman, 2001.
Garmus, David and David Herron. Measuring the Software Process: A Practical Guide
to Functional Measurement. Englewood Cliffs, NJ: Prentice Hall, 1995.
Glass, R.L. Software Runaways: Lessons Learned from Massive Software Project
Failures. Englewood Cliffs, NJ: Prentice Hall, 1998.
Harris, Michael, David Herron, and Stacia Iwanicki. The Business Value of IT:
Managing Risks, Optimizing Performance, and Measuring Results. Boca Raton, FL:
CRC Press (Auerbach), 2008.
Humphrey, Watts. Managing the Software Process. Reading, MA: Addison Wesley, 1989.
International Function Point Users Group (IFPUG). IT Measurement--Practical Advice
from the Experts. Boston: Addison Wesley Longman, 2002.
Johnson, James, et al. The Chaos Report. West Yarmouth, MA: The Standish Group, 2000.
Jones, Capers. Assessment and Control of Software Risks. Englewood Cliffs, NJ:
Prentice Hall, 1994.
Jones, Capers. Estimating Software Costs. New York: McGraw-Hill, 2007.
Jones, Capers. Patterns of Software System Failure and Success. Boston: International
Thomson Computer Press, December 1995.
Jones, Capers. Program Quality and Programmer Productivity. IBM Technical Report
TR 02.764, IBM. San Jose, CA. January 1977.
Jones, Capers. Programming Productivity. New York: McGraw-Hill, 1986.
Jones, Capers. Software Assessments, Benchmarks, and Best Practices. Boston: Addison
Wesley Longman, 2000.
Jones, Capers. "Software Project Management Practices: Failure Versus Success."
CrossTalk, Vol. 19, No. 6 (June 2006): 4­8.
Jones, Capers. "Why Flawed Software Projects are not Cancelled in Time." Cutter IT
Journal, Vol. 10, No. 12 (December 2003): 12­17.
Laird, Linda M. and Carol M. Brennan. Software Measurement and Estimation: A
Practical Approach. Hoboken, NJ: John Wiley & Sons, 2006.
McConnell, Steve. Software Estimating: Demystifying the Black Art. Redmond, WA:
Microsoft Press, 2006.
Park, Robert E., et al. Checklists and Criteria for Evaluating the Costs and Schedule
Estimating Capabilities of Software Organizations. Technical Report CMU/SEI 95-
SR-005. Pittsburgh, PA: Software Engineering Institute, January 1995.
Park, Robert E., et al. Software Cost and Schedule Estimating--A Process Improvement
Initiative. Technical Report CMU/SEI 94-SR-03. Pittsburgh, PA: Software
Engineering Institute, May 1994.
Parthasarathy, M.A. Practical Software Estimation--Function Point Metrics for
Insourced and Outsourced Projects. Upper Saddle River, NJ: Infosys Press, Addison
Wesley, 2007.
Putnam, Lawrence H. and Ware Myers. Industrial Strength Software--Effective
Management Using Measurement. Los Alamitos, CA: IEEE Press, 1997.
Putnam, Lawrence H. Measures for Excellence ­ Reliable Software On Time, Within
Budget. Englewood Cliffs, NJ: Yourdon Press, Prentice Hall, 1992.
436
Chapter Six
Roetzheim, William H. and Reyna A. Beasley. Best Practices in Software Cost and
Schedule Estimation. Saddle River, NJ: Prentice Hall PTR, 1998.
Stein, Timothy R. The Computer System Risk Management Book and Validation Life
Cycle. Chico, CA: Paton Press, 2006.
Strassmann, Paul. Governance of Information Management: The Concept of an
Information Constitution, Second Edition. (eBook). Stamford, CT: Information
Economics Press, 2004.
Strassmann, Paul. Information Payoff. Stamford, CT: Information Economics Press,
1985.
Strassmann, Paul. Information Productivity. Stamford, CT: Information Economics
Press, 1999.
Strassmann, Paul. The Squandered Computer. Stamford, CT: Information Economics
Press, 1997.
Stukes, Sherry, Jason Deshoretz, Henry Apgar, and Ilona Macias. Air Force Cost
Analysis Agency Software Estimating Model Analysis. TR-9545/008-2. Contract
F04701-95-D-0003, Task 008. Management Consulting & Research, Inc. Thousand
Oaks, CA. September 30 1996.
Stutzke, Richard D. Estimating Software-Intensive Systems. Upper Saddle River, NJ:
Addison Wesley, 2005.
Symons, Charles R. Software Sizing and Estimating--Mk II FPA (Function Point
Analysis). Chichester, UK: John Wiley & Sons, 1991.
Wellman, Frank. Software Costing: An Objective Approach to Estimating and
Controlling the Cost of Computer Software. Englewood Cliffs, NJ: Prentice Hall,
1992.
Whitehead, Richard. Leading a Development Team. Boston: Addison Wesley, 2001.
Yourdon, Ed. Death March--The Complete Software Developer's Guide to Surviving
"Mission Impossible" Projects. Upper Saddle River, NJ: Prentice Hall PTR, 1997.
Yourdon, Ed. Outsource: Competing in the Global Productivity Race. Englewood Cliffs,
NJ: Prentice Hall PTR, 2005.