1. SRE & Software Fault
Management
Through Measurement and
Modeling
Rick Karcich
rkarcich@cisco.com
2. SRE & SW Fault Management
• SRE is a Measurement of Change Activity & Test
Activity Problem
• Composed of 3 phases
o Static Code Measurement
Measuring the code that implements each requirement
o Measurement of Change, Build-to-Build, Sprint-to-
Sprint
Developing the Change Profile
o Dynamic Test Measurement
Developing the Test Profile
• Understanding how effective our tests are at
hitting changed portions of code releases
SRE == Software Reliability Engineering
3. Measurement as the basis for SRE
improvement
• A Basic Problem
o Testers find too many bugs too late in a
release/sprint cycle
o Bug fixing destroys the crisp, predictable
execution of a project
• The Goal
o Predictably meeting release schedules with
known quality/reliability
o Develop information actionable by testers
4. Measurement as the basis for SRE
improvement…
• We don’t really want to know about the things
that we can measure
o Lines of Code
o Statements
• We really want to understand the things that we
cannot measure
o Software faults
o Software development effort
5. Measurement as the basis for SRE
improvement…
• Modern software systems change continuously
• They evolve functionally
• The code base evolves as a result
6. Software Evolution: Measuring A
Moving Target
• We assume that we are developing/maintaining a
single program
• In effect, we are really working with many
programs over time
• They are different programs in a very real sense
• We must identify and measure each version of
each program module
8. The Introduction Of Faults
• People make errors in the interpretation of
their tasks
o System Analysts
o Systems Designers
o Developers
• These errors are manifested in
o Specifications
o Design
o Programs
o as faults
• Faults, when executed, result in failures
10. Execution Consequences Of
Faults: Failures
• Faults are found in program modules
• A fault can only cause a failure if it is executed
• Different functionalities execute different sets of
modules
• Faults are associated with program
functionalities
• A test suite generated from a representative
operational profile is a precondition for reliability
analysis
11. Faults And Uncertainty
• Can never know when all faults have been found
• May use past experience to anticipate fault count
• Must create a fault surrogate
o Obtained from past development efforts
o Varies directly with faults
o Anticipates distribution of faults in modules
12. • Comments
• Executable Statements
• Non-executable Statements
• Total Operators
• Unique Operators
• Function Operators
• Total Operands
• Unique Operands Fault Index(FI)
• Unique Actual Operands
• Nodes
• Edges
• Paths
• Maximum Path Length
• Average Path Length
• Cycles
Development of FI from Raw
Metrics – Static Measurement
13. FI as a Fault Surrogate
• The FI metric is a statistical synthesis of program
module complexity
• Program modules may be ordered by FI
• The relative complexity of a software system is
the average FI of the component modules
• Validation of the FI concept
o Correlates well (0.90) with measures of software
faults
14. A Fault Index
• FI is a synthesized, dimensionless metric
• FI is a fault surrogate
o Composed of metrics closely related to faults
o Highly correlated with faults
15. Converting Data to Information
CMA
Metric
Analysis
PCA/
FI
Principal
Components
Analysis
Modules
Program Lots of
Data
12 23 54 12 203 39 238 34
7 13 64 12 215 9 39 238
11 21 54 12 241 39 238 35
5 33 44 12 205 39 138 44
42 55 54 12 113 29 234 14
FI
100
90
110
95
105
16. Measurement of Change, Build-to-
Build(Spirnt-to-Sprint) – Code Churn and
Code Delta
• Fault Index(FI) acts as a proxy for faults
• FI values change from one build to the next as a
module is changed
• Code deltas are differences in FI values from
build to the next
• Code churn is the absolute value of Code Delta –
this is the measure of change activity
17. The Measurement of Change
Process
Build i
Build j
Source
Code
Measurement
Tools
Baseline
Baselined
Build j
Baselined
Build i
PCA Domain
Sc
Domain
Score
Change
Code Churn
Code Delta
18. Baselining A Software
Development Project
• Software changes over software builds
• Measurements, such as relative complexity,
change across builds
• Initial(arbitrary) build as a baseline
• Relative complexity of each build
• Measure change in fault surrogate from an initial
baseline
Guru? Just tell me where I’m at… I know where I want to go…
Integrating measurement into build process, measuring change…
Integrating measurement into the test process, measuring coverage…
Talk about the concept of Errors > Faults > Failures
Classic SRE techniques … ‘very good at predicting the past’…
Build predictive models for fault identification. This is the main goal of the entire QA effort, under constraints of limited time and resources…
Software doesn’t fail as a function of time – it fails as a function of what it is asked to do --- and in customer environments it fails in ways customers ask it to perform and for which it has not been adequately tested…
We synthesize the Fault Index(FI) that represents the potential fault burden for each software module…
An overall Fault Index for an entire build can also be developed…
The value of this FI cannot be overstated. For example, during the code inspection process, there is simply not enough time or manpower to thoroughly examine the entire code base. The FI value will identify those modules that are potentially most fault prone so that the inspection process can be focused where it will yield the most value in identifying potential faults.
From N > N+1
Module B changed
Module C eliminated
Module F changed
Module H changed
Modules L & M added
The ‘fault surrogate’ is the Fault Index… and is central to our modeling of software failures as a function of software behavior…
This is a list of measurements to be taken as Iteration 0…
As we learn more about the character of faults our customers are seeing, e.g., latency/timing faults, we can add measurements to gauge these kinds of faults…
Build a static software measurement tool.
This tool is, in fact, a modification to the compiler system. The measurements that we need to make on the code are best taken by the compiler. When these measurements are integrated with the compiler they will occur automatically during the build process. This obviates the need for a separate software measurement system.
Create the infrastructure to capture the measurement data during the compilation/build process.
This involves the creation of a database that will collect measurement data for all versions of code modules.
Create the infrastructure to capture build lists.
Build lists are vital to the software measurement process. The build list will contain a complete list of all code modules and their version numbers for each build. These data, in conjunction with the measurement database will permit measurement data to be established for each build of the system.
The relationships among the many software complexity metrics has made the use of these metrics untenable as project management tools. ‘Relative Complexity’ provides the notion of a single metric and assigns a single value to each program in a program set to order the programs by their complexity.
In a civil engineering, topographical metaphor, the baseline build provides a point-in-time build from which successive code deltas are measured.