This study analyzed 680,209 builds from 1,276 open source projects on Travis CI to evaluate noise and heterogeneity in historical build data. The study found that 12% of passing builds ignored failures, breaks can persist for over 400 days, and 67% of breaks are stale. Additionally, 9-14% of builds are incorrectly labeled due to noise. Build outcomes also exhibited heterogeneity, with 41% of breaks occurring outside of the build tool (Maven) and environment-specific breaks being common. The implications are that researchers should filter noise from analyses and consider heterogeneity, while tool builders should look beyond tools to recover from breaks.
Noise and Heterogeneity in Historical Build Data: An Empirical Study of Travis CI
1. Noise and Heterogeneity in
Historical Build Data:
An Empirical Study of Travis CI
Keheliya Gallaba Shane McIntoshChristian Macho Martin Pinzger
@keheliya
keheliya.github.io
@Mitschiiii
mitschi.github.io
@pinzger
pinzger.github.io
@shane_mcintosh
shanemcintosh.org
2. Source Code Automated builds
check the impact of
changes on the
software product
Build System
Deliverables
2
3. Build outcome data is used to solve software
engineering research problems
For
understanding
and predicting
build breakage
For
measuring
the build
breakage rate
For
communicating
the current
build status
3
5. Can the off-the-shelf
historical CI build data be
trusted?
The zdavatz/spreadsheet
project has had the
allow_failure feature enabled
for the entire lifetime of the project!
5
9. We look for passing builds with actively ignored failures
9
680,209
Builds
496,204
Builds
59,904
Builds
Select
passing
builds
Select
builds
with failing jobs
Check if the allow_failure
property is enabled for the
failing jobs in .travis.yml
10. Passing build outcomes do not always indicate that
the build was entirely clean
12% of passing
builds have an
actively ignored
failure.
Up to 87% of the
jobs are actively
ignored.
10
11. Passively ignored breakages may introduce noise
when all breakages are assumed to be distracting
11
680,209
Builds
610,550
Builds
Build
filtering
Graph construction using
version control data
Graph
analysis
Long breakage sequences may
mean developers passively ignored
failures by not immediately fixing
them.
12. In some cases, builds can remain broken for 423 days
Overall median length of the failure sequence is five commits. 12
13. One of the reasons for ignoring a build breakage:
Staleness
13
Developers may become
desensitized to stale* breakages.
*If the project has encountered a given
breakage in the past it's a stale breakage.
14. 14
Maven
Build Log
Build fails due to the
same reason as a
prior failure?
Stale
Breakage
We measure staleness in Maven build breakages
Failure details are
equal to a prior
failure?
Not Stale
Breakage
YES YES
NONO
Maven Log Analyzer
15. Two of every three build breakages (67%) that we
analyze are stale
15
16. We propose
Signal-To-Noise Ratio to
quantify the proportion
of noise
16
Has Ignored
Breakages
No Ignored
Breakages
Broken
Builds
False Build
Breakages
True Build
Breakages
Passing
Builds
False Build
Successes
True Build
Successes
SignalNoise
17. One in every 7 to 11 builds (9%-14%) is incorrectly labelled
17
18. Noise may influence analyses
based on build outcome data
18
Passing build outcomes do not
always indicate that the build was
entirely clean
Build breakages can persist for up
to 485 commits (423 days)
67% of build breakages we analyze
are stale
9%-14% of builds are incorrectly
labelled
19. Are build
outcomes
homogeneous?
19
Noise may influence analyses
based on build outcome data
Passing build outcomes do not
always indicate that the build was
entirely clean
Build breakages can persist for up
to 485 commits (423 days)
67% of build breakages we analyze
are stale
9%-14% of builds are incorrectly
labelled
22. Builds can break for various reasons
22
Compilation
Failure
Test
Failure
Dependency
Resolution
Failure
We extend Maven Log Analyzer to parse and classify broken
Maven build logs by type
Deployment
Failure
23. Maven Log Analyzer supports new
build breakage categories
23
Ant Inside
Maven
Goal Failed Broken Outside Maven
Run System/Java
Program
Run Jetty
Server
Manage Ruby
Gems
Polyglot for
Maven
No Log
Available
Failed Before
Maven
Travis
Aborted
Failed After
Maven
Travis
Cancelled
24. Tool-specific breakage is rare.
24
41% of the broken builds failed due to problems
outside of Maven.
25. 25
Noise may influence analyses
based on build outcome data
Passing build outcomes do not
always indicate that the build was
entirely clean
Build breakages can persist for up
to 485 commits (423 days)
67% of build breakages we analyze
are stale
9%-14% of builds are incorrectly
labelled
Build outcomes are heterogenous
Environment-specific breakage is
commonplace
Tool-specific breakage is rare
Future automatic breakage
recovery techniques should tackle
issues in the CI scripts
26. Our observations have broader implications for
researchers and tool builders 26
For Research
Community
For Tool Builders
Build outcome noise should be
filtered out before analyses
Heterogeneity should be
considered when training build
outcome prediction models
Automatic breakage recovery
should look beyond tool-specific
insight
Richer information should be
included in build outcome reports
and dashboards