Software fuzzing has long been a trusted method for finding vulnerabilities that are difficult to discover using traditional methods. The application of AI and ML to this field has already begun to bear very promising results. Learn the various methods of fuzzing through examples, documentation, and other related data that can guide practitioners on where to start and which tools are ready to be applied today.
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Â
Moving to Modern DevOps with Fuzzing and ML - DevOps Next
1.
2. Š 2020 Perforce Software, Inc.
#devopsnext-devops-code
LIVE SLACK Q&A
3. Š 2020 Perforce Software, Inc.
Moving to Modern DevOps
with Fuzzing and ML
J U S T I N R E O C K
4. 4 | DevOps Next 2020 perforce.com
Confidentiality Statement
The information contained in this document is strictly confidential, privileged, and
only for the information of the intended recipient. The information contained in this
document may not be otherwise used, disclosed, copied, altered, or distributed
without the prior written consent of Perforce Software, Inc.
5. 5 | DevOps Next 2020 perforce.com
⢠Iâm always fascinated by touch-free processes that use large aggregate sets of data to solve problems
⢠Although often considered âbrute-forceâ solutions, given how large the playing field is, these days thereâs a science to
culling down an infinitely-sized list to a list that is merely astronomical in size
⢠Software bloom, particularly in the world of free software, is continuing in much the predicted pattern, in that it is
exponentiating, and the exponents are getting quite large in 2020
⢠So, our traditional means of software testing, and therefore software quality, will need to be rethought again to deal
with this bloom
⢠Software fuzzing is an area I find particularly fascinating right now, as it is attempting to use large aggregate data sets
to automate quality
⢠An impressive number of vulnerabilities and bugs have been discovered recently using modern fuzzing techniques
⢠The application of AI and ML is beginning to show promise in improving these techniques even further
Why Choose This Topic?
6. 6 | DevOps Next 2020 perforce.com
Doctors are the worst patients.
Coders are the worst testers.
Thatâs why we QA!
7. 7 | DevOps Next 2020 perforce.com
⢠Human cognition simply has limitations, and it becomes increasingly difficult to predict, and therefore account for,
every possible testing scenario in order to prove software robustness
⢠Even if we could imagine all the right scenarios, how much of the code we write is even our code anymore?
⢠Largely, the business of application development concerns itself with the interplay of various prewritten
dependencies
⢠Open-first development, of which I am a fervent supporter, opens us to a new set of unexpected states which might
become bugs or even vulnerabilities
⢠Though QA teams are still the most reliable form of functional testing, total hardening of software is nearly
impossible these days
⢠Thereâs too much input, too much behind the scenes interplay, and too much reliance on direct and external
dependencies to be sure weâve taken our application logic down as many paths as possible
⢠At a certain point, we need other, non-interactive means of testing areas of the application that human testers may
be blind to
The Limits of QA
8. 8 | DevOps Next 2020 perforce.com
⢠Software fuzzing is one means of achieving this kind of testing, where we
attempt to automate taking an application down as many code execution
paths as possible
⢠And thatâs really the point of any kind of testing, isnât it, ideally?
⢠Of course, there are so many logical paths now, right down to the very way we
even encode and decode the characters that form the UIs we interact with!
⢠The industry has derived other well-known methods, such as:
⢠Static Code Analysis â Whereby the code, syntax, dependency chain, etc, is analyzed
to determine possible code quality issues â sometimes code is even executed and
output is analyzed
⢠Symbolic Execution â Code is analyzed and inputs are run through various valid
states, program state is examined and symbols are populated according a valid range
Automated Methods
9. 9 | DevOps Next 2020 perforce.com
⢠Software fuzzing can complement other methods of automated fuzzing, and really a full testing solution should, at
least right now in late 2020, include elements of all these previously discussed testing methods
⢠Fuzzing attempts to take code execution paths down routes that were not or could not be determined through these
other methods.
⢠Static code analysis is still derived by human understanding of the syntax of the code being analyzed, and the
language the code is written in, so it deals very much in the realm of âvalidityâ
⢠Symbolic execution can be used within static code analysis to help derive the output of various blocks of code, but it
also lives mostly within the realm of valid inputs
⢠This is all well and good, but, what about the myriad unaccounted-for scenarios that couldnât be derived by looking at
the code?
⢠Fuzzing, or at least, the goal of fuzzing, is to utilize input randomness to try and catch the program in code execution
states that it didnât expect
Fuzzing
10. 10 | DevOps Next 2020 perforce.com
Fuzzing at its Most Basic
Source: https://arxiv.org/pdf/1906.11133.pdf (Section 2)
11. 11 | DevOps Next 2020 perforce.com
(a,b) => {
return (a / b);
}
1: [a=7,b=2] => 7 / 2 => A non-interesting state
2: [a=3,b=5] => 3 / 5 => A non-interesting state
3: [a=10,b=2] => 10 / 2 => 5 => A non-interesting state
4: [a=0,b=10] => 0 / 10 => 0 => A non-interesting state
âŚ.
??: [a=9,b=0] => 9 / 0 => An interesting state! Fatal
divide by 0 condition
Fuzzing â A Silly Example
12. 12 | DevOps Next 2020 perforce.com
⢠The generation of inputs and recognition of interesting states is what weâll predominately focus on here, thatâs the
biggest challenge to productive fuzzing, but also fuzzingâs greatest benefit
⢠When realized properly, fuzzing can eliminate a lot of the bias of the tester, and even of the static analyzer
⢠Although, as pictured, some program knowledge can be used to derive effective means of generating the input set, or
test corpus, the inputs are, as much as possible, not biased by the tester
⢠This is because we are, more or less, throwing fully random data at program inputs
⢠Thatâs data that is random not just in content, but also in format and encoding
⢠So, throwing alphanumeric or obscure UTF-8 input or otherwise at, perhaps, input that expects a number
⢠While the solution and practicality of fuzzing is defined by its function, so is fuzzingâs most impressive weakness
⢠How can we possibly, out of a pool of infinitely random inputs, scale down to a corpus we know will generate lots of
interesting states without introducing too much bias
⢠And for the purposes of this presentation, how can AI and ML assist us in refining our corpus?
Fuzzing
13. 13 | DevOps Next 2020 perforce.com
Types of Fuzzers
⢠In which our test corpus is based on modifications to existing valid test cases, or rather any corpus of test cases
that has been known to generate âinteresting statesâ.
⢠This is generally unbounded, and so a lot of corpus data ends up being useless and not generating any interesting
states
Mutation Based Fuzzing
⢠Improves on some of the problems with mutation-based fuzzing by generating a test corpus based on the same
input rules that are used to frame the normal test cases
⢠This makes them much more bounded than Mutation-based fuzzers â which also means that we can measure
how much of a possible testing surface has been explored with a Generation-Based fuzzer
Generation Based Fuzzing
⢠Applies a bit of learning to the test corpus generated in a mutation-based way
⢠So, for instance, the fuzzer might retain a bit of info on how many new interesting states were derived from a bit
of corpus, and that might be combined with another bit of random or interesting data, and so on
Evolutionary Fuzzing
14. 14 | DevOps Next 2020 perforce.com
⢠All of this advancement in fuzzing has helped, but it should be evident where there are huge advancements that still
need to be made if we want fuzzing to advance to a logical next-step of touch-free testing
⢠For instance, fuzzing right now requires a great deal of software domain knowledge to be effective at:
⢠Recognizing that the state itself is in fact different than other states which have previously been encountered
⢠Knowing when we are spinning our wheels by generating a lot of varied input thatâs making the program âdo the same thingâ
as it has been doing for other inputs
⢠If it is a newly discovered code execution path, recognizing that the state is meaningful
⢠Determining how to interpret that state and provide taxonomy, i.e. was this a crash, a non-fatal condition, etc
⢠Deciding how to report that state based on its taxonomy, i.e. should a heap dump be provided
⢠Beyond that, how do we know when to mutate our inputs?
⢠Even as creative humans, we run into the same cognitive limitations when we try to derive new ways of mutating
input as we do simply deriving the input in the first place
Limitations of Fuzzing
15. 15 | DevOps Next 2020 perforce.com
That hasnât stopped us from making big advancements in software quality by using the advanced fuzzing methods
weâve already described
LibFuzzer and ClusterFuzz
LibFuzzer is a mutation fuzzer thatâs easy to include in
your own regressions, and is used by countless of
libraries and has uncovered thousands of bugs
ClusterFuzz is a Google sponsored distributed fuzzing
project that takes advantage of LibFuzzer and is approaching
50,000 discovered browser and OSS bugs (in OSS-Fuzz)
Yet we still have so far to go in efficiently reducing our test corpus if we want to get to feasible touch-free testing
16. 16 | DevOps Next 2020 perforce.com
⢠At this point, itâs probably clear that evolution based fuzzing and generation based fuzzing bear the most promise in
terms of improving test corpus through ML
⢠Generation based fuzzing gives us a finite (albeit very, very large in some cases) test surface to select from, which
means we can gauge how much of a test surface has been explored by a learning-based fuzzer
⢠So, for instance, if we trained a model to predict whether a new generational bit of input would generate an
interesting state, we could turn around and apply that prediction to a brand new piece of software
⢠This could, if properly trained, seriously shorten the number of random cycles necessary to filter down to generated
input that will yield interesting states when applied to a brand new application
⢠Evolutionary fuzzing, though an entirely different approach, can benefit from ML as well
⢠Imagine training a model on what types of evolution based mutations made to a test corpus actually end up yielding
interesting states
⢠Evolutionary fuzzingâs most pervasive limitation, the sheer, infinite amount of surface available to it, could be greatly
optimized
Finally! ML and Fuzzing!
17. 17 | DevOps Next 2020 perforce.com
ďź Reduction of the Test Corpus
ďź Optimized Mutation of Test Corpus
ďź Interesting State Recognition
ďź Bug/Vulnerability Translation from Interesting State
ďź Elimination of Bias from Test Corpus
Areas of Focus for ML in Fuzzing
18. 18 | DevOps Next 2020 perforce.com
⢠With any learning model, we must first identify areas by which we can measure the effectiveness of the sample data
that we throw at the learning network
⢠In this case of software fuzzing, one such yardstick can be established using test scheduling, which is the process of
prioritizing a bit of test input based on how likely that bit is to trigger an interesting state
⢠Patrice Godefroid, best known for his SAGE fuzzing engine which combines symbolic execution and generation-based
fuzzing, is a leading researcher at Microsoft in this field
⢠SAGE is an interesting approach which, as Godefroid puts it, â[Lets] a single symbolic execution generate thousands of
new testsâ by executing a cycle of symbolic execution and then generating thousands of corpus bits from that
generation
⢠SAGE is not really a learning solution, but it would lead Godefroid to his first major experiment in this arena, which
he called his âLearn & Fuzzâ solution
⢠âLearn & Fuzzâ carries the goal of eliminating security vulnerabilities for the PDF parser in the Microsoft Edge browser,
testing each PDF input field type that could render malicious behavior from a parsed document
ML and Fuzzing
19. 19 | DevOps Next 2020 perforce.com
⢠Godefroid set up a Recurrent Neural Network to keep track of
whether fuzzed input of an âobjectively validâ state would trigger a
previously unknown interesting state
⢠In other words, for the derived data to be useful, it must not
trigger any known or handled state by the program, including error
states that have been trapped â but it must also trigger an
interesting state
⢠This is a true âneedle in a haystackâ where we must generate a
small corpus of inputs which will cause unexpected things to
happen in the PDF parser which were not already accounted for
by input validation, encoding validation, and exception handling
⢠Pinpointing those needles, though, means reducing by several
orders of magnitude the test corpus, which in turn greatly reduces
the amount of expensive fuzzing that needs to be done
Learn & Fuzz
20. 20 | DevOps Next 2020 perforce.com
⢠Godefroid took a somewhat adversarial approach, employing three different sampling strategies to see which would
lead to the highest test coverage while producing enough objectively valid inputs to be useful
⢠A massive set of PDF files were stitched together to create a gigantic set of PDF field inputs, and those inputs were
fuzzed using different algorithms
⢠Through a series of tests (outlined in my chapter!) Godefroid arrived at a model called SampleFuzz, and that model
was shown to provide the highest overall coverage â the most important metric -- with a completely acceptable
âobjectively validâ pass rate:
Learn & Fuzz
21. 21 | DevOps Next 2020 perforce.com
⢠These results are very promising! Over and above the random and known sample sets, a larger test coverage was
generated
⢠But we canât ignore that the the Sample-10k rate, though it did fall almost 2,000 cases short of SampleFuzz, also
generated 10% more passable data
⢠The conclusion of the study here is that there still exists tension between learning, which tries to make sense of
unordered data by reducing chaos, and fuzzing, which tries to pinpoint various scenarios by increasing chaos!
⢠It should also be noted that no new bugs were found in these additional 2,000 valid test cases, so, this study is still
fairly academic
⢠All that means is that there still room to grow in this field!
⢠Our last current study is that of ExploitMeter, which combines the accessibility of open source software with deep
learning to determine patterns that indicate whether found interesting states are in fact exploitable
⢠So this is an example of using ML to recognize whether an âinteresting stateâ is in fact a âuseful stateâ
A Good, Academic Start
22. 22 | DevOps Next 2020 perforce.com
⢠ExploitMeter itself is still nascent, only trying to predict whether a piece of software is likely to have exploitable
vulnerabilities or not, based on the input types that it has learned are exploitable in other open source applications
ExploitMeter
http://www.cs.binghamton.edu/~ghyan/papers/pac17.pdf (Section V-D)
23. 23 | DevOps Next 2020 perforce.com
⢠The great news is that thereâs still a *ton* of work to do in this field â and how many fields can still say that?
⢠And perhaps even better news is just how accessible deep learning frameworks are to modern developers
⢠Open source learning libraries like TensorFlow and PyBrain make it easy for anyone to get started with these types of
experiments
⢠This is right on time, as our fully realized transformed future is just ahead of us, and the need for fully automated
testing has never been higher
⢠Though weâre still far from the Platonic ideal of a fuzzing framework, one that would eliminate the need for program
knowledge to generate a useful corpus and identify interesting states, itâs clear that the most promise for reaching
this goal lies in deep learning
⢠Major advancements will be needed across the board for this to materialize â but imagine the bulletproof software
landscape that will exist when we finally achieve it!
⢠The future of software quality is deep fuzzing â and the future is bulletproof!
A Lot to Do â A Good Problem to Have!
25. 25 | DevOps Next 2020 perforce.com
Advancing the State of The Art
in AI and Testing
COMING UP NEXTâŚ
TRACK
Testing Tools
Cognitive Engineering â Shifting
Right with Gated.AI Testing
TRACK
Continuous Testing
How Does AIOps Benefit DevOps
Pipeline and Software Quality
TRACK
DevOps & Code
26. Š 2020 Perforce Software, Inc.
#devopsnext-devops-code
LIVE SLACK Q&A