Follow along with the R Markdown file at http://rpubs.com/alexcpsec/tiq-test-Summer2014-2
Source code available at:
https://github.com/mlsecproject/tiq-test
https://github.com/mlsecproject/tiq-test-Summer2014
https://github.com/mlsecproject/combine
---------
Full Abstract:
Threat Intelligence feeds are now being touted as the saving grace for SIEM and log management deployments, and as a way to supercharge incident detection and even response practices. We have heard similar promises before as an industry, so it is only fair to try to investigate. Since the actual number of breaches and attacks worldwide is unknown, it is impossible to measure how good threat intelligence feeds really are, right? Enter a new scientific breakthrough developed over the last 300 years: statistics!
This presentation will consist of a data-driven analysis of a cross-section of threat intelligence feeds (both open-source and commercial) to measure their statistical bias, overlap, and representability of the unknown population of breaches worldwide. Are they a statistical good measure of the population of "bad stuff" happening out there? Is there even such a thing? How tuned to your specific threat surface are those feeds anyway? Regardless, can we actually make good use of them even if the threats they describe have no overlap with the actual incidents you have been seeing in your environment?
We will provide an open-source tool for attendees to extract, normalize and export data from threat intelligence feeds to use in their internal projects and systems. It will be pre-configured with current OSINT network feed and easily extensible for private or commercial feeds. All the statistical code written and research data used (from the open-source feeds) will be made available in the spirit of reproducible research. The tool itself will be able to be used by attendees to perform the same type of tests on their own data.
Join Alex and Kyle on a journey through the actual real-world usability of threat intelligence to find out which mix of open source and private feeds are right for your organization.
Measuring the IQ of your Threat Intelligence Feeds (#tiqtest)
1. Measuring
the
IQ
of
your
Threat
Intelligence
Feeds
(#TIQtest)
Alex
Pinto
MLSec
Project
@alexcpsec
@MLSecProject!
Kyle
Maxwell
Researcher
@kylemaxwell!
2. Alex
Pinto
• Science
guy
at
MLSec
Project
• ML
trainer
• Network
security
aficionado
• Tortured
by
SIEMs
as
a
child
• Hacker
Spirit
Animal™:
CAFFEINATED
CAPYBARA!
whoami(s)
Kyle
Maxwell
• Researcher
at
[REDACTED]
• Math
Smuggler
• Recovering
Incident
Responder
• GPL
zealot
• Hacker
Spirit
Animal™:
AXIOMATIC
ARMADILLO!
(hUps://secure.flickr.com/photos/kobashi_san/)
(hUp://www.langorigami.com/art/gallery/
gallery.php?tag=mammals&name=armadillo)
3. • Threat
Intel
102
• Measuring
Intelligence
• Data
Preparaaon
• Tesang
the
Data
• Tools:
• COMBINE
• TIQ-‐TEST
• Some
parang
ideas
Agenda
(hUp://www.savagechickens.com/2008/12/iq-‐test.html)
4. Threat
Intel
102:
Capability
and
Intent
• What
are
they
able
to
do?
• What
are
they
intending
to
do?!
5. Threat
Intel
102:
Cage
Matches
• Signatures
vs
Indicators
• Data
vs
Intelligence
• Tacacal
vs
Strategic
• Atomic
vs
Composite
6. Threat
Intel
102:
Pyramid
of
Pain
“Simple”
and
“easy”
aren’t
always
(David
Bianco
–
Pyramid
of
Pain)
7. What
about
IP
addresses?
• Approximately
same
value
as
hostnames
(APT
vs
DGA)
• Finite
resource
(unal
IPv6,
that
is)
• Managed
/
controlled
by
orgs
• Difficulty
/
economic
incenaves
/
implied
“cost”
• Also,
recyclable
(hUps://xkcd.com/865/)
8. Given
IP
addresses
harvested
from
TI
feeds,
can
we
measure
how
much
they
“help”
our
defense
metrics?!
9. Introducing
TIQ-‐TEST
• All
these
tests
are
available
as
R
funcaons
at
• hUps://github.com/mlsecproject/aq-‐test
• Have
fun,
prove
me
wrong,
suggest
stuff
• Tools
that
implement
those
tests
• Sample
data
+
R
Markdown
file
• The
excuse
to
learn
a
staasacal
language
you
were
waiang
for!
10. Data
Sources
–
Types
of
data
• Extract
the
“raw”
informaaon
from
indicator
feeds
• Both
IP
addresses
and
hostnames
were
extracted
11. Data
Sources
–
Feeds
Selected
• Data
was
separated
into
“inbound”
and
“outbound”
12. Data
PreparaLon
and
Cleansing
• Convert
the
hostname
data
to
IP
addresses:
• Acave
IP
addresses
for
the
respecave
date
(“A”
query)
• Passive
DNS
from
Farsight
Security
(DNSDB)
• We
removed
non-‐public
IPs
from
the
dataset
(RFC1918)
• Yeah,
we
know
it
is
a
“parking
technique”
(hUps://xkcd.com/742/)
13. Data
PreparaLon
and
Cleansing
• For
each
IP
record
(including
the
ones
from
hostnames):
• Add
asnumber
and
asname
(from
MaxMind
ASN
DB)
• Add
country
(from
MaxMind
GeoLite
DB)
• Add
rhost
(again
from
DNSDB)
–
most
popular
“PTR”
• The
experiments
will
be
around
ASNs
and
Geolocaaon
14. Data
PreparaLon
and
Cleansing
• However,
we
will
NOT
be
using
maps.
Just
let
it
go.
16. TesLng
the
Data
• Let’s
generate
some
interesang
metrics:
• NOVELTY
–
How
oxen
do
they
update
themselves?
• OVERLAP
–
How
do
they
compare
to
what
you
got?
• POPULATION
–
what
is
in
them
anyway?
• Populaaon
is
tricky:
• Could
mean
the
enare
world
(all
IPv4
space)
• Should
ideally
mean
YOUR
world
17. But
WHAT
IS
THE
IQ?!??1?
• We
will
withhold
judgment
• The
best
data
composiaon
is
the
best
one
for
you
• We
will
do
our
best
to
explain
results
so
you
can
decide.
• Maybe
on
further
(or
more
private)
research…
25. PopulaLon
Test
• Let
us
use
the
ASN
and
GeoIP
databases
that
we
used
to
enrich
our
data
as
a
reference
of
the
“true”
populaaon.
• But,
but,
human
beings
are
unpredictable!
We
will
never
be
able
to
forecast
this!
26.
27.
28. Can
we
get
a
beSer
look?
• Don’t
like
squinang
either
• Staasacal
inference-‐based
comparison
models
(hypothesis
tesang)
• Exact
binomial
tests
(when
we
have
the
“true”
pop)
• Chi-‐squared
proporaon
tests
(similar
to
independence
tests)
29. Can
we
get
a
beSer
look?
• We
can
beUer
esamate,
with
confidence
intervals,
our
measures
of
error.
• Also,
p-‐values!
(with
apologies
to
Alex
HuUon)
• We
promise
to
be
very
conservaave
in
using
them.
30.
31. Hacker
Spirit
Animal™
Guide
• US
–
Eagle
• CA
–
Moose
• FR
–
Frog
• GB
–
Bulldog
• AU
–
Koala
• BR
–
Capybara
/
Toucan
• Texas
–
Armadillo
• Disclaimer:
we
do
not
endorse
Geolocaaon-‐based
aUribuaon
37. Introducing
COMBINE
• Components:
1. Reaper
gathers
the
threat
data
directly
from
feeds.
2. Thresher
normalizes
it
into
a
simplisac
data
model.
3. Winnower
opaonally
performs
basic
validaaon
or
enrichment.
4. Baler
transforms
the
data
into
CybOX,
CSV,
JSON,
and
CIM.
(Only
CSV
and
JSON
work
right
now).
Could
also
write
others
fairly
easily.
(nudge
nudge,
wink
wink)
38. Introducing
COMBINE
• Always
trying
to
feed
it
more.
Lots
of
possibiliaes,
including
your
own
data
sources.
• We
clearly
do
NOT
endorse
any
included
feeds.
39. Introducing
COMBINE
• Enrichments
-‐
think
metadata.
• AS,
geolocaaon
• DNS
resoluaons
courtesy
of
Farsight
DNSDB
• Ask
them
for
an
API
key
to
test
it,
tell
them
Alex
Pinto
sent
you
;)
40. MLSec
Project
• Both
projects
have
been
released
as
GPLv3
by
MLSec
Project
• Will
replace
the
internal
versions
we
have
on
the
main
code
• Looking
for
paracipants
and
data
sharing
agreements
• Liked
TIQ-‐TEST?
We
can
benchmark
your
private
feeds
using
these
and
other
techniques
• Visit
hSps://www.mlsecproject.org
,
message
@MLSecProject
or
just
e-‐mail
me.!
41. Take
Aways
• Analyze
your
data.
• Extract
value
from
it!
• Try
before
you
buy!
Different
test
results
mean
different
things
to
different
orgs.
• Use
the
tools!
Suggest
new
tests!
• Share
data
with
us!
We
take
good
care
of
it,
make
sure
it
gets
proper
exercise.
42. Thanks!
• Q&A?
• Feedback!
Alex
Pinto
@alexcpsec
@MLSecProject
”The
measure
of
intelligence
is
the
ability
to
change."
-‐
Albert
Einstein
Kyle
Maxwell
@kylemaxwell