Tailored, Machine Learning-driven Password Guessing Attacks and Mitigation

Tailored,
Machine Learning-driven
Password Guessing Attacks
and Mitigation
Georg Knabl

Georg Knabl
• self-employed IT-Consultant &
Software Engineer at
• based in Graz, Austria
• areas of expertise
• machine learning implementations
• web development
• information security
2

The Problem with
Human Passwords
4

A Human Attack Vector
• people use password creation schemes
• types
• machine-random (&CtAEaCp?b&v"s%)
• human-general (123456)
• human-individual (John1970!)
• human-random (randomly typed, 34ghjk34f3hjkHGFC)
• What about correct horse battery staple?
• issues
• reduced entropy
• attacker: knowing scheme (+ personal data) => password
• humans limited in creativity
 somebody else might have come up with same scheme
 schemes publicly available in password leaks
5

Traditional Approaches
Hybrid or rule-
based
•dictionaries
•word-
mangling
rules
Markov Models
•high-
probability
character
sequences
Masks
•reduce set to
typical
structures
Brute-force
•try every
possible
combination
7
key space (Dunning, 2016)
• tool support:
hashcat, John-the-Ripper, PACK, CeWL, CUPP, …

Dictionary Sources
• password leaks: rockyou.txt, exploit.in, …
• tailored lists
• CeWL: web scraping
• CUPP: pre-defined questions
8
Analytics
Website
Designs
Webdesign
Rebranding
passionately
simply
Factory
…
smithJohn@*
smithJohn@@
smithJohn_1
smithSmithy
smith_
smith_01
smith_01050
…
123456
12345
123456789
password
iloveyou
princess
1234567
12345678
abc123
…

Neural Networks
10
• analyze huge datasets
• learn hidden structures
• reproduce structures
on new data
• supervised learning process:
train on data generate model
use model to
analyze/generate

Recurrent Neural Networks (RNN)
• learn, analyze, reproduce sequences
• password = sequence of characters
• password list: next password
 n: just another character
11
(Olah, 2015)

RNN Tokenization
12
0 a
1 b
2 c
3 d
4 e
… …
92 n
„abc“
source
data
training generation
target
data
„cde“0, 1, 2 2, 3, 4

char-rnn
• RNN predicts character sequences based on
training text
• by Andrej Karpathy
• https://github.com/karpathy/char-rnn
13
(Karpathy, 2015)

Works of Shakespeare
14
training
output
(Karpathy, 2015)

Linux Source Code
15
training output
(Karpathy, 2015)

rockyou.txt
16
training output

General Human Passwords Guessing
• Neural
Networks
outperform
other methods
at above 10^10
guesses
• (almost) infinite
number of
passwords
17
(Melicher et. al., 2016)

Exploiting Individual Human
Password Schemes
A Machine Learning Approach
18

Relevance
• most passwords have
individual context
• individual details publicly
available (OSINT)
• social media
 harvester scripts
• website user tables
 leaked database dumps
• …
19
exploit.in

Tailored Password Lists
20
training output
John2050
180374
09091958
06031982
160883
soni
John!
john!
j0hn.5m17h
john.smith
Smith866
asdfghj
John50

Data Protection Compliance
• EU-GDPR (General Data Protection Regulation)
• significant fines
• up to 20 mio. € or 4% of worldwide annual revenue
• processing personal data requires consent
• password lists contain personal information
•  publicly available leaked data illegal
• imbalance
• info-sec researcher:
has to comply & find (less ideal) alternatives
• attacker:
ignores regulations & trains on best available data
21

Data Protection Compliance
• compliant solutions to collect data
• general passwords:
• use e.g. top-100,000 passwords list
 no personal details contained
• individual details + passwords:
• compliance based on "public interest"? (GDPR Art. 6 (1) (e))
• collect consent from users
 requires broad access to user data
a) directly store & relate data until training is finished
 requires password storage in plaintext (!!!)
b) only store tokenized password schemes without user relation
 requires all relatable personal data to be known at password
hashing time
22

Challenges
• generate password sequences ✓
• GDPR compliance ?
• recognize & relate individual structures ?
• How to relate personal data?
• same scheme, different character sequences
<first name><year of birth>!
John1985!, Jane1992!
• dealing with obfuscations ?
• e.g. Leetspeak, all upper/lower case
j0hn1985!, JOHN1985!, john1985!
23

Generating a Dataset Containing
Individual Details
• starting point: any password leak that contains
a personal identifier
• char-rnn requires > 50,000 entries for proper
results
• e.g. exploit.in (797 mio. credentials):
<email address>:<password>
• collect, match and attach personal details to
entries
• e.g. using social media harvester
24

Generating a Dataset Containing
Individual Details
25
Gender Username First Name Last Name Year of Birth Password
f margarete Judy Wells 1972 Wells106
f sondra Lucia Morrow 1950 cvbnm
f zakia Gale Weiss 1999 syndikat
f eada Ana Elliott 1994 Ana94
f karalee Denise Hanson 1965 OLIVER
m agatha Edmond Daniels 1956 Agatha
…
• example result:

Password Schemes Used
• Random: random choice of top-X password list (e.g. 123456)
• Easy to Type: nearby characters on keyboard (e.g. qwerty)
• Username: use person‘s username (e.g. smithy)
• First Name + „!“: use person‘s first name plus exclamation mark (e.g.
John!)
• Lowercased First Name + „!“: use person‘s lowercased first name plus
exclamation mark (e.g. john!)
• Last Name + Random Int: use person‘s last name plus a three digit integer
at the end (e.g. Smith758)
• Username Leetspeak: use person‘s username in Leetspeak (e.g. 5m17hy)
• First Name + Year of Birth (4 digits): use person‘s first name plus their year
of birth (e.g. John1985)
• First Name + Year of Birth (2 digits): use person‘s first name plus their year
of birth in two digits (e.g. John85)
26

Tokenization
• replace personal details with column id
• column id is just another character
• problem: exact matching fails to match
obfuscations or abbreviations
• John != j0hn
• 1986 != 86
27
# First Name Year of Birth Password Resulting Password Tokens
1 Max 1983 Max1983! column: First Name, column: Year of Birth, !
2 John 1986 John86! column: First Name, 8, 6, !
3 Max 1987 123456 1, 2, 3, 4, 5, 6

Support Matching Using Data
Variations
• add on-the-fly word mangling rules to columns
• Leetspeak
• lowercase
• uppercase
• …
28
f f f F tania 74n14 tania TANIA Kara k4r4 kara KARA Rosales r054135 rosales ROSALES
…
f tania Kara Rosales
…

Challenges
• generate password sequences ✓
• GDPR compliance ✓
•  use top-X password lists + fake rules
• recognize & relate individual structures ✓
•  column ids instead individual details
• dealing with obfuscations ✓
•  on-the-fly word mangling rules to extend
columns
29

Implementation
• Python application based on Sean Robertson's
pytorch-char-rnn
• https://github.com/spro/char-rnn.pytorch
• adaptions (excerpt)
• matrix-based individual detail matching
• on-the-fly word-mangling rules
30

Training
31
Whn
carickte
aanhls
cshscarn
suasso
ail
zpkoty
beigedl
11883469
aw
aeeenl
aiseie
enal
faedni
bnoxtln
Wh
ronis25
44353133
maty
0598971
treames
bicken
ratont
tulie
stocker
shathos
netrer
derfa
tolei
dorled
Wh
ge
butter
jackout
05081984
lllllll
sian
harder
chedle
raven
11021985
supers
17031988
spike
duddick
epoch 10 epoch 40 epoch 280

Attacking the Target
• collect data about victim & generate dataset
• use trained model to generate a tailored
password list
• quality of list depends heavily on
• selected training data
• hyperparameter configuration
32
Gender Username First Name Last Name Year of Birth
m john.smith John Smith 2050

Results & Qualitative Analysis
33

Scheme Adoption
34
John2050
180374
09091958
06031982
160883
soni
John!
John!
[skipped until line 14]
john!
j0hn.5m17h
john.smith
Smith866
asdfghj
John50
[...]
Random:
stochastic character generation
(mostly human dates)
First Name + Year of Birth (4 digits):
learned
Username Leetspeak:
learned using word mangling
Last Name + Random Int:
partially learned + stochastic generation
Lowercased First Name + „!“:
learned using word mangling
First Name + „!“:
learned
Easy to Type:
learned
Username:
learned
First Name + Year of Birth (2 digits):
partially learned + stochastic generation
Duplicate because of
few available rules
Gender Username First Name Last Name Year of Birth
m john.smith John Smith 2050

Proving Password Scheme Adoption
1. use new fake dataset with same schemes
2. loop through each entry and generate a
individual password list (1000 entries)
3. check if password is on that list
35
Gender Username First Name Last Name Year of Birth Password
f margarete Judy Wells 1972 Wells106
?

Results
• 6 models with different
configurations
• all models match about
70% in password lists of
only ~100 lines
• optimized configurations
increase matching
efficiency
• recreated distributions
of schemes
36

Mitigation Strategies
• generating own model and check user‘s password
against generated lists
• attacker‘s model and dataset not available
 password lists will differ
• long or complex passwords
• passwords might still be guessed if they contain
personal information
• e.g. JohnSmith1985 is actually
<column: firstname><column: lastname><column: year of birth>
• treating all human-like passwords as insecure
• requires classification of human likeliness
38

Human Password Classification
• using machine learning to classify human likeliness
• dataset (80k human + 80k machine labeled passwords)
• classifiers
• Logistic Regression
• Multinomial Naïve Bayes
• Linear Support Vector Machine
• Random Forest
• vectorizers
• TFIDF
• Count
39
&CtAEaCp?b&v"s% m
-SUuf4TLtF m
mallrats h
bP0.}BO/L&{: m
^=c.rgH$z m
boxers h
j&uzHCutff_A{ m
656565 h
6>IB|~@4^n}K m
forever1 h
…

Results
accuracy human vs. machine-random:
99% correct
40
14061966 0.9961306540
y-JQ6{v;_yb|q 0.0000000000
ZBT4n#z-x 0.0000121259
longball 0.9920406811
vikings 0.9723564484
gunit 0.9683620674
.XP?]b36nP]l| 0.0000000000
8J9{Bd^ 0.0000107884
123india 0.9986476258
*[qg;t 0.0000058089
…

What about randomly-typed
passwords?
• human-random passwords
• almost impossible for humans to distinguish
• previously trained model:
83% correct
• specifically trained model (human-random vs. machine-random):
94% correct
41
,asgl213
HGHfwjiofjiw!?
FEA452
dciuowed7983zy_
jksdgf644kjbndf
Xkkeelt7tad5z
sabjas012
123jfmvfkfn49fvk.
…

Conclusion
• machine learning can be used to efficiently
attack passwords created by humans
• mitigation
• treat human passwords as insecure
• warn users or provide password policy
 use machine learning model to identify human
passwords
 integrate on web servers & password storage
services
43

Resources
• Thesis Machine Learning-driven Password
Lists:
• https://www.researchgate.net/publication/328719
001_Machine_Learning-driven_Password_Lists
• Human Password Classifier:
• https://github.com/georgknabl/human-password-
classifier
• ready-to-use trained models available via e-mail
44

45
"The only secure password is the one you can't remember."
Troy Hunt (haveibeenpwned.com)

Contact
46
DI (FH) Georg Knabl, MSc
IT-Consultant & Software Engineer
georg.knabl@pageonstage.at

Sources
• Dunning, Julian (2016). Statistics Will Crack Your Password. Available
from: https://p16.praetorian.com/blog/statistics-will-crack-
yourpassword-mask-structure [Mar. 3, 2018]
• Karpathy, Andrej (2015). The Unreasonable Effectiveness of
Recurrent Neural Networks. Available from:
http://karpathy.github.io/2015/05/21/rnn-effectiveness/ [Nov. 10,
2017]
• Melicher, William, Blase Ur, Sean M Segreti, Saranga Komanduri,
Lujo Bauer, Nicolas Christin, and Lorrie Faith Cranor (2016). „Fast,
Lean, and Accurate: Modeling Password Guessability Using Neural
Networks“. In: 25th {USENIX} Security Symposium ({USENIX} Security
16). Vancouver: {USENIX} Association, pp. 175–191.
• Olah, Christopher (2015). Understanding LSTM Networks. Available
from: http://colah.github.io/posts/2015- 08-Understanding- LSTMs/
[Nov. 10, 2017]
47

Tailored, Machine Learning-driven Password Guessing Attacks and Mitigation

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Tailored, Machine Learning-driven Password Guessing Attacks and Mitigation

Ähnlich wie Tailored, Machine Learning-driven Password Guessing Attacks and Mitigation (20)

Mehr von DefCamp

Mehr von DefCamp (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Tailored, Machine Learning-driven Password Guessing Attacks and Mitigation