A machine learning model, specifically a recurrent neural network, can be trained on large datasets of passwords and personal details to generate targeted password guessing lists that leverage common human-used password construction schemes. While the model is able to learn patterns and generate password candidates matching 80% of schemes in lists of 100 passwords, mitigation strategies could treat passwords identified as likely human-generated as insecure or integrate a human password classifier on servers to warn of risks.
2. Georg Knabl
• self-employed IT-Consultant &
Software Engineer at
• based in Graz, Austria
• areas of expertise
• machine learning implementations
• web development
• information security
2
5. A Human Attack Vector
• people use password creation schemes
• types
• machine-random (&CtAEaCp?b&v"s%)
• human-general (123456)
• human-individual (John1970!)
• human-random (randomly typed, 34ghjk34f3hjkHGFC)
• What about correct horse battery staple?
• issues
• reduced entropy
• attacker: knowing scheme (+ personal data) => password
• humans limited in creativity
somebody else might have come up with same scheme
schemes publicly available in password leaks
5
7. Traditional Approaches
Hybrid or rule-
based
•dictionaries
•word-
mangling
rules
Markov Models
•high-
probability
character
sequences
Masks
•reduce set to
typical
structures
Brute-force
•try every
possible
combination
7
key space (Dunning, 2016)
• tool support:
hashcat, John-the-Ripper, PACK, CeWL, CUPP, …
10. Neural Networks
10
• analyze huge datasets
• learn hidden structures
• reproduce structures
on new data
• supervised learning process:
train on data generate model
use model to
analyze/generate
11. Recurrent Neural Networks (RNN)
• learn, analyze, reproduce sequences
• password = sequence of characters
• password list: next password
n: just another character
11
(Olah, 2015)
12. RNN Tokenization
12
0 a
1 b
2 c
3 d
4 e
… …
92 n
„abc“
source
data
training generation
target
data
„cde“0, 1, 2 2, 3, 4
13. char-rnn
• RNN predicts character sequences based on
training text
• by Andrej Karpathy
• https://github.com/karpathy/char-rnn
13
(Karpathy, 2015)
17. General Human Passwords Guessing
• Neural
Networks
outperform
other methods
at above 10^10
guesses
• (almost) infinite
number of
passwords
17
(Melicher et. al., 2016)
19. Relevance
• most passwords have
individual context
• individual details publicly
available (OSINT)
• social media
harvester scripts
• website user tables
leaked database dumps
• …
19
exploit.in
20. Tailored Password Lists
20
training output
John2050
180374
09091958
06031982
160883
soni
John!
john!
j0hn.5m17h
john.smith
Smith866
asdfghj
John50
21. Data Protection Compliance
• EU-GDPR (General Data Protection Regulation)
• significant fines
• up to 20 mio. € or 4% of worldwide annual revenue
• processing personal data requires consent
• password lists contain personal information
• publicly available leaked data illegal
• imbalance
• info-sec researcher:
has to comply & find (less ideal) alternatives
• attacker:
ignores regulations & trains on best available data
21
22. Data Protection Compliance
• compliant solutions to collect data
• general passwords:
• use e.g. top-100,000 passwords list
no personal details contained
• individual details + passwords:
• compliance based on "public interest"? (GDPR Art. 6 (1) (e))
• collect consent from users
requires broad access to user data
a) directly store & relate data until training is finished
requires password storage in plaintext (!!!)
b) only store tokenized password schemes without user relation
requires all relatable personal data to be known at password
hashing time
22
23. Challenges
• generate password sequences ✓
• GDPR compliance ?
• recognize & relate individual structures ?
• How to relate personal data?
• same scheme, different character sequences
<first name><year of birth>!
John1985!, Jane1992!
• dealing with obfuscations ?
• e.g. Leetspeak, all upper/lower case
j0hn1985!, JOHN1985!, john1985!
23
24. Generating a Dataset Containing
Individual Details
• starting point: any password leak that contains
a personal identifier
• char-rnn requires > 50,000 entries for proper
results
• e.g. exploit.in (797 mio. credentials):
<email address>:<password>
• collect, match and attach personal details to
entries
• e.g. using social media harvester
24
25. Generating a Dataset Containing
Individual Details
25
Gender Username First Name Last Name Year of Birth Password
f margarete Judy Wells 1972 Wells106
f sondra Lucia Morrow 1950 cvbnm
f zakia Gale Weiss 1999 syndikat
f eada Ana Elliott 1994 Ana94
f karalee Denise Hanson 1965 OLIVER
m agatha Edmond Daniels 1956 Agatha
…
• example result:
26. Password Schemes Used
• Random: random choice of top-X password list (e.g. 123456)
• Easy to Type: nearby characters on keyboard (e.g. qwerty)
• Username: use person‘s username (e.g. smithy)
• First Name + „!“: use person‘s first name plus exclamation mark (e.g.
John!)
• Lowercased First Name + „!“: use person‘s lowercased first name plus
exclamation mark (e.g. john!)
• Last Name + Random Int: use person‘s last name plus a three digit integer
at the end (e.g. Smith758)
• Username Leetspeak: use person‘s username in Leetspeak (e.g. 5m17hy)
• First Name + Year of Birth (4 digits): use person‘s first name plus their year
of birth (e.g. John1985)
• First Name + Year of Birth (2 digits): use person‘s first name plus their year
of birth in two digits (e.g. John85)
26
27. Tokenization
• replace personal details with column id
• column id is just another character
• problem: exact matching fails to match
obfuscations or abbreviations
• John != j0hn
• 1986 != 86
27
# First Name Year of Birth Password Resulting Password Tokens
1 Max 1983 Max1983! column: First Name, column: Year of Birth, !
2 John 1986 John86! column: First Name, 8, 6, !
3 Max 1987 123456 1, 2, 3, 4, 5, 6
28. Support Matching Using Data
Variations
• add on-the-fly word mangling rules to columns
• Leetspeak
• lowercase
• uppercase
• …
28
f f f F tania 74n14 tania TANIA Kara k4r4 kara KARA Rosales r054135 rosales ROSALES
…
f tania Kara Rosales
…
32. Attacking the Target
• collect data about victim & generate dataset
• use trained model to generate a tailored
password list
• quality of list depends heavily on
• selected training data
• hyperparameter configuration
32
Gender Username First Name Last Name Year of Birth
m john.smith John Smith 2050
34. Scheme Adoption
34
John2050
180374
09091958
06031982
160883
soni
John!
John!
[skipped until line 14]
john!
[skipped until line 23]
j0hn.5m17h
[skipped until line 30]
john.smith
[skipped until line 80]
Smith866
[skipped until line 85]
asdfghj
[skipped until line 514]
John50
[...]
Random:
stochastic character generation
(mostly human dates)
First Name + Year of Birth (4 digits):
learned
Username Leetspeak:
learned using word mangling
Last Name + Random Int:
partially learned + stochastic generation
Lowercased First Name + „!“:
learned using word mangling
First Name + „!“:
learned
Easy to Type:
learned
Username:
learned
First Name + Year of Birth (2 digits):
partially learned + stochastic generation
Duplicate because of
few available rules
Gender Username First Name Last Name Year of Birth
m john.smith John Smith 2050
35. Proving Password Scheme Adoption
1. use new fake dataset with same schemes
2. loop through each entry and generate a
individual password list (1000 entries)
3. check if password is on that list
35
Gender Username First Name Last Name Year of Birth Password
f margarete Judy Wells 1972 Wells106
?
36. Results
• 6 models with different
configurations
• all models match about
70% in password lists of
only ~100 lines
• optimized configurations
increase matching
efficiency
• recreated distributions
of schemes
36
38. Mitigation Strategies
• generating own model and check user‘s password
against generated lists
• attacker‘s model and dataset not available
password lists will differ
• long or complex passwords
• passwords might still be guessed if they contain
personal information
• e.g. JohnSmith1985 is actually
<column: firstname><column: lastname><column: year of birth>
• treating all human-like passwords as insecure
• requires classification of human likeliness
38
39. Human Password Classification
• using machine learning to classify human likeliness
• dataset (80k human + 80k machine labeled passwords)
• classifiers
• Logistic Regression
• Multinomial Naïve Bayes
• Linear Support Vector Machine
• Random Forest
• vectorizers
• TFIDF
• Count
39
&CtAEaCp?b&v"s% m
-SUuf4TLtF m
mallrats h
bP0.}BO/L&{: m
^=c.rgH$z m
boxers h
j&uzHCutff_A{ m
656565 h
6>IB|~@4^n}K m
forever1 h
…
43. Conclusion
• machine learning can be used to efficiently
attack passwords created by humans
• mitigation
• treat human passwords as insecure
• warn users or provide password policy
use machine learning model to identify human
passwords
integrate on web servers & password storage
services
43
44. Resources
• Thesis Machine Learning-driven Password
Lists:
• https://www.researchgate.net/publication/328719
001_Machine_Learning-driven_Password_Lists
• Human Password Classifier:
• https://github.com/georgknabl/human-password-
classifier
• ready-to-use trained models available via e-mail
44
45. 45
"The only secure password is the one you can't remember."
Troy Hunt (haveibeenpwned.com)
46. Contact
46
DI (FH) Georg Knabl, MSc
IT-Consultant & Software Engineer
georg.knabl@pageonstage.at
47. Sources
• Dunning, Julian (2016). Statistics Will Crack Your Password. Available
from: https://p16.praetorian.com/blog/statistics-will-crack-
yourpassword-mask-structure [Mar. 3, 2018]
• Karpathy, Andrej (2015). The Unreasonable Effectiveness of
Recurrent Neural Networks. Available from:
http://karpathy.github.io/2015/05/21/rnn-effectiveness/ [Nov. 10,
2017]
• Melicher, William, Blase Ur, Sean M Segreti, Saranga Komanduri,
Lujo Bauer, Nicolas Christin, and Lorrie Faith Cranor (2016). „Fast,
Lean, and Accurate: Modeling Password Guessability Using Neural
Networks“. In: 25th {USENIX} Security Symposium ({USENIX} Security
16). Vancouver: {USENIX} Association, pp. 175–191.
• Olah, Christopher (2015). Understanding LSTM Networks. Available
from: http://colah.github.io/posts/2015- 08-Understanding- LSTMs/
[Nov. 10, 2017]
47