Presentation for PyDataDC 2016
You've got data. Lots of it. You might not realize it, but people want to get their hands on that data. You probably don't want that, so let's go over a few things you can do to dissuade attackers from getting their grubby mitts on your hard processed datastore. We'll cover the obvious things (spoiler alert: encryption) and then move on to some advances techniques for keeping your data secure while still keeping it usable (that is to say, analyzable).
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Â
Eat Your Vegetables - Data Security for Data Scientists
1. 1
Eat Your Vegetables
Data Security for Data Scientists
Welcome to Eat Your Vegetables!
Hope you're having a great conference so far
2. 2
Agenda
1. Agenda
2. Intro
3. Convincing Time
4. Security Concepts
5. Tools
6. Questions
The agenda - self referential, eh?
Intro stuff
Why this is important (if you're not convinced already)
Some basic tips for security - NOT deïŹnitions
Tools to make security easier
Time for questions at the end
Slides will be online
3. 3
Name:
Will Voorhees
Occupation:
Software Engineer
Favorite Color:
Who's this guy?! - Your SPEAKER
Pic from Halloween 2010
Been doing tech for 15 years, but real software development for about 5
Currently working in security org creating enterprise security tools
4. Orange
Twitter:
@will2041
Why are we here?
Remember this bit from Red vs Blue?
To talk about vegetables!
Vegetables are good for you
And by "vegetable", I mean security
You've got to EAT YOUR VEGTABLES, just like mom said
Vegetables can be tasty
5. 45
Someone Wants Your Data
No, seriously. Someonewants your data.This is the predicate to everything
"Well duh, my data is awesome!"
Attackers are interested in all kinds of data for all kinds of reasons
Even Pokemon Go accounts have value
It's not always monetary - 2015 Ashley Madison as an example
"Hacktivism"
6. 6 . 1
Why should I care?
Obvious stuff - money
About a year ago, insurance company Lloyd's estimated $400 bil/year lost to hacking
So what other reasons do I have?
7. 6 . 2
HIPPAA
SOX
Trade Sanctions
(Government) contracts
Etc.
6 . 3
Analytical Confidence
Encryption/signing can provide a "no one touched this" guarantee
Nice out of box beneïŹt of adding security
Nothing like re-running a model on data that's changed and freaking out
8. 6 . 4
Fun!
You're kidding...
Puzzles!
Red Team vs Blue Team competition
Caesar cipher
Some of the math is interesting
9. 6 . 5
You got me!
That was filler.
I'm sorry
They're valid reasons, they're just not the most important reason
10. 6 . 6
Trust
It's what's for dinner.
We're data stewards - everyone trusts us with data
Doesn't matter what data you have, someone trusts you with it
We are ultimately responsible for our data
Magic Information Security elves won't save us
11. 6 . 7
Trust Has a Cost
Governments lose national security - OPM (OfïŹce of Personnel Management), IRS
e-commerce sites lose sales
Remember that money thing? I lied! Journal of Cyber Security says a breach costs as much as the defense
It's cheaper to get hacked
So maybe it's not about money...
12. 6 . 8
Human Trust
Trust of people isn't as easily quantiïŹed
Target, Ashley Madison still in business - But what's the impact?
This is all very murky - needs more research
In absence of data, do what's right
13. 7
What can I do?
Good news: there's some easy stuff
Bad news: there's some really hard stuffEasy stuff is pretty easy, once you learn it
Hard stuff is really hard, even after you learn it
Think Heartbleed from 2014 for OpenSSL
Buffer overïŹow bug let attacker get memory dump
14. 8
Patching can proactively save your butt
Doing it often means you know how to do it quickly
Quick response can be really important - think heartbleed or shellshock
9 . 1
The Easy Stuff
I claim that the easy stuff is pretty easy!
Sad fact: doing the basics makes you better than a lot of companies
15. 9 . 2
Access Control
Don't leave things open to the world!
Some restriction is better than no restriction
Accounts can only do certain things
This is what keeps your intern from deleting your data lake
Let's use Nissan as an example
Leaf completely open to internet for physical control
Minimum bad PR, maximum loss of life
16. 9 . 3
TLS
a.k.a. SSL
TLS replaced SSL
Everyone still calls TLS "SSL"
What - authentication and encryption on connections
17. 9 . 4
TLS Myths
Let's encrypt - https://letsencrypt.org
Cost
It's 2016 and Let's Encrypt is a thing
Performance impact is negligible
Gmail to SSL -> No special tuning, less than 1% CPU and ms of latency
18. 9 . 5
Account Separation
Yes, that's actually a fruit, but I started getting desparate
Your backup user doesn't need write access to your master DB
Whole companies have been lost because they used one account (Code Spaces)
Minimize blast radius
Backups on a separate account!
19. 9 . 6
Short Lived Credentials
Just like passwords, you need to rotate your keys
Limit blast radius
STS hands out temporary credentials
Short lived because...
21. 9 . 8
Scanners grab creds and spin up instances for Bitcoin mining, etc.
Shorted lived creds limit the blast radius
10
Hey, that's just general
security stuff!
What about big data?!Take a Breather
All that stuff gates access to the data
Even if you do nothing else, this is your ïŹrst line defense
But yes, let's talk about data
22. 11 . 1
Signatures
Provides that "no one messed with this"
guarantee
Teased with "Analytical ConïŹdence"
Compliment to encryption
23. Signing vs hashing
Signing proves identity - hacker can just update a hash with the data
11 . 2
Encryption
But not really...
Cryptography Engineering: Design Principles and Practical
Applications
First thing people think of
Not an "End All Be All"
People think slapping on encryption solves all the security issues
Really hard to get right
ConïŹdentiality vs integrity
AES - some modes provide integrity, others don't
Waaay more to this than I can cover - Google or book
24. But there are other challenges...
11 . 3
Encrypted data is a pain
It's always going to be slower
Some tools just freak out at the thought - Bye bye grep
So we really want to work with unencrypted data
But for minimal time and only in certain places
Tooling can help with this - but it requires effort
Callback to companies going out of business, etc.
25. 11 . 4
Key management is a painMore data = more keys
New data should use a different key
A leaked key doesn't reveal all your data
Now you have many keys to manage
Keep keys somewhere else!
Reference to key used to encrypt should be kept with data
S3 metadata can keep a key reference
Key serials - don't forget 'em
26. 11 . 5
Bummed out yet?
Yes, I know that's not a vegetable
It's got the "vegetable" tag on ïŹickr... so remember the importance of correct tagging
27. 11 . 6
Tips
Decrypt, but be safe
Split it up
Work with metadata
Use the tools...
Not all doom and gloom
Minimize the amount of time data is unencrypted
When actually working with it, keep it somewhere safe
You don't need all data for everything. Split things up
If you don't need the actual data, just work with metadata and avoid encryption all together
Speaking of tools...
28. 12 . 1
Tools!
You're not alone - lots of people care about security
There are low level libraries and high level tools
Everyday new tools that make security easier are being developed
Do NOT roll your own crypto
29. 12 . 2
High Level
JWT
python-jose
https://jwt.io/
https://github.com/mpdavis/python-joseStart at the top of the stack
JWT = JSON Web Token
JOSE = JavaScript Object Signing and Encryption
Does signing, but no encryption
Can be used for powerful web AuthN/AuthZ
Encryption via TLS for connections
30. 12 . 3
JOSE
from jose import jws
signed = jws.sign({'a': 'b'}, 'secret', algorithm='HS256')
>>> 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJhIjoiYiJ9.jiMyrsmD8AoHWeQgmxZ5yq8z0lXS67_
Signed JSON
31. 12 . 4
Low Level
PyCrypto
PyOpenSSL
https://github.com/dlitz/pycrypto
https://github.com/pyca/pyopenssl
Taking a step down the stack here...
Both provide low-level crypto operations
You have the POWER!
32. 12 . 5
Too scary!
cryptography https://github.com/pyca/cryptography
OK, maybe we went too far down the stack
Most of us don't need low level primitives
By the same folks doing PyOpenSSL
Goal is to have human friendly crypto
33. 12 . 6
Example
from cryptography.fernet import Fernet
key = Fernet.generate_key()
f = Fernet(key)
token = f.encrypt(b"My giant binary blob")
f.decrypt(token)
Data encryption and decryption is easy!
35. 12 . 8
Oh, wait a second...
class Payload(object):
""" Executes /bin/ls when unpickled. """
def __reduce__(self):
""" Run /bin/ls on the remote machine. """
return (subprocess.Popen, (('/bin/ls',),))
Example from Travis Cunningham
Unpickling executes code on box
36. 12 . 9
Mitigations
Sign your pickles
Secure transfer
Don't pickle
Sign and verify pickles before unpickling
Trusted endpoints that allow changes in between don't work
37. 13 . 1
Providers
Changing topics entirely here...
All that is great, but why do it when someone can provide it for you?!
Terribly biased to AWS, so we're going to focus on that, but a lot of this applies to any provider
38. 13 . 2
Server Side Encryption
S3 lets you do server side encryption
Can have bucket policy to enforce
Prevents a data leak from revealing everything
Great for regulatory compliance, but trusts Amazon wholly
Although they are trustworthy
More than 50% of IT professionals don't fully trust providers to not leak data
If you're paranoid...
39. 13 . 3
Client Side Encryption
s3-encryption https://github.com/bold eld/s3-encryption
You encrypt things before sending to storage
See AmazonS3EncryptionClient
Key idea: you can add security to existing libraries! boto + cryptography = cool
Transparent/easy security gets people onboard
40. 13 . 4
Speaking of keys...
Key Management Service
You want keys, KMS gives you keys
Makes Amazon manage all the keys you're using for encryption
KMS keeps the key and limits access as you see ïŹt
Here again, Java SDK has a full feature set to emulate
Direct envelope (key+data) encryption method
41. 14
Conclusion!
Security is important! Trust is priceless.
Do the basics - they are better than nothing
Python has lots of security tools
Providers can help
Thank goodness he's done...
42. 15
Thanks/Promo Time
You ne people
District Data Labs - http://www.districtdatalabs.com
My ïŹrst time speaking, so thanks for being my inaugural audience
Feedback welcome!
DDL is DC based data science research group
Come see what we're up to!
Talks in this room at 11:30, 1:15, and 3:00 tomorrow