As machine learning moves from niche to mainstream tech stacks how do DevOps engineers prepare for a very different set of problems. A brief look at the new issues that arise from machine learning, an overview of cutting-edge "old school" solutions and how to drag data science (kicking and screaming) into a world of automation.
Video: https://www.youtube.com/watch?v=KHxZCRajRiA
Join DevOps Exchange London here: http://meetup.com/DevOps-Exchange-London/
Follow DOXLON on twitter http://www.twitter.com/doxlon
2. As machine learning moves from niche to
mainstream tech stacks how do DevOps engineers
prepare for a very different set of problems. A brief
look at the new issues that arise from machine
learning, an overview of cutting-edge "old school"
solutions and how to drag data science (kicking and
screaming) into a world of automation.
4. Ravelin
Fraud Detection. Ravelin examines your visitor and
payment data in real time, telling your systems
which customers are fraudsters. We use Machine
Learning, Rule Engines, Graph Networks and
Industry Expertise to respond with scores in
milliseconds. Perfect for an on-demand world.
Raised $2m last year. Fintech. Hiring
12. Stack
Go + Python
AWS
MicroServices
Storage: Cassandra, Postgres, ElasticSearch, Redis, Graph Database X, ZooKeeper
Queue: NSQ, Kinesis
Instrumentation: InfluxDB, Grafana
Docker - but only for local dev
13. Doing Things The
Right Way
TerraForm
100% Automation
Horizontally Scalable
Continuous Integration
No need for SSH access
100% Visibility - Metrics & Logs
23. We ♡ BigQuery
Costs - $5 per terabyte, 5c per range query per terabyte
Managed - but no reserve compute resources needed!
Distributed columns easily append
Dataflow
Restriction:
Can’t Update
No Indexes
27. Work on the Cloud!
“Stephen’s laptop was measurably heavier because of the amount
of data he had on it. We asked him nicely to move everything to
the cloud and now the internet is a little heavier” - Science 2016
28. Data
“Single point of success”- Jose CTO Hailo 2014
AWS
32 Cores 244GB RAM
Google Cloud Platform
32 Cores 208GB RAM
Azure
16 Cores 112GB RAM
31. Hardware - GPU’s
Specific for Deep Learning
AWS have a GPU machine but $$$
No virtualization
Buy and build your own server
Q. How Deep is your problem?
Speech, Video, Images
machine learning is becoming common place
saas solutions coming out
challenges have changed, I’ll try and cover some of the solutions my company has come up with
But first who am I and what authority so I have to speak about DevOps or machine learning.
Confession - I’m not really a full time devops but when we founded the company I had to do a little bit of everything.
Had help but I learnt a lot.
So what do we do.
Fraud detection - lots and lots of data in real time using lots of tools.
ML -> Rule Engine, Graph Networks etc
Raised 2m, growing rapidly. THose of you who like buzz words we are “fintect” and we are hiring (i’ll come back to that later).
So fraud you say- what do you mean?
How big of a problem is it really?
How much data are you talking about?
How can this be? I’ve never lost money to fraud? - Merchant do, in fact they lose most of the above figure.
Im sure 90% of people in this room has had a transaction declined or card blocked because you were overseas or for some completely unknown reason. That is because of fraud
Chip and pin stopped fraud right? - Sure, kindof but that 14B just moved online
USA have only moved to chip this year (but not the pin)
Detecting fraud is hard. - There is a N in all of the M’s
Most startup out there sign up customers and have drop outs - not the hockey stick of silicon valley, constantly messing around with their funnel to find those evangelical customers.
Fraudsters look like your best customer ever, they sign up and start spending money - lots of it!
The cost of one fraudster to your business could be as much as …...
100 real customers (depending on your margins)
So how do you stop fraudsters?
Glad you asked.
It’s a good job we have cutting edge companies like: Visa, Mastercard, Amex and all those trustworthy banks on it.
With all their might their solution…..
3D secure!
I’m sure 100% of everyone in this room has seen this page before.
Awesome no more fraud. I mean fraudsters dont know your secret password, so job done.
For those who have been looking for the N it is….
here….
So Ravelin might has well pack up and go home, the bank’s have solved the issue for all of us.
I mean everyone remembers that random password you setup once 18 months ago whilst trying to by a stupid wedding list gift. A password you only use every 3 months.
Problem is neither can you or any other customer! So conversion drops….
Typically 20 - 25% on websites.
Spoken to merchant who experience 50% dropout rate on mobile!!!
So that the history, so what tools are we using that the banks are not…. (except Mondo of course).
Go in a binary you compile and you put it somewhere.
DevOps need to get on this, life is better.
Got to make room for python - basically machine learning libraries are all in python
AWS - obviously
Microservices - obviously because we are a startup and we are cool
Storage: lots of different databases for specific needs. The right db for the job
Instrumentation guys - Do it! It is so useful - if something goes wrong it is the first thing we look at.
Docker - not a fan (as most of the improvements in workflow you get from go anyway) but we do use it for local dev which is awesome at.
Could rant about docker for 20 mins but need to move on.
Terraform - so much better than cloudformation but infrastructure as code - big thumbs up!
100% automation - Kill a box is come straight back up again. Spin up everything at the click of a button right?
CI Always be building - just moved to a mono-repo but that is a talk for another time
SSH - if someone SSH into prod - alarms should be sounding. In this day and age you shouldn’t need access.
Metrics - guys get on this. How many of you have metrics?
but seriously a word about servers...
This is not a server (or microservice) - it’s a puppy, aka a pet.
Never name your servers…
treat them like livestock with numbers.
Infrastructure is a working farm not a house. Services/services are livestock.
True story, we use ZooKeeper - it’s very important to us - it deals with service discovery and global locking.
Last Wednesday - AWS decided to teminate one of our ZK nodes, for no reason.
Those of you who have worked with ZK you’ll know as long as you have quorum (2 or out 3) you’ll be ok - and we were.
But not only that, new zk box came online and rejoined the cluster without any manual help. I mean we were worried, and 3 of us in the office where looking at our metric page for 30 mins straight but it just worked.
So those are our DevOps beliefs at Ravelin and what I want to cover is the specific issues with Machine Learning and how we solved them.
Lets start with Data Warehousing
What:
We have databases for operational needs e.g. postgres, cassandra etc for real time requests from services.
We dont want unleash our data scientists and their none optimised queries on it.
Why:
Exploratory work without impacting production databases on real data
Query anything - unlimited resources for long running queries
How much data:
Terrabytes of data. If I had 100GB I would be tempted to move to BQ
I want to walk through the history of data warehousing - 15 - 20 years ago...
Per year.
Then you know 10 - 15 years ago it came down in price a lot
I’m calling this v1.5
And v2 is what I assume all of you guys are used to..
Licence is free but Devs, servers and consultants cost a pretty penny.
Anyone here a Hadoop contractor - bet you own a house in London.
Anyone guess how much v3 costs?
We had the ability and skill to build our own cluster but:
Dont need to plan for capacity because we have on demand resource
maintenance time
DevOps time
on pay when you use
Dataflow
One thing I would say is, we have a really good account manager at Google who was a huge help. If you guys are serious and have big data ping me and I will personally introduce you.
We know ML can work on a distributed systems however it complex. E.g. some algorithms require super fast network cards.
But majority of the algorithms are build for single machine and you can just throw loads of memory.
37 signals etc - just scale up
So Ram is all good - but GPU’s is another kettle of fish.
GPU is good for a specific
When I say expensive, I mean in terms of money but also time