Creating a Data Science Team from an Architect's perspective. This is about team building on how to support a data science team with the right staff, including data engineers and devops.
1. So you want to data
science.
Adam Muise
Chief Architect
2. Who am I?!
• Chief Architect at Paytm Labs!
• Paytm Labs is a data-driven lab founded to take on
the really hard problems of scaling up Fraud,
Recommendation, Rating, and Platform at Paytm!
• Paytm is an Indian Payments/Wallet company, has
50 Million wallets already, adds almost 1 Million
wallets a day, and will be greater than 100 Million
customers by the end of the year. Alibaba recently
invested in us, perhaps you heard. !
• I’ve also worked with Data Science teams at IBM,
Cloudera, and Hortonworks!
7. The Leadership!
If you are creating a data science
team, chances are that you are not a
Data Scientist. Data Scientists are
best applied to the problems of data,
not management.!
8. The Leadership!
Your boss (should ask): Why do you
even data science to solve the problem?!
You (should) answer: The problem is too
complex to solve without machine
learning. Here’s why.!
You (should not) answer: Big data and
data science is on the roadmap.!
9. The Leadership!
You have your budget for a team of 2
data scientists. That’s a good start
right? Get ready to ask for more
money. !
10. The Leadership!
You need to ask your management for:!
- Budget for 2 data engineers for every data scientist you hire!
- Access to the data lake, failing that, access the data warehouse!
- DevOps!
- Time to gain domain expertise before producing results!
- Exec-level cooperation from those teams who own the data and
tools you need and those who understand the data you need!
- A budget for servers/tools/additional storage based on a TCO
calculation you already did (right?)!
- A dedicated place for your team to work!
11. The Leadership!
Got DataLake?!
!
No? Depending on your
problem space,
chances are you are
building one unless you
can pull what you need
from an Existing Data
Warehouse.!
12. The Leadership!
You didn’t do a TCO (Total Cost of Ownership) calculation?
Ok, here you go:!
1. Internal/External cloud instances that can run Spark/
Hadoop/etc!
2. Storage costs (S3, internal, etc) for your analytical data
sets!
3. Lead time to get started, something like 1-2 months
depending on the complexity of the problem (Fraud
might take 3 months whereas Recommendation Engines
might be 1 month)!
4. Training time and costs for tools you didn’t know you
needed!
What! How much!
24-32 medium to large
instances on AWS each
month!
$15,000 to $45,000 per
month!
Storage costs for S3 (400TB
to 2PB)!
$12,000 to $57,000 per
month!
Salaries & Operating
Expenses!
2 x $xxxxx your operating
costs including salaries for
yourself and 3 people!
Training!
(Courses for Tools and
perhaps a conference trip
for hiring)!
$5,000 to $15,000!
14. The Team!
So you have permission, resources,
and a corner in an office. How do you
start? !
15. The Team!
Assemble your team in the following
order:!
1. Get a Data Engineer with a good
analytical mind. Have him beg,
borrow, or steal whatever data sets
that might be applicable to the
problem. Without data, no data
sciencey stuff can happen.!
16. The Team!
Assemble your team in the
following order:!
2. While you are getting
your data, hire or recruit
an internal Data Scientist. !
Easy, right?!
17. !!!!!!WARNING!!!!!!!
Data Science is not a mystical art form handed down by monks and taught over
50 years. You just need:!
• a good math background!
• academic or job experience with machine learning !
• business context!
• understand how to code. !
That can be easier to find than you think. !
!
That being said, everybody seems to think they are data scientists these days,
from the guy who writes the monthly SQL reports to your office manager who is a
wiz at excel. !
18. The Team!
Assemble your team in the following
order:!
3. More Data Engineers. !
4. DevOps support (if you don’t have
a common resource pool to draw
from).!
19. The Team!
Keep your data science team innovative, keep
them away from bureaucracy, keep them cool.
Don’t discount the cool factor.!
They are supposed to solve hard problems, not
deal with the everyday business issues. To
objective they need to be decoupled from the
emergencies and mediocre. !
If that sounds elitist then I challenge you to
create a scaling fraud detection system with your
existing data warehouse team. No really, try it. !
20. The Team!
What will they do?!
The Data Engineer !
Your data engineer is the heart and sole of your data science
team and will get almost none of the credit in the end. They
will help build your data pipeline, perform data
transformations, optimize training, automate validation, and
take the results into production. !
If you are lucky, you have Data Scientists that respect this
role and will often take some of these roles on to help ensure
their vision reaches production. Instead of relying on luck,
you can hire this way too. !
21. The Team!
What will they do?!
The Data Scientist!
Your Data Scientist will explore the data, create models, validate,
explore the data again, go in a different direction, clarify
requirements, model again, validate, retract, and then produce a
good model. The process is not deterministic and is a mix of
research and implementation. A good Data Scientist will be able to
code in the tools that you intend to go implement production code
with, something like Scala in Spark. !
Your Data Scientist will have or at least learn the business context
required to solve your problem. They will need to communicate with
business experts to validate their solutions actually solve the
problem or to help drive them in a new direction. !
22. The Team!
What will they do?!
DevOps!
Developer Operations will help
build that data pipeline for you. If
you have to build a Data Lake from
scratch, you are going to really rely
on these folks. They should be
elite, understand distributed
systems, ride a motorcycle, and be
someone you feel uncomfortable
standing next to in an elevator.!
23. Managing The Team!
If your Data Scientists are not stellar
coders, put a Data Engineer in their
grill and make them produce code.
They can’t contribute if they can’t get
their hands dirty. Data Science is not
an ivory tower. !
24. Managing The Team!
Introduce your team to the
business team that knows the
data or business processes
better than anyone else. Often
that’s not the CIO-favored DWH
team, but rather the Customer
Service Representatives*!
*This was especially true in fighting Fraud. !
25. Managing The Team!
Ways to make your team hate you:!
Data Scientists:!
• Don’t provide the data they need to create their models!
• Suggest that they create their own training data, from scratch!
• Provide ambiguous goals for the accuracy and precision of their models!
• Tell them to mine the data / don’t’ have a plan!
• Don’t respect the time it takes to create a model!
Data Engineers:!
• Let the Data Scientists use whatever tool they want without respect to parallel processing or
implementation!
• Have no management control over your data sources!
DevOps:!
• Use anything by IBM, Microsoft, SAS, or Oracle in your pipeline!
• Let the Data Engineers decide on the infrastructure!
27. The Work!
Start out with a clear that is
unambiguous. !
“I want to detect and prevent 50% of
Fraud in my payments system”!
“I want to increase conversion rates in
my eCommerce platform by 20%”!
28. The Work!
Get as much of the raw data as soon as you can
and as fast as you can. Don’t have a Data Lake?
Get your Hadoop on ASAP. !
!
29. The Work!
Give the team time to research the
data, gain context and become
experts. !
!
30. The Work!
Data without context == a complete
lack of direction in research. !
Research needs constant checks to
ensure that the primary problem is
being solved. !
!
31. The Work!
Data Science Development !=
Engineering Software Development.!
You will have to separate your
research process from the
engineering process that delivers the
models to production. !
!
32. The Work!
Data Engineering is an ongoing
process. You will need to maintain
pipelines, adapt to schema changes,
implement data cleansing, maintain
metadata in the data lake, optimize
processing workflows, etc. You will
never outgrow the need for your Data
Engineers. !
!
34. The Architecture!
Start with the cloud. You need to get
your infrastructure up as quickly as
possible. At the beginning, this is
cheaper than you think compared the
time and startup costs for creating an
on-premise data lake, even/especially if
you have an existing IT Team*!
!
*If you are big corporation your IT team is often the biggest barrier to your success in
creating an independent Data Science team.!
36. The Architecture!Lambda Architecture!
Batch Ingest:!
• SQOOP from MySQL instances!
• Keep as much in HDFS as you can, offload to S3 for
DR/Archive and when you have colder data!
• Spark and other Hadoop processing tools can run
natively over S3 data so it’s never really gone (don’t
use Glacier in a processing workflow)!
Realtime Ingest:!
• Mypipe to get events from binary log data and push
into Kafka topics (under construction)!
• VoltDB connector to get events from DB and push to
Kafka (under construction)!
• Streaming data piped through Kafka!
• All Realtime data processed with Spark Streaming or
Storm from Kafka!
37. The Architecture!
As you grow, your processing and
storage needs will likely mature.
Consider moving to on-premise
solution for your Hadoop/Processing
architecture. You can always archive
to S3 if you need DR and don’t have
the appetite to create two clusters.!
38. The Architecture!
With an on-premise architecture, you
can interact with existing on-premise
production systems quickly. For us,
that means real-time Fraud detection
and action. You may find yourself
maintaining both in the long run.!
40. armando@paytm.com - @jabenitez
Supervised learning vs Anomaly detection
๏ Very small number of positive
examples
๏ Large number of negative examples.
๏ Many different “types” of anomalies.
Hard for any algorithm to learn from
positive examples what the
anomalies look like; future anomalies
may look nothing like any of the
anomalous examples we’ve seen so
far.
40
๏ Ideally large number of positive and
negative examples.
๏ Enough positive examples for
algorithm to get a sense of what
positive examples are like, future
positive examples likely to be similar
to ones in training set.
* Anomaly Detection - Andrew Ng - Coursera ML Course
41. armando@paytm.com - @jabenitez
What approach to follow?
๏ Not so good: One model to rule them all
๏ Better:
๏ Many models competing against each other
๏ 100s or 1000s of rules running in parallel
๏ Know thy customer
41
42. armando@paytm.com - @jabenitez
Feature Selection
๏ Want
p(x) large (small) for normal examples, "
p(x) small (large) for anomalous examples
๏ Most common problem: "
comparable distributions for both normal and anomalous examples
๏ Possible solutions:
๏ Apply transformation and variable combinations:
๏ xn+1 = ( x1 + x4 ) 2 / x3
๏ Focus on variable ratios and transaction velocity
๏ Use deep learning for feature extraction
๏ Dimensionality reduction
๏ your solution here
42
45. armando@paytm.com - @jabenitez
What have we have tried
๏ Density estimator
๏ 2D Profiles
๏ Anomaly detection
๏ Clustering
๏ Model ensemble (Random forest)
๏ Deep learning (RBM)
๏ Logistic Regression
45
Combine
47. armando@paytm.com - @jabenitez
Anomaly Detection* - Example
๏ Choose features, xi , that are indicative of anomalous examples.
๏ Fit parameters to a normal distribution
๏ Given new example, compute:
๏ Anomaly if
47
* Anomaly Detection - Andrew Ng - Coursera ML Course
48. armando@paytm.com - @jabenitez
Algorithm Evaluation
๏ Fit model on training set
๏ On a cross validation/test example, predict
๏ Possible evaluation metrics:
๏ True positive, false positive, false negative, true negative
๏ Precision/Recall
๏ F1-score
48
50. armando@paytm.com - @jabenitez
Anomaly Detection*
50
* Anomaly Detection - Andrew Ng - Coursera ML Course
Cross validation set:
Test set:
Assume we have some labeled data, of
anomalous and non-anomalous
examples: y = 0 if standard
behaviour, . y = 1 if
anomalous.
Training set: "
(assume normal examples/not anomalous)
54. armando@paytm.com - @jabenitez
The lake again
54
Lake Simcoe
going on
Lake Superior
Classic Lambda
Architecture
Various
Processing
Frameworks
Near-Realtime
Scoring/Alerting*
55. armando@paytm.com - @jabenitez
Fraud Capabilities and Technology
A. Batch Ingest and Analysis of
transaction data from Database
B. Batch Behavioural and Portfolio
heuristic fraud detection
C. Near-realtime anomaly and
heuristic fraud detection
D. Online Model Scoring
55
A. Traditional ETL tools for transfer, HDFS/S3 for
storage, Spark for processing
B. Model analysis with iPython/Scala Notebook,
Spark for processing, HDFS/HBase/Cassandra
for storage
C. Kafka real-time ingest, introduce Storm/Spark
Streaming for near-realtime interception of
data, HBase for model/rule storage and lookup
D. JPMML/Spark Streaming for realtime model
scoring