SlideShare ist ein Scribd-Unternehmen logo
1 von 57
Downloaden Sie, um offline zu lesen
So you want to data
science.
Adam Muise
Chief Architect
Who am I?!
•  Chief Architect at Paytm Labs!
•  Paytm Labs is a data-driven lab founded to take on
the really hard problems of scaling up Fraud,
Recommendation, Rating, and Platform at Paytm!
•  Paytm is an Indian Payments/Wallet company, has
50 Million wallets already, adds almost 1 Million
wallets a day, and will be greater than 100 Million
customers by the end of the year. Alibaba recently
invested in us, perhaps you heard. !
•  I’ve also worked with Data Science teams at IBM,
Cloudera, and Hortonworks!
Paytm!
This presentation is
short so that you can
ask a lot of questions.!
Wisdom Nuggets…!
The Leadership!
The Leadership!
If you are creating a data science
team, chances are that you are not a
Data Scientist. Data Scientists are
best applied to the problems of data,
not management.!
The Leadership!
Your boss (should ask): Why do you
even data science to solve the problem?!
You (should) answer: The problem is too
complex to solve without machine
learning. Here’s why.!
You (should not) answer: Big data and
data science is on the roadmap.!
The Leadership!
You have your budget for a team of 2
data scientists. That’s a good start
right? Get ready to ask for more
money. !
The Leadership!
You need to ask your management for:!
-  Budget for 2 data engineers for every data scientist you hire!
-  Access to the data lake, failing that, access the data warehouse!
-  DevOps!
-  Time to gain domain expertise before producing results!
-  Exec-level cooperation from those teams who own the data and
tools you need and those who understand the data you need!
-  A budget for servers/tools/additional storage based on a TCO
calculation you already did (right?)!
-  A dedicated place for your team to work!
The Leadership!
Got DataLake?!
!
No? Depending on your
problem space,
chances are you are
building one unless you
can pull what you need
from an Existing Data
Warehouse.!
The Leadership!
You didn’t do a TCO (Total Cost of Ownership) calculation?
Ok, here you go:!
1.  Internal/External cloud instances that can run Spark/
Hadoop/etc!
2.  Storage costs (S3, internal, etc) for your analytical data
sets!
3.  Lead time to get started, something like 1-2 months
depending on the complexity of the problem (Fraud
might take 3 months whereas Recommendation Engines
might be 1 month)!
4.  Training time and costs for tools you didn’t know you
needed!
What! How much!
24-32 medium to large
instances on AWS each
month!
$15,000 to $45,000 per
month!
Storage costs for S3 (400TB
to 2PB)!
$12,000 to $57,000 per
month!
Salaries & Operating
Expenses!
2 x $xxxxx your operating
costs including salaries for
yourself and 3 people!
Training!
(Courses for Tools and
perhaps a conference trip
for hiring)!
$5,000 to $15,000!
The Team!
The Team!
So you have permission, resources,
and a corner in an office. How do you
start? !
The Team!
Assemble your team in the following
order:!
1. Get a Data Engineer with a good
analytical mind. Have him beg,
borrow, or steal whatever data sets
that might be applicable to the
problem. Without data, no data
sciencey stuff can happen.!
The Team!
Assemble your team in the
following order:!
2. While you are getting
your data, hire or recruit
an internal Data Scientist. !
Easy, right?!
!!!!!!WARNING!!!!!!!
Data Science is not a mystical art form handed down by monks and taught over
50 years. You just need:!
•  a good math background!
•  academic or job experience with machine learning !
•  business context!
•  understand how to code. !
That can be easier to find than you think. !
!
That being said, everybody seems to think they are data scientists these days,
from the guy who writes the monthly SQL reports to your office manager who is a
wiz at excel. !
The Team!
Assemble your team in the following
order:!
3. More Data Engineers. !
4. DevOps support (if you don’t have
a common resource pool to draw
from).!
The Team!
Keep your data science team innovative, keep
them away from bureaucracy, keep them cool.
Don’t discount the cool factor.!
They are supposed to solve hard problems, not
deal with the everyday business issues. To
objective they need to be decoupled from the
emergencies and mediocre. !
If that sounds elitist then I challenge you to
create a scaling fraud detection system with your
existing data warehouse team. No really, try it. !
The Team!
What will they do?!
The Data Engineer !
Your data engineer is the heart and sole of your data science
team and will get almost none of the credit in the end. They
will help build your data pipeline, perform data
transformations, optimize training, automate validation, and
take the results into production. !
If you are lucky, you have Data Scientists that respect this
role and will often take some of these roles on to help ensure
their vision reaches production. Instead of relying on luck,
you can hire this way too. !
The Team!
What will they do?!
The Data Scientist!
Your Data Scientist will explore the data, create models, validate,
explore the data again, go in a different direction, clarify
requirements, model again, validate, retract, and then produce a
good model. The process is not deterministic and is a mix of
research and implementation. A good Data Scientist will be able to
code in the tools that you intend to go implement production code
with, something like Scala in Spark. !
Your Data Scientist will have or at least learn the business context
required to solve your problem. They will need to communicate with
business experts to validate their solutions actually solve the
problem or to help drive them in a new direction. !
The Team!
What will they do?!
DevOps!
Developer Operations will help
build that data pipeline for you. If
you have to build a Data Lake from
scratch, you are going to really rely
on these folks. They should be
elite, understand distributed
systems, ride a motorcycle, and be
someone you feel uncomfortable
standing next to in an elevator.!
Managing The Team!
If your Data Scientists are not stellar
coders, put a Data Engineer in their
grill and make them produce code.
They can’t contribute if they can’t get
their hands dirty. Data Science is not
an ivory tower. !
Managing The Team!
Introduce your team to the
business team that knows the
data or business processes
better than anyone else. Often
that’s not the CIO-favored DWH
team, but rather the Customer
Service Representatives*!
*This was especially true in fighting Fraud. !
Managing The Team!
Ways to make your team hate you:!
Data Scientists:!
•  Don’t provide the data they need to create their models!
•  Suggest that they create their own training data, from scratch!
•  Provide ambiguous goals for the accuracy and precision of their models!
•  Tell them to mine the data / don’t’ have a plan!
•  Don’t respect the time it takes to create a model!
Data Engineers:!
•  Let the Data Scientists use whatever tool they want without respect to parallel processing or
implementation!
•  Have no management control over your data sources!
DevOps:!
•  Use anything by IBM, Microsoft, SAS, or Oracle in your pipeline!
•  Let the Data Engineers decide on the infrastructure!
The Work!
The Work!
Start out with a clear that is
unambiguous. !
“I want to detect and prevent 50% of
Fraud in my payments system”!
“I want to increase conversion rates in
my eCommerce platform by 20%”!
The Work!
Get as much of the raw data as soon as you can
and as fast as you can. Don’t have a Data Lake?
Get your Hadoop on ASAP. !
!
The Work!
Give the team time to research the
data, gain context and become
experts. !
!
The Work!
Data without context == a complete
lack of direction in research. !
Research needs constant checks to
ensure that the primary problem is
being solved. !
!
The Work!
Data Science Development !=
Engineering Software Development.!
You will have to separate your
research process from the
engineering process that delivers the
models to production. !
!
The Work!
Data Engineering is an ongoing
process. You will need to maintain
pipelines, adapt to schema changes,
implement data cleansing, maintain
metadata in the data lake, optimize
processing workflows, etc. You will
never outgrow the need for your Data
Engineers. !
!
The Architecture!
The Architecture!
Start with the cloud. You need to get
your infrastructure up as quickly as
possible. At the beginning, this is
cheaper than you think compared the
time and startup costs for creating an
on-premise data lake, even/especially if
you have an existing IT Team*!
!
*If you are big corporation your IT team is often the biggest barrier to your success in
creating an independent Data Science team.!
The Architecture!
We had to build a data lake. It looks like
this:!
!
The Architecture!Lambda Architecture!
Batch Ingest:!
•  SQOOP from MySQL instances!
•  Keep as much in HDFS as you can, offload to S3 for
DR/Archive and when you have colder data!
•  Spark and other Hadoop processing tools can run
natively over S3 data so it’s never really gone (don’t
use Glacier in a processing workflow)!
Realtime Ingest:!
•  Mypipe to get events from binary log data and push
into Kafka topics (under construction)!
•  VoltDB connector to get events from DB and push to
Kafka (under construction)!
•  Streaming data piped through Kafka!
•  All Realtime data processed with Spark Streaming or
Storm from Kafka!
The Architecture!
As you grow, your processing and
storage needs will likely mature.
Consider moving to on-premise
solution for your Hadoop/Processing
architecture. You can always archive
to S3 if you need DR and don’t have
the appetite to create two clusters.!
The Architecture!
With an on-premise architecture, you
can interact with existing on-premise
production systems quickly. For us,
that means real-time Fraud detection
and action. You may find yourself
maintaining both in the long run.!
What Actual Data Science
looks like…
armando@paytm.com - @jabenitez
Supervised learning vs Anomaly detection
๏  Very small number of positive
examples 
๏  Large number of negative examples.
๏  Many different “types” of anomalies.
Hard for any algorithm to learn from
positive examples what the
anomalies look like; future anomalies
may look nothing like any of the
anomalous examples we’ve seen so
far.
40
๏  Ideally large number of positive and
negative examples.
๏  Enough positive examples for
algorithm to get a sense of what
positive examples are like, future
positive examples likely to be similar
to ones in training set.
* Anomaly Detection - Andrew Ng - Coursera ML Course
armando@paytm.com - @jabenitez
What approach to follow?
๏  Not so good: One model to rule them all
๏  Better: 
๏  Many models competing against each other
๏  100s or 1000s of rules running in parallel
๏  Know thy customer
41
armando@paytm.com - @jabenitez
Feature Selection
๏  Want 

p(x) large (small) for normal examples, "
p(x) small (large) for anomalous examples
๏  Most common problem: "
comparable distributions for both normal and anomalous examples
๏  Possible solutions:
๏  Apply transformation and variable combinations:
๏  xn+1 = ( x1 + x4 ) 2 / x3
๏  Focus on variable ratios and transaction velocity
๏  Use deep learning for feature extraction
๏  Dimensionality reduction
๏  your solution here
42
armando@paytm.com - @jabenitez
Feature Selection
43
armando@paytm.com - @jabenitez
Feature Selection
44
Variable X
Counts
BKG
SIG
armando@paytm.com - @jabenitez
What have we have tried
๏  Density estimator
๏  2D Profiles
๏  Anomaly detection
๏  Clustering
๏  Model ensemble (Random forest)
๏  Deep learning (RBM)
๏  Logistic Regression
45
Combine
armando@paytm.com - @jabenitez
Gaussian distribution
46
armando@paytm.com - @jabenitez
Anomaly Detection* - Example
๏  Choose features, xi , that are indicative of anomalous examples.
๏  Fit parameters to a normal distribution
๏  Given new example, compute: 
๏  Anomaly if 
47
* Anomaly Detection - Andrew Ng - Coursera ML Course
armando@paytm.com - @jabenitez
Algorithm Evaluation
๏  Fit model on training set
๏  On a cross validation/test example, predict
๏  Possible evaluation metrics:
๏  True positive, false positive, false negative, true negative
๏  Precision/Recall
๏  F1-score
48
armando@paytm.com - @jabenitez
Implementation
49
armando@paytm.com - @jabenitez
Anomaly Detection*
50
* Anomaly Detection - Andrew Ng - Coursera ML Course
Cross validation set:

Test set:
Assume we have some labeled data, of
anomalous and non-anomalous
examples: y = 0 if standard
behaviour, . y = 1 if
anomalous.
Training set: "
(assume normal examples/not anomalous)
armando@paytm.com - @jabenitez
Transform, Normalize, Calculate
51
armando@paytm.com - @jabenitez
Scala
52
Creating Scalable
Architecture
Futures!
armando@paytm.com - @jabenitez
The lake again
54
Lake Simcoe
going on
Lake Superior
Classic Lambda
Architecture
Various
Processing
Frameworks
Near-Realtime
Scoring/Alerting*
armando@paytm.com - @jabenitez
Fraud Capabilities and Technology
A.  Batch Ingest and Analysis of
transaction data from Database
B.  Batch Behavioural and Portfolio
heuristic fraud detection
C.  Near-realtime anomaly and
heuristic fraud detection
D.  Online Model Scoring 
55
A.  Traditional ETL tools for transfer, HDFS/S3 for
storage, Spark for processing
B.  Model analysis with iPython/Scala Notebook,
Spark for processing, HDFS/HBase/Cassandra
for storage
C.  Kafka real-time ingest, introduce Storm/Spark
Streaming for near-realtime interception of
data, HBase for model/rule storage and lookup
D.  JPMML/Spark Streaming for realtime model
scoring
armando@paytm.com - @jabenitez
Our framework shopping list
56
iPython &
Scala
Notebooks
Explore & Train Ingest, Store, Score, & Act
Spark
::Core ::MLLib
::Streaming ::GraphX?
Intercept
with Storm?
Spark
Streaming?
Kafka, Hadoop, HBase, Cassandra, SolrCloud, & S3
OpenScoring
?
JPMML?R?
armando@paytm.com - @jabenitez57
Fin

Weitere ähnliche Inhalte

Was ist angesagt?

Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
StampedeCon
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
Anna Shymchenko
 
Big Data & Oracle Technologies
Big Data & Oracle TechnologiesBig Data & Oracle Technologies
Big Data & Oracle Technologies
Oleksii Movchaniuk
 
IDERA Live | The Ever Growing Science of Database Migrations
IDERA Live | The Ever Growing Science of Database MigrationsIDERA Live | The Ever Growing Science of Database Migrations
IDERA Live | The Ever Growing Science of Database Migrations
IDERA Software
 
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
StampedeCon
 
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku
 

Was ist angesagt? (20)

Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
 
Data Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerData Science with Hadoop: A Primer
Data Science with Hadoop: A Primer
 
Flexible Design
Flexible DesignFlexible Design
Flexible Design
 
The Ecosystem is too damn big
The Ecosystem is too damn big The Ecosystem is too damn big
The Ecosystem is too damn big
 
Smart data for a predictive bank
Smart data for a predictive bankSmart data for a predictive bank
Smart data for a predictive bank
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...
 
Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015
 
Big Data & Oracle Technologies
Big Data & Oracle TechnologiesBig Data & Oracle Technologies
Big Data & Oracle Technologies
 
IDERA Live | The Ever Growing Science of Database Migrations
IDERA Live | The Ever Growing Science of Database MigrationsIDERA Live | The Ever Growing Science of Database Migrations
IDERA Live | The Ever Growing Science of Database Migrations
 
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
 
Big data trends challenges opportunities
Big data trends challenges opportunitiesBig data trends challenges opportunities
Big data trends challenges opportunities
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
 
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
 
Data Warehousing using Hadoop
Data Warehousing using HadoopData Warehousing using Hadoop
Data Warehousing using Hadoop
 
Summary introduction to data engineering
Summary introduction to data engineeringSummary introduction to data engineering
Summary introduction to data engineering
 
Rob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoopRob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoop
 

Andere mochten auch

ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
Jayesh Thakrar
 

Andere mochten auch (9)

Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
 
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache Flume
 
ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
 
大型电商的数据服务的要点和难点
大型电商的数据服务的要点和难点 大型电商的数据服务的要点和难点
大型电商的数据服务的要点和难点
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Flume vs. kafka
Flume vs. kafkaFlume vs. kafka
Flume vs. kafka
 

Ähnlich wie Paytm labs soyouwanttodatascience

Ähnlich wie Paytm labs soyouwanttodatascience (20)

Data Science Overview
Data Science OverviewData Science Overview
Data Science Overview
 
Lean Analytics: How to get more out of your data science team
Lean Analytics: How to get more out of your data science teamLean Analytics: How to get more out of your data science team
Lean Analytics: How to get more out of your data science team
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
The Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterThe Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value Thereafter
 
What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data Science
 
Dapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDapper: the microORM that will change your life
Dapper: the microORM that will change your life
 
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
 
How to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointHow to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePoint
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
Bbbt presentation 210415_final_2
Bbbt presentation 210415_final_2Bbbt presentation 210415_final_2
Bbbt presentation 210415_final_2
 
Journey of The Connected Enterprise - Knowledge Graphs - Smart Data
Journey of The Connected Enterprise - Knowledge Graphs - Smart DataJourney of The Connected Enterprise - Knowledge Graphs - Smart Data
Journey of The Connected Enterprise - Knowledge Graphs - Smart Data
 
The Lost Tales of Platform Design (February 2017)
The Lost Tales of Platform Design (February 2017)The Lost Tales of Platform Design (February 2017)
The Lost Tales of Platform Design (February 2017)
 
Business in the Driver’s Seat – An Improved Model for Integration
Business in the Driver’s Seat – An Improved Model for IntegrationBusiness in the Driver’s Seat – An Improved Model for Integration
Business in the Driver’s Seat – An Improved Model for Integration
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
 
Synapse NanoApps
Synapse NanoAppsSynapse NanoApps
Synapse NanoApps
 
How to Build Tools for Data Scientists That Don't Suck
How to Build Tools for Data Scientists That Don't SuckHow to Build Tools for Data Scientists That Don't Suck
How to Build Tools for Data Scientists That Don't Suck
 

Mehr von Adam Muise

KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012
Adam Muise
 

Mehr von Adam Muise (20)

2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadam2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
 
Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of Hadoop
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
2014 sept 4_hadoop_security
2014 sept 4_hadoop_security2014 sept 4_hadoop_security
2014 sept 4_hadoop_security
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop1012014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
 
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
 
2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda
 
2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
What is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACWhat is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMAC
 
What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog
 
KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotech
 

Paytm labs soyouwanttodatascience

  • 1. So you want to data science. Adam Muise Chief Architect
  • 2. Who am I?! •  Chief Architect at Paytm Labs! •  Paytm Labs is a data-driven lab founded to take on the really hard problems of scaling up Fraud, Recommendation, Rating, and Platform at Paytm! •  Paytm is an Indian Payments/Wallet company, has 50 Million wallets already, adds almost 1 Million wallets a day, and will be greater than 100 Million customers by the end of the year. Alibaba recently invested in us, perhaps you heard. ! •  I’ve also worked with Data Science teams at IBM, Cloudera, and Hortonworks!
  • 4. This presentation is short so that you can ask a lot of questions.!
  • 7. The Leadership! If you are creating a data science team, chances are that you are not a Data Scientist. Data Scientists are best applied to the problems of data, not management.!
  • 8. The Leadership! Your boss (should ask): Why do you even data science to solve the problem?! You (should) answer: The problem is too complex to solve without machine learning. Here’s why.! You (should not) answer: Big data and data science is on the roadmap.!
  • 9. The Leadership! You have your budget for a team of 2 data scientists. That’s a good start right? Get ready to ask for more money. !
  • 10. The Leadership! You need to ask your management for:! -  Budget for 2 data engineers for every data scientist you hire! -  Access to the data lake, failing that, access the data warehouse! -  DevOps! -  Time to gain domain expertise before producing results! -  Exec-level cooperation from those teams who own the data and tools you need and those who understand the data you need! -  A budget for servers/tools/additional storage based on a TCO calculation you already did (right?)! -  A dedicated place for your team to work!
  • 11. The Leadership! Got DataLake?! ! No? Depending on your problem space, chances are you are building one unless you can pull what you need from an Existing Data Warehouse.!
  • 12. The Leadership! You didn’t do a TCO (Total Cost of Ownership) calculation? Ok, here you go:! 1.  Internal/External cloud instances that can run Spark/ Hadoop/etc! 2.  Storage costs (S3, internal, etc) for your analytical data sets! 3.  Lead time to get started, something like 1-2 months depending on the complexity of the problem (Fraud might take 3 months whereas Recommendation Engines might be 1 month)! 4.  Training time and costs for tools you didn’t know you needed! What! How much! 24-32 medium to large instances on AWS each month! $15,000 to $45,000 per month! Storage costs for S3 (400TB to 2PB)! $12,000 to $57,000 per month! Salaries & Operating Expenses! 2 x $xxxxx your operating costs including salaries for yourself and 3 people! Training! (Courses for Tools and perhaps a conference trip for hiring)! $5,000 to $15,000!
  • 14. The Team! So you have permission, resources, and a corner in an office. How do you start? !
  • 15. The Team! Assemble your team in the following order:! 1. Get a Data Engineer with a good analytical mind. Have him beg, borrow, or steal whatever data sets that might be applicable to the problem. Without data, no data sciencey stuff can happen.!
  • 16. The Team! Assemble your team in the following order:! 2. While you are getting your data, hire or recruit an internal Data Scientist. ! Easy, right?!
  • 17. !!!!!!WARNING!!!!!!! Data Science is not a mystical art form handed down by monks and taught over 50 years. You just need:! •  a good math background! •  academic or job experience with machine learning ! •  business context! •  understand how to code. ! That can be easier to find than you think. ! ! That being said, everybody seems to think they are data scientists these days, from the guy who writes the monthly SQL reports to your office manager who is a wiz at excel. !
  • 18. The Team! Assemble your team in the following order:! 3. More Data Engineers. ! 4. DevOps support (if you don’t have a common resource pool to draw from).!
  • 19. The Team! Keep your data science team innovative, keep them away from bureaucracy, keep them cool. Don’t discount the cool factor.! They are supposed to solve hard problems, not deal with the everyday business issues. To objective they need to be decoupled from the emergencies and mediocre. ! If that sounds elitist then I challenge you to create a scaling fraud detection system with your existing data warehouse team. No really, try it. !
  • 20. The Team! What will they do?! The Data Engineer ! Your data engineer is the heart and sole of your data science team and will get almost none of the credit in the end. They will help build your data pipeline, perform data transformations, optimize training, automate validation, and take the results into production. ! If you are lucky, you have Data Scientists that respect this role and will often take some of these roles on to help ensure their vision reaches production. Instead of relying on luck, you can hire this way too. !
  • 21. The Team! What will they do?! The Data Scientist! Your Data Scientist will explore the data, create models, validate, explore the data again, go in a different direction, clarify requirements, model again, validate, retract, and then produce a good model. The process is not deterministic and is a mix of research and implementation. A good Data Scientist will be able to code in the tools that you intend to go implement production code with, something like Scala in Spark. ! Your Data Scientist will have or at least learn the business context required to solve your problem. They will need to communicate with business experts to validate their solutions actually solve the problem or to help drive them in a new direction. !
  • 22. The Team! What will they do?! DevOps! Developer Operations will help build that data pipeline for you. If you have to build a Data Lake from scratch, you are going to really rely on these folks. They should be elite, understand distributed systems, ride a motorcycle, and be someone you feel uncomfortable standing next to in an elevator.!
  • 23. Managing The Team! If your Data Scientists are not stellar coders, put a Data Engineer in their grill and make them produce code. They can’t contribute if they can’t get their hands dirty. Data Science is not an ivory tower. !
  • 24. Managing The Team! Introduce your team to the business team that knows the data or business processes better than anyone else. Often that’s not the CIO-favored DWH team, but rather the Customer Service Representatives*! *This was especially true in fighting Fraud. !
  • 25. Managing The Team! Ways to make your team hate you:! Data Scientists:! •  Don’t provide the data they need to create their models! •  Suggest that they create their own training data, from scratch! •  Provide ambiguous goals for the accuracy and precision of their models! •  Tell them to mine the data / don’t’ have a plan! •  Don’t respect the time it takes to create a model! Data Engineers:! •  Let the Data Scientists use whatever tool they want without respect to parallel processing or implementation! •  Have no management control over your data sources! DevOps:! •  Use anything by IBM, Microsoft, SAS, or Oracle in your pipeline! •  Let the Data Engineers decide on the infrastructure!
  • 27. The Work! Start out with a clear that is unambiguous. ! “I want to detect and prevent 50% of Fraud in my payments system”! “I want to increase conversion rates in my eCommerce platform by 20%”!
  • 28. The Work! Get as much of the raw data as soon as you can and as fast as you can. Don’t have a Data Lake? Get your Hadoop on ASAP. ! !
  • 29. The Work! Give the team time to research the data, gain context and become experts. ! !
  • 30. The Work! Data without context == a complete lack of direction in research. ! Research needs constant checks to ensure that the primary problem is being solved. ! !
  • 31. The Work! Data Science Development != Engineering Software Development.! You will have to separate your research process from the engineering process that delivers the models to production. ! !
  • 32. The Work! Data Engineering is an ongoing process. You will need to maintain pipelines, adapt to schema changes, implement data cleansing, maintain metadata in the data lake, optimize processing workflows, etc. You will never outgrow the need for your Data Engineers. ! !
  • 34. The Architecture! Start with the cloud. You need to get your infrastructure up as quickly as possible. At the beginning, this is cheaper than you think compared the time and startup costs for creating an on-premise data lake, even/especially if you have an existing IT Team*! ! *If you are big corporation your IT team is often the biggest barrier to your success in creating an independent Data Science team.!
  • 35. The Architecture! We had to build a data lake. It looks like this:! !
  • 36. The Architecture!Lambda Architecture! Batch Ingest:! •  SQOOP from MySQL instances! •  Keep as much in HDFS as you can, offload to S3 for DR/Archive and when you have colder data! •  Spark and other Hadoop processing tools can run natively over S3 data so it’s never really gone (don’t use Glacier in a processing workflow)! Realtime Ingest:! •  Mypipe to get events from binary log data and push into Kafka topics (under construction)! •  VoltDB connector to get events from DB and push to Kafka (under construction)! •  Streaming data piped through Kafka! •  All Realtime data processed with Spark Streaming or Storm from Kafka!
  • 37. The Architecture! As you grow, your processing and storage needs will likely mature. Consider moving to on-premise solution for your Hadoop/Processing architecture. You can always archive to S3 if you need DR and don’t have the appetite to create two clusters.!
  • 38. The Architecture! With an on-premise architecture, you can interact with existing on-premise production systems quickly. For us, that means real-time Fraud detection and action. You may find yourself maintaining both in the long run.!
  • 39. What Actual Data Science looks like…
  • 40. armando@paytm.com - @jabenitez Supervised learning vs Anomaly detection ๏  Very small number of positive examples ๏  Large number of negative examples. ๏  Many different “types” of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we’ve seen so far. 40 ๏  Ideally large number of positive and negative examples. ๏  Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set. * Anomaly Detection - Andrew Ng - Coursera ML Course
  • 41. armando@paytm.com - @jabenitez What approach to follow? ๏  Not so good: One model to rule them all ๏  Better: ๏  Many models competing against each other ๏  100s or 1000s of rules running in parallel ๏  Know thy customer 41
  • 42. armando@paytm.com - @jabenitez Feature Selection ๏  Want p(x) large (small) for normal examples, " p(x) small (large) for anomalous examples ๏  Most common problem: " comparable distributions for both normal and anomalous examples ๏  Possible solutions: ๏  Apply transformation and variable combinations: ๏  xn+1 = ( x1 + x4 ) 2 / x3 ๏  Focus on variable ratios and transaction velocity ๏  Use deep learning for feature extraction ๏  Dimensionality reduction ๏  your solution here 42
  • 44. armando@paytm.com - @jabenitez Feature Selection 44 Variable X Counts BKG SIG
  • 45. armando@paytm.com - @jabenitez What have we have tried ๏  Density estimator ๏  2D Profiles ๏  Anomaly detection ๏  Clustering ๏  Model ensemble (Random forest) ๏  Deep learning (RBM) ๏  Logistic Regression 45 Combine
  • 47. armando@paytm.com - @jabenitez Anomaly Detection* - Example ๏  Choose features, xi , that are indicative of anomalous examples. ๏  Fit parameters to a normal distribution ๏  Given new example, compute: ๏  Anomaly if 47 * Anomaly Detection - Andrew Ng - Coursera ML Course
  • 48. armando@paytm.com - @jabenitez Algorithm Evaluation ๏  Fit model on training set ๏  On a cross validation/test example, predict ๏  Possible evaluation metrics: ๏  True positive, false positive, false negative, true negative ๏  Precision/Recall ๏  F1-score 48
  • 50. armando@paytm.com - @jabenitez Anomaly Detection* 50 * Anomaly Detection - Andrew Ng - Coursera ML Course Cross validation set: Test set: Assume we have some labeled data, of anomalous and non-anomalous examples: y = 0 if standard behaviour, . y = 1 if anomalous. Training set: " (assume normal examples/not anomalous)
  • 51. armando@paytm.com - @jabenitez Transform, Normalize, Calculate 51
  • 54. armando@paytm.com - @jabenitez The lake again 54 Lake Simcoe going on Lake Superior Classic Lambda Architecture Various Processing Frameworks Near-Realtime Scoring/Alerting*
  • 55. armando@paytm.com - @jabenitez Fraud Capabilities and Technology A.  Batch Ingest and Analysis of transaction data from Database B.  Batch Behavioural and Portfolio heuristic fraud detection C.  Near-realtime anomaly and heuristic fraud detection D.  Online Model Scoring 55 A.  Traditional ETL tools for transfer, HDFS/S3 for storage, Spark for processing B.  Model analysis with iPython/Scala Notebook, Spark for processing, HDFS/HBase/Cassandra for storage C.  Kafka real-time ingest, introduce Storm/Spark Streaming for near-realtime interception of data, HBase for model/rule storage and lookup D.  JPMML/Spark Streaming for realtime model scoring
  • 56. armando@paytm.com - @jabenitez Our framework shopping list 56 iPython & Scala Notebooks Explore & Train Ingest, Store, Score, & Act Spark ::Core ::MLLib ::Streaming ::GraphX? Intercept with Storm? Spark Streaming? Kafka, Hadoop, HBase, Cassandra, SolrCloud, & S3 OpenScoring ? JPMML?R?