Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Joining The Club
A c c e l e r a t i n g B i g D a t a w i t h A p a c h e S p a r k
Dollar Shave Club
Background on DSC
Engineering at DSC
Growth of Data Team
Show & Tell: Machine Learning Pipeline
Outline
A David and Goliath Story
Introduction
of new
members
Goliath
Engineering at DSC
u  Frontend
u  Ember.js web apps
u  iOS and Android apps
u  HTML email
u  Backend
u  Ruby on Rails web ...
Engineering at DSC
highscalability.com
Data Engineering at DSC
A David and Big Data Story
Big Data
What is the barrier to entry?
Big Data
What is the barrier to entry?
u  Requires a different set of capabilities
Big Data
What is the barrier to entry?
u  Requires a different set of capabilities
u  Investing resources without an obvio...
Big Data
What is the barrier to entry?
u  Requires a different set of capabilities
u  Investing resources without an obvio...
Good Foundations
Data Engineering
u  Machine learning pipeline
u  Models served in production
u  Exploratory Analysis
u  Customer segmentat...
Data Engineering
u  Maxwell + Kafka + Spark Streaming
u  Streaming data replication
u  Streaming metrics directly from the...
Anatomy of a Machine Learning Pipeline
Box Manager Email
Box Manager Email
Problem: Order the product tiles in “Box Manager Email” to maximize profit
Constraints:
u  Every customer...
Box Manager Email
Problem: Order the product tiles in “Box Manager Email” to maximize profit
Constraints:
u  Every customer...
Strategy
For each product, model the behavior which best distinguishes someone who
buys that product from someone who buys...
Design
u  Extract data from data warehouse (Redshift)
u  Join that data with hand-curated metadata (knowledge base)
u  Agg...
def performExtraction(
extractorClass, exportName, join_table=None, join_key_col=None,
start_col=None, include_start_col=T...
def exportFromRedshift(self, path):
export = self.exportDataFrame()
writeParquetWithRetry(export, path)
return sqlContext....
Domain Knowledge is Critical
The way that an expert organizes and represents facts in their domain.
u  Guides feature extr...
Aggregate (Shard, Compress, Join) and Pivot!
This dance is hard to choreograph
Aggregate (Shard, Compress, Join) and Pivot!
This dance is hard to choreograph
u  8,736 columns
u  2.6 million rows
Datafr...
def generateQuery(self):
return """
{0}
FROM {1}
GROUP BY customer_id, {2}, {3}, {4}
""".format(
self.selectClause(), self...
Aggregate (Shard, Compress, Join) and Pivot!
Aggregate (Shard, Compress, Join) and Pivot!
(0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0)	
(	18,	(6,16),	(1,2)	)
def perform(self):
keyedMonthlyEvents = self.dataFrame.map(self.keyRow())
pivotRDD = keyedMonthlyEvents
.combineByKey(
sel...
Aggregate (Compress, Shard, Join) and Pivot!
Featurize
u  "Explode" each customer's history into several "windows" of time.
Featurize
u  "Explode" each customer's history into several "windows" of time.
u  Define one or more prediction targets
Featurize
u  "Explode" each customer's history into several "windows" of time.
u  Define one or more prediction targets
u  ...
Featurize
u  "Explode" each customer's history into several "windows" of time.
u  Define one or more prediction targets
u  ...
Select Features
Select Features
1.  Randomly select a set of new features to test
Select Features
1.  Randomly select a set of new features to test
2.  Derive training set for new features + previously se...
Select Features
1.  Randomly select a set of new features to test
2.  Derive training set for new features + previously se...
Select Features
1.  Randomly select a set of new features to test
2.  Derive training set for new features + previously se...
Select Features
1.  Randomly select a set of new features to test
2.  Derive training set for new features + previously se...
Select Features
1.  Randomly select a set of new features to test
2.  Derive training set for new features + previously se...
Production Model
u  Spark ML makes parameter tuning easy
u  Reusable modules!
brett.bevers@dollarshaveclub.com
h+p://app.jobvite.com/m?33KSgiwI
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Nächste SlideShare
Wird geladen in …5
×

Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club

3.160 Aufrufe

Veröffentlicht am

Abstract:-

Data engineering at Dollar Shave Club has grown significantly over the last year. In that time, it has expanded in scope from conventional web-analytics and business intelligence to include real-time, big data and machine learning applications. We have bootstrapped a dedicated data engineering team in parallel with developing a new category of capabilities. And the business value that we delivered early on has allowed us to forge new roles for our data products and services in developing and carrying out business strategy. This progress was made possible, in large part, by adopting Apache Spark as an application framework. This talk describes what we have been able to accomplish using Spark at Dollar Shave Club.

Bio:-

Brett Bevers, Ph.D. Brett is a backend engineer and leads the data engineering team at Dollar Shave Club. More importantly, he is an ex-academic who is driven to understand and tackle hard problems. His latest challenge has been to develop tools powerful enough to support data-driven decision making in high value projects.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club

  1. 1. Joining The Club A c c e l e r a t i n g B i g D a t a w i t h A p a c h e S p a r k Dollar Shave Club
  2. 2. Background on DSC Engineering at DSC Growth of Data Team Show & Tell: Machine Learning Pipeline Outline
  3. 3. A David and Goliath Story Introduction
  4. 4. of new members Goliath
  5. 5. Engineering at DSC u  Frontend u  Ember.js web apps u  iOS and Android apps u  HTML email u  Backend u  Ruby on Rails web backends u  Internal services (Ruby, Node.js, Golang, Python, Elixir) u  Data and analytics (Python, SQL, Spark) u  QA u  CircleCI, SauceLabs, Jenkins u  TestUnit, Selenium u  IT u  Office and warehouse IT
  6. 6. Engineering at DSC highscalability.com
  7. 7. Data Engineering at DSC A David and Big Data Story
  8. 8. Big Data What is the barrier to entry?
  9. 9. Big Data What is the barrier to entry? u  Requires a different set of capabilities
  10. 10. Big Data What is the barrier to entry? u  Requires a different set of capabilities u  Investing resources without an obvious ROI
  11. 11. Big Data What is the barrier to entry? u  Requires a different set of capabilities u  Investing resources without an obvious ROI Knowing where to start
  12. 12. Good Foundations
  13. 13. Data Engineering u  Machine learning pipeline u  Models served in production u  Exploratory Analysis u  Customer segmentation (clustering) u  Hypothesis testing u  Data mining u  NLP (topic modeling)
  14. 14. Data Engineering u  Maxwell + Kafka + Spark Streaming u  Streaming data replication u  Streaming metrics directly from the data layer
  15. 15. Anatomy of a Machine Learning Pipeline
  16. 16. Box Manager Email
  17. 17. Box Manager Email Problem: Order the product tiles in “Box Manager Email” to maximize profit Constraints: u  Every customer sees some ordered set of products u  Do not show products already added to box
  18. 18. Box Manager Email Problem: Order the product tiles in “Box Manager Email” to maximize profit Constraints: u  Every customer sees some ordered set of products u  Do not show products already added to box +25% revenue per email open
  19. 19. Strategy For each product, model the behavior which best distinguishes someone who buys that product from someone who buys other products; rank a product by the strength of the indicative behavior, when present, and rank a product randomly otherwise Model u  Logistic Regression u  Learns the “tipping point” between success and failure u  Success = “buys product X”
  20. 20. Design u  Extract data from data warehouse (Redshift) u  Join that data with hand-curated metadata (knowledge base) u  Aggregate and pivot events by customer and discretized time u  Generate a training set of feature vectors u  Select features to include in the final model u  Train and productionize the final model
  21. 21. def performExtraction( extractorClass, exportName, join_table=None, join_key_col=None, start_col=None, include_start_col=True, event_start_date=None ): customer_id_col = extractorClass.customer_id_col timestamp_col = extractorClass.timestamp_col extr_agrs = extractorArgs( customer_id_col, timestamp_col, join_table, join_key_col, start_col, include_start_col, event_start_date ) extractor = extractorClass(**extr_agrs) export_path = redshiftExportPath(exportName) return extractor.exportFromRedshift(export_path) # writes to Parquet Extract
  22. 22. def exportFromRedshift(self, path): export = self.exportDataFrame() writeParquetWithRetry(export, path) return sqlContext.read.parquet(path) .persist(StorageLevel.MEMORY_AND_DISK) def exportDataFrame(self): query = self.generateQuery() return sqlContext.read .format("com.databricks.spark.redshift") .option("url", urlOption) .option("query", query) .option("tempdir", tempdir) .load() Extract
  23. 23. Domain Knowledge is Critical The way that an expert organizes and represents facts in their domain. u  Guides feature extraction u  Prevents overfitting u  Vastly superior to unsupervised feature extraction (e.g., PCA)
  24. 24. Aggregate (Shard, Compress, Join) and Pivot! This dance is hard to choreograph
  25. 25. Aggregate (Shard, Compress, Join) and Pivot! This dance is hard to choreograph u  8,736 columns u  2.6 million rows Dataframes API is not optimized for extremely wide datasets
  26. 26. def generateQuery(self): return """ {0} FROM {1} GROUP BY customer_id, {2}, {3}, {4} """.format( self.selectClause(), self._tempTableName, self.bucketingExpr(), self.timestampCol, self.startDateExpr ) def perform(self): self.preprocessedDataFrame().registerTempTable(self._tempTableName) return sqlContext.sql(self.generateQuery()) Aggregate (Shard, Compress, Join) and Pivot!
  27. 27. Aggregate (Shard, Compress, Join) and Pivot!
  28. 28. Aggregate (Shard, Compress, Join) and Pivot! (0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0) ( 18, (6,16), (1,2) )
  29. 29. def perform(self): keyedMonthlyEvents = self.dataFrame.map(self.keyRow()) pivotRDD = keyedMonthlyEvents .combineByKey( self.initPivot(), self.pivotEvent(), self.combineDicts() ) .map(self.convertToRow()) .persist(StorageLevel.MEMORY_AND_DISK) return sqlContext.createDataFrame(pivotRDD, self.pivotedSchema()) Aggregate (Shard, Compress, Join) and Pivot!
  30. 30. Aggregate (Compress, Shard, Join) and Pivot!
  31. 31. Featurize u  "Explode" each customer's history into several "windows" of time.
  32. 32. Featurize u  "Explode" each customer's history into several "windows" of time. u  Define one or more prediction targets
  33. 33. Featurize u  "Explode" each customer's history into several "windows" of time. u  Define one or more prediction targets u  Standardize each historical feature
  34. 34. Featurize u  "Explode" each customer's history into several "windows" of time. u  Define one or more prediction targets u  Standardize each historical feature u  Persist on S3 as text files of compressed sparse vectors
  35. 35. Select Features
  36. 36. Select Features 1.  Randomly select a set of new features to test
  37. 37. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features
  38. 38. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features 3.  Train model
  39. 39. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features 3.  Train model 4.  Calculate the p-value for each feature
  40. 40. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features 3.  Train model 4.  Calculate the p-value for each feature
  41. 41. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features 3.  Train model 4.  Calculate the p-value for each feature 5.  Retain significant features 6.  Repeat
  42. 42. Production Model u  Spark ML makes parameter tuning easy u  Reusable modules!
  43. 43. brett.bevers@dollarshaveclub.com h+p://app.jobvite.com/m?33KSgiwI

×