SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Apache Spark and Bluemix Meetup
Jean-Baptiste Martin
July 6, 2016
Project industrialization with Apache Spark
2Copyright © Capgemini 2015. All Rights Reserved
Who am I
 Jean-Baptiste Martin
 Managing Consultant at Capgemini
 Background: technical
 Big Data Analytics for 2 years
 Product manager People Analytics
 Founder at Top Notch
3Copyright © Capgemini 2015. All Rights Reserved
Project industrialization with Apache Spark
1. Spark in People Analytics
2. Team Organization
3. Issue #1: Text Replace
4. Issue #2: Non-Serializable Objects
5. Issue #3: Unit Testing
6. Issue #4: Wall of Code
Code available at:
https://github.com/jeanbmar/meetup-spark
4Copyright © Capgemini 2015. All Rights Reserved
Spark in People Analytics
 What is People Analytics?
5Copyright © Capgemini 2015. All Rights Reserved
Spark in People Analytics
Unstructured
WEX
AppBuilder
Watson Explorer
WEX Engine
Data Indexing
Visualization
HDFS
Store
Analytics Engine
Data Reconciliation
ODPi
HDFS Access
Structured
SGBD
CSV Files
Employees
Candidates
Jobs
1
2
3
4
6Copyright © Capgemini 2015. All Rights Reserved
Team Organization
1. Prototyping:
• Technologies: Hadoop, Java, R, Watson Explorer
• Team Profiles: 4 big data dev (Java), 1 data scientist, 1 data analyst
2. Industrialization:
• Technologies: Hadoop, Java and Scala, Spark, Watson Explorer
• Team Profiles:
– 2 data scientists
– 2 software developers
– 1 sys admin
– 2 web developers
3. All along:
• Strong support from IBM (expertise, implementation, go-to-market)
7Copyright © Capgemini 2015. All Rights Reserved
Issues we faced
 Issue #1: Text Replace
 Issue #2: Non-Serializable Objects
 Issue #3: Unit Testing
 Issue #4: Wall of Code
8Copyright © Capgemini 2015. All Rights Reserved
Issue #1 : Text Replace
 Browse and replace text is common when performing natural language
processing
« I work with WEX at Cap Gemini »
« I work with Watson Explorer at Capgemini »
Cap Gemini Capgemini
WEX Watson Explorer
+
=
9Copyright © Capgemini 2015. All Rights Reserved
Issue #1 : Text Replace
 Issues when:
• There’s a lot of documents to process
• Dictionaries (synonyms, stopwords, protected words, …) contain 1000+ entries
 Traditional implementations:
• Loop over dictionary entries  LOW PERF AND/OR INCORRECT
• Regular Expressions  LOW PERF
 We want: read text 1x and perform transformations on the fly
10Copyright © Capgemini 2015. All Rights Reserved
Issue #1 : Text Replace
 Solution
1. Expand dictionaries in HashMap objects, e.g.
2. Read text character by character and perform lookups over HashMap objects
– X  combination of characters is a part of an existing word
– null  no match
– Other  match
W X
WE X
WEX Watson Explorer
11Copyright © Capgemini 2015. All Rights Reserved
Issue #1 : Text Replace
 Case 1:
• Have: “Engineer. English. Fluent en.”
• Want: “Engineer. English. Fluent english.”
 Case 2:
• Have: “Cap Gemini consultant and Big Data developer with strong xp on Hadoop,
mostly Hadoop FS. BI background (DataStage, Cognos, Oracle, DB2). Worked on
multiple Watson technologies, including Watson API and WEX.”
• Dictionary, 875 entries including:
Cap Gemini Capgemini
Hadoop FS HDFS
DataStage IBM DataStage
Cognos IBM Cognos
DB2 IBM DB2
WEX Watson Explorer
12Copyright © Capgemini 2015. All Rights Reserved
Issue #2: Non-Serializable Objects
 Sometimes, people need to use external libraries to perform specific
transformations on objects
 Example: perform NLP transformations with Apache OpenNLP
 Problem:
• OpenNLP objects are not serializable  No broadcast
• OpenNLP objects take time to initialize  Never-ending closures
• We don’t want to convert OpenNLP source code (actually we tried)
13Copyright © Capgemini 2015. All Rights Reserved
Issue #2: Non-Serializable Objects
 Solution: Initiliaze singletons and bind them to Spark tasks using Java
ThreadLocal
Singleton class
Bind singleton to task thread
Will be called in closure
14Copyright © Capgemini 2015. All Rights Reserved
Issue #2: Non-Serializable Objects
 Then call transformation in closures:
 Benefits: objects are initialized only 1x per task instead of 1x per RDD element
Retrieve holder from current task
Get singleton object
Get SimpleClass object
15Copyright © Capgemini 2015. All Rights Reserved
Issue #3: Unit Testing
 One major step when moving from prototype to production is to define a proper
testing strategy
 Way people do their tests (non-exhaustive):
1. They run everything on cluster o/
2. They use a local context
 What we did:
• Use a local context
 Problem: jobs grab content from HDFS using Oozie job.properties
 Solution: setup a flexible configuration to operate seamlessly on cluster and locally
16Copyright © Capgemini 2015. All Rights Reserved
Issue #3: Unit Testing
 How it looks like:
Class applying a set of transformations
This grabs files on HDFS, can’t use locally
17Copyright © Capgemini 2015. All Rights Reserved
Issue #3: Unit Testing
 How can seamlessly operate with remote or local job.properties?
Using this
ConfigHelper class
18Copyright © Capgemini 2015. All Rights Reserved
Issue #3: Unit Testing
Call conf
Grab on FS
19Copyright © Capgemini 2015. All Rights Reserved
Issue #3: Unit Testing
 Finally, our test:
20Copyright © Capgemini 2015. All Rights Reserved
Issue #4: Wall of Code
 Object-oriented programming modeling doesn’t apply well in Spark
 As a result, we tend to write huge functions with tons of transformations
 People Analytics V0.01alpha : 1 class
 How we managed this:
 We regrouped consistant sets of transformations into functional classes
Functional class
Class consecutive
operations in run method
21Copyright © Capgemini 2015. All Rights Reserved
Thank You
Credits:
jean-baptiste.martin@capgemini.com
jerome.delvigne@capgemini.com
Code available at:
https://github.com/jeanbmar/meetup-spark
The information contained in this presentation is proprietary.
Copyright © 2015 Capgemini. All rights reserved.
Rightshore® is a trademark belonging to Capgemini.
www.capgemini.com
About Capgemini
With 180,000 people in over 40 countries, Capgemini is one of
the world's foremost providers of consulting, technology and
outsourcing services. The Group reported 2014 global revenues
of EUR 10.573 billion.
Together with its clients, Capgemini creates and delivers
business, technology and digital solutions that fit their needs,
enabling them to achieve innovation and competitiveness. A
deeply multicultural organization, Capgemini has developed its
own way of working, the Collaborative Business Experience™,
and draws on Rightshore®, its worldwide delivery model.
Learn more about us at www.capgemini.com.

Weitere ähnliche Inhalte

Was ist angesagt?

SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...
SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...
SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...
DevOpsDays Tel Aviv
 
Bridging the Gap - Laracon 2013
Bridging the Gap - Laracon 2013Bridging the Gap - Laracon 2013
Bridging the Gap - Laracon 2013
Ben Corlett
 

Was ist angesagt? (20)

Julia + R for Data Science
Julia + R for Data ScienceJulia + R for Data Science
Julia + R for Data Science
 
Extending the google_assistant
Extending the google_assistantExtending the google_assistant
Extending the google_assistant
 
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWhat We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
 
State of NuPIC
State of NuPICState of NuPIC
State of NuPIC
 
Does reporting takes lots of time
Does reporting takes lots of timeDoes reporting takes lots of time
Does reporting takes lots of time
 
Kafka and GraphQL: Misconceptions and Connections | Gerard Klijs, Open Web
Kafka and GraphQL: Misconceptions and Connections | Gerard Klijs, Open WebKafka and GraphQL: Misconceptions and Connections | Gerard Klijs, Open Web
Kafka and GraphQL: Misconceptions and Connections | Gerard Klijs, Open Web
 
Open Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache SparkOpen Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache Spark
 
Scaling Analysis Responsibly
Scaling Analysis ResponsiblyScaling Analysis Responsibly
Scaling Analysis Responsibly
 
SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...
SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...
SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...
 
Reactive programming and Hystrix fault tolerance by Max Myslyvtsev
Reactive programming and Hystrix fault tolerance by Max MyslyvtsevReactive programming and Hystrix fault tolerance by Max Myslyvtsev
Reactive programming and Hystrix fault tolerance by Max Myslyvtsev
 
A Link Generator for Increasing the Utility of OpenAPI-to-GraphQL Translations
A Link Generator for Increasing the Utility of OpenAPI-to-GraphQL TranslationsA Link Generator for Increasing the Utility of OpenAPI-to-GraphQL Translations
A Link Generator for Increasing the Utility of OpenAPI-to-GraphQL Translations
 
Bridging the Gap - Laracon 2013
Bridging the Gap - Laracon 2013Bridging the Gap - Laracon 2013
Bridging the Gap - Laracon 2013
 
Making Computations Execute Very Quickly
Making Computations Execute Very QuicklyMaking Computations Execute Very Quickly
Making Computations Execute Very Quickly
 
Refactoring Design Patterns the Functional Way (in Scala)
Refactoring Design Patterns the Functional Way (in Scala)Refactoring Design Patterns the Functional Way (in Scala)
Refactoring Design Patterns the Functional Way (in Scala)
 
Java is Dead, Long Live Ceylon, Kotlin, etc
Java is Dead,  Long Live Ceylon, Kotlin, etcJava is Dead,  Long Live Ceylon, Kotlin, etc
Java is Dead, Long Live Ceylon, Kotlin, etc
 
Making Python computations fast
Making Python computations fastMaking Python computations fast
Making Python computations fast
 
Building A Distributed Build System at Google Scale (StrangeLoop 2016)
Building A Distributed Build System at Google Scale (StrangeLoop 2016)Building A Distributed Build System at Google Scale (StrangeLoop 2016)
Building A Distributed Build System at Google Scale (StrangeLoop 2016)
 
Deck 8983a1d9-68df-4447-8481-3b4fd0de734c-9-52
Deck 8983a1d9-68df-4447-8481-3b4fd0de734c-9-52Deck 8983a1d9-68df-4447-8481-3b4fd0de734c-9-52
Deck 8983a1d9-68df-4447-8481-3b4fd0de734c-9-52
 
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
Dr. Datascience or: How I Learned to Stop Munging and Love TestsDr. Datascience or: How I Learned to Stop Munging and Love Tests
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
 
GraphQL Europe Recap
GraphQL Europe RecapGraphQL Europe Recap
GraphQL Europe Recap
 

Ähnlich wie Capgemini - Project industrialization with apache spark

Data Science Accelerator Program
Data Science Accelerator ProgramData Science Accelerator Program
Data Science Accelerator Program
GoDataDriven
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
 
How and Why you can and should Participate in Open Source Projects (AMIS, Sof...
How and Why you can and should Participate in Open Source Projects (AMIS, Sof...How and Why you can and should Participate in Open Source Projects (AMIS, Sof...
How and Why you can and should Participate in Open Source Projects (AMIS, Sof...
Lucas Jellema
 
StorageOS: a Software Defined Storage Solution for OpenShift
StorageOS: a Software Defined Storage Solution for OpenShiftStorageOS: a Software Defined Storage Solution for OpenShift
StorageOS: a Software Defined Storage Solution for OpenShift
Cheryl Hung
 

Ähnlich wie Capgemini - Project industrialization with apache spark (20)

Data Science Accelerator Program
Data Science Accelerator ProgramData Science Accelerator Program
Data Science Accelerator Program
 
Best practices iOS meetup - pmd
Best practices iOS meetup - pmdBest practices iOS meetup - pmd
Best practices iOS meetup - pmd
 
Mark Hughes Annual Seminar Presentation on Open Source
Mark Hughes Annual Seminar Presentation on Open Source Mark Hughes Annual Seminar Presentation on Open Source
Mark Hughes Annual Seminar Presentation on Open Source
 
"Leveraging the Event Loop for Blazing-Fast Applications!", Michael Di Prisco
"Leveraging the Event Loop for Blazing-Fast Applications!",  Michael Di Prisco"Leveraging the Event Loop for Blazing-Fast Applications!",  Michael Di Prisco
"Leveraging the Event Loop for Blazing-Fast Applications!", Michael Di Prisco
 
DAWN and Scientific Workflows
DAWN and Scientific WorkflowsDAWN and Scientific Workflows
DAWN and Scientific Workflows
 
System design for Web Application
System design for Web ApplicationSystem design for Web Application
System design for Web Application
 
Semantic web, python, construction industry
Semantic web, python, construction industrySemantic web, python, construction industry
Semantic web, python, construction industry
 
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
 
Translating Apereo Software: A Case Study using Sakai and Transifex
Translating Apereo Software:  A Case Study using Sakai and TransifexTranslating Apereo Software:  A Case Study using Sakai and Transifex
Translating Apereo Software: A Case Study using Sakai and Transifex
 
A Tale of Two Apps
A Tale of Two AppsA Tale of Two Apps
A Tale of Two Apps
 
Keynote at Converge 2019
Keynote at Converge 2019Keynote at Converge 2019
Keynote at Converge 2019
 
HiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOSHiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOS
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 
How and Why you can and should Participate in Open Source Projects (AMIS, Sof...
How and Why you can and should Participate in Open Source Projects (AMIS, Sof...How and Why you can and should Participate in Open Source Projects (AMIS, Sof...
How and Why you can and should Participate in Open Source Projects (AMIS, Sof...
 
Automate Hadoop Cluster Deployment in a Banking Ecosystem
Automate Hadoop Cluster Deployment in a Banking EcosystemAutomate Hadoop Cluster Deployment in a Banking Ecosystem
Automate Hadoop Cluster Deployment in a Banking Ecosystem
 
SharePoint Framework tips and tricks
SharePoint Framework tips and tricksSharePoint Framework tips and tricks
SharePoint Framework tips and tricks
 
Accelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine LearningAccelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine Learning
 
StorageOS: a Software Defined Storage Solution for OpenShift
StorageOS: a Software Defined Storage Solution for OpenShiftStorageOS: a Software Defined Storage Solution for OpenShift
StorageOS: a Software Defined Storage Solution for OpenShift
 
Python libraries
Python librariesPython libraries
Python libraries
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 

Capgemini - Project industrialization with apache spark

  • 1. Apache Spark and Bluemix Meetup Jean-Baptiste Martin July 6, 2016 Project industrialization with Apache Spark
  • 2. 2Copyright © Capgemini 2015. All Rights Reserved Who am I  Jean-Baptiste Martin  Managing Consultant at Capgemini  Background: technical  Big Data Analytics for 2 years  Product manager People Analytics  Founder at Top Notch
  • 3. 3Copyright © Capgemini 2015. All Rights Reserved Project industrialization with Apache Spark 1. Spark in People Analytics 2. Team Organization 3. Issue #1: Text Replace 4. Issue #2: Non-Serializable Objects 5. Issue #3: Unit Testing 6. Issue #4: Wall of Code Code available at: https://github.com/jeanbmar/meetup-spark
  • 4. 4Copyright © Capgemini 2015. All Rights Reserved Spark in People Analytics  What is People Analytics?
  • 5. 5Copyright © Capgemini 2015. All Rights Reserved Spark in People Analytics Unstructured WEX AppBuilder Watson Explorer WEX Engine Data Indexing Visualization HDFS Store Analytics Engine Data Reconciliation ODPi HDFS Access Structured SGBD CSV Files Employees Candidates Jobs 1 2 3 4
  • 6. 6Copyright © Capgemini 2015. All Rights Reserved Team Organization 1. Prototyping: • Technologies: Hadoop, Java, R, Watson Explorer • Team Profiles: 4 big data dev (Java), 1 data scientist, 1 data analyst 2. Industrialization: • Technologies: Hadoop, Java and Scala, Spark, Watson Explorer • Team Profiles: – 2 data scientists – 2 software developers – 1 sys admin – 2 web developers 3. All along: • Strong support from IBM (expertise, implementation, go-to-market)
  • 7. 7Copyright © Capgemini 2015. All Rights Reserved Issues we faced  Issue #1: Text Replace  Issue #2: Non-Serializable Objects  Issue #3: Unit Testing  Issue #4: Wall of Code
  • 8. 8Copyright © Capgemini 2015. All Rights Reserved Issue #1 : Text Replace  Browse and replace text is common when performing natural language processing « I work with WEX at Cap Gemini » « I work with Watson Explorer at Capgemini » Cap Gemini Capgemini WEX Watson Explorer + =
  • 9. 9Copyright © Capgemini 2015. All Rights Reserved Issue #1 : Text Replace  Issues when: • There’s a lot of documents to process • Dictionaries (synonyms, stopwords, protected words, …) contain 1000+ entries  Traditional implementations: • Loop over dictionary entries  LOW PERF AND/OR INCORRECT • Regular Expressions  LOW PERF  We want: read text 1x and perform transformations on the fly
  • 10. 10Copyright © Capgemini 2015. All Rights Reserved Issue #1 : Text Replace  Solution 1. Expand dictionaries in HashMap objects, e.g. 2. Read text character by character and perform lookups over HashMap objects – X  combination of characters is a part of an existing word – null  no match – Other  match W X WE X WEX Watson Explorer
  • 11. 11Copyright © Capgemini 2015. All Rights Reserved Issue #1 : Text Replace  Case 1: • Have: “Engineer. English. Fluent en.” • Want: “Engineer. English. Fluent english.”  Case 2: • Have: “Cap Gemini consultant and Big Data developer with strong xp on Hadoop, mostly Hadoop FS. BI background (DataStage, Cognos, Oracle, DB2). Worked on multiple Watson technologies, including Watson API and WEX.” • Dictionary, 875 entries including: Cap Gemini Capgemini Hadoop FS HDFS DataStage IBM DataStage Cognos IBM Cognos DB2 IBM DB2 WEX Watson Explorer
  • 12. 12Copyright © Capgemini 2015. All Rights Reserved Issue #2: Non-Serializable Objects  Sometimes, people need to use external libraries to perform specific transformations on objects  Example: perform NLP transformations with Apache OpenNLP  Problem: • OpenNLP objects are not serializable  No broadcast • OpenNLP objects take time to initialize  Never-ending closures • We don’t want to convert OpenNLP source code (actually we tried)
  • 13. 13Copyright © Capgemini 2015. All Rights Reserved Issue #2: Non-Serializable Objects  Solution: Initiliaze singletons and bind them to Spark tasks using Java ThreadLocal Singleton class Bind singleton to task thread Will be called in closure
  • 14. 14Copyright © Capgemini 2015. All Rights Reserved Issue #2: Non-Serializable Objects  Then call transformation in closures:  Benefits: objects are initialized only 1x per task instead of 1x per RDD element Retrieve holder from current task Get singleton object Get SimpleClass object
  • 15. 15Copyright © Capgemini 2015. All Rights Reserved Issue #3: Unit Testing  One major step when moving from prototype to production is to define a proper testing strategy  Way people do their tests (non-exhaustive): 1. They run everything on cluster o/ 2. They use a local context  What we did: • Use a local context  Problem: jobs grab content from HDFS using Oozie job.properties  Solution: setup a flexible configuration to operate seamlessly on cluster and locally
  • 16. 16Copyright © Capgemini 2015. All Rights Reserved Issue #3: Unit Testing  How it looks like: Class applying a set of transformations This grabs files on HDFS, can’t use locally
  • 17. 17Copyright © Capgemini 2015. All Rights Reserved Issue #3: Unit Testing  How can seamlessly operate with remote or local job.properties? Using this ConfigHelper class
  • 18. 18Copyright © Capgemini 2015. All Rights Reserved Issue #3: Unit Testing Call conf Grab on FS
  • 19. 19Copyright © Capgemini 2015. All Rights Reserved Issue #3: Unit Testing  Finally, our test:
  • 20. 20Copyright © Capgemini 2015. All Rights Reserved Issue #4: Wall of Code  Object-oriented programming modeling doesn’t apply well in Spark  As a result, we tend to write huge functions with tons of transformations  People Analytics V0.01alpha : 1 class  How we managed this:  We regrouped consistant sets of transformations into functional classes Functional class Class consecutive operations in run method
  • 21. 21Copyright © Capgemini 2015. All Rights Reserved Thank You Credits: jean-baptiste.martin@capgemini.com jerome.delvigne@capgemini.com Code available at: https://github.com/jeanbmar/meetup-spark
  • 22. The information contained in this presentation is proprietary. Copyright © 2015 Capgemini. All rights reserved. Rightshore® is a trademark belonging to Capgemini. www.capgemini.com About Capgemini With 180,000 people in over 40 countries, Capgemini is one of the world's foremost providers of consulting, technology and outsourcing services. The Group reported 2014 global revenues of EUR 10.573 billion. Together with its clients, Capgemini creates and delivers business, technology and digital solutions that fit their needs, enabling them to achieve innovation and competitiveness. A deeply multicultural organization, Capgemini has developed its own way of working, the Collaborative Business Experience™, and draws on Rightshore®, its worldwide delivery model. Learn more about us at www.capgemini.com.