Weitere ähnliche Inhalte Ähnlich wie Capgemini - Project industrialization with apache spark (20) Kürzlich hochgeladen (20) Capgemini - Project industrialization with apache spark1. Apache Spark and Bluemix Meetup
Jean-Baptiste Martin
July 6, 2016
Project industrialization with Apache Spark
2. 2Copyright © Capgemini 2015. All Rights Reserved
Who am I
Jean-Baptiste Martin
Managing Consultant at Capgemini
Background: technical
Big Data Analytics for 2 years
Product manager People Analytics
Founder at Top Notch
3. 3Copyright © Capgemini 2015. All Rights Reserved
Project industrialization with Apache Spark
1. Spark in People Analytics
2. Team Organization
3. Issue #1: Text Replace
4. Issue #2: Non-Serializable Objects
5. Issue #3: Unit Testing
6. Issue #4: Wall of Code
Code available at:
https://github.com/jeanbmar/meetup-spark
5. 5Copyright © Capgemini 2015. All Rights Reserved
Spark in People Analytics
Unstructured
WEX
AppBuilder
Watson Explorer
WEX Engine
Data Indexing
Visualization
HDFS
Store
Analytics Engine
Data Reconciliation
ODPi
HDFS Access
Structured
SGBD
CSV Files
Employees
Candidates
Jobs
1
2
3
4
6. 6Copyright © Capgemini 2015. All Rights Reserved
Team Organization
1. Prototyping:
• Technologies: Hadoop, Java, R, Watson Explorer
• Team Profiles: 4 big data dev (Java), 1 data scientist, 1 data analyst
2. Industrialization:
• Technologies: Hadoop, Java and Scala, Spark, Watson Explorer
• Team Profiles:
– 2 data scientists
– 2 software developers
– 1 sys admin
– 2 web developers
3. All along:
• Strong support from IBM (expertise, implementation, go-to-market)
7. 7Copyright © Capgemini 2015. All Rights Reserved
Issues we faced
Issue #1: Text Replace
Issue #2: Non-Serializable Objects
Issue #3: Unit Testing
Issue #4: Wall of Code
8. 8Copyright © Capgemini 2015. All Rights Reserved
Issue #1 : Text Replace
Browse and replace text is common when performing natural language
processing
« I work with WEX at Cap Gemini »
« I work with Watson Explorer at Capgemini »
Cap Gemini Capgemini
WEX Watson Explorer
+
=
9. 9Copyright © Capgemini 2015. All Rights Reserved
Issue #1 : Text Replace
Issues when:
• There’s a lot of documents to process
• Dictionaries (synonyms, stopwords, protected words, …) contain 1000+ entries
Traditional implementations:
• Loop over dictionary entries LOW PERF AND/OR INCORRECT
• Regular Expressions LOW PERF
We want: read text 1x and perform transformations on the fly
10. 10Copyright © Capgemini 2015. All Rights Reserved
Issue #1 : Text Replace
Solution
1. Expand dictionaries in HashMap objects, e.g.
2. Read text character by character and perform lookups over HashMap objects
– X combination of characters is a part of an existing word
– null no match
– Other match
W X
WE X
WEX Watson Explorer
11. 11Copyright © Capgemini 2015. All Rights Reserved
Issue #1 : Text Replace
Case 1:
• Have: “Engineer. English. Fluent en.”
• Want: “Engineer. English. Fluent english.”
Case 2:
• Have: “Cap Gemini consultant and Big Data developer with strong xp on Hadoop,
mostly Hadoop FS. BI background (DataStage, Cognos, Oracle, DB2). Worked on
multiple Watson technologies, including Watson API and WEX.”
• Dictionary, 875 entries including:
Cap Gemini Capgemini
Hadoop FS HDFS
DataStage IBM DataStage
Cognos IBM Cognos
DB2 IBM DB2
WEX Watson Explorer
12. 12Copyright © Capgemini 2015. All Rights Reserved
Issue #2: Non-Serializable Objects
Sometimes, people need to use external libraries to perform specific
transformations on objects
Example: perform NLP transformations with Apache OpenNLP
Problem:
• OpenNLP objects are not serializable No broadcast
• OpenNLP objects take time to initialize Never-ending closures
• We don’t want to convert OpenNLP source code (actually we tried)
13. 13Copyright © Capgemini 2015. All Rights Reserved
Issue #2: Non-Serializable Objects
Solution: Initiliaze singletons and bind them to Spark tasks using Java
ThreadLocal
Singleton class
Bind singleton to task thread
Will be called in closure
14. 14Copyright © Capgemini 2015. All Rights Reserved
Issue #2: Non-Serializable Objects
Then call transformation in closures:
Benefits: objects are initialized only 1x per task instead of 1x per RDD element
Retrieve holder from current task
Get singleton object
Get SimpleClass object
15. 15Copyright © Capgemini 2015. All Rights Reserved
Issue #3: Unit Testing
One major step when moving from prototype to production is to define a proper
testing strategy
Way people do their tests (non-exhaustive):
1. They run everything on cluster o/
2. They use a local context
What we did:
• Use a local context
Problem: jobs grab content from HDFS using Oozie job.properties
Solution: setup a flexible configuration to operate seamlessly on cluster and locally
16. 16Copyright © Capgemini 2015. All Rights Reserved
Issue #3: Unit Testing
How it looks like:
Class applying a set of transformations
This grabs files on HDFS, can’t use locally
17. 17Copyright © Capgemini 2015. All Rights Reserved
Issue #3: Unit Testing
How can seamlessly operate with remote or local job.properties?
Using this
ConfigHelper class
20. 20Copyright © Capgemini 2015. All Rights Reserved
Issue #4: Wall of Code
Object-oriented programming modeling doesn’t apply well in Spark
As a result, we tend to write huge functions with tons of transformations
People Analytics V0.01alpha : 1 class
How we managed this:
We regrouped consistant sets of transformations into functional classes
Functional class
Class consecutive
operations in run method
21. 21Copyright © Capgemini 2015. All Rights Reserved
Thank You
Credits:
jean-baptiste.martin@capgemini.com
jerome.delvigne@capgemini.com
Code available at:
https://github.com/jeanbmar/meetup-spark
22. The information contained in this presentation is proprietary.
Copyright © 2015 Capgemini. All rights reserved.
Rightshore® is a trademark belonging to Capgemini.
www.capgemini.com
About Capgemini
With 180,000 people in over 40 countries, Capgemini is one of
the world's foremost providers of consulting, technology and
outsourcing services. The Group reported 2014 global revenues
of EUR 10.573 billion.
Together with its clients, Capgemini creates and delivers
business, technology and digital solutions that fit their needs,
enabling them to achieve innovation and competitiveness. A
deeply multicultural organization, Capgemini has developed its
own way of working, the Collaborative Business Experience™,
and draws on Rightshore®, its worldwide delivery model.
Learn more about us at www.capgemini.com.