SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
May, 2015
Douglas Eisenstein - Advanti
Stanislav Seltser - Advanti
BOSTON 2015
@opendatasci
O P E N
D A T A
S C I E N C E
C O N F E R E N C E_
Spark, Python, and Parquet
Learn How to Use Spark, Python, and Parquet for
Loading and Transforming Data in 45 Minutes
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Agenda
Use Case Background
What’s Spark and Parquet?
Demo: code + data = fun part
Questions
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
TAKEAWAYS: WHAT WILL YOU LEARN TODAY?
Learn new technologies and be motivated to start using them
3	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
In about the same time it will take you to mow your lawn…
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Takeaways: you will learn …
1.  What is Spark and Parquet and why to consider it?
2.  A live demo showing you 5 useful transforms in Spark
3.  Instructions for DIY cluster: Spark, Hadoop, Hive, Parquet, and C*
4.  “Take home” fund holdings open dataset from Morningstar
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
USE CASE: DATASET, TRADEOFFS, SOLUTION
What’s our use case? Why choose Spark?
6	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Holdings Data
“The contents of an investment portfolio held by an individual or entity such as a mutual
fund or pension fund.” – Investopedia 2015
Sources:
o  3rd Parties: Morningstar, Lipper, FactSet, Bloomberg, Thomson Reuters
o  Internal: Specialized portfolio accounting systems per asset type
o  Public: SEC 13F’s when AUM is >= $100M, includes hedge funds
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Use Case: Challenges
1.  Millions of files, Reshape data, pre-aggregate holdings,
disambiguate securities, unstack/stack data
2.  Create wide tables (think columnar) for storing time series data
about 1M instruments over 12 months for 381k funds
3.  Rules are abstracted to work on “holdings” making them vendor-
agnostic and reusable (13F’s, 3rd parties, proprietary, etc)
4.  All sorts of “data usability” issues: missing positions, missing
identifiers, sparse derivative data, irregular fund reporting, etc
5.  Need to create your own “ownership views” by asset class and
period (ex. Fixed Income Ownership View)
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Open Dataset Description
Key Points
o  Holdings are electively contributed for fresher ranks
o  Long and Short positions included, unique…
o  Datasets: free open data, trial data, paid licensed data
o  Our dataset starts in 2014-01 across all fund types
Descriptive Statistics
o  History starts in 2000+
o  Fund types: Open, Closed, ETFs, SMAs
o  381k funds since 2014
o  1M unique securities since 2014
o  82 legal types, ex Corporate Bond or Equity Swap
Open Dataset: open-ended funds for report period 2014-06-10
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Common Usages
1.  Ownership: Top N largest long/short positions per asset class
2.  Flow: Compute holdings flow across various financial assets
3.  Comparison: Peer group analysis through holdings attribution
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
What’s the best part about starting a new project? You get to “play”
with new tech toys right? But which ones….?
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Data Processing Landscape
o  f
Too
many
choices
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Modern General Purpose Data Pipeline
Why:
1.  Spark for in-memory transforms/analytics and fast exploration
2.  Parquet for fast columnar data retrieval
3.  Python as the glue layer and to re-use data transforms
Data Pipeline:
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Granular Options
Remember… Technology moves quickly!
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
SPARK: WHAT IS IT?
Everyone hears about it, but what is Spark?
15	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
If you do any Googling, don’t search for Spark 1.4
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
What is Spark?
Highlights
o  General purpose data processing engine
o  In-memory data persistence
o  Interactive shells in Scala and Python
Key Concepts
o  RDD: Collections of objects stored in RAM or on Disk
o  Transforms: map(), filter(), reduceByKey(), join()
o  Actions: count(), collect(), save()
Spark SQL
o  Runs SQL or HQL queries
o  In-line SQL UDF
o  Distributed DataFrame
o  Parquet, Hive, Cassandra
Graphic	
  sourced:	
  hAps://goo.gl/fpLrSs	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
What is a Spark DataFrame?
What’s DataFrame?
•  A collection of rows organized into named columns
•  Expressive language for wrangling data into something usable
•  Ability to select, filter, and aggregate structured data
What’s a Spark DataFrame?
•  Distributed collection of data organized into named columns
•  Conceptually equivalent to a table in a relational database
•  Relational data processing: project, filter, aggregate, join, etc
•  Operations: groupBy(),
join(), sql(),
unique()
•  Create UDF’s and push them
into SparkSQL
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Distributed Data Aggregation
o  Distributed high-performance data, once only available by
expensive appliances: Netezza, GreenPlum, Exadata, Teradata, etc
o  Code 1-liner doing a SUM() between Pandas (standalone) and
Spark (distributed)
Pandas DataFrame Spark DataFrame
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
PARQUET: WHAT IS IT?
No, it’s not your Grandma’s flooring…
20	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
What is Parquet?
Parquet File Format Parquet in HDFS
“Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem,
regardless of the choice of data processing framework, data model or programming language.” –
parquet.apache.org
•  Columnar File Format
•  Supports Nested Data Structures
•  Not tied to any commercial framework
•  Accessible by HIVE, Spark, Pig, Drill, MR
•  R/W in HDFS or local file system
•  Gaining Strong Usage…
Features	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
What is Columnar Data?
•  Limit’s IO to data needed
•  Columnar compresses better
•  Type specific encodings available
•  Enables vectorized execution engines
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
How is Columnar Data Read?
Projection à
9 columns
Predicate à
50 Columns Wide
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Spark DataFrame è Parquet
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
DEMO: HIVE, PARQUET, SPARK SQL, DATAFRAME
Prepare CSV’s in HIVE, persist in Parquet, show Spark SQL and
DataFrame transforms using interactive shell in PySpark
25	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Spark SQL is Experimental
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Demo Outline
T
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
OBSTACLES: NOTHING IS EASY
Expect workarounds and time spent hacking, but consider this, it’s
a learning opportunity…
28	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Nothing is easy – unless you’re Ricky Bobby – “Shake and Bake”
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Obstacles
Spark
•  Don’t expect Pandas-convenient Spark DataFrames (yet at least)
(e.g. no upsampling, backfilling, etc)
•  RDD’s are powerful, but you’ll need to get your hands dirty writing
lower-level code (not a bad thing)
•  Allocate enough memory on all of your nodes, it’s a hog!
Parquet
•  Date and binary support are pending although timestamp,
decimal, char, and varchar are now supported in Hive 0.14.0
•  You won’t get Vertica or Cassandra level response times (maybe in
the future – that’s my speculation)
Getting Started
•  Learn by doing: reading is useful to gain the basics but just “jump
right in and do it”
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
QUESTIONS
Douglas Eisenstein
@dougeisenstein
doug.eisenstein@advantisolutions.com
Helping to create fast, reliable, and transparent modern data pipelines for financial analytics
Talk feedback: https://goo.gl/Y0UuWy
31	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
APPENDIX
Using Spark, Python, and Cassandra for Loading and
Transformations at Scale
32	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Resources
hAp://training.databricks.com/workshop/itas_workshop.pdf	
  (Intro	
  to	
  Spark	
  PDF)	
  	
  
hAps://databricks-­‐training.s3.amazonaws.com/slides/SparkSQLTraining.Summit.July2014.pdf	
  (Spark	
  SQL	
  Training	
  Module)	
  	
  
hAp://training.databricks.com/workshop/itas_workshop.pdf	
  (Spark	
  101)	
  	
  
hAp://spark-­‐summit.org/2014/training	
  (General	
  Spark	
  Videos)	
  	
  
hAps://spark.apache.org/docs/1.3.0/sql-­‐programming-­‐guide.html	
  (Spark	
  SQL	
  and	
  DataFrame’s	
  —	
  awesome)	
  	
  
hAp://www.slideshare.net/EvanChan2/2014-­‐07olapcassspark	
  (OLAP	
  with	
  Cassandra	
  and	
  Spark)	
  	
  
hAps://databricks.com/blog/2015/03/24/spark-­‐sql-­‐graduates-­‐from-­‐alpha-­‐in-­‐spark-­‐1-­‐3.html	
  (Latest	
  Spark/DataFrame/Parquet)	
  	
  
hAps://spark.apache.org/docs/latest/programming-­‐guide.html	
  (Basics	
  of	
  Spark	
  Development)	
  	
  
hAps://spark.apache.org/docs/latest/configura=on.html	
  (Spark	
  Configura=on)	
  	
  
hAps://academy.datastax.com/demos/datastax-­‐enterprise-­‐joining-­‐tables-­‐apache-­‐spark	
  (Joining	
  Cassandra	
  Tables	
  in	
  Spark)	
  	
  
hAp://www.infoobjects.com/author/rishi/	
  (Spark	
  /	
  Parquet	
  Integra=on)	
  	
  
hAps://parquet.incubator.apache.org/presenta=ons/	
  (Parquet	
  Videos/Slides,	
  awesome)	
  	
  
hAp://www.slideshare.net/databricks/spark-­‐sqlsse2015public	
  (All	
  about	
  DataFrame	
  by	
  the	
  author)	
  	
  
hAp://www.infoobjects.com/category/spark_cookbook/	
  (Good	
  walkthrough	
  of	
  Spark	
  Demos)	
  	
  
hAp://tobert.github.io/post/2014-­‐07-­‐15-­‐installing-­‐cassandra-­‐spark-­‐stack.html	
  (Spark	
  /	
  Cassandra	
  Integra=on	
  from	
  scratch)	
  	
  
hAp://blog.cloudera.com/blog/2015/05/working-­‐with-­‐apache-­‐spark-­‐or-­‐how-­‐i-­‐learned-­‐to-­‐stop-­‐worrying-­‐and-­‐love-­‐the-­‐
shuffle/	
  (Spark	
  by	
  Data	
  Engineer)	
  	
  
hAp://blog.cloudera.com/blog/2015/03/how-­‐to-­‐tune-­‐your-­‐apache-­‐spark-­‐jobs-­‐part-­‐1/	
  (Tuning	
  Spark)	
  	
  
hAps://spark-­‐summit.org/2015-­‐east/wp-­‐content/uploads/2015/03/SSE15-­‐21-­‐Sandy-­‐Ryza.pdf	
  ()	
  	
  
hAps://www.youtube.com/watch?v=0OM68k3np0E&list=PL-­‐x35fyliRwiiYSXHyI61RXdHlYR3QjZ1&index=5	
  ()	
  
http://www.slideshare.net/dataera/parquet-format
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
About Me
o  My vision is to make data preparation fast and reliable
o  I help financial firms with data-intensive processes
o 
o  In my spare time: CrossFit, Baseball, No Gardening!
@dougeisenstein

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Data Science
Data ScienceData Science
Data Science
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
PySaprk
PySaprkPySaprk
PySaprk
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 

Ähnlich wie Spark, Python and Parquet

Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon
 

Ähnlich wie Spark, Python and Parquet (20)

Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
Spark Intro @ analytics big data summit
Spark  Intro @ analytics big data summitSpark  Intro @ analytics big data summit
Spark Intro @ analytics big data summit
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
Ischools workshop - 4 - data discovery
Ischools workshop - 4 - data discoveryIschools workshop - 4 - data discovery
Ischools workshop - 4 - data discovery
 
Survey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsSurvey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and Analytics
 
Data Engineering A Deep Dive into Databricks
Data Engineering A Deep Dive into DatabricksData Engineering A Deep Dive into Databricks
Data Engineering A Deep Dive into Databricks
 
How Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackHow Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science Stack
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
 
Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark
 
How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Skillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data JournalismSkillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data Journalism
 
Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02
 
Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data Virtualization
 
IIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data ScienceIIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data Science
 
Databases for Data Science
Databases for Data ScienceDatabases for Data Science
Databases for Data Science
 
FAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesFAIR data: LOUD for all audiences
FAIR data: LOUD for all audiences
 

Mehr von odsc

Kaggle The Home of Data Science
Kaggle The Home of Data ScienceKaggle The Home of Data Science
Kaggle The Home of Data Science
odsc
 
The Art of Data Science
The Art of Data Science The Art of Data Science
The Art of Data Science
odsc
 
Frontiers of Open Data Science Research
Frontiers of Open Data Science ResearchFrontiers of Open Data Science Research
Frontiers of Open Data Science Research
odsc
 

Mehr von odsc (20)

Understanding the Chief Data Officer
Understanding the Chief Data Officer Understanding the Chief Data Officer
Understanding the Chief Data Officer
 
Machine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge DiscoveryMachine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge Discovery
 
API Driven Development
API Driven Development API Driven Development
API Driven Development
 
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata AnalysisMobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
 
Productionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground UpProductionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground Up
 
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and HiveBig Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
 
Think Breadth, Not Depth
Think Breadth, Not DepthThink Breadth, Not Depth
Think Breadth, Not Depth
 
Data Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and InformationData Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and Information
 
Building a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure MLBuilding a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure ML
 
Beyond Names
Beyond NamesBeyond Names
Beyond Names
 
How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500
 
Domain Expertise and Unstructured Data
Domain Expertise and Unstructured DataDomain Expertise and Unstructured Data
Domain Expertise and Unstructured Data
 
Kaggle The Home of Data Science
Kaggle The Home of Data ScienceKaggle The Home of Data Science
Kaggle The Home of Data Science
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
 
Bridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source ToolsBridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source Tools
 
Top 10 Signs of the Textpocalypse
Top 10 Signs of the TextpocalypseTop 10 Signs of the Textpocalypse
Top 10 Signs of the Textpocalypse
 
The Art of Data Science
The Art of Data Science The Art of Data Science
The Art of Data Science
 
Frontiers of Open Data Science Research
Frontiers of Open Data Science ResearchFrontiers of Open Data Science Research
Frontiers of Open Data Science Research
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 

Kürzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Spark, Python and Parquet

  • 1. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   May, 2015 Douglas Eisenstein - Advanti Stanislav Seltser - Advanti BOSTON 2015 @opendatasci O P E N D A T A S C I E N C E C O N F E R E N C E_ Spark, Python, and Parquet Learn How to Use Spark, Python, and Parquet for Loading and Transforming Data in 45 Minutes
  • 2. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Agenda Use Case Background What’s Spark and Parquet? Demo: code + data = fun part Questions
  • 3. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   TAKEAWAYS: WHAT WILL YOU LEARN TODAY? Learn new technologies and be motivated to start using them 3  
  • 4. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   In about the same time it will take you to mow your lawn…
  • 5. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Takeaways: you will learn … 1.  What is Spark and Parquet and why to consider it? 2.  A live demo showing you 5 useful transforms in Spark 3.  Instructions for DIY cluster: Spark, Hadoop, Hive, Parquet, and C* 4.  “Take home” fund holdings open dataset from Morningstar
  • 6. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   USE CASE: DATASET, TRADEOFFS, SOLUTION What’s our use case? Why choose Spark? 6  
  • 7. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Holdings Data “The contents of an investment portfolio held by an individual or entity such as a mutual fund or pension fund.” – Investopedia 2015 Sources: o  3rd Parties: Morningstar, Lipper, FactSet, Bloomberg, Thomson Reuters o  Internal: Specialized portfolio accounting systems per asset type o  Public: SEC 13F’s when AUM is >= $100M, includes hedge funds
  • 8. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Use Case: Challenges 1.  Millions of files, Reshape data, pre-aggregate holdings, disambiguate securities, unstack/stack data 2.  Create wide tables (think columnar) for storing time series data about 1M instruments over 12 months for 381k funds 3.  Rules are abstracted to work on “holdings” making them vendor- agnostic and reusable (13F’s, 3rd parties, proprietary, etc) 4.  All sorts of “data usability” issues: missing positions, missing identifiers, sparse derivative data, irregular fund reporting, etc 5.  Need to create your own “ownership views” by asset class and period (ex. Fixed Income Ownership View)
  • 9. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Open Dataset Description Key Points o  Holdings are electively contributed for fresher ranks o  Long and Short positions included, unique… o  Datasets: free open data, trial data, paid licensed data o  Our dataset starts in 2014-01 across all fund types Descriptive Statistics o  History starts in 2000+ o  Fund types: Open, Closed, ETFs, SMAs o  381k funds since 2014 o  1M unique securities since 2014 o  82 legal types, ex Corporate Bond or Equity Swap Open Dataset: open-ended funds for report period 2014-06-10
  • 10. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Common Usages 1.  Ownership: Top N largest long/short positions per asset class 2.  Flow: Compute holdings flow across various financial assets 3.  Comparison: Peer group analysis through holdings attribution
  • 11. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   What’s the best part about starting a new project? You get to “play” with new tech toys right? But which ones….?
  • 12. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Data Processing Landscape o  f Too many choices
  • 13. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Modern General Purpose Data Pipeline Why: 1.  Spark for in-memory transforms/analytics and fast exploration 2.  Parquet for fast columnar data retrieval 3.  Python as the glue layer and to re-use data transforms Data Pipeline:
  • 14. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Granular Options Remember… Technology moves quickly!
  • 15. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   SPARK: WHAT IS IT? Everyone hears about it, but what is Spark? 15  
  • 16. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   If you do any Googling, don’t search for Spark 1.4
  • 17. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   What is Spark? Highlights o  General purpose data processing engine o  In-memory data persistence o  Interactive shells in Scala and Python Key Concepts o  RDD: Collections of objects stored in RAM or on Disk o  Transforms: map(), filter(), reduceByKey(), join() o  Actions: count(), collect(), save() Spark SQL o  Runs SQL or HQL queries o  In-line SQL UDF o  Distributed DataFrame o  Parquet, Hive, Cassandra Graphic  sourced:  hAps://goo.gl/fpLrSs  
  • 18. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   What is a Spark DataFrame? What’s DataFrame? •  A collection of rows organized into named columns •  Expressive language for wrangling data into something usable •  Ability to select, filter, and aggregate structured data What’s a Spark DataFrame? •  Distributed collection of data organized into named columns •  Conceptually equivalent to a table in a relational database •  Relational data processing: project, filter, aggregate, join, etc •  Operations: groupBy(), join(), sql(), unique() •  Create UDF’s and push them into SparkSQL
  • 19. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Distributed Data Aggregation o  Distributed high-performance data, once only available by expensive appliances: Netezza, GreenPlum, Exadata, Teradata, etc o  Code 1-liner doing a SUM() between Pandas (standalone) and Spark (distributed) Pandas DataFrame Spark DataFrame
  • 20. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   PARQUET: WHAT IS IT? No, it’s not your Grandma’s flooring… 20  
  • 21. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   What is Parquet? Parquet File Format Parquet in HDFS “Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.” – parquet.apache.org •  Columnar File Format •  Supports Nested Data Structures •  Not tied to any commercial framework •  Accessible by HIVE, Spark, Pig, Drill, MR •  R/W in HDFS or local file system •  Gaining Strong Usage… Features  
  • 22. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   What is Columnar Data? •  Limit’s IO to data needed •  Columnar compresses better •  Type specific encodings available •  Enables vectorized execution engines
  • 23. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   How is Columnar Data Read? Projection à 9 columns Predicate à 50 Columns Wide
  • 24. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Spark DataFrame è Parquet
  • 25. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   DEMO: HIVE, PARQUET, SPARK SQL, DATAFRAME Prepare CSV’s in HIVE, persist in Parquet, show Spark SQL and DataFrame transforms using interactive shell in PySpark 25  
  • 26. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Spark SQL is Experimental
  • 27. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Demo Outline T
  • 28. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   OBSTACLES: NOTHING IS EASY Expect workarounds and time spent hacking, but consider this, it’s a learning opportunity… 28  
  • 29. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Nothing is easy – unless you’re Ricky Bobby – “Shake and Bake”
  • 30. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Obstacles Spark •  Don’t expect Pandas-convenient Spark DataFrames (yet at least) (e.g. no upsampling, backfilling, etc) •  RDD’s are powerful, but you’ll need to get your hands dirty writing lower-level code (not a bad thing) •  Allocate enough memory on all of your nodes, it’s a hog! Parquet •  Date and binary support are pending although timestamp, decimal, char, and varchar are now supported in Hive 0.14.0 •  You won’t get Vertica or Cassandra level response times (maybe in the future – that’s my speculation) Getting Started •  Learn by doing: reading is useful to gain the basics but just “jump right in and do it”
  • 31. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   QUESTIONS Douglas Eisenstein @dougeisenstein doug.eisenstein@advantisolutions.com Helping to create fast, reliable, and transparent modern data pipelines for financial analytics Talk feedback: https://goo.gl/Y0UuWy 31  
  • 32. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   APPENDIX Using Spark, Python, and Cassandra for Loading and Transformations at Scale 32  
  • 33. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Resources hAp://training.databricks.com/workshop/itas_workshop.pdf  (Intro  to  Spark  PDF)     hAps://databricks-­‐training.s3.amazonaws.com/slides/SparkSQLTraining.Summit.July2014.pdf  (Spark  SQL  Training  Module)     hAp://training.databricks.com/workshop/itas_workshop.pdf  (Spark  101)     hAp://spark-­‐summit.org/2014/training  (General  Spark  Videos)     hAps://spark.apache.org/docs/1.3.0/sql-­‐programming-­‐guide.html  (Spark  SQL  and  DataFrame’s  —  awesome)     hAp://www.slideshare.net/EvanChan2/2014-­‐07olapcassspark  (OLAP  with  Cassandra  and  Spark)     hAps://databricks.com/blog/2015/03/24/spark-­‐sql-­‐graduates-­‐from-­‐alpha-­‐in-­‐spark-­‐1-­‐3.html  (Latest  Spark/DataFrame/Parquet)     hAps://spark.apache.org/docs/latest/programming-­‐guide.html  (Basics  of  Spark  Development)     hAps://spark.apache.org/docs/latest/configura=on.html  (Spark  Configura=on)     hAps://academy.datastax.com/demos/datastax-­‐enterprise-­‐joining-­‐tables-­‐apache-­‐spark  (Joining  Cassandra  Tables  in  Spark)     hAp://www.infoobjects.com/author/rishi/  (Spark  /  Parquet  Integra=on)     hAps://parquet.incubator.apache.org/presenta=ons/  (Parquet  Videos/Slides,  awesome)     hAp://www.slideshare.net/databricks/spark-­‐sqlsse2015public  (All  about  DataFrame  by  the  author)     hAp://www.infoobjects.com/category/spark_cookbook/  (Good  walkthrough  of  Spark  Demos)     hAp://tobert.github.io/post/2014-­‐07-­‐15-­‐installing-­‐cassandra-­‐spark-­‐stack.html  (Spark  /  Cassandra  Integra=on  from  scratch)     hAp://blog.cloudera.com/blog/2015/05/working-­‐with-­‐apache-­‐spark-­‐or-­‐how-­‐i-­‐learned-­‐to-­‐stop-­‐worrying-­‐and-­‐love-­‐the-­‐ shuffle/  (Spark  by  Data  Engineer)     hAp://blog.cloudera.com/blog/2015/03/how-­‐to-­‐tune-­‐your-­‐apache-­‐spark-­‐jobs-­‐part-­‐1/  (Tuning  Spark)     hAps://spark-­‐summit.org/2015-­‐east/wp-­‐content/uploads/2015/03/SSE15-­‐21-­‐Sandy-­‐Ryza.pdf  ()     hAps://www.youtube.com/watch?v=0OM68k3np0E&list=PL-­‐x35fyliRwiiYSXHyI61RXdHlYR3QjZ1&index=5  ()   http://www.slideshare.net/dataera/parquet-format
  • 34. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   About Me o  My vision is to make data preparation fast and reliable o  I help financial firms with data-intensive processes o  o  In my spare time: CrossFit, Baseball, No Gardening! @dougeisenstein