SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Innovation and
Reinvention Driving
Transformation
OCTOBER 9, 2018
2018 HPCC Systems® Community
Day
James McMullan
Innovation with Connection,
The new HPCC Systems Plugins and Modules
Overview
• Why Integrate HPCC Systems, Spark
and Hadoop?
• Spark-Thor Component
• Goals, Features
• Spark-HPCC Connector
• Goals, Features, Demo
• HDFS Connector
• Goals, Features, Demo
• Closing Thoughts
Innovation with Connection, The new HPCC Systems Plugins and
Modules
Why Integrate HPCC Systems, Spark and Hadoop?
• Our goal: Allow you to do more
• Combine strengths of the different ecosystems
• Still in early stages
• Exploring the potential of these integrations
• More than one compelling use case
• Python statistical and ML libraries through PySpark
• New data formats
• New methods of consuming data
Innovation with Connection, The new HPCC Systems Plugins and
Modules
Spark-Thor
Component
HPCC Systems Spark-Thor Component - Goals
• Easy setup of co-located Spark & HPCC Systems
• Easy configuration
• Allow custom configuration
• Unified startup of Spark & HPCC Systems
• Default configuration that works with HPCC Systems
• Log directories
• Work directories
• Resource allocation
Innovation with Connection, The new HPCC Systems Plugins and
Modules
HPCC Systems Spark-Thor Component - Features
• Spark-Thor Component Installation
• Packaged as a plugin
• Platform build option
• Easy Configuration through configmgr
• Spark cluster mirrors Thor cluster
• Resource allocation settings
• Custom configuration through spark-
env.sh
• Default configuration
• Fixes common issues
• Works with HPCC Systems configuration
• Easily Start & Stop Spark
• Unified startup
• Uses existing HPCC Systems scripts
Innovation with Connection, The new HPCC Systems Plugins and
Modules
Spark-HPCC
Connector
Spark-HPCC Connector - Goals
• Read and write data from Spark
• Reliable and easy to use
• Performant
• Memory usage
• I/O throughput
• Row construction cost
• Allow co-location of Spark and HPCC
• Use HPCC Systems data with Spark MLlib
• RDD and DataFrame support
Innovation with Connection, The new HPCC Systems Plugins and
Modules
Spark-HPCC Connector - Features
Reading data from HPCC Systems
• Co-located and remote
• HPCC Systems record definitions
• All scalar types, child datasets, and sets of
scalars
• Construct RDD of Rows
• Easy translation from RDD to Dataframe
• Easy translation from RDD to ML Datasets
• Field & row filtering on HPCC Systems side
• Distributed reading
Innovation with Connection, The new HPCC Systems Plugins and
Modules
3
1
3
0
2
Accidents < 3
2
4
1
1
4
Spark-HPCC Connector - Features
Writing data to HPCC Systems
• Co-located writes only
• RDD<Row> of Spark SQL data-types
• Integer, Long, Float, Double,
• BigDecimal, String, Sequence, Row,
byte[]
• Automated Row to Record translation
• Distributed writing
• Creation of new datasets only
Innovation with Connection, The new HPCC Systems Plugins and
Modules
Row
Record
Spark-HPCC Connector - Features
PySpark Support
• Utilizes Scala / Java API
• Reading & Writing
• Same features & limitations
• Py4J library used to construct JavaRDDs
• PySpark picklers used for Serialization / Deserialization
• Uses PySpark configured pickler
• PySpark introduces additional overhead
• RDD Serialization and Deserialization required
Innovation with Connection, The new HPCC Systems Plugins and
Modules
Spark-HPCC Connector - Demo
Innovation with Connection, The new HPCC Systems Plugins and
Modules
Spark-HPCC Connector - Results
• Read and write data from Spark
• Reliable and easy to use
• Performant
• Memory usage
• I/O throughput
• Row construction cost
• Co-location of Spark and HPCC
Systems
• HPCC Systems data with Spark
MLlib
• RDD and DataFrame support
✔
✔
✔
• No additional row / field overhead
• 1.5 GBit / s
• 30 million rows / s
✔
✔
✔
Presentation Title Here (Insert Menu > Header & Footer > Apply) 13
Spark-HPCC Connector - Roadmap
• Remote writes
• Improved performance
• Better data locality planning
• Scala/Java generic RDDs: RDD<YourClass>
• Automated field mapping
• Automated type conversion
• Automated filtering
• DataFrameReader and DataFrameWriter
• No intermediate RDD
Innovation with Connection, The new HPCC Systems Plugins and
Modules
Spark-HPCC Connector - Availability
• HPCC Systems GitHub:
• https://github.com/hpcc-systems/Spark-HPCC
• We are open for feedback & feature requests
• We want to hear about your use cases!
• Pull requests welcome!
Innovation with Connection, The new HPCC Systems Plugins and
Modules
HDFS Connector
HDFS Connector - Goals
• Read and write HDFS data from HPCC
Systems
• Reliable and easy to use
• Performant
• Memory usage
• I/O throughput
• Row construction cost
• Few dependencies
• Allow collocation of HPCC Systems and
HDFS
• Support multiple file formats
Innovation with Connection, The new HPCC Systems Plugins and
Modules
HDFS Connector - Features
Reading from HDFS
• Thor and CSV files
• Thor files support all HPCC Systems record
layouts
• Including variable length records and records
with children
• CSV files support only scalar datatypes
• Integers, Reals, Decimals, Strings, Varstrings,
UTF8
• Automatic field mapping and filtering
• Distributed Reading
• Dynamically split datasets down to the HDFS
block size (64 MiB)
Innovation with Connection, The new HPCC Systems Plugins and
Modules
Node 2Node 1
HDFS Connector - Features
Writing to HDFS
• Supports Thor and CSV files
• Thor files support all HPCC Systems record layouts
• CSV files support only scalar datatypes
• Integers, Reals, Decimals, Strings, Varstrings,
UTF8
• Distributed writing
• Aware of HPCC Systems cluster topology
• Additional metadata added to Thor Files
• Record structure validation and dynamic splitting
• Multiple write modes: Create Only, Overwrite or
Append
Innovation with Connection, The new HPCC Systems Plugins and
Modules
Node 2Node 1
HDFS Connector - Demo
Innovation with Connection, The new HPCC Systems Plugins and
Modules
HDFS Connector - Results
• Read and write HDFS data
• Reliable and easy to use
• Performant
• Memory usage
• I/O throughput
• Record construction cost
• Few dependencies
• Co-location of HPCC Systems and
HDFS
• Support multiple file formats
✔
✔
✔
• No additional memory overhead
• TBD
• TBD
✔
✔
✔
Presentation Title Here (Insert Menu > Header & Footer > Apply) 21
HDFS Connector - Roadmap
• Parquet support
• Reading & Writing
• Automatic column filtering
• Expand library support to Hadoop libhdfs
• Statically linking to Apache Hawq
libhdfs3
• Support Hadoop HDFS add-ons
• S3A client
• Performance tuning
Innovation with Connection, The new HPCC Systems Plugins and
Modules
HDFS Connector - Availability
• Available as Technical Preview
• HPCC Systems GitHub:
• https://github.com/hpcc-systems/HDFS-Connector
• We are open for feedback & feature requests
• We want to hear about your use cases!
• Pull requests welcome!
Innovation with Connection, The new HPCC Systems Plugins and
Modules
Closing Thoughts
• Integration work between HPCC Systems, Spark and Hadoop is on-going
• Our goal: allow you to do more with your data
• Spark-HPCC and HDFS Connector available now
• Feedback, Feature requests and PRs wanted
• Tell us about your uses cases!
Innovation with Connection, The new HPCC Systems Plugins and
Modules
Questions?
Spark-Thor Plugin:
https://hpccsystems.com/download
Spark-HPCC Connector:
https://github.com/hpcc-systems/Spark-HPCC
HDFS Connector Tech Preview:
https://github.com/hpcc-systems/HDFS-
Connector

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du
 
HBaseConEast2016: Splice machine open source rdbms
HBaseConEast2016: Splice machine open source rdbmsHBaseConEast2016: Splice machine open source rdbms
HBaseConEast2016: Splice machine open source rdbmsMichael Stack
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Spark Summit
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseDataWorks Summit/Hadoop Summit
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...Michael Stack
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseDataWorks Summit/Hadoop Summit
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseenissoz
 
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...Michael Stack
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestHBaseCon
 
HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardMatthew Blair
 
HBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseHBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseCloudera, Inc.
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoYu Liu
 

Was ist angesagt? (20)

Pnuts Review
Pnuts ReviewPnuts Review
Pnuts Review
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Empower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and HadoopEmpower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and Hadoop
 
HBaseConEast2016: Splice machine open source rdbms
HBaseConEast2016: Splice machine open source rdbmsHBaseConEast2016: Splice machine open source rdbms
HBaseConEast2016: Splice machine open source rdbms
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
 
Curb your insecurity with HDP
Curb your insecurity with HDPCurb your insecurity with HDP
Curb your insecurity with HDP
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
HDF for the Cloud - New HDF Server Features
HDF for the Cloud - New HDF Server FeaturesHDF for the Cloud - New HDF Server Features
HDF for the Cloud - New HDF Server Features
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse
 
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ Flipboard
 
HBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseHBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBase
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
 

Ähnlich wie Innovation with Connection, The new HPCC Systems Plugins and Modules

HPCC Systems 6.0.0 Highlights
HPCC Systems 6.0.0 HighlightsHPCC Systems 6.0.0 Highlights
HPCC Systems 6.0.0 HighlightsHPCC Systems
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesIntel® Software
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
 
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC TechnologiesAccelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC Technologiesinside-BigData.com
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
A First Look at HPCC Systems 7.0, Innovation in Action
A First Look at HPCC Systems 7.0, Innovation in ActionA First Look at HPCC Systems 7.0, Innovation in Action
A First Look at HPCC Systems 7.0, Innovation in ActionHPCC Systems
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
 
The Download: Tech Talks by the HPCC Systems Community, Episode 11
The Download: Tech Talks by the HPCC Systems Community, Episode 11The Download: Tech Talks by the HPCC Systems Community, Episode 11
The Download: Tech Talks by the HPCC Systems Community, Episode 11HPCC Systems
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...viirya
 
Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Spark - Reuniao SouJava 12/04/2014Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Spark - Reuniao SouJava 12/04/2014soujavajug
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Tech-Spark: SQL Server on Linux
Tech-Spark: SQL Server on LinuxTech-Spark: SQL Server on Linux
Tech-Spark: SQL Server on LinuxRalph Attard
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Cask Data
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe
 

Ähnlich wie Innovation with Connection, The new HPCC Systems Plugins and Modules (20)

HPCC Systems 6.0.0 Highlights
HPCC Systems 6.0.0 HighlightsHPCC Systems 6.0.0 Highlights
HPCC Systems 6.0.0 Highlights
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing Technologies
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC TechnologiesAccelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
A First Look at HPCC Systems 7.0, Innovation in Action
A First Look at HPCC Systems 7.0, Innovation in ActionA First Look at HPCC Systems 7.0, Innovation in Action
A First Look at HPCC Systems 7.0, Innovation in Action
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
 
The Download: Tech Talks by the HPCC Systems Community, Episode 11
The Download: Tech Talks by the HPCC Systems Community, Episode 11The Download: Tech Talks by the HPCC Systems Community, Episode 11
The Download: Tech Talks by the HPCC Systems Community, Episode 11
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
 
Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Spark - Reuniao SouJava 12/04/2014Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Spark - Reuniao SouJava 12/04/2014
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Tech-Spark: SQL Server on Linux
Tech-Spark: SQL Server on LinuxTech-Spark: SQL Server on Linux
Tech-Spark: SQL Server on Linux
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4
 
Apache drill
Apache drillApache drill
Apache drill
 

Mehr von HPCC Systems

Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...HPCC Systems
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems
 
Towards Trustable AI for Complex Systems
Towards Trustable AI for Complex SystemsTowards Trustable AI for Complex Systems
Towards Trustable AI for Complex SystemsHPCC Systems
 
Closing / Adjourn
Closing / Adjourn Closing / Adjourn
Closing / Adjourn HPCC Systems
 
Community Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon CuttingCommunity Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon CuttingHPCC Systems
 
Release Cycle Changes
Release Cycle ChangesRelease Cycle Changes
Release Cycle ChangesHPCC Systems
 
Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index HPCC Systems
 
Advancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningAdvancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningHPCC Systems
 
Expanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesExpanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesHPCC Systems
 
Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsLeveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsHPCC Systems
 
DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch HPCC Systems
 
Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem HPCC Systems
 
Work Unit Analysis Tool
Work Unit Analysis ToolWork Unit Analysis Tool
Work Unit Analysis ToolHPCC Systems
 
Community Award Ceremony
Community Award Ceremony Community Award Ceremony
Community Award Ceremony HPCC Systems
 
Dapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL NeaterDapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL NeaterHPCC Systems
 
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...HPCC Systems
 
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...HPCC Systems
 

Mehr von HPCC Systems (20)

Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Towards Trustable AI for Complex Systems
Towards Trustable AI for Complex SystemsTowards Trustable AI for Complex Systems
Towards Trustable AI for Complex Systems
 
Welcome
WelcomeWelcome
Welcome
 
Closing / Adjourn
Closing / Adjourn Closing / Adjourn
Closing / Adjourn
 
Community Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon CuttingCommunity Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon Cutting
 
Path to 8.0
Path to 8.0 Path to 8.0
Path to 8.0
 
Release Cycle Changes
Release Cycle ChangesRelease Cycle Changes
Release Cycle Changes
 
Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index
 
Advancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningAdvancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine Learning
 
Docker Support
Docker Support Docker Support
Docker Support
 
Expanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesExpanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network Capabilities
 
Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsLeveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC Systems
 
DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch
 
Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem
 
Work Unit Analysis Tool
Work Unit Analysis ToolWork Unit Analysis Tool
Work Unit Analysis Tool
 
Community Award Ceremony
Community Award Ceremony Community Award Ceremony
Community Award Ceremony
 
Dapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL NeaterDapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL Neater
 
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
 
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
 

Kürzlich hochgeladen

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss ConfederationEfruzAsilolu
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制vexqp
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxVivek487417
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdftheeltifs
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjurptikerjasaptiker
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 

Kürzlich hochgeladen (20)

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

Innovation with Connection, The new HPCC Systems Plugins and Modules

  • 1. Innovation and Reinvention Driving Transformation OCTOBER 9, 2018 2018 HPCC Systems® Community Day James McMullan Innovation with Connection, The new HPCC Systems Plugins and Modules
  • 2. Overview • Why Integrate HPCC Systems, Spark and Hadoop? • Spark-Thor Component • Goals, Features • Spark-HPCC Connector • Goals, Features, Demo • HDFS Connector • Goals, Features, Demo • Closing Thoughts Innovation with Connection, The new HPCC Systems Plugins and Modules
  • 3. Why Integrate HPCC Systems, Spark and Hadoop? • Our goal: Allow you to do more • Combine strengths of the different ecosystems • Still in early stages • Exploring the potential of these integrations • More than one compelling use case • Python statistical and ML libraries through PySpark • New data formats • New methods of consuming data Innovation with Connection, The new HPCC Systems Plugins and Modules
  • 5. HPCC Systems Spark-Thor Component - Goals • Easy setup of co-located Spark & HPCC Systems • Easy configuration • Allow custom configuration • Unified startup of Spark & HPCC Systems • Default configuration that works with HPCC Systems • Log directories • Work directories • Resource allocation Innovation with Connection, The new HPCC Systems Plugins and Modules
  • 6. HPCC Systems Spark-Thor Component - Features • Spark-Thor Component Installation • Packaged as a plugin • Platform build option • Easy Configuration through configmgr • Spark cluster mirrors Thor cluster • Resource allocation settings • Custom configuration through spark- env.sh • Default configuration • Fixes common issues • Works with HPCC Systems configuration • Easily Start & Stop Spark • Unified startup • Uses existing HPCC Systems scripts Innovation with Connection, The new HPCC Systems Plugins and Modules
  • 8. Spark-HPCC Connector - Goals • Read and write data from Spark • Reliable and easy to use • Performant • Memory usage • I/O throughput • Row construction cost • Allow co-location of Spark and HPCC • Use HPCC Systems data with Spark MLlib • RDD and DataFrame support Innovation with Connection, The new HPCC Systems Plugins and Modules
  • 9. Spark-HPCC Connector - Features Reading data from HPCC Systems • Co-located and remote • HPCC Systems record definitions • All scalar types, child datasets, and sets of scalars • Construct RDD of Rows • Easy translation from RDD to Dataframe • Easy translation from RDD to ML Datasets • Field & row filtering on HPCC Systems side • Distributed reading Innovation with Connection, The new HPCC Systems Plugins and Modules 3 1 3 0 2 Accidents < 3 2 4 1 1 4
  • 10. Spark-HPCC Connector - Features Writing data to HPCC Systems • Co-located writes only • RDD<Row> of Spark SQL data-types • Integer, Long, Float, Double, • BigDecimal, String, Sequence, Row, byte[] • Automated Row to Record translation • Distributed writing • Creation of new datasets only Innovation with Connection, The new HPCC Systems Plugins and Modules Row Record
  • 11. Spark-HPCC Connector - Features PySpark Support • Utilizes Scala / Java API • Reading & Writing • Same features & limitations • Py4J library used to construct JavaRDDs • PySpark picklers used for Serialization / Deserialization • Uses PySpark configured pickler • PySpark introduces additional overhead • RDD Serialization and Deserialization required Innovation with Connection, The new HPCC Systems Plugins and Modules
  • 12. Spark-HPCC Connector - Demo Innovation with Connection, The new HPCC Systems Plugins and Modules
  • 13. Spark-HPCC Connector - Results • Read and write data from Spark • Reliable and easy to use • Performant • Memory usage • I/O throughput • Row construction cost • Co-location of Spark and HPCC Systems • HPCC Systems data with Spark MLlib • RDD and DataFrame support ✔ ✔ ✔ • No additional row / field overhead • 1.5 GBit / s • 30 million rows / s ✔ ✔ ✔ Presentation Title Here (Insert Menu > Header & Footer > Apply) 13
  • 14. Spark-HPCC Connector - Roadmap • Remote writes • Improved performance • Better data locality planning • Scala/Java generic RDDs: RDD<YourClass> • Automated field mapping • Automated type conversion • Automated filtering • DataFrameReader and DataFrameWriter • No intermediate RDD Innovation with Connection, The new HPCC Systems Plugins and Modules
  • 15. Spark-HPCC Connector - Availability • HPCC Systems GitHub: • https://github.com/hpcc-systems/Spark-HPCC • We are open for feedback & feature requests • We want to hear about your use cases! • Pull requests welcome! Innovation with Connection, The new HPCC Systems Plugins and Modules
  • 17. HDFS Connector - Goals • Read and write HDFS data from HPCC Systems • Reliable and easy to use • Performant • Memory usage • I/O throughput • Row construction cost • Few dependencies • Allow collocation of HPCC Systems and HDFS • Support multiple file formats Innovation with Connection, The new HPCC Systems Plugins and Modules
  • 18. HDFS Connector - Features Reading from HDFS • Thor and CSV files • Thor files support all HPCC Systems record layouts • Including variable length records and records with children • CSV files support only scalar datatypes • Integers, Reals, Decimals, Strings, Varstrings, UTF8 • Automatic field mapping and filtering • Distributed Reading • Dynamically split datasets down to the HDFS block size (64 MiB) Innovation with Connection, The new HPCC Systems Plugins and Modules Node 2Node 1
  • 19. HDFS Connector - Features Writing to HDFS • Supports Thor and CSV files • Thor files support all HPCC Systems record layouts • CSV files support only scalar datatypes • Integers, Reals, Decimals, Strings, Varstrings, UTF8 • Distributed writing • Aware of HPCC Systems cluster topology • Additional metadata added to Thor Files • Record structure validation and dynamic splitting • Multiple write modes: Create Only, Overwrite or Append Innovation with Connection, The new HPCC Systems Plugins and Modules Node 2Node 1
  • 20. HDFS Connector - Demo Innovation with Connection, The new HPCC Systems Plugins and Modules
  • 21. HDFS Connector - Results • Read and write HDFS data • Reliable and easy to use • Performant • Memory usage • I/O throughput • Record construction cost • Few dependencies • Co-location of HPCC Systems and HDFS • Support multiple file formats ✔ ✔ ✔ • No additional memory overhead • TBD • TBD ✔ ✔ ✔ Presentation Title Here (Insert Menu > Header & Footer > Apply) 21
  • 22. HDFS Connector - Roadmap • Parquet support • Reading & Writing • Automatic column filtering • Expand library support to Hadoop libhdfs • Statically linking to Apache Hawq libhdfs3 • Support Hadoop HDFS add-ons • S3A client • Performance tuning Innovation with Connection, The new HPCC Systems Plugins and Modules
  • 23. HDFS Connector - Availability • Available as Technical Preview • HPCC Systems GitHub: • https://github.com/hpcc-systems/HDFS-Connector • We are open for feedback & feature requests • We want to hear about your use cases! • Pull requests welcome! Innovation with Connection, The new HPCC Systems Plugins and Modules
  • 24. Closing Thoughts • Integration work between HPCC Systems, Spark and Hadoop is on-going • Our goal: allow you to do more with your data • Spark-HPCC and HDFS Connector available now • Feedback, Feature requests and PRs wanted • Tell us about your uses cases! Innovation with Connection, The new HPCC Systems Plugins and Modules

Hinweis der Redaktion

  1. Our goal is allow you to do more Integrating the strengths of HPCC and the Spark & Hadoop ecosystems allows you to do more Still in early stages Our team and other teams are exploring the potential of these integrations More than one compelling use case There are some great ML and statistical libraries being built in Python. PySpark allows us to easily leverage these libraries. Without having to add support ourselves. Support for different data formats. Parquet, Avro, ORC. Lots of 3rd parties are building connectors to Spark for different dataformats. Support for different ways of consuming data. Spark Streaming, 3rd party connectors we don’t have to build
  2. Collocated and remote HPCC Record Definitions All scalar types: Strings, Numeric values, Data blobs, Records Child datasets Sets of scalars Construct RDD of Rows Easy translation from RDD to DataFrame Easy translation from RDD to ML datasets Field filtering on HPCC side
  3. Create Dataset in HPCC Should show multiple scalar & child datasets Read that dataset Spark / Scala Print out the schema Count the number of rows Print out first & last row Modify the dataset in Spark String concatenation Write dataset back to HPCC Show original dataset and Spark’s changes Demonstrate reading a dataset in PySpark
  4. Create a dataset in HPCC Read it from HPCC and Write it to HDFS as a Thor file Show dataset in HDFS web app Read dataset from HDFS and compare it to the HPCC dataset Create another dataset in HPCC and append it to the Thor dataset in HDFS Show the dataset has expanded in HDFS web app Read Thor dataset from HDFS and write it back as a CSV dataset Read CSV dataset in Spark?