As part of the 2018 HPCC Systems Summit Community Day event:
The HPCC Systems platform team continues to expand interoperability with third party systems, which increases the platform feature-set and facilitates custom solutions. James will share an update on the latest connectors available, including the Spark-HPCC, and the upcoming HDFS connector plugin.
James McMullan has a broad range Software Engineering experience from developing low level system drivers for X-Ray fluorescence equipment to mobile video games and web applications. He is a recent addition to the Lexis Nexis team and is part of the HPCC Systems Platform team where he has been working on connectors integrating HPCC Systems with the Spark & Hadoop ecosystems.
2. Overview
• Why Integrate HPCC Systems, Spark
and Hadoop?
• Spark-Thor Component
• Goals, Features
• Spark-HPCC Connector
• Goals, Features, Demo
• HDFS Connector
• Goals, Features, Demo
• Closing Thoughts
Innovation with Connection, The new HPCC Systems Plugins and
Modules
3. Why Integrate HPCC Systems, Spark and Hadoop?
• Our goal: Allow you to do more
• Combine strengths of the different ecosystems
• Still in early stages
• Exploring the potential of these integrations
• More than one compelling use case
• Python statistical and ML libraries through PySpark
• New data formats
• New methods of consuming data
Innovation with Connection, The new HPCC Systems Plugins and
Modules
5. HPCC Systems Spark-Thor Component - Goals
• Easy setup of co-located Spark & HPCC Systems
• Easy configuration
• Allow custom configuration
• Unified startup of Spark & HPCC Systems
• Default configuration that works with HPCC Systems
• Log directories
• Work directories
• Resource allocation
Innovation with Connection, The new HPCC Systems Plugins and
Modules
6. HPCC Systems Spark-Thor Component - Features
• Spark-Thor Component Installation
• Packaged as a plugin
• Platform build option
• Easy Configuration through configmgr
• Spark cluster mirrors Thor cluster
• Resource allocation settings
• Custom configuration through spark-
env.sh
• Default configuration
• Fixes common issues
• Works with HPCC Systems configuration
• Easily Start & Stop Spark
• Unified startup
• Uses existing HPCC Systems scripts
Innovation with Connection, The new HPCC Systems Plugins and
Modules
8. Spark-HPCC Connector - Goals
• Read and write data from Spark
• Reliable and easy to use
• Performant
• Memory usage
• I/O throughput
• Row construction cost
• Allow co-location of Spark and HPCC
• Use HPCC Systems data with Spark MLlib
• RDD and DataFrame support
Innovation with Connection, The new HPCC Systems Plugins and
Modules
9. Spark-HPCC Connector - Features
Reading data from HPCC Systems
• Co-located and remote
• HPCC Systems record definitions
• All scalar types, child datasets, and sets of
scalars
• Construct RDD of Rows
• Easy translation from RDD to Dataframe
• Easy translation from RDD to ML Datasets
• Field & row filtering on HPCC Systems side
• Distributed reading
Innovation with Connection, The new HPCC Systems Plugins and
Modules
3
1
3
0
2
Accidents < 3
2
4
1
1
4
10. Spark-HPCC Connector - Features
Writing data to HPCC Systems
• Co-located writes only
• RDD<Row> of Spark SQL data-types
• Integer, Long, Float, Double,
• BigDecimal, String, Sequence, Row,
byte[]
• Automated Row to Record translation
• Distributed writing
• Creation of new datasets only
Innovation with Connection, The new HPCC Systems Plugins and
Modules
Row
Record
11. Spark-HPCC Connector - Features
PySpark Support
• Utilizes Scala / Java API
• Reading & Writing
• Same features & limitations
• Py4J library used to construct JavaRDDs
• PySpark picklers used for Serialization / Deserialization
• Uses PySpark configured pickler
• PySpark introduces additional overhead
• RDD Serialization and Deserialization required
Innovation with Connection, The new HPCC Systems Plugins and
Modules
12. Spark-HPCC Connector - Demo
Innovation with Connection, The new HPCC Systems Plugins and
Modules
13. Spark-HPCC Connector - Results
• Read and write data from Spark
• Reliable and easy to use
• Performant
• Memory usage
• I/O throughput
• Row construction cost
• Co-location of Spark and HPCC
Systems
• HPCC Systems data with Spark
MLlib
• RDD and DataFrame support
✔
✔
✔
• No additional row / field overhead
• 1.5 GBit / s
• 30 million rows / s
✔
✔
✔
Presentation Title Here (Insert Menu > Header & Footer > Apply) 13
14. Spark-HPCC Connector - Roadmap
• Remote writes
• Improved performance
• Better data locality planning
• Scala/Java generic RDDs: RDD<YourClass>
• Automated field mapping
• Automated type conversion
• Automated filtering
• DataFrameReader and DataFrameWriter
• No intermediate RDD
Innovation with Connection, The new HPCC Systems Plugins and
Modules
15. Spark-HPCC Connector - Availability
• HPCC Systems GitHub:
• https://github.com/hpcc-systems/Spark-HPCC
• We are open for feedback & feature requests
• We want to hear about your use cases!
• Pull requests welcome!
Innovation with Connection, The new HPCC Systems Plugins and
Modules
17. HDFS Connector - Goals
• Read and write HDFS data from HPCC
Systems
• Reliable and easy to use
• Performant
• Memory usage
• I/O throughput
• Row construction cost
• Few dependencies
• Allow collocation of HPCC Systems and
HDFS
• Support multiple file formats
Innovation with Connection, The new HPCC Systems Plugins and
Modules
18. HDFS Connector - Features
Reading from HDFS
• Thor and CSV files
• Thor files support all HPCC Systems record
layouts
• Including variable length records and records
with children
• CSV files support only scalar datatypes
• Integers, Reals, Decimals, Strings, Varstrings,
UTF8
• Automatic field mapping and filtering
• Distributed Reading
• Dynamically split datasets down to the HDFS
block size (64 MiB)
Innovation with Connection, The new HPCC Systems Plugins and
Modules
Node 2Node 1
19. HDFS Connector - Features
Writing to HDFS
• Supports Thor and CSV files
• Thor files support all HPCC Systems record layouts
• CSV files support only scalar datatypes
• Integers, Reals, Decimals, Strings, Varstrings,
UTF8
• Distributed writing
• Aware of HPCC Systems cluster topology
• Additional metadata added to Thor Files
• Record structure validation and dynamic splitting
• Multiple write modes: Create Only, Overwrite or
Append
Innovation with Connection, The new HPCC Systems Plugins and
Modules
Node 2Node 1
20. HDFS Connector - Demo
Innovation with Connection, The new HPCC Systems Plugins and
Modules
21. HDFS Connector - Results
• Read and write HDFS data
• Reliable and easy to use
• Performant
• Memory usage
• I/O throughput
• Record construction cost
• Few dependencies
• Co-location of HPCC Systems and
HDFS
• Support multiple file formats
✔
✔
✔
• No additional memory overhead
• TBD
• TBD
✔
✔
✔
Presentation Title Here (Insert Menu > Header & Footer > Apply) 21
22. HDFS Connector - Roadmap
• Parquet support
• Reading & Writing
• Automatic column filtering
• Expand library support to Hadoop libhdfs
• Statically linking to Apache Hawq
libhdfs3
• Support Hadoop HDFS add-ons
• S3A client
• Performance tuning
Innovation with Connection, The new HPCC Systems Plugins and
Modules
23. HDFS Connector - Availability
• Available as Technical Preview
• HPCC Systems GitHub:
• https://github.com/hpcc-systems/HDFS-Connector
• We are open for feedback & feature requests
• We want to hear about your use cases!
• Pull requests welcome!
Innovation with Connection, The new HPCC Systems Plugins and
Modules
24. Closing Thoughts
• Integration work between HPCC Systems, Spark and Hadoop is on-going
• Our goal: allow you to do more with your data
• Spark-HPCC and HDFS Connector available now
• Feedback, Feature requests and PRs wanted
• Tell us about your uses cases!
Innovation with Connection, The new HPCC Systems Plugins and
Modules
Our goal is allow you to do more
Integrating the strengths of HPCC and the Spark & Hadoop ecosystems allows you to do more
Still in early stages
Our team and other teams are exploring the potential of these integrations
More than one compelling use case
There are some great ML and statistical libraries being built in Python. PySpark allows us to easily leverage these libraries. Without having to add support ourselves.
Support for different data formats. Parquet, Avro, ORC. Lots of 3rd parties are building connectors to Spark for different dataformats.
Support for different ways of consuming data. Spark Streaming, 3rd party connectors we don’t have to build
Collocated and remote
HPCC Record Definitions
All scalar types:
Strings, Numeric values, Data blobs, Records
Child datasets
Sets of scalars
Construct RDD of Rows
Easy translation from RDD to DataFrame
Easy translation from RDD to ML datasets
Field filtering on HPCC side
Create Dataset in HPCC
Should show multiple scalar & child datasets
Read that dataset Spark / Scala
Print out the schema
Count the number of rows
Print out first & last row
Modify the dataset in Spark
String concatenation
Write dataset back to HPCC
Show original dataset and Spark’s changes
Demonstrate reading a dataset in PySpark
Create a dataset in HPCC
Read it from HPCC and Write it to HDFS as a Thor file
Show dataset in HDFS web app
Read dataset from HDFS and compare it to the HPCC dataset
Create another dataset in HPCC and append it to the Thor dataset in HDFS
Show the dataset has expanded in HDFS web app
Read Thor dataset from HDFS and write it back as a CSV dataset
Read CSV dataset in Spark?