Innovation with Connection, The new HPCC Systems Plugins and Modules

Innovation and
Reinvention Driving
Transformation
OCTOBER 9, 2018
2018 HPCC Systems® Community
Day
James McMullan
Innovation with Connection,
The new HPCC Systems Plugins and Modules

Overview
• Why Integrate HPCC Systems, Spark
and Hadoop?
• Spark-Thor Component
• Goals, Features
• Spark-HPCC Connector
• Goals, Features, Demo
• HDFS Connector
• Goals, Features, Demo
• Closing Thoughts
Innovation with Connection, The new HPCC Systems Plugins and
Modules

Why Integrate HPCC Systems, Spark and Hadoop?
• Our goal: Allow you to do more
• Combine strengths of the different ecosystems
• Still in early stages
• Exploring the potential of these integrations
• More than one compelling use case
• Python statistical and ML libraries through PySpark
• New data formats
• New methods of consuming data
Modules

HPCC Systems Spark-Thor Component - Goals
• Easy setup of co-located Spark & HPCC Systems
• Easy configuration
• Allow custom configuration
• Unified startup of Spark & HPCC Systems
• Default configuration that works with HPCC Systems
• Log directories
• Work directories
• Resource allocation
Modules

HPCC Systems Spark-Thor Component - Features
• Spark-Thor Component Installation
• Packaged as a plugin
• Platform build option
• Easy Configuration through configmgr
• Spark cluster mirrors Thor cluster
• Resource allocation settings
• Custom configuration through spark-
env.sh
• Default configuration
• Fixes common issues
• Works with HPCC Systems configuration
• Easily Start & Stop Spark
• Unified startup
• Uses existing HPCC Systems scripts
Modules

Spark-HPCC Connector - Goals
• Read and write data from Spark
• Reliable and easy to use
• Performant
• Memory usage
• I/O throughput
• Row construction cost
• Allow co-location of Spark and HPCC
• Use HPCC Systems data with Spark MLlib
• RDD and DataFrame support
Modules

Spark-HPCC Connector - Features
Reading data from HPCC Systems
• Co-located and remote
• HPCC Systems record definitions
• All scalar types, child datasets, and sets of
scalars
• Construct RDD of Rows
• Easy translation from RDD to Dataframe
• Easy translation from RDD to ML Datasets
• Field & row filtering on HPCC Systems side
• Distributed reading
Modules
3
1
3
0
2
Accidents < 3
2
4
1
1
4

Writing data to HPCC Systems
• Co-located writes only
• RDD<Row> of Spark SQL data-types
• Integer, Long, Float, Double,
• BigDecimal, String, Sequence, Row,
byte[]
• Automated Row to Record translation
• Distributed writing
• Creation of new datasets only
Modules
Row
Record

PySpark Support
• Utilizes Scala / Java API
• Reading & Writing
• Same features & limitations
• Py4J library used to construct JavaRDDs
• PySpark picklers used for Serialization / Deserialization
• Uses PySpark configured pickler
• PySpark introduces additional overhead
• RDD Serialization and Deserialization required
Modules

Spark-HPCC Connector - Demo
Modules

Spark-HPCC Connector - Results
• Read and write data from Spark
• Performant
• Memory usage
• I/O throughput
• Co-location of Spark and HPCC
Systems
• HPCC Systems data with Spark
MLlib
• RDD and DataFrame support
✔
✔
✔
• No additional row / field overhead
• 1.5 GBit / s
• 30 million rows / s
✔
✔
✔
Presentation Title Here (Insert Menu > Header & Footer > Apply) 13

Spark-HPCC Connector - Roadmap
• Remote writes
• Improved performance
• Better data locality planning
• Scala/Java generic RDDs: RDD<YourClass>
• Automated field mapping
• Automated type conversion
• Automated filtering
• DataFrameReader and DataFrameWriter
• No intermediate RDD
Modules

Spark-HPCC Connector - Availability
• HPCC Systems GitHub:
• https://github.com/hpcc-systems/Spark-HPCC
• We are open for feedback & feature requests
• We want to hear about your use cases!
• Pull requests welcome!
Modules

HDFS Connector - Goals
• Read and write HDFS data from HPCC
Systems
• Performant
• Memory usage
• I/O throughput
• Few dependencies
• Allow collocation of HPCC Systems and
HDFS
• Support multiple file formats
Modules

HDFS Connector - Features
Reading from HDFS
• Thor and CSV files
• Thor files support all HPCC Systems record
layouts
• Including variable length records and records
with children
• CSV files support only scalar datatypes
• Integers, Reals, Decimals, Strings, Varstrings,
UTF8
• Automatic field mapping and filtering
• Distributed Reading
• Dynamically split datasets down to the HDFS
block size (64 MiB)
Modules
Node 2Node 1

HDFS Connector - Features
Writing to HDFS
• Supports Thor and CSV files
• Thor files support all HPCC Systems record layouts
• CSV files support only scalar datatypes
• Integers, Reals, Decimals, Strings, Varstrings,
UTF8
• Distributed writing
• Aware of HPCC Systems cluster topology
• Additional metadata added to Thor Files
• Record structure validation and dynamic splitting
• Multiple write modes: Create Only, Overwrite or
Append
Modules
Node 2Node 1

HDFS Connector - Demo
Modules

HDFS Connector - Results
• Read and write HDFS data
• Performant
• Memory usage
• I/O throughput
• Record construction cost
• Few dependencies
• Co-location of HPCC Systems and
HDFS
• Support multiple file formats
✔
✔
✔
• No additional memory overhead
• TBD
• TBD
✔
✔
✔
Presentation Title Here (Insert Menu > Header & Footer > Apply) 21

HDFS Connector - Roadmap
• Parquet support
• Reading & Writing
• Automatic column filtering
• Expand library support to Hadoop libhdfs
• Statically linking to Apache Hawq
libhdfs3
• Support Hadoop HDFS add-ons
• S3A client
• Performance tuning
Modules

HDFS Connector - Availability
• Available as Technical Preview
• HPCC Systems GitHub:
• https://github.com/hpcc-systems/HDFS-Connector
• We are open for feedback & feature requests
• We want to hear about your use cases!
• Pull requests welcome!
Modules

Closing Thoughts
• Integration work between HPCC Systems, Spark and Hadoop is on-going
• Our goal: allow you to do more with your data
• Spark-HPCC and HDFS Connector available now
• Feedback, Feature requests and PRs wanted
• Tell us about your uses cases!
Modules

Questions?
Spark-Thor Plugin:
https://hpccsystems.com/download
Spark-HPCC Connector:
https://github.com/hpcc-systems/Spark-HPCC
HDFS Connector Tech Preview:
https://github.com/hpcc-systems/HDFS-
Connector

Innovation with Connection, The new HPCC Systems Plugins and Modules

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Innovation with Connection, The new HPCC Systems Plugins and Modules

Ähnlich wie Innovation with Connection, The new HPCC Systems Plugins and Modules (20)

Mehr von HPCC Systems

Mehr von HPCC Systems (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Innovation with Connection, The new HPCC Systems Plugins and Modules

Hinweis der Redaktion