This document provides a summary of HPCC Systems, including:
1. A brief history and overview of the architecture with a use case example of calculating insurance policy data within a specified radius.
2. Descriptions of the main components of HPCC Systems - Thor for batch processing, Roxie for real-time queries, and ECL as the data-oriented programming language.
3. Information on how HPCC Systems can be integrated with other systems and technologies through connectors, drivers, and the ability to embed other languages.
4. Case study
Introduction to HPCC Systems4
For a given X and Y coordinate, calculate
within a specified radius the following :
• Total number of policies
• Total value of policies
Update each record with this information
THE CHALLENGE
5. Data Flow Oriented Big Data Platform
Introduction to HPCC Systems5
ESP
Middleware
Services
Raw data from
several sources
BatchSubscribersPortal
Thor (data refinery)
• Shared Nothing MPP Architecture
• Commodity Hardware
• Batch ETL and Analytics
ECL
Batch requests for
scoring and analytics • Easy to use • Implicitly Parallel • Compiles to C++
ROXIE (data delivery)
• Shared Nothing MPP Architecture
• Commodity Hardware
• Real-time Indexed Based Query
• Low Latency, Highly Concurrent
and Highly Redundant
Batch Processed
Data
6. BatchSubscribers
Thor
Thor – The Batch Processing Analytics Engine
Introduction to HPCC Systems6
Raw data
from
several
sources
Reporting
ECL
Batch
reporting
requests
ROXIE
Batch
reporting
requests
Massively Parallel Extract Transform and
Load (ETL) engine
• Built from the ground up as a parallel data
environment
Enables data integration on a scale not
previously available
• Current LexisNexis person data build process
generates 350 billion intermediate results at peak
Suitable for:
• Massive joins/merges
• Massive sorts and transformations
• Any N2 problem
“Identify and catalog all the
stars in the Milky Way galaxy”
7. BatchSubscribers
Thor
ROXIE – The Real-Time Analytics Delivery Engine
Introduction to HPCC Systems7
Raw data
from
several
sources
Reporting
ECL
Batch
reporting
requests
ROXIE
Batch
reporting
requests
A massively parallel, high throughput,
structured query response engine
Ultra fast due to its read-only nature
Allows indices to be built onto data for
efficient multi-user retrieval of data
Suitable for:
• Volumes of structured queries
• Full text ranked Boolean search
“I want the star Alpha Centauri”
8. ECL – The Data Flow Oriented Programming Language
BatchSubscribers
Thor
Introduction to HPCC Systems8
Raw data
from
several
sources
Reporting
ECL
Batch
reporting
requests
ROXIE
Batch
reporting
requests
• An easy to use, data-centric programming
language optimized for large-scale data
management and query processing
• Highly efficient — automatically distributes
workload across all nodes.
• Industry analysts: “80% more efficient than C++,
Java and SQL — 1/3 reduction in programmer
time to maintain/enhance existing applications”
• Benchmark against SQL (5 times more efficient)
for code generation
• Automatic parallelization and synchronization
of sequential algorithms for parallel and
distributed processing. Compiles to C++
• Large library of built-in modules to handle
common data manipulation tasks. Can embed /
import : C++, Python, JavaScript, R, Java
Declarative programming language … powerful, extensible,
implicitly parallel, maintainable, complete and homogeneous
10. A Robust — and Proven — Platform for IoT
Introduction to HPCC Systems10
ROXIE
HPCC Systems Platform
Data Collection
Rules Execution
Alert Delivery
Search
BI
• Real-time indexed based search
• Real-time rules execution
• Alert call back
• Real-time store
• Real-time analytics on
real-time data
• Long term store
• Batch analytics
Distributed Massively Parallel Architecture
Real-time Services
ThorCassandra
13. HPCC: Internet of Things Architecture
Introduction to HPCC Systems13
ROXIE
• REST
• SOAP
• Websocket
• IPv6
• 6LoWPAN
• UDP
• uIP
• DTLS
• MQTT
• CoAP
• ROLL
• XMPP-IoT
• Mihini/M3DA
Thor
Index Updates
• AMQP
• DDS
• LLAP
• LWM2M
• SSI
• IOTDB
• SensorML
• IPSO
• Telehash
• TSMP
• NanoIP
• ONS 2.0
Adapter
Blueberries KiwisFigs BananasGrapes Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
10
9
8
7
6
5
4
3
2
AMT
DATE
Grapes
12.5%
Figs
12.5%
Blueberries
12.5%
Apples
12.5%
Bananas
12.5%
Kiwis
12.5%
Oranges
12.5%
Cherries
12.5%
Good
Fair
Danger
14. HPCC Systems Technology: Big Data Is Our Core Competency
14
SPEED
• Scales to extreme workloads
quickly and easily
• Increases speed of
development leads to
faster production/delivery
• Improves developer
productivity
Introduction to HPCC Systems
15. HPCC Systems Technology: Big Data Is Our Core Competency
15
SPEED CAPACITY
• Scales to extreme workloads
quickly and easily
• Increases speed of
development leads to
faster production/delivery
• Improves developer
productivity
• Enables massive joins,
merges, transformations,
sorts, or tough N2 problems
• Increases business
responsiveness
• Accelerates creation of
new services via rapid
prototyping capabilities
• Offers a platform for
collaboration and innovation
leading to better results
Introduction to HPCC Systems
16. HPCC Systems Technology: Big Data Is Our Core Competency
16
SPEED CAPACITY COST SAVINGS
• Scales to extreme workloads
quickly and easily
• Increases speed of
development leads to
faster production/delivery
• Improves developer
productivity
• Enables massive joins,
merges, transformations,
sorts, or tough N2 problems
• Increases business
responsiveness
• Accelerates creation of
new services via rapid
prototyping capabilities
• Offers a platform for
collaboration and innovation
leading to better results
• Leverages commodity
hardware so fewer people can
do much more in less time
• Uses IT resources efficiently
via sharing and higher system
utilization
• Open source since 2011
Introduction to HPCC Systems
17. • Grid computing
• Data-centric language (ECL)
• Integrated delivery system that offers data plus analytics
Our Solutions Are Powered by HPCC at Their Core
Introduction to HPCC Systems17
Big
Data
Structured
Records
Unstructured
Records
News
Articles
Proprietary
Data
Public
Records
Unstructured and
Structured Content High Performance Computing Cluster Platform (HPCC) Analysis Applications Key Capabilities
• Over 4 petabytes of content
• 50 billion records
• 20,000 sources
• 8.9 billion unique name and
address combinations
• Multi-bureau/multi-source
models and bureau roll-over
support
• Extensive experience
leveraging atomic level data,
combining and leveraging
disparate data
• Approximately 400 models
deployed (custom and
flagship)
• Data and analytics
• Identity verification and
authentication
• Fraud detection and prevention
• Investigation
• Screening
• Receivables management
Fusion
Linking
Refinery
Financial Services
Government
Health Care
Insurance
Legal
Retail
Open Source Components
Complex Analysis
Clustering Analysis
Link Analysis
Entity Resolution
18. Example : Understanding People Relations Helps Us Predict Risk
8.9 B
unique name/
address combos
4 B
property
records
37 M
unique
businesses
417 M
criminal
records
269 M
auto and home
claim records
188.5 M
unique
cell phones
16.5 B
consumer
records
3.7 B
motor vehicle
registrations
SSN
xxx-xx-xxxxx
321 High St.
Chicago, IL 60540
2000 – 2013
Mobile Phone
630.555.9876
Boat License
#414567
K.R.
Jones
Kathy
Jones
Kathy R.
Jones
Kathy
Schroeder
Car VIN
#RGSWA04A87B1xxxxx
123 Avenue
San Francisco, CA 94107
2013 – Present
Lived at …
Owns …
Aliases …
Personal info …
Involved in …
DUI Case
#4859xxx-xxx
Felony Indictment
Chicago C#0404-xxx
Bankruptcy
September 12, 2013
Filed for …Loan Application
January 30, 2015
Introduction to HPCC Systems18
Four Petabytes of Information :
• 50 billion records
• 20,000 sources
• Several million records added daily
19. Example : Understanding People Relations Helps Us Predict Risk
8.9 B
unique name/
address combos
4 B
property
records
37 M
unique
businesses
417 M
criminal
records
269 M
auto and home
claim records
188.5 M
unique
cell phones
16.5 B
consumer
records
3.7 B
motor vehicle
registrations
• Collect largest, broadest,
deepest, most accurate,
up-to-date repository
of public record and
contributory data
• Clean and standardize
the data
• Identify unique entities
using sophisticated
learning techniques
• Create the social
relationships
SSN
xxx-xx-xxxxx
321 High St.
Chicago, IL 60540
2000 – 2013
Mobile Phone
630.555.9876
Boat License
#414567
K.R.
Jones
Kathy
Jones
Kathy R.
Jones
Kathy
Schroeder
Car VIN
#RGSWA04A87B1xxxxx
123 Avenue
San Francisco, CA 94107
2013 – Present
Lived at …
Owns …
Aliases …
Personal info …
Involved in …
DUI Case
#4859xxx-xxx
Felony Indictment
Chicago C#0404-xxx
Bankruptcy
September 12, 2013
Filed for …Loan Application
January 30, 2015
Introduction to HPCC Systems19
23. Why HPCC?
• Efficient MPP + sub-second queries
• Consistent support, all in one platform
• Scales out to thousands of nodes
• Great learning curve
• Fast development
• Open source since 2011 : Apache 2.0
• Reliable, mature : 10+ years in production
24. Next steps
• Virtual Machine image
• Online training : vouchers available
• Documentation
• Forum : online community
• External testimonies and use cases
• Meetups
25. Useful Links
• HPCC Meetups : http://www.meetup.com/HPCC-Dublin-Big-Data
• HPCC Systems: https://hpccsystems.com/
• Community forums: https://hpccsystems.com/bb
• The HPCC Systems blog: https://hpccsystems.com/resources/blog
• Online training: learn.lexisnexis.com/hpcc
• Summit: https://hpccsystems.com/community/events/2015-hpcc-systems-engineering-summit-community-day
• HPCC on YouTube: https://www.youtube.com/user/HPCCSystems/videos
• GitHub: https://github.com/hpcc-systems
• Lambda architecture : http://cdn.hpccsystems.com/whitepapers/Lambda.pdf
• Performance : https://hpccsystems.com/resources/blog/lchapman/look-whats-coming-soon-hpcc-systems-600-beta-2
• JDBC Driver : https://hpccsystems.com/download/third-party-integrations/hpcc-jdbc-driver
• HDFS to HPCC Connector : http://cdn.hpccsystems.com/install/h2h/1.4.4-1/docs/HDFS_to_HPCC_Connector-1.4.4-1.pdf
• HPCC on AWS : https://aws.hpccsystems.com/aws/getting_started/
HPCC Systems - Online Resources25