Join us as we continue this series of webinars specifically designed for the community by the community with the goal to share knowledge, spark innovation and further build and link the relationships within our HPCC Systems community.
Episode 12 includes Tech Talks featuring speakers from our community on topics covering exploratory data analysis, geospatial solutions and ECL Tips leveraging the HPCC Systems platform.
1) Itauma Itauma, PhD Candidate, Keiser University - Conducting exploratory data analysis in educational research using HPCC Systems®
2) Ignacio Calvo, LexisNexis Risk Solutions - Big Data and Geospatial with HPCC Systems®
3) Bob Foreman, Senior Software Engineer, HPCC Systems, LexisNexis Risk Solutions - ECL Tip of the Month
2. Welcome!
• Please share: Let others know you are here with #HPCCTechTalks
• Ask questions! We will answer as many questions as we can following each speaker.
• Look for polls at the bottom of your screen. Exit full-screen mode or refresh your screen if
you don’t see them.
• We welcome your feedback - please rate us before you leave today and visit our blog for
information after the event.
• Want to be one of our featured speakers? Let us know! techtalks@hpccsystems.com
The Download: Tech Talks #HPCCTechTalks2
3. Watch for Details
Announced Soon!
Community announcements
3
Dr. Flavio Villanustre
VP Technology
RELX Distinguished Technologist
LexisNexis® Risk Solutions
Flavio.Villanustre@lexisnexisrisk.com
The Download: Tech Talks #HPCCTechTalks
• HPCC Systems® Platform updates
• 6.4.12 is the latest gold version / Community Changelog
• 7.0.0 Beta planned for early Q2 – among the key features:
• Spark integration
• Indexer
• Record Translation
• Session Management Improvements
• VS Code Beta version
• Roadmap items for 2018 and beyond
• New Case Study
• 3LOQ leverages HPCC Systems in their Habitual AI solution
• Latest Blogs
• Tips and Tricks for ECL – Part 2 - PARSE
• Fly on the wall at our first Hackathon
• Reminder: 2018 Summer Internship Proposal Period Open Through April 6, 2018
• Interested candidates can submit proposals from the Ideas List
• Program runs late May through mid August
• Visit the Student Wiki for more details
2018 HPCC Systems
Community Day
4. Coming soon - 10K Trees Campaign for Earth Day
4 The Download: Tech Talks #HPCCTechTalks
World Planting Day, March 21
through Earth Day on April 22
• Help us help the environment on behalf of our
community!
• HPCC Systems is dedicated to the environment
and is giving you the opportunity to take
action and be a small part of a big impact.
• HPCC Systems, partnering with the National
Forest Foundation, is growing and promoting
awareness of environmental sustainability with
their 10,000 Trees challenge.
5. Today’s speakers
5 The Download: Tech Talks #HPCCTechTalks
Itauma Itauma
PhD Candidate,
Keiser University
amightyo@gmail.com
Itauma Itauma is a doctoral candidate at Keiser University and a computer science
instructor at Wayne State University. His interests lie in learning analytics and utilizing
HPCC Systems for educational research. He has an undergraduate degree in Electrical
Engineering from the University of Ilorin and two Masters Degrees, a Master of Science
in Computer Engineering from Istanbul Technical University, majoring in human-robot
interaction and a Master of Science in Computer Science from Wayne State University
where his thesis was based on leveraging HPCC Systems for Big Data analytics.
Featured Community Speaker
6. Today’s speakers
6 The Download: Tech Talks #HPCCTechTalks
Ignacio Calvo
Software Engineering Lead
LexisNexis Risk Solutions
Ignacio.Calvo@lexisnexisrisk.com
Ignacio is a Software Engineering Lead with 17 years of experience in the
development of IT projects for different markets (insurance, finance, telecom,
retailing). He has been working for 5 years in LexisNexis creating Big Data solutions
with geospatial capabilities using HPCC Systems. He is the organizer of the HPCC
Systems meetup group in Dublin and a CoderDojo mentor.
Bob Foreman
Senior Software Engineer
LexisNexis Risk Solutions
Robert.Foreman@lexisnexisrisk.com
Bob Foreman has worked with the HPCC Systems technology platform and
the ECL programming language for over 5 years, and has been a technical
trainer for over 25 years. He is the developer and designer of the HPCC
Systems Online Training Courses, and is the Senior Instructor for all
classroom and Webex/Lync based training.
7. Conducting exploratory data analysis in
educational research using HPCC Systems®
Itauma Itauma
PhD Candidate
Keiser University
8. Quick poll:
How strongly correlated do you think identification
with math, and confidence in the ability to succeed
in math are?
See poll on bottom of presentation screen
9. Outline
The Download: Tech Talks #HPCCTechTalks9
• What is Exploratory Data Analysis (EDA)?
• Why is EDA Important?
• Techniques, Types, and Steps
• Role in Educational Research
• The HPCC Systems Advantage in Educational Research
• Data Visualization Examples
• Exploring the HSLS:09 Dataset
10. What is Exploratory Data Analysis (EDA)?
• Broad open minded overview of data
• Converts data from its raw form to a form
that makes sense
• Allows the data to speak for itself with no
assumptions made
• No rigidity with rules
• An important first step in data analysis
The Download: Tech Talks #HPCCTechTalks10
11. Exploratory Data Analysis
• Consists of:
• Organizing and summarizing raw data
• Looking for important features and patterns in
the data
• Looking for any striking deviations from any
pattern found
• Interpreting findings in the context of the
research question
The Download: Tech Talks #HPCCTechTalks11
12. Why is Exploratory Data Analysis Important?
• Gain new insight
• Explore data structures
• Detect missing data
• Check significant variables
• Examine relationships between
variables
• Select an appropriate model
• Check model assumptions
The Download: Tech Talks #HPCCTechTalks12
13. Importance of Exploratory Data Analysis
• Summarizes data
• Often reveals new ways to think about data.
• Helps in refining research questions and
sometimes reveals new questions.
• After EDA, we are able to ask specific questions
of our data
The Download: Tech Talks #HPCCTechTalks13
14. Techniques of EDA
• Usually graphical
• May be combined with quantitative techniques.
• Visualization helps to discover data patterns.
• raw data plots such as traces, histograms, and
probability plots;
• simple statistics plots such as mean plots, standard
deviation plots, and box plots.
• No limitation to these techniques
• A researcher can develop novel ways to visualize
data
The Download: Tech Talks #HPCCTechTalks14
15. Types and Steps in Exploratory Data Analysis
• Graphical vs Non-graphical
• Univariate vs Multivariate
• Examine one variable at a time
• Summarize and then examine the distribution of variable(s) of interest.
• What values the variables take
• How often the variables take those values.
• Can come up with different research questions and choose to analyze the
data in different ways.
• Data is so awesome and having a tool that makes it very easy to analyze
makes it fun and exciting.
The Download: Tech Talks #HPCCTechTalks15
16. Exploratory Data Analysis
• Statistics: collects data, summarizes data, and interprets data
• Statistics plays a significant role in social sciences which includes the field of
education. Converts data into useful information.
• EDA= Data Visualization + Statistics = Better data decision making
The Download: Tech Talks #HPCCTechTalks16
17. Educational Research
• Systematic and organized inquiry applied to
collecting, analyzing, and reporting
information that addresses educational
problems and questions (McMillan, 2015)
• Describe
• Predict
• Improve
• Explain
• Important for the advancement of knowledge
in the field of education
The Download: Tech Talks #HPCCTechTalks17
18. Machine Learning vs Statistical Learning: The HPCC Systems
Advantage in Educational Research
• Era of big data, learning analytics and
personalized learner experience
• Machine learning needed to build systems that
learn from data
• Learning analytics: The process of quantifying,
analyzing, and reporting learner data to discover
patterns and enhance learning to improve
learner performance (Siemens & Baker, 2012)
• Growing data collection from learning platforms
and devices to create personalized data-driven
learning programs for learner success
The Download: Tech Talks #HPCCTechTalks18
19. Machine Learning vs Statistical Learning: The HPCC Systems
Advantage in Educational Research
• Educational researchers need tools that can
handle big data, and will benefit from the
use of HPCC Systems.
• Statistical learning is limited because of the
need to develop a hypothesis and make
assumptions about the data before building
a model.
• In machine learning, algorithms are flexible,
run directly on the model, and outputs the
requested features with the data speaking
out for itself.
The Download: Tech Talks #HPCCTechTalks19
20. Machine Learning vs Statistical Learning: The HPCC Systems
Advantage in Educational Research
• Common statistical tools such as SPSS widely used in educational research, is
limited in terms of scalability and big data.
• HPCC Systems is open source, and can handle both data visualization and
statistical analysis, all integrated in the platform.
• The HPCC Systems Visualization Bundle provides visual representations of
data analysis.
• HPCC Systems can also perform simple descriptive & inferential statistics.
The Download: Tech Talks #HPCCTechTalks20
https://github.com/hpcc-systems/Visualization
21. Data Visualization in HPCC Systems
• Visualization bundle is an open-source add-
on to the HPCC Systems platform to allow
the creation of visualizations from the
results of queries written in ECL
• Important means of conveying information
from massive datasets
• Pie Charts, Line graphs, Maps, and other
visual graphs
• Simplifies the complex
• In addition, the underlying visualization
framework supports advanced features to
allow the combination of graphs to make
interactive dashboards
• Integration of Tableau in HPCC Systems
(Alternative)
The Download: Tech Talks #HPCCTechTalks21
22. Data Visualization Examples
• In a previous study, HPCC Systems
ML correlation and regression
modules were used to determine
the strength of the correlation
between chocolate consumption,
life expectancy, and happiness.
The Download: Tech Talks #HPCCTechTalks22
23. Exploring the HSLS:09 Dataset
• The US High School Longitudinal
Study of 2009 (HSLS:09) is a national
cohort study of over 23,000 ninth
graders from 944 schools, in 2009,
through their secondary and post-
secondary years.
• Focus of the HSLS:09 includes
students’ trajectories from high
school, and how students choose
college majors and careers.
The Download: Tech Talks #HPCCTechTalks23
24. • Research Question: Is Math Identity associated
with Math Self-efficacy?
• Math identity: the level of a student's
identification with math represented by
agreements with the statements "You see
yourself as a math person" and/or "Others see
me as a math person".
• Self-efficacy: the level of confidence a student
has about the ability to succeed.
• STEM (Science Technology Engineering
Mathematics)
• Let’s find out! Remember, EDA helps in refining
research questions and sometimes reveals new
questions.
The Download: Tech Talks #HPCCTechTalks 24
Exploring the HSLS:09 Dataset
25. Exploring the HSLS:09 Dataset
• CSV file sprayed into the HPCC Systems
cluster
• Recordset filtering
• Three features projected
• X1MTHID –Math Identity
• X1MTHEFF –Math Self-efficacy
• X2SEX –Gender
The Download: Tech Talks #HPCCTechTalks25
26. Exploring the HSLS:09 Dataset
• Next, query the dataset
• Descriptive Statistics
The Download: Tech Talks #HPCCTechTalks26
27. Exploring the HSLS:09 Dataset
• Is Math Identity associated
with Math Self-efficacy?
• Sub-question identified: Is
this association different
between males and females?
The Download: Tech Talks #HPCCTechTalks27
Correlation Coefficient
Correlation between Math Identity and Math Self Efficacy 0.6303
Correlation between Math Identity and Math Self Efficacy of Males 0.6237
Correlation between Math Identity and Math Self Efficacy of Females 0.6381
29. Data Interpretations
• The effect of seeing oneself as a math person or being seen as a math person
is associated with increased confidence in math
• Effect is stronger for females than for males
• The higher the level of a student’s identification with math, the higher the
confidence to succeed in math. This can be a strong factor in students’
decisions to enroll in STEM programs.
• *Correlation does not imply causation*
The Download: Tech Talks #HPCCTechTalks29
30. Quick poll:
Would you consider using HPCC Systems
for exploratory data analysis?
See poll on bottom of presentation screen
43. Projections are used to represent the world in ways
we can process
•The Earth is round and maps are flat
•Physical Maps
•Computer Maps
What is a projection?
Have I seen projections before?
•Peter vs Mercator vs Winkel tripel
•GPS (latitude/longitude)
•Google Maps
46. WGS84
•Latitude and longitude
•Our best approximation of the world
•Not always the best for a specific region
•Not technically a projection
Projections to know about
Mercator
•Many different ones, choose one based on your location
•Reduces the area it covers to a simple Cartesian plane
•Good near the central axis, bad far away from it :
• Web Mercator covers the whole world – good near equator, gets worse as you travel north or
south
• NAD83 / Georgia East, British National Grid, Irish National Grid…
Very good for that territory, awful anywhere else
50. Bringing Geospatial into HPCC Systems
GOAL
Bring our geospatial processes
into the realm of Big Data
51. STEPS
Spatial filtering of vector geometries
Spatial operations using vector geometries
Spatial reference projection and transformation
Reading of compressed geo-raster files
Big Data
Extend HPCC Systems and ECL to support the following
main capabilities :
53. Ingesting vector data
It’s a CSV file.
Id Name Geometry Projection Value
1 Alice’s
place
POINT (53.78925462 -6.08354321) 4326* €5,973,000
2 Bob’s place POINT (-34.78925462 7.08354321) 4326 €872,000
3 Celine’s
place
POINT (102.78925462 -6.08354321) 4326 €9,324,000
* WGS84 (Lat/Lon)
3.
Peril tag
2.
Geocode address
1.
Policy data
Data ready to
ingest
54. Ingesting vector data
It’s a GML / XML file.
3.
Process and index
2.
Parse XPATH
1.
Shape data
Data ready to
query
55. Ingesting vector data
It’s a GML / XML file.
3.
Process and index
2.
Parse XPATH
1.
Shape data
Data ready to
query
56. Ingesting vector data
It’s a GML / XML file.
3.
Process and index
2.
Parse XPATH
1.
Shape data
Data ready to
query
58. Indexing vector data
Rtree
• Outline Box: Biggest rectangle
• Boxes contain boxes
• Bottom box in the tree contains actual
geometries
• Here, 3 levels pictured
• Boxes can overlap (entries are only in one)
59. Querying vector data
Searching an R-Tree: e.g. Finding all buildings (points) inside a flood zone (polygon)
Does the query polygon overlap our box?
Return empty list
Search our boxes’
children
Is it a leaf node?
Return all nodes
for verification
Y
N
Y
N
60. Ingesting raster data
It’s a raster / TIFF file. Bitmap image
3.
Process and index
2.
Tile and spray
1.
Raster data
Data ready to
query
61. Ingesting raster data
3.
Process and index
2.
Tile and spray
1.
Raster data
Data ready to
query
Tiling divides raster images into
small manageable areas of known
dimensions.
These tiles have their own
metadata:
• Bounding box
• Grid position
62. Ingesting raster data
3.
Process and index
2.
Tile and spray
1.
Raster data
Data ready to
query
1. Figure out which grid position the
geometry needs
2. Extract the required pixel
3. Interrogate the pixel for its value
4. Interpret its value
5. Return to user
63. Ingesting raster data
It’s a raster / TIFF file. Bitmap image
3.
Process and index
2.
Tile and spray
1.
Raster data
Data ready to
query
64. Ingesting raster data
It’s a raster / TIFF file.
3.
Process and index
2.
Tile and spray
1.
Raster data
Data ready to
query
65. Bringing it all together
*Andrew Farrell
In pursuit of perils : Geo-spatial risk analysis through HPCC Systems
https://hpccsystems.com/resources/blog/afarrell/pursuit-perils-geo-spatial-risk-analysis-
through-hpcc-systems
68. Why Geospatial with HPCC Systems?
• Efficient parallel processing
• Ability to import libraries from different languages
• Good coverage of functions and spatial predicates
• Fast ingestion
• Support for different formats
• Sub-second queries
71. ECL Tip: The Top Ten Common ECL
Compiler/Runtime Errors, and how to correct them
Bob Foreman
Senior Software Engineer
LexisNexis Risk Solutions
72. Quick poll:
What do you think about the ECL
Compiler messages?
See poll on bottom of presentation screen
73. Background
• During the many years of ECL training classes, it was discovered that many
developers encounter the same errors while learning ECL.
• Many of these errors are easy fixes, but it is important to understand what
the error message is saying and what in turn needs to be corrected.
• Errors fall into two categories, compiler and runtime.
• Compiler errors are related to syntax or improper references to other definitions.
• Runtime (or system) errors are errors that prevent a submitted workunit from
completing, and these are often easily corrected.
• Presenting the Top Ten ECL Compiler/Runtime (System) Errors:
The Download: Tech Talks #HPCCTechTalks73
74. Number 10 – The Workunit Assassin
Text:
Error: System error: 10056: THOR ABORT
Type:
Runtime (System)
Cause:
Somebody killed (aborted) your workunit!
Fix:
Find out who killed you and why, then restart your workunit when all clear
The Download: Tech Talks #HPCCTechTalks74
75. Number 9 – Unfriended Node
The Download: Tech Talks #HPCCTechTalks75
Text:
Error: System error: 4: MP link closed (<ip address>:<port>)
Type:
Runtime (System), MP is Message Passing
Cause:
Out of memory (OOM), network issue, hardware fault, or version bug.
Fix:
Review your slave log and syslog, configuration, C++ leak. If problem
persists, open an issue in Jira.
76. Number 8 – Local Limbo
The Download: Tech Talks #HPCCTechTalks76
Text:
Error: Compile/Link failed for <pathL<workunit number>
Type:
Compiler
Cause:
You lost connection with your cluster, and the target has reverted to a
Local target.
Fix:
Restart your ECL IDE and verify cluster connection.
77. Number 7 – Missing Data Pieces (TIE)
The Download: Tech Talks #HPCCTechTalks77
Text:
Error: Need to supply a value for field <fieldname>
Error: Transform does not supply a value for field "SELF.<fieldname>"
Type:
Compiler
Cause:
In TABLE, your field is missing or field requires a default value.
In TRANSFORM, one or more SELF.field definition(s) missing.
Fix:
Add the default value to table, and make sure your field is referenced
properly in the TRANSFORM
78. Number 6 – Divide and Conquer
The Download: Tech Talks #HPCCTechTalks78
Text:
System error: 0: Graph graph1[1], dedup[3]: Global DEDUP,ALL is not
supported
Type:
Runtime (System)
Cause:
Some intensive ECL operations require breaking down the job into smaller
pieces to run more efficiently.
Fix:
GROUP your target DEDUP recordset
79. Number 5 – Dataset Hide and Seek
The Download: Tech Talks #HPCCTechTalks79
Text:
System error: 10001: Graph graph1[1], Missing logical file <filename>
Type:
Runtime (System)
Cause:
The filename you entered in the DATASET declaration does not match the
name of the file you sprayed.
Fix:
Find and correct your typo, check for proper use of the tilde (~).
80. Number 4 – No Dataset to Read!
The Download: Tech Talks #HPCCTechTalks80
Text:
Error: file.<fieldname> - no specified row for Table file
Type:
Compiler
Cause:
The code is trying to reference a field value from a single record when the
only thing in scope is the entire dataset, or a field may be out of scope in a
parent/child denormalized dataset.
Fix:
Definition needs to be modified to retrieve a single record in scope.
81. Number 3 – Data Imposters! (TIE)
The Download: Tech Talks #HPCCTechTalks81
Text:
Error: System error: 0: Dataset layout does not match published layout
for file <filename>
Error: System error: 0: Published record size # for file <filename> does
not match coded record size #
Type:
Runtime (System)
Cause:
Your RECORD structure definition does not exactly match the metadata
RECORD structure the DFU has for that dataset.
Fix:
Correct field name, position, or value type.
82. Number 2 - Action Retraction
The Download: Tech Talks #HPCCTechTalks82
Text:
Error: Definition contains actions after the EXPORT has been defined
Type:
Compiler
Cause:
Your ECL code contains an action (explicit or implicit) following an
EXPORTed definition.
Fix:
Remove either the action or the EXPORT.
83. Number 1 – MODULE Mayhem!
The Download: Tech Talks #HPCCTechTalks83
Text:
Warning: (1,0): error C2386: Module <module name> does not EXPORT
an attribute main()
Type:
Runtime (System)
Cause:
Your MODULE has multiple exports. You need to tell the compiler which
one you want to run.
Fix:
Use a Builder window or BWR file to explicitly drilldown to the definition
you need. You could also rename one EXPORT definition as “Main” (not
recommended).
84. Honorable Mention – Warning Worries
The Download: Tech Talks #HPCCTechTalks84
Text:
WARNING: Compiler/Server mismatch:
Compiler: 6.4.2 community_6.4.2-1
Server: community_6.4.8-
Cause:
Compiler referenced in ECL IDE does not match the server version.
Fix:
Update your ECL IDE or your cluster version as appropriate.
WARNING: SOAP 1.1 fault: SOAP-ENV:Client[no subcode]
"An HTTP processing error occurred“
Detail: [no detail]
Cause:
Your cluster is not using a shared repository.
Fix:
This warning can be safely ignored if you know you are using a local repository.
85. Summary – The Top Ten
1. Warning: (1,0): error C2386: Module <module name> does not EXPORT an attribute main() (0, 0), 0,
2. Error: Definition contains actions after the EXPORT has been defined (2, 1), 2325,
3. Error: System error: 0: Dataset layout does not match published layout for file <filename> (0, 0), 0,
3. Error: System error: 0: Published record size 29 for file <filename> does not match coded record size 32 (0, 0), 0,
4. Error: file.<fieldname> - no specified row for Table file (4, 1), 2131, <ECL File and Local Path>
5. Error: System error: 10001: Graph graph1[1], Missing logical file <filename> (0, 0), 10001,
6. Error: System error: 0: Graph graph1[1], dedup[3]: Global DEDUP,ALL is not supported (0, 0), 0,
7. Error: Need to supply a value for field <fieldname> (9, 50), 2170, (tables)
7. Error: Transform does not supply a value for field "SELF.<fieldname>" (15, 1), 2111,
8. Error: Compile/Link failed for <pathL<workunit number>
9. Error: System error: 4: MP link closed (10.194.96.16:6600)
10. Error: System error: 10056: THOR ABORT
Honorable mention:
WARNING: Compiler/Server mismatch:
Compiler: 6.4.2 community_6.4.2-1
Server: community_6.4.8-
WARNING: SOAP 1.1 fault: SOAP-ENV:Client[no subcode]
"An HTTP processing error occurred"
Detail: [no detail]
The Download: Tech Talks #HPCCTechTalks85
86. Summary
• Many compiler errors are common to everyone and can be easily analyzed.
• As time goes on, your exposure to these common errors will point to quick
and easy solutions.
• Knowing what to do and where to go when you can’t decipher a message is
critical for productivity.
The Download: Tech Talks #HPCCTechTalks86
87. Quick poll:
Out of the top ten messages just
presented, how many have you
personally experienced?
See poll on bottom of presentation screen
89. • Have a new success story to share?
• Want to pitch a new use case?
• Have a new HPCC Systems application you want to demo?
• Want to share some helpful ECL tips and sample code?
• Have a new suggestion for the roadmap?
• Be a featured speaker for an upcoming episode! Email your idea to
Techtalks@hpccsystems.com
• Visit The Download Tech Talks wiki for more information:
https://wiki.hpccsystems.com/display/hpcc/HPCC+Systems+Tech+Talks
Mark your calendar for the April 19 Tech Talk!
Topics include Developing A Custom, Pluggable HPCC Systems Security Manager
Watch our Events page for details.
Submit a talk for an upcoming episode!
89 The Download: Tech Talks #HPCCTechTalks
90. A copy of this presentation will be made available soon on our blog:
hpccsystems.com/blog
Thank You!
Hinweis der Redaktion
.
Picture about Peter Vs Mercator – one for coastline, one for area, check out the sizes of Greenland and Africa
Lesson: Projections distort the data!