Big data appliance ecosystem - in memory db, hadoop, analytics, data mining, business intelligence with multiple data source charts, twitter support and analysis.
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Hpdw 2015-v10-paper
1. Dr. Seah Boon Keong
MIMOS BHD
seahbk2006@yahoo.com
Using High Performance
Parallel Data Warehouse
(HPDW) Big Data Analytical
Platform for Big Data Analysis
3. Harness Big Data to improve decision making
Decisions based upon
transactional data
• Social data
• Information on Video and images
• Machine-generated data (sensors,
etc)
Decisions based upon all
data
Before Big Data After Big Data
4. Challenges/Problems for Data Scientist or Analytics
1 2 3
Hardware Setup and
configuration
Big Data Setup Streaming Setup
Integration Work
and Testing
Selecting and test
multiple tools
required
Analytics SetupVisualization
Setup
Analytics then only can be performed (estimation effort -
1000 man hours for tasks 1-8)
4
5678
9
5. Challenges/Problems of RDBMS for processing
big data
• Bringing a combination of Big Data to data
warehouse is a challenge
• Existing RDBMS technology is not built for
handling large data set
• In addition the ability to perform join queries
between historical and streaming data
6. How HPDW can address data scientist or data analysis
pains?
HPDW Appliance
Integrated Big Data Platform for Batch and Stream
Hide the complexity of development and integration
from scratch of various components
Enable data scientist and data analysis to focus on
analysing data and not on big data setup
Provided with integrated R tools for data analysis
with HPDW data access
Provided with data visualization tool
Additional service for Data Warehouse migration to
Big Data
Enable various stream analysis such as IOT devices
through RESTful service in JSON
7. How HPDW can address data scientist or data analysis
pains?
HPDW Appliance
Integrated Big Data Platform for Batch and Stream
Hide the complexity of development and integration
from scratch of various components
Enable data scientist and data analysis to focus on
analysing data and not on big data setup
Provided with integrated R tools for data analysis
with HPDW data access
Provided with data visualization tool
Additional service for Data Warehouse migration to
Big Data
Enable various stream analysis such as IOT devices
through RESTful service in JSON
HPDW allows analysts to
focus on analyzing data,
not on managing
infrastructure
9. HPDW Big Data Analytics Architecture
Business
Data
Data
Streams
Social Log
Enterprise
DB
Data Streaming
Data Platform
Data Exploration
Analytics
Reports
Output
Sentiments
IoT Trends
Charts,
Dashboard
Drill Down
Reports
HPDW Big Data Analytics Platform
API
(REST+JSON)
JDBC ODBC
Data
Migratio
n Plugin
InMemory
Fast Data
Join SQL (Batch and
Stream, Data Lakes)
R Spark
Other BI Tools
Tableau
Python
Multi Data Source
Exploration
Charts
Drill Down
Hadoop
12. Data Platform
HPDW Appliance
Fast SQL Query
Join Query for Historical
Data and Data Streams
RDBMS data migration
plugin
JDBC Support
ODBC Big DataSupport
for BI Integration such as
Tableau
13. HPDW Appliance
Fast SQL Query
Unify Query for Historical
Data and Data Streams
Analytics of multiple data
sources for immediate
data exploration
RDBMS data migration
connector
Supports Data Mining Tool
(R Package, etc)
Additional service for Data
Warehouse migration to
Big Data
Integrate with 3rd party BI
tool (Tableau, etc)
HPDW Sample Query and Unify Query
SELECT d.monthname_part||'-'||CAST(d.yearpart AS VARCHAR) AS
monthyear,r.referencesourcedesc,
ag.agegroupdesc,g.gendermalaydesc,SUM(f.encounter_cnt) AS encounter_cnt
FROM fact_patientencounter_100000000 f
JOIN dim_lk_reference r on r.sk_dim_reference=f.sk_dim_reference
JOIN dim_lk_agegroup ag ON ag.sk_dim_agegroup=f.sk_dim_agegroup
JOIN dim_lk_gender g ON g.sk_dim_gender=f.sk_dim_gender
JOIN dim_date d ON d.sk_dim_date=f.sk_dim_date where d.yearpart=2013
GROUP BY d.monthname_part||'-'||CAST(d.yearpart AS
VARCHAR),r.referencesourcedesc,ag.agegroupdesc,g.gendermalaydesc
SELECT * FROM hpdw.stream.tweets WHERE text like '%malaysia%'
Sample of join query output
SELECT dim_lk_gender.*, hpdw.stream.tweets.* FROM dim_lk_gender,
hpdw.stream.tweets WHERE text like '%malaysia%'
14. DB Viewer (Aqua Studio)
HPDW Appliance
JDBC
Connector
Viewing Data in HPDW
15. Tableau
HPDW Appliance
ODBC
Connector
Use of HPDW data in Tableau
Note: Compare to Hortonworks ODBC (Hive) benchmark, HPDW ODBC is much
faster for data access :
• <1 sec for HPDW ODBC direct,
• 30-40 sec for ODBC Hortonworks Hive.
20. Supports Python, R, Spark, etc
Data analysis on HPDW data sources and others
Transform and aggregate data for further data understanding
Data Exploration
24. HPDW Benchmark
Nodes 4 x Physical Nodes
CPU
Intel Xeon Ten-Core
E5-2660v3 2.60Ghz
processors – 20 Cores
RAM 128 GB
Storage HDD 4 TB (RAID 10)
OS Ubuntu (64 bits)
Query 1 (Total number of
patients)
Query 2 (Total Encounters by month & year,
servicetype, nationality, agegroup, gender)
Query 3 (Total Encounters by month & year,
reference hospital, agegroup, gender)
SELECT count (*) FROM
fact_patientencounter_10000
0000
SELECT d.monthname_part||'-'||CAST(d.yearpart AS
VARCHAR) AS monthyear,
st.servicetypedesc,n.nationalitydesc,ag.agegroupdesc,
g.gendermalaydesc,SUM(f.encounter_cnt) AS
encounter_cnt
FROM fact_patientencounter_100000000 f
JOIN dim_lk_servicetype st on
st.sk_dim_servicetype=f.sk_dim_servicetype
JOIN dim_lk_agegroup ag ON
ag.sk_dim_agegroup=f.sk_dim_agegroup
JOIN dim_lk_gender g ON
g.sk_dim_gender=f.sk_dim_gender
JOIN dim_lk_nationality n ON
n.sk_dim_nationality=f.sk_dim_nationality
JOIN dim_date d ON d.sk_dim_date=f.sk_dim_date where
d.yearpart=2013
GROUP BY d.monthname_part||'-'||CAST(d.yearpart AS
VARCHAR),st.servicetypedesc,
n.nationalitydesc,ag.agegroupdesc,g.gendermalaydesc
SELECT d.monthname_part||'-'||CAST(d.yearpart AS
VARCHAR) AS monthyear,r.referencesourcedesc,
ag.agegroupdesc,g.gendermalaydesc,SUM(f.encounter_
cnt) AS encounter_cnt
FROM fact_patientencounter_100000000 f
JOIN dim_lk_reference r on
r.sk_dim_reference=f.sk_dim_reference
JOIN dim_lk_agegroup ag ON
ag.sk_dim_agegroup=f.sk_dim_agegroup
JOIN dim_lk_gender g ON
g.sk_dim_gender=f.sk_dim_gender
JOIN dim_date d ON d.sk_dim_date=f.sk_dim_date
where d.yearpart=2013
GROUP BY d.monthname_part||'-'||CAST(d.yearpart
AS
VARCHAR),r.referencesourcedesc,ag.agegroupdesc,g.ge
ndermalaydesc
Evaluation
1. Migrated MOH Data Warehouse from
PostgreSQL to HPDW
2. Performing 3 different sets of query in
100M, 200M and 300M
3. Comparing HPDW against a well-
known relational database
(PostgreSQL Enterprise 9.4)
CPU
Intel Xeon Ten-Core
E5-2660v3 2.60Ghz
processors – 20
Cores
RAM 96 GB
Storage HDD 4 TB (RAID 5)
OS Ubuntu (64 bits)
HPDW
PostgreSQL
25. Test Case: Total number of patients
Numbers of
records (in
millions)
Execution Time (in s )
HPDW PostgreSQL
1 2 3 4 5 Averag
e
1 2 3 4 5 Average
100 3 1 1 1 1 1.4 101.6 11.1 10.7 10.7 10.7 29
200 3 1 2 1 1 1.6 208.2 130.4 28.3 28.4 28.4 84.7
300 4 3 4 3 2 3.2 432.6 423.6 345.6 315.1 313.7 366.1
1.4
1.6
3.2
10.7
28.4
313.7
0 50 100 150 200 250 300 350
100 M
200 M
300 M
Execution Time (in s)
Rows
TC 1: Total number of patients
PostgreSQL
HPDW
100x
In Test Case 1:
PostgreSQL takes about 313.7 seconds to execute 300 M rows of records.
HPDW just takes 3.2 seconds. It is 100 times faster than PostgreSQL.
26. Test Case: Total Encounters by month & year, service type, nationality,
age group, gender
In Test Case 2:, PostgreSQL takes about 21057 seconds to execute 300 M rows of records.
HPDW just takes 92.4 seconds. It is 228 times faster than PostgreSQL.
Numbers
of records
(in
millions)
Execution Time (in s )
HPDW PostgreSQL *
1 2 3 4 5 Averag
e
1 2 3 4 5 Average
100 26 25 25 26 25 25.4 5757 5757
200 47 46 51 49 47 48 12682 12682
300 92 103 89 89 89 92.4 21057 21057
Note:*Test is only carried out once for each row of PostgreSQL due to time constraints
25.4
48
92.4
5757
12682
21057
0 5000 10000 15000 20000 25000
100 M
200 M
300 M
Execution Time (in s)
Rows
PostgreSQL
HPDW
228x
27. Test Case: Total Encounters by month & year, reference hospital, age group, gender
In Test Case 3: PostgreSQL takes about 508.6 seconds to execute 300 M rows of records.
HPDW just takes 46.8 seconds. It is 11 times faster than PostgreSQL.
17.8
31.2
46.8
87.9
276
508.6
0 100 200 300 400 500 600
100 M
200 M
300 M
Execution Time (in s)
Rows
PostgreSQL
HPDW
Numbers
of records
(in
millions)
Execution Time (in s )
HPDW PostgreSQL
1 2 3 4 5 Averag
e
1 2 3 4 5 Average
100 17 17 17 18 20 17.8 125.1 87.7 87.8 87.9 87.9 95.28
200 35 34 32 30 25 31.2 277 285 279 277.7 276.6 279.06
300 49 44 46 47 48 46.8 509.6 507.8 507.9 508.8 508.6 508.5
11x
28. Overview of Benchmark Results of HPDW vs
PostgreSQL
• Performance
improvement of
11x – 200x
• Data Size
100M Rows=
8GB
200M Rows=
16GB
300M Rows=
24GB
228x
29. High Performance with Fewer Cores and Nodes
HPDW
Appliance
PostgreSQL
Over
11-200x
Faster
40 sec
3 hours
31. Conclusion and Future Work
32
Summary
• Successfully developed HPDW Big Data Analytical Platform
• Consists of 4 major sections: Data Streaming, Data Platform, Data
Exploration and Analytics
• Provide end-to-end solution for both storing and analyzing of historical and
streaming data - Unify query
• HPDW uses InMemory for data process and Infiniband/10gbs as the high
network speed to interconnect all the data nodes.
• Incorporates RESTful JSON for easy stream data insertion.
• Provide JDBC and ODBC connection for further 3rd party tool integration
Future Work
• To have more SQL query commands to be supported which will include the
update statement.
• On the HPDW Data analytics section is to include real time streaming of data
visualisation and also more data sources supported such as OData, Excel, etc