Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Why Hadoop is important to Syncsort
1. Why was it so important to us
To open the MapReduce framework
12/11/2013
Syncsort Confidential and Proprietary - do not copy or distribute
2. Agenda
Who are we ?
What did we do ?
Why did we do that ?
With whom did we do it with?
For which results ?
Syncsort Confidential and Proprietary - do not copy or distribute
2
3. Agenda
Who are we ?
What did we do ?
Why did we do that ?
With whom did we do it with?
For which results ?
Syncsort Confidential and Proprietary - do not copy or distribute
3
4. Syncsort
For 40 years we have been helping companies solve their big data
issues…even before they knew the name Big Data!
Integrating Big Data…
Smarter!
Our customers are achieving the
impossible, every day!
• 50% of all mainframes run Syncsort
• 1,500 Mainframe Customers: Most
used & trusted 3rd party mainframe
software
• Speed leader for ETL & Sort
• A history of innovation
• 25+ Issued & Pending Patents
• Large global customer base
• 15,000+ deployments in 68 countries
• First-to-market, fully integrated
approach to Hadoop ETL
Syncsort Confidential and Proprietary - do not copy or distribute
Key Partners
4
5. Agenda
Who are we ?
What did we do ?
Why did we do that ?
With whom did we do it with?
For which results ?
Syncsort Confidential and Proprietary - do not copy or distribute
5
6. Smart Contributions to Improve Hadoop
Augmenting Critical Batch
Processing Capabilities
JIRA Description
4807
Allow MapOutputBuffer to be pluggable
4808
Allow Reduce-side merge to be pluggable
4809
Make classes required for 2454 public
4812
Create reduce input merger plug-in
4842
Shuffle race can hang reducer
2461
HDFS file name globbing in libhdfs
4482
Backport of 2454 to MapReduce 1 & 1.2
Plugin Shipping on CDH 4.2 and later
Syncsort Confidential and Proprietary - do not copy or distribute
6
7. Opening the MapReduce Framework
Here and here to replace MapReduce native sort
Mapper
Output
Sorter
Here to perform functional
logic on our engine
Syncsort Confidential and Proprietary - do not copy or distribute
Shuffle
Input
Sorter
Reducer
Here to perform functional
logic on our engine
7
8. Agenda
Who are we ?
What did we do ?
Why did we do that ?
With whom did we do it with?
For which results ?
Syncsort Confidential and Proprietary - do not copy or distribute
8
9. Syncsort: Just integrating data … faster
A simple DI engine easy to
deploy, operate, and
administer
ETL like development GUI
Auto-tuning
Best patented algorithms
Sort
Join
Aggregate
Copy
+
Syncsort Confidential and Proprietary - do not copy or distribute
Merge
Fast, fast, faster than any
other
The more data the better
9
10. From Data to Big Data
$$$
Variety
Velocity
Quarterly
Monthly
Weekly
Daily
Intra-day
$$$
Right / Real-time
$$$
Volume
Mainframe
PC
Internet Revolution
Mobile & Social Media Revolution
70s
60s
80s
Syncsort Confidential and Proprietary - do not copy or distribute
90s
2000s
2010s
Next?
10
11. Smart Architecture
Hadoop Integration… for Real
(No Code Generation. No Compiling. No Bolts. No Nuts!)
Runs natively within MapReduce
Small footprint installs on every node
Open source contributions extend
capabilities of MapReduce
Hadoop Cluster
Unleash Hadoop’s Potential
Syncsort Confidential and Proprietary - do not copy or distribute
Pluggable sort
Expanded use cases (i.e. “No sort” option)
Vertical scalability
Design flexibility (MapMapReduceReduce)
No need to worry
about this…
11
12. Agenda
Who are we ?
What did we do ?
Why did we do that ?
With whom did we do it with?
For which results ?
Syncsort Confidential and Proprietary - do not copy or distribute
12
13. Cloudera + Syncsort: Smarter Connectivity… Also for Mainframe
Because Mainframe Is Big Data Too!
Connect
• Read files directly from mainframe
• No software required on mainframe
• Already installed on 50% of mainframes
Translate
• Parse & transform: packed
decimal, EBCDIC/ASCII, multi-format
• No coding required
Load & • Load directly to HDFS
• Offload batch data processing
Process • Find more insights
Syncsort Confidential and Proprietary - do not copy or distribute
13
14. Syncsort DMX-h + Cloudera Manager
Cloudera Manager
CDH Cluster + ISV software
Support Integration
Monitoring
Syncsort
DMX-h
A
P
I
Management
Installation
CDH Nodes
Syncsort Confidential and Proprietary - do not copy or distribute
DMX-h on every CDH node
14
15. Agenda
Who are we ?
What did we do ?
Why did we do that ?
With whom did we do it with?
For which results ?
Syncsort Confidential and Proprietary - do not copy or distribute
15
16. Test cases
Sort Acceleration
– Terasort
• Run terasort with DMX-h and without DMX-h in various configurations to
compare performance.
ETL
– Use DMX-h to perform several different ETL jobs and compare against
equivalent jobs in Pig (Apache Pig version 0.9.2-gphd-1.2.0.0).
• File Change Data Capture (CDC)
• Web Log Aggregation
Syncsort Confidential and Proprietary - do not copy or distribute
16
19. Cluster Configuration – DMX-h Ran on 763 Nodes!
Cluster Specs:
– 763 node cluster
•
•
•
•
1 node – job tracker
1 node - name node
1 node – secondary name node
760 data and task nodes
Hadoop cluster configuration changes (from
defaults):
– 128 MB HDFS Block size (file.blocksize)
– 1.5 GB map/ 4GB reduce task JVM
memory (mapred.child.java.opts)
– Maximum 22 map tasks and 4 reduce
tasks per node
(mapred.tasktracker.map.tasks.maximu
m&
mapred.tasktracker.reduce.tasks.maximu
m)
Syncsort Confidential and Proprietary - do not copy or distribute
Cluster Node Specs:
– 12 cores - Dual Intel Westmere (Hexcore) CPUs, 2.93 GHz, 12 MB Cache
– 48GB DDR3 RDIMM Memory
– 12 x 2TB 3.5” drives Seagate 7200rpm.
– Disk 0 + Disk 1 are RAID1 (mirrored)
for OS.
• 100 MB/Sec write
• 115 MB/Sec read
– 10 single disk JBOD
– Mellanox ConnectX®-3 VPI NIC
(Supported data rates 40GbE;10GbE)
– RHEL 6.1 64-bit
– Java 1.6 (jdk.x86_64-2000:1.6.0_29fcs)
19
20. Sort Acceleration - Terasort
Use Case
TERASORT
TERASORT
TERASORT
TERASORT
TERASORT
TERASORT
TERASORT
Native/A
Mem
ETL or
lternativ
Elapsed
ory
Sort
e
DMX-h Time Native/Alterna
DMX-h Impro Native/Alter
Accele Alterna Data Size Elapsed Elapsed Improv tive Memory
Physical veme native CPU
ration tive
(GB)
time
Time ement
(GB)
Memory (GB) nt
Time
Sort
Accele
ration Native
512
0:01:47 0:01:45
2%
12,863
12,873
0%
114,297
Sort
Accele
ration Native 1,024 0:02:29 0:01:11 52%
14,512
14,522
0%
194,896
Sort
Accele
ration Native 1,536 0:04:02 0:01:23 66%
14,684
14,694
0%
287,055
Sort
Accele
ration Native 4,096 0:03:31 0:02:29 29%
31,520
31,549
0%
927,379
Sort
Accele
ration Native 10,242 0:08:51 0:05:14 41%
47,935
47,951
0% 2,835,927
Sort
Accele
ration Native 20,484 0:14:55 0:12:28 16%
106,153
105,239
1% 6,112,296
Sort
Accele
ration Native 102,400 1:12:12 0:51:59 28%
387,262
387,211
0% 30,436,624
Syncsort Confidential and Proprietary - do not copy or distribute
Native/
CPU Alterna
Impro tive DMX-h
DMX-h CPU veme MB/SecMB/Sec
Time
nt /Node /Node
62,491
45%
6.5
6.6
98,972
49%
9.3
19.4
143,759
50%
8.6
25.0
380,442
59%
26.2
37.0
1,460,101
49%
26.4
44.6
3,696,727
40%
31.0
37.4
16,589,332 45%
32.3
44.9
20
21. File CDC
Native/
ETL or
Native/Alt
Elapse
Memor
Alterna DMXSort
Data ernative DMX-h d Time Native/Altern
DMX-h
y Native/Alt
CPU tive
h
AccelerAlterna Size Elapsed Elapsed Improv ative Memory Physical Improv ernative DMX-h Improv MB/Se MB/Se
Use Case ation tive (GB)
time
Time ement
(GB)
Memory (GB) ement CPU Time CPU Time ement c/Nodec/Node
FileCDC
ETL
Pig
148
0:05:31
0:01:33
72%
79,876
79,559
0%
79,876
79,559
0%
0.6
2.2
FileCDC
ETL
Pig
450
0:05:11
0:01:58
62%
243,834
182,869
25%
243,834
182,869
25%
1.9
5.3
FileCDC
ETL
Pig
1,515
0:07:49
0:03:44
52%
845,263
557,226
34%
845,263
557,226
34%
4.4
9.4
Syncsort Confidential and Proprietary - do not copy or distribute
21
22. Web Log Aggregation
Use Case
WebLogAggregation Split Size & fixes
WebLogAggregation Split Size & fixes
WebLogAggregation Split Size & fixes
WebLogAggregation Split Size & fixes
Data Native/Alter
Altern Size
native
ative (GB) Elapsed time
DMX-h
Elapsed
Time
Native/A
Elapsed
lternativ
Time
Memory Native/Alter
CPU
e
DMX-h
Improve Native/Alternativ DMX-h Physical Improve native CPU DMX-h CPU Improve MB/Sec/ MB/Sec/
ment e Memory (GB)
Memory (GB)
ment
Time
Time
ment
Node
Node
Pig
2,067
0:01:12
0:00:58
19%
13,499
7,813
42%
145,972
56,496
61%
40.1
49.8
Pig
4,135
0:01:42
0:01:23
19%
18,003
15,579
13%
300,627
152,390
49%
56.1
69.6
Pig
10,240
0:05:16
0:02:04
61%
40,773
39,091
4%
807,473
335,537
58%
45.3
115.4
Pig
20,480
0:07:54
0:06:58
12%
78,654
78,128
1%
1,339,453
568,107
58%
60.4
68.4
Syncsort Confidential and Proprietary - do not copy or distribute
22
23. Test Drive DMX-h:
Bridge the Gap Between
Big Iron & Big Data!
• Self-contained image
• Use case accelerators for
• mainframe, Hadoop and more!
Running on CDH
A Smarter Approach…
(
+
)
www.syncsort.com/try
…and Quite Possibly The Only Approach!
23
Hinweis der Redaktion
The ability to process and analyze mainframe data with Hadoop can open up a wealth of opportunities by delivering deeper analytics, at lower cost. Unfortunately, there are no native Hadoop ETL capabilities for mainframe. Simply ingesting mainframe data involves lots of manual effort and coding, plus a combination of mainframe and Hadoop skills that are nearly impossible to find. The Use Case Accelerators for Mainframe Connectivity and Translation combine decades of mainframe expertise with state-of-the-art Hadoop capabilities to provide a painless and seamless approach to leverage mainframe data. Read files directly from the mainframe, parse and transform the data – packed decimal, COMP, EBCDIC/ASCII, multi-format records, and more - without installing any software on the mainframe and without writing any code. SAMPLE EBCDIC data!