SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Downloaden Sie, um offline zu lesen
BIG
HANDLING LARGE DATA
The CloverETL Cluster Architecture Explained
Wednesday, August 14, 13
The Reality:
You have a really big pile to deal with.
One traditional digger might not be enough.
Really Big Data
Wednesday, August 14, 13
You could get a really big, expensive digger...
Really Big Data
Wednesday, August 14, 13
…or several smaller ones and get the job done faster & cheaper.
Really Big Data
Wednesday, August 14, 13
But what if the one big one suffers a mechanical failure?
Really Big Data
Wednesday, August 14, 13
With small diggers, failure of one does not affect the rest.
Really Big Data
Wednesday, August 14, 13
Which one do you choose ?
vs
Wednesday, August 14, 13
CloverETL Cluster resiliency features
Optimizing for robustness...
Wednesday, August 14, 13
Fault resiliency – HW & SW
automatic fail-over
Before After
Node 2 Node 1 Node 2Node 1
Wednesday, August 14, 13
automatic load balancing
Load Balancing
N
ew
task
Before After
Node 2
Node 1 Node 1
Node 2
Wednesday, August 14, 13
CloverETL Cluster - BIG DATA features
Optimizing for speed...
Wednesday, August 14, 13
Traditionally, data transformations were run on a single, big server
with multiple CPUs and plenty of RAM.
And it was expensive.
Wednesday, August 14, 13
Then the CloverETL team
developed the concept of a data
transformation cluster.
The CloverETL
Cluster was born
It creates a powerful data transformation beast from a set of low-cost
commodity hardware machines.
Wednesday, August 14, 13
Now, one data transformation can be set to run in parallel on
all available nodes of the CloverETL Cluster.
Wednesday, August 14, 13
Each cluster node executing the
transformation is automatically fed with a
different portion of the input data.
Part 1
Part 2
Part 3
Wednesday, August 14, 13
Part
1
Part
2
Part
3
Now
Before
=
=
Working in parallel, they finish the job faster,
with less resources needed individually.
Wednesday, August 14, 13
That sounds nice and simple.
But how is it really done?
Wednesday, August 14, 13
CloverETL allows certain
transformation components to be
assigned to multiple cluster nodes.
runs
1x
runs
1x
runs
3x
Allocated to
Allocated to
Allocatedto
Allocatedto
Node 1
Node 2
Node 3
CloverETL Cluster
Such components then run in multiple instances.
We call this
Allocation.
Allocated to
Wednesday, August 14, 13
Special components allow
incoming data to be split
and sent in parallel flows to
multiple nodes where the
processing flow continues.
Node 1
Node 2
Node 3
Serial data Partitioned data
Node 1
1st instance
2nd instance
3rd instance
Wednesday, August 14, 13
Other components gather
data from parallel flows back
into a single, serial one.
Node 1
Node 2
Node 3
Serial dataPartitioned data
Node 1
1st instance
2nd instance
3rd instance
Wednesday, August 14, 13
The original transformation is automatically
“rewritten” into several smaller ones, which
are executed by cluster nodes in parallel.
Which nodes will be used is determined by
Allocation.
Node 1
Node 2
Node 3
2nd instance
3rd instance
Serial data Serial dataPartitioned data
1st instance
Node 3
Wednesday, August 14, 13
Let’s take a look
at an example.
Wednesday, August 14, 13
In this example, we’ll read data about company
addresses.There are 10,499,849 records in total.
We also calculate statistics of the number
of companies residing in each US state.
We get a total of 51 records – one
record per US state.
serial processing
Wednesday, August 14, 13
Here, we’re processing the same input data, but in parallel now.
We get a total of 51
records again.
Split Gather
work in
3 parallel
streams
Each parallel stream
gets a portion of the
input data
Partial results
Wednesday, August 14, 13
Go parallel in 1 minute.
☟
drag&drop drag&drop
serial
parallel
Wednesday, August 14, 13
What’s the Trick?
Split the input data into
parallel streams.
Do the heavy lifting on smaller data
portions in parallel.
Bring the individual pieces of
results together at the end.
☞
☜
DONE
Wednesday, August 14, 13
Let’s continue.
More on allocation and partitioned sandboxes
Wednesday, August 14, 13
A Sandbox
We assume you are familiar
with the CloverETL Server’s
concept of a SANDBOX.
SANDBOX is a logical name for a file
directory structure managed by the Server. It
allows individual projects on the Server to be
separated into logical units. Each CloverETL
data transformation can access multiple
sandboxes either locally or remotely.
Let’s look at a special type of
sandbox – partitioned
Wednesday, August 14, 13
The sandbox presents “originals” – combined data.
Part 2
Part 1 Partitioned
sandbox
“SboxP”
Part 3
Node 1
Node 2
Node 3
SboxP
In a partitioned Sandbox, the input file is split into subfiles,
each residing on a different node of the Cluster in a similarly
structured folder.
Wednesday, August 14, 13
Partitioned
Sandboxes
A partitioned sandbox is a
logical abstraction on top of
similarly structured folders
on different Cluster nodes.
The Sandbox’s logical
structure with a unified view of folders & files
The Sandbox’s physical
structure with listed locations/nodes of
files’ portions
Wednesday, August 14, 13
Partitioned Sandbox
Partitioned sandbox defines how
data is partitioned
across nodes of the CloverETL
Cluster
Allocation
Allocation defines how a
transformation’s run is distributed
across nodes of the CloverETL Cluster
☜
☞
The allocation can be set to derive from the sandbox layout.
Data processing happens where data resides.
We tell the cluster to run our transformation
components on nodes that also contain portions of
data we want to process.
☟
Wednesday, August 14, 13
Allocation Determined By a
Partitioned Sandbox:
4 partitions 4 parallel
transformations.
There’s no gathering at the end - partitioned results are
stored directly to the partitioned sandbox.Allocation for the
aggregator is derived from sandbox being used.
Wednesday, August 14, 13
Allocation Determined By an
Explicit Number:
8 parallel transformations.
Partitioning at the beginning and gathering at
the end is necessary as we need to cross the
serial⇿parallel boundary twice.
Wednesday, August 14, 13
A Data Skew
This is called a data skew.
Data is not uniformly distributed across partitions.
This indicates that chosen partitioning key is not
the best for the maximum performance.
However, the chosen key allows us to perform only
single pass aggregation (no semi-results) - thus it’s a
good tradeoff.
The busiest worker will have to process 2.5 million rows whereas the least busy,
only 0.67 million – that is, approximately 3.5x less.
Wednesday, August 14, 13
Parallel Pitfalls
When processing data in parallel, a few things should be considered.
Aggregating, Sorting, Joining…
Working in parallel means producing “parallel”/semi results.
First, we produce 4 aggregated
semi-results. Then we aggregate the
semi-results to get the final result.
➔semi-result1
➔semi-result 2
➔semi-result3
➔semi-result4
record stream1
record stream2
record stream3
record stream4
These partial results have to be further
processed to get final result.
➔final resultsemi-result1,2,3,4 ➔
The good news: When increasing or changing the
number of parallel streams, we don’t have to
change the transformation.
Wednesday, August 14, 13
Parallel Pitfalls
Full transformation – parallel aggregation & post-processing semi results
sum()
here
count()
here
Why ?
Example: A parallel counting of occurrences of companies
per state using count().
In step 1, we produce partial results. Because records are
partitioned in a round-robin, data for one state may appear
in multiple parallel streams.
For example, we might get data for NY as 4 partial results
in 4 different streams.
In step 2, we merge all the partial results from
the 4 parallel streams into a sequence and then
aggregate again to get the final numbers.
At this step the aggregation function is sum() –
we sum the partial counts.
Step 1
Aggregating, Sorting, Joining…
Step 2
Wednesday, August 14, 13
Parallel Pitfalls
Parallel sorting
merge
here
sort
here
Why ?
Sorting in parallel ➔ records are sorted in
individual parallel streams, but not across all
streams.
Bringing parallel sorted streams together
into serial stream ➔ records have to be
merged according to the same key as
used in parallel sorting ➔ to produce
overall sorted serial result.
1 2
Aggregating, Sorting, Joining…
Wednesday, August 14, 13
Parallel Pitfalls
Why ?
Joining in parallel➔master&slave(s) records
must be partitioned by the same key/field.The
same key must be used for joining records.
!
In another case, there is a danger that records
from master & slave with the same key will not
join as they end up in different parallel streams.
Joiner joins only within one stream and not
across streams.
!
Aggregating, Sorting, Joining…
Parallel joining
Wednesday, August 14, 13
Parallel Pitfalls
Example
Result
(all master records joined)
Parallel joining - 3 parallel streams - partitioning by state
[AL AK AZ AR CA CO CT DC DE FL]
[GA HI ID IL IN IA KS KY LA ME MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND]
[OH OK OR PA RI SC SDTNTX UTVT VA WA WV WI WY]
[AK AZ DE]
[IL MD NY]
[OR PA VA]
[AK AZ DE]
[IL MD NY]
[OR PAVA]
1⥤
2⥤
3⥤
1⥤
2⥤
3⥤
1⥤
2⥤
3⥤
Aggregating, Sorting, Joining…
stream
stream
stream
stream
stream
stream
Wednesday, August 14, 13
Parallel Pitfalls
Result
(some master records joined)
Parallel joining - 3 parallel streams - partitioning round robin
[AL AR CT FL GA HI IA LA MA MS NE NJ NC OH PA SD UT WA WY]
[AK CA DC IL KS ME MI MO NV NM ND OK RITNVT WV]
[AZ CO DE ID IN KY MD MN MT NH NY OR SCTXVA WI]
[]
[]
[DE NY]
[AK IL OR]
[AZ MDVA]
[DE NY PA]
1⥤
3⥤
2⥤
1⥤
2⥤
3⥤
1⥤
2⥤
3⥤
Aggregating, Sorting, Joining…
Example
stream
stream
stream
stream
stream
stream
Wednesday, August 14, 13
Bringing it all together…
Going parallel is easy!
Try it out for yourself.
☞ BIG DATA problems are handled through Cluster’s scalability
☞ Existing transformations can be easily converted to parallel
☞ There’s no magic – users have full control over what’s happening
☞ CloverETL Cluster has built in fault resiliency and load balancing
Wednesday, August 14, 13
If you have any questions, check out:
www.cloveretl.com
forum.cloveretl.com
blog.cloveretl.com
Wednesday, August 14, 13

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

Empfohlen

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Empfohlen (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

CloverETL Cluster - Big Data Parallel Processing Explained

  • 1. BIG HANDLING LARGE DATA The CloverETL Cluster Architecture Explained Wednesday, August 14, 13
  • 2. The Reality: You have a really big pile to deal with. One traditional digger might not be enough. Really Big Data Wednesday, August 14, 13
  • 3. You could get a really big, expensive digger... Really Big Data Wednesday, August 14, 13
  • 4. …or several smaller ones and get the job done faster & cheaper. Really Big Data Wednesday, August 14, 13
  • 5. But what if the one big one suffers a mechanical failure? Really Big Data Wednesday, August 14, 13
  • 6. With small diggers, failure of one does not affect the rest. Really Big Data Wednesday, August 14, 13
  • 7. Which one do you choose ? vs Wednesday, August 14, 13
  • 8. CloverETL Cluster resiliency features Optimizing for robustness... Wednesday, August 14, 13
  • 9. Fault resiliency – HW & SW automatic fail-over Before After Node 2 Node 1 Node 2Node 1 Wednesday, August 14, 13
  • 10. automatic load balancing Load Balancing N ew task Before After Node 2 Node 1 Node 1 Node 2 Wednesday, August 14, 13
  • 11. CloverETL Cluster - BIG DATA features Optimizing for speed... Wednesday, August 14, 13
  • 12. Traditionally, data transformations were run on a single, big server with multiple CPUs and plenty of RAM. And it was expensive. Wednesday, August 14, 13
  • 13. Then the CloverETL team developed the concept of a data transformation cluster. The CloverETL Cluster was born It creates a powerful data transformation beast from a set of low-cost commodity hardware machines. Wednesday, August 14, 13
  • 14. Now, one data transformation can be set to run in parallel on all available nodes of the CloverETL Cluster. Wednesday, August 14, 13
  • 15. Each cluster node executing the transformation is automatically fed with a different portion of the input data. Part 1 Part 2 Part 3 Wednesday, August 14, 13
  • 16. Part 1 Part 2 Part 3 Now Before = = Working in parallel, they finish the job faster, with less resources needed individually. Wednesday, August 14, 13
  • 17. That sounds nice and simple. But how is it really done? Wednesday, August 14, 13
  • 18. CloverETL allows certain transformation components to be assigned to multiple cluster nodes. runs 1x runs 1x runs 3x Allocated to Allocated to Allocatedto Allocatedto Node 1 Node 2 Node 3 CloverETL Cluster Such components then run in multiple instances. We call this Allocation. Allocated to Wednesday, August 14, 13
  • 19. Special components allow incoming data to be split and sent in parallel flows to multiple nodes where the processing flow continues. Node 1 Node 2 Node 3 Serial data Partitioned data Node 1 1st instance 2nd instance 3rd instance Wednesday, August 14, 13
  • 20. Other components gather data from parallel flows back into a single, serial one. Node 1 Node 2 Node 3 Serial dataPartitioned data Node 1 1st instance 2nd instance 3rd instance Wednesday, August 14, 13
  • 21. The original transformation is automatically “rewritten” into several smaller ones, which are executed by cluster nodes in parallel. Which nodes will be used is determined by Allocation. Node 1 Node 2 Node 3 2nd instance 3rd instance Serial data Serial dataPartitioned data 1st instance Node 3 Wednesday, August 14, 13
  • 22. Let’s take a look at an example. Wednesday, August 14, 13
  • 23. In this example, we’ll read data about company addresses.There are 10,499,849 records in total. We also calculate statistics of the number of companies residing in each US state. We get a total of 51 records – one record per US state. serial processing Wednesday, August 14, 13
  • 24. Here, we’re processing the same input data, but in parallel now. We get a total of 51 records again. Split Gather work in 3 parallel streams Each parallel stream gets a portion of the input data Partial results Wednesday, August 14, 13
  • 25. Go parallel in 1 minute. ☟ drag&drop drag&drop serial parallel Wednesday, August 14, 13
  • 26. What’s the Trick? Split the input data into parallel streams. Do the heavy lifting on smaller data portions in parallel. Bring the individual pieces of results together at the end. ☞ ☜ DONE Wednesday, August 14, 13
  • 27. Let’s continue. More on allocation and partitioned sandboxes Wednesday, August 14, 13
  • 28. A Sandbox We assume you are familiar with the CloverETL Server’s concept of a SANDBOX. SANDBOX is a logical name for a file directory structure managed by the Server. It allows individual projects on the Server to be separated into logical units. Each CloverETL data transformation can access multiple sandboxes either locally or remotely. Let’s look at a special type of sandbox – partitioned Wednesday, August 14, 13
  • 29. The sandbox presents “originals” – combined data. Part 2 Part 1 Partitioned sandbox “SboxP” Part 3 Node 1 Node 2 Node 3 SboxP In a partitioned Sandbox, the input file is split into subfiles, each residing on a different node of the Cluster in a similarly structured folder. Wednesday, August 14, 13
  • 30. Partitioned Sandboxes A partitioned sandbox is a logical abstraction on top of similarly structured folders on different Cluster nodes. The Sandbox’s logical structure with a unified view of folders & files The Sandbox’s physical structure with listed locations/nodes of files’ portions Wednesday, August 14, 13
  • 31. Partitioned Sandbox Partitioned sandbox defines how data is partitioned across nodes of the CloverETL Cluster Allocation Allocation defines how a transformation’s run is distributed across nodes of the CloverETL Cluster ☜ ☞ The allocation can be set to derive from the sandbox layout. Data processing happens where data resides. We tell the cluster to run our transformation components on nodes that also contain portions of data we want to process. ☟ Wednesday, August 14, 13
  • 32. Allocation Determined By a Partitioned Sandbox: 4 partitions 4 parallel transformations. There’s no gathering at the end - partitioned results are stored directly to the partitioned sandbox.Allocation for the aggregator is derived from sandbox being used. Wednesday, August 14, 13
  • 33. Allocation Determined By an Explicit Number: 8 parallel transformations. Partitioning at the beginning and gathering at the end is necessary as we need to cross the serial⇿parallel boundary twice. Wednesday, August 14, 13
  • 34. A Data Skew This is called a data skew. Data is not uniformly distributed across partitions. This indicates that chosen partitioning key is not the best for the maximum performance. However, the chosen key allows us to perform only single pass aggregation (no semi-results) - thus it’s a good tradeoff. The busiest worker will have to process 2.5 million rows whereas the least busy, only 0.67 million – that is, approximately 3.5x less. Wednesday, August 14, 13
  • 35. Parallel Pitfalls When processing data in parallel, a few things should be considered. Aggregating, Sorting, Joining… Working in parallel means producing “parallel”/semi results. First, we produce 4 aggregated semi-results. Then we aggregate the semi-results to get the final result. ➔semi-result1 ➔semi-result 2 ➔semi-result3 ➔semi-result4 record stream1 record stream2 record stream3 record stream4 These partial results have to be further processed to get final result. ➔final resultsemi-result1,2,3,4 ➔ The good news: When increasing or changing the number of parallel streams, we don’t have to change the transformation. Wednesday, August 14, 13
  • 36. Parallel Pitfalls Full transformation – parallel aggregation & post-processing semi results sum() here count() here Why ? Example: A parallel counting of occurrences of companies per state using count(). In step 1, we produce partial results. Because records are partitioned in a round-robin, data for one state may appear in multiple parallel streams. For example, we might get data for NY as 4 partial results in 4 different streams. In step 2, we merge all the partial results from the 4 parallel streams into a sequence and then aggregate again to get the final numbers. At this step the aggregation function is sum() – we sum the partial counts. Step 1 Aggregating, Sorting, Joining… Step 2 Wednesday, August 14, 13
  • 37. Parallel Pitfalls Parallel sorting merge here sort here Why ? Sorting in parallel ➔ records are sorted in individual parallel streams, but not across all streams. Bringing parallel sorted streams together into serial stream ➔ records have to be merged according to the same key as used in parallel sorting ➔ to produce overall sorted serial result. 1 2 Aggregating, Sorting, Joining… Wednesday, August 14, 13
  • 38. Parallel Pitfalls Why ? Joining in parallel➔master&slave(s) records must be partitioned by the same key/field.The same key must be used for joining records. ! In another case, there is a danger that records from master & slave with the same key will not join as they end up in different parallel streams. Joiner joins only within one stream and not across streams. ! Aggregating, Sorting, Joining… Parallel joining Wednesday, August 14, 13
  • 39. Parallel Pitfalls Example Result (all master records joined) Parallel joining - 3 parallel streams - partitioning by state [AL AK AZ AR CA CO CT DC DE FL] [GA HI ID IL IN IA KS KY LA ME MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND] [OH OK OR PA RI SC SDTNTX UTVT VA WA WV WI WY] [AK AZ DE] [IL MD NY] [OR PA VA] [AK AZ DE] [IL MD NY] [OR PAVA] 1⥤ 2⥤ 3⥤ 1⥤ 2⥤ 3⥤ 1⥤ 2⥤ 3⥤ Aggregating, Sorting, Joining… stream stream stream stream stream stream Wednesday, August 14, 13
  • 40. Parallel Pitfalls Result (some master records joined) Parallel joining - 3 parallel streams - partitioning round robin [AL AR CT FL GA HI IA LA MA MS NE NJ NC OH PA SD UT WA WY] [AK CA DC IL KS ME MI MO NV NM ND OK RITNVT WV] [AZ CO DE ID IN KY MD MN MT NH NY OR SCTXVA WI] [] [] [DE NY] [AK IL OR] [AZ MDVA] [DE NY PA] 1⥤ 3⥤ 2⥤ 1⥤ 2⥤ 3⥤ 1⥤ 2⥤ 3⥤ Aggregating, Sorting, Joining… Example stream stream stream stream stream stream Wednesday, August 14, 13
  • 41. Bringing it all together… Going parallel is easy! Try it out for yourself. ☞ BIG DATA problems are handled through Cluster’s scalability ☞ Existing transformations can be easily converted to parallel ☞ There’s no magic – users have full control over what’s happening ☞ CloverETL Cluster has built in fault resiliency and load balancing Wednesday, August 14, 13
  • 42. If you have any questions, check out: www.cloveretl.com forum.cloveretl.com blog.cloveretl.com Wednesday, August 14, 13