Hadoop Distributions: Bottlenecks and Tuning

•Als PPTX, PDF herunterladen•

1 gefällt mir•928 views

This presentation by Alexey Diomin, R&D Engineer at Altoros, explains how to spot performance bottlenecks in Hadoop and overviews five approaches to eliminating them.

Technologie Business

OpenSource

Monitoring

Target Group

Apache Hadoop

Yes

X

Developers

Cloudera

Yes

Good

All

Hortonworks

Yes

Good

All

MapR

No

Bad

Enterprise

PivotalHD

No

Bad

Enterprise

3

1. Increase size of cluster
2. Increase input block size
3. Increase buffer size

13

1. Increase size of cluster
2. Increase input block size
3. Increase buffer size

14

1. Increase size of cluster
2. Increase input block size
3. Increase buffer size

18

1. Increase size of cluster
2. Increase input block size
3. Increase buffer size

20

Wordcount

Reduce function as Combine
combine 1:

<a, 1> <b, 1> <a, 1>

=> <a, 2> <b, 1>

combine 2:

<a, 1> <b, 1>

=> <a, 1> <b, 1>

Reduce:

<a, {1, 2}> <b, {1, 1}> => <a, 3> <b, 2>

23

Mean

combine 1: <k,40> <k,30> <k,20> =>

<k, 30>

combine 2: <k,2> <k,8>

=>

<k, 5>

Reduce:

=>

<k, 17.5>

<k, {30, 5}>

24

Mean

combine 1: <k,40> <k,30> <k,20> =>

<k, 30>

combine 2: <k,2> <k,8>

=>

<k, 5>

Reduce:

=>

<k, 17.5>

<k, {30, 5}>

(40 + 30 + 20 + 2 + 8)/5 = 17.5

25

Mean

combine 1:

<k,<40,1>> <k,<30,1>>, <k,<20,1>>

=>

<k, <90,3> >

<k,<2,1>> <k, <8,1>>

=>

<k, <10, 2> >

Reduce:

=>

<k, 20>

combine 2:

<k, {<90,3>, <10,2>} >

26

Weitere ähnliche Inhalte

Andere mochten auch

Linux tuning to improve PostgreSQL performance

PostgreSQL-Consulting

Video: https://www.facebook.com/atscaleevents/videos/1693888610884236/ . Talk by Brendan Gregg from Facebook's Performance @Scale: "Linux performance analysis has been the domain of ancient tools and metrics, but that's now changing in the Linux 4.x series. A new tracer is available in the mainline kernel, built from dynamic tracing (kprobes, uprobes) and enhanced BPF (Berkeley Packet Filter), aka, eBPF. It allows us to measure latency distributions for file system I/O and run queue latency, print details of storage device I/O and TCP retransmits, investigate blocked stack traces and memory leaks, and a whole lot more. These lead to performance wins large and small, especially when instrumenting areas that previously had zero visibility. This talk will summarize this new technology and some long-standing issues that it can solve, and how we intend to use it at Netflix."

Linux BPF Superpowers

Brendan Gregg

Broken benchmarks, misleading metrics, and terrible tools. This talk will help you navigate the treacherous waters of Linux performance tools, touring common problems with system tools, metrics, statistics, visualizations, measurement overhead, and benchmarks. You might discover that tools you have been using for years, are in fact, misleading, dangerous, or broken. The speaker, Brendan Gregg, has given many talks on tools that work, including giving the Linux PerformanceTools talk originally at SCALE. This is an anti-version of that talk, to focus on broken tools and metrics instead of the working ones. Metrics can be misleading, and counters can be counter-intuitive! This talk will include advice for verifying new performance tools, understanding how they work, and using them successfully.

Broken Linux Performance Tools 2016

Brendan Gregg

Talk for USENIX/LISA2014 by Brendan Gregg, Netflix. At Netflix performance is crucial, and we use many high to low level tools to analyze our stack in different ways. In this talk, I will introduce new system observability tools we are using at Netflix, which I've ported from my DTraceToolkit, and are intended for our Linux 3.2 cloud instances. These show that Linux can do more than you may think, by using creative hacks and workarounds with existing kernel features (ftrace, perf_events). While these are solving issues on current versions of Linux, I'll also briefly summarize the future in this space: eBPF, ktap, SystemTap, sysdig, etc.

Linux Performance Analysis: New Tools and Old Secrets

Brendan Gregg

Video: https://www.youtube.com/watch?v=FJW8nGV4jxY and https://www.youtube.com/watch?v=zrr2nUln9Kk . Tutorial slides for O'Reilly Velocity SC 2015, by Brendan Gregg. There are many performance tools nowadays for Linux, but how do they all fit together, and when do we use them? This tutorial explains methodologies for using these tools, and provides a tour of four tool types: observability, benchmarking, tuning, and static tuning. Many tools will be discussed, including top, iostat, tcpdump, sar, perf_events, ftrace, SystemTap, sysdig, and others, as well observability frameworks in the Linux kernel: PMCs, tracepoints, kprobes, and uprobes. This tutorial is updated and extended on an earlier talk that summarizes the Linux performance tool landscape. The value of this tutorial is not just learning that these tools exist and what they do, but hearing when and how they are used by a performance engineer to solve real world problems — important context that is typically not included in the standard documentation.

Velocity 2015 linux perf tools

Brendan Gregg

There seems to be one constant when it comes to solar panels: people have a lot of questions about them. About a year ago, Alex Moundalexis decided to install solar photovoltaic panels on his roof. When he started researching solar panels, he too had lots of questions, so he started taking notes; those notes have become a reference for ongoing reflection and conversation with friends and family. From making the initial decision to generating electricity for the first time took about three months, but since then, his small array has provided more than 90% of his home’s electrical need. Alex shares his experiences evaluating solar PV systems for his home, the resulting financial and energy impacts, and a few surprising things that popped up in the process. As presented at OSCON 2016 in Austin, Texas. https://youtu.be/FCeNer9F2wU

Alex Moundalexis

Talk for PerconaLive 2016 by Brendan Gregg. Video: https://www.youtube.com/watch?v=CbmEDXq7es0 . "Systems performance provides a different perspective for analysis and tuning, and can help you find performance wins for your databases, applications, and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes six important areas of Linux systems performance in 50 minutes: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events), static tracing (tracepoints), and dynamic tracing (kprobes, uprobes), and much advice about what is and isn't important to learn. This talk is aimed at everyone: DBAs, developers, operations, etc, and in any environment running Linux, bare-metal or the cloud."

Linux Systems Performance 2016

Brendan Gregg

Talk for SCaLE13x. Video: https://www.youtube.com/watch?v=_Ik8oiQvWgo . Profiling can show what your Linux kernel and appliacations are doing in detail, across all software stack layers. This talk shows how we are using Linux perf_events (aka "perf") and flame graphs at Netflix to understand CPU usage in detail, to optimize our cloud usage, solve performance issues, and identify regressions. This will be more than just an intro: profiling difficult targets, including Java and Node.js, will be covered, which includes ways to resolve JITed symbols and broken stacks. Included are the easy examples, the hard, and the cutting edge.

Linux Profiling at Netflix

Brendan Gregg

Advanced Hadoop Tuning and Optimization - Hadoop Consulting

Impetus Technologies

Hadoop configuration & performance tuning

Vitthal Gogate

Video: https://www.youtube.com/watch?v=JRFNIKUROPE . Talk for linux.conf.au 2017 (LCA2017) by Brendan Gregg, about Linux enhanced BPF (eBPF). Abstract: A world of new capabilities is emerging for the Linux 4.x series, thanks to enhancements that have been included in Linux for to Berkeley Packet Filter (BPF): an in-kernel virtual machine that can execute user space-defined programs. It is finding uses for security auditing and enforcement, enhancing networking (including eXpress Data Path), and performance observability and troubleshooting. Many new open source tools that have been written in the past 12 months for performance analysis that use BPF. Tracing superpowers have finally arrived for Linux! For its use with tracing, BPF provides the programmable capabilities to the existing tracing frameworks: kprobes, uprobes, and tracepoints. In particular, BPF allows timestamps to be recorded and compared from custom events, allowing latency to be studied in many new places: kernel and application internals. It also allows data to be efficiently summarized in-kernel, including as histograms. This has allowed dozens of new observability tools to be developed so far, including measuring latency distributions for file system I/O and run queue latency, printing details of storage device I/O and TCP retransmits, investigating blocked stack traces and memory leaks, and a whole lot more. This talk will summarize BPF capabilities and use cases so far, and then focus on its use to enhance Linux tracing, especially with the open source bcc collection. bcc includes BPF versions of old classics, and many new tools, including execsnoop, opensnoop, funcccount, ext4slower, and more (many of which I developed). Perhaps you'd like to develop new tools, or use the existing tools to find performance wins large and small, especially when instrumenting areas that previously had zero visibility. I'll also summarize how we intend to use these new capabilities to enhance systems analysis at Netflix.

BPF: Tracing and more

Brendan Gregg

Andere mochten auch (11)

Linux tuning to improve PostgreSQL performance

Linux BPF Superpowers

Broken Linux Performance Tools 2016

Linux Performance Analysis: New Tools and Old Secrets

Velocity 2015 linux perf tools

Linux Systems Performance 2016

Linux Profiling at Netflix

Advanced Hadoop Tuning and Optimization - Hadoop Consulting

Hadoop configuration & performance tuning

BPF: Tracing and more

Ähnlich wie Hadoop Distributions: Bottlenecks and Tuning

Обзор Hadoop-дистрибутивов. Тюнинг «узких мест» Hadoop

Olga Lavrentieva

PostgreSQL: Joining 1 million tables

Hans-Jürgen Schönig

How we switched to columnar at SpendHQ

MariaDB plc

Vcs slides on or 2014

Shakti Ranjan

My mapreduce1 presentation

Noha Elprince

Big data meetup 2012 01-18 - stripped

Malcolm Box

Here is our most popular Hadoop Interview Questions and Answers from our Hadoop Developer Interview Guide. Hadoop Developer Interview Guide has over 100 REAL Hadoop Developer Interview Questions with detailed answers and illustrations asked in REAL interviews. The Hadoop Interview Questions listed in the guide are not "might be" asked interview question, they were asked in interviews at least once.

Hadoop Interview Questions and Answers

Big Data Interview Questions

SEMLA_logging_infra

swy351

Big query - Command line tools and Tips - (MOSG)

Soshi Nemoto

COCOA: Communication-Efficient Coordinate Ascent

jeykottalam

LalitBDA2015V3

Lalit Kumar

Ähnlich wie Hadoop Distributions: Bottlenecks and Tuning (11)

Обзор Hadoop-дистрибутивов. Тюнинг «узких мест» Hadoop

PostgreSQL: Joining 1 million tables

How we switched to columnar at SpendHQ

Vcs slides on or 2014

My mapreduce1 presentation

Big data meetup 2012 01-18 - stripped

Hadoop Interview Questions and Answers

SEMLA_logging_infra

Big query - Command line tools and Tips - (MOSG)

COCOA: Communication-Efficient Coordinate Ascent

LalitBDA2015V3

Mehr von Altoros

Maturing with Kubernetes

Altoros

Kubernetes Platform Readiness and Maturity Assessment

Altoros

In this webinar we will discuss a crawl, walk, run approach to continuous delivery (CD) for applications, point by point: Where to start, how to advance, and how to reach the level of maximum automation. How to orchestrate CI/CD processes along with routing and business continuity. When the automation level is sufficient. GitOps principles and their benefits. What tools should be used to automate CI, CD, GitOps, Container Registry, Secrets management, etc

Journey Through Four Stages of Kubernetes Deployment Maturity

Altoros

SGX: Improving Privacy, Security, and Trust Across Blockchain Networks

Altoros

Using the Cloud Foundry and Kubernetes Stack as a Part of a Blockchain CI/CD ...

Altoros

A Zero-Knowledge Proof: Improving Privacy on a Blockchain

Altoros

Crap. Your Big Data Kitchen Is Broken.

Altoros

The combination of StackPointCloud with NetApp creates NetApp Kubernetes Service, the industry’s first complete Kubernetes platform for multi-cloud deployments and a complete cloud-based stack for Azure, Google Cloud, AWS, and NetApp HCI. Further, Trident is a fully supported open source project maintained by NetApp, designed from the ground up to help meet the sophisticated persistence demands of containerized applications.

Containers and Kubernetes

Altoros

Distributed Ledger Technology for Over-the-Counter Trading

Altoros

5-Step Deployment of Hyperledger Fabric on Multiple Nodes

Altoros

Deploying Kubernetes on GCP with Kubespray

Altoros

UAA for Kubernetes

Altoros

Troubleshooting .NET Applications on Cloud Foundry

Altoros

Jenkins has been the preferred tool for continuous integration and deployment for many years already due to it's smooth user experience, easy configuration, abundance of available plugins and integrations. During the talk we will tell about best practices on using Jenkins together with Cloud Foundry installations, accelerating cloud-native application delivery and packaging using combination of Docker and Jenkins and thoughtful configuration of CI/CD pipelines and keeping apps up-to-date on all CF environments.

Continuous Integration and Deployment with Jenkins for PCF

Altoros

How to Never Leave Your Deployment Unattended

Altoros

Cloud Foundry Monitoring How-To: Collecting Metrics and Logs

Altoros

Smart Baggage Tracking: End-to-End Sensor-Based Solution

Altoros

Navigating the Ecosystem of Pivotal Cloud Foundry Tiles

Altoros

AI as a Catalyst for IoT

Altoros

If your are using Cloud Foundry, you are most obviously into the microservices architecture and cloud-native app development approach. These are definitely best practices in modern application development, but too much of a good thing is good for nothing. Overuse of these principles may lead to over-engineering, when an application is split into too much microservices and, as such, gets hard to maintain and support. This presentation highlights how far overuse of the microservices concept can go, what issues exist, and how these issues can be avoided.

Over-Engineering: Causes, Symptoms, and Treatment

Altoros

Mehr von Altoros (20)

Maturing with Kubernetes

Kubernetes Platform Readiness and Maturity Assessment

Journey Through Four Stages of Kubernetes Deployment Maturity

SGX: Improving Privacy, Security, and Trust Across Blockchain Networks

Using the Cloud Foundry and Kubernetes Stack as a Part of a Blockchain CI/CD ...

A Zero-Knowledge Proof: Improving Privacy on a Blockchain

Crap. Your Big Data Kitchen Is Broken.

Containers and Kubernetes

Distributed Ledger Technology for Over-the-Counter Trading

5-Step Deployment of Hyperledger Fabric on Multiple Nodes

Deploying Kubernetes on GCP with Kubespray

UAA for Kubernetes

Troubleshooting .NET Applications on Cloud Foundry

Continuous Integration and Deployment with Jenkins for PCF

How to Never Leave Your Deployment Unattended

Cloud Foundry Monitoring How-To: Collecting Metrics and Logs

Smart Baggage Tracking: End-to-End Sensor-Based Solution

Navigating the Ecosystem of Pivotal Cloud Foundry Tiles

AI as a Catalyst for IoT

Over-Engineering: Causes, Symptoms, and Treatment

Kürzlich hochgeladen

MS Copilot expands with MS Graph connectors

Nanddeep Nachan

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

Join our latest Connector Corner webinar to discover how UiPath Integration Service revolutionizes API-centric automation in a 'Quote to Cash' process—and how that automation empowers businesses to accelerate revenue generation. A comprehensive demo will explore connecting systems, GenAI, and people, through powerful pre-built connectors designed to speed process cycle times. Speakers: James Dickson, Senior Software Engineer Charlie Greenberg, Host, Product Marketing Manager

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

DianaGray10

This presentations targets students or working professionals. You may know Google for search, YouTube, Android, Chrome, and Gmail, but did you know Google has many developer tools, platforms & APIs? This comprehensive yet still high-level overview outlines the most impactful tools for where to run your code, store & analyze your data. It will also inspire you as to what's possible. This talk is 50 minutes in length.

Powerful Google developer tools for immediate impact! (2023-24 C)

wesley chun

GenAI Risks & Security Meetup 01052024.pdf

lior mazor

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

Zilliz

A Beginners Guide to Building a RAG App Using Open Source Milvus

Zilliz

Scalable LLM APIs for AI and Generative AI Application Development Ettikan Karuppiah, Director/Technologist - NVIDIA Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...

apidays

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

As privacy and data protection regulations evolve rapidly, organizations operating in multiple jurisdictions face mounting challenges to ensure compliance and safeguard customer data. With state-specific privacy laws coming up in multiple states this year, it is essential to understand what their unique data protection regulations will require clearly. How will data privacy evolve in the US in 2024? How to stay compliant? Our panellists will guide you through the intricacies of these states' specific data privacy laws, clarifying complex legal frameworks and compliance requirements. This webinar will review: - The essential aspects of each state's privacy landscape and the latest updates - Common compliance challenges faced by organizations operating in multiple states and best practices to achieve regulatory adherence - Valuable insights into potential changes to existing regulations and prepare your organization for the evolving landscape

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc

Whatsapp Number Escorts Call girls 8617370543 Available 24x7 Navi Mumbai Call Girls Service Offer Genuine VIP Model Escorts Call Girls in Your Budget. Navi Mumbai Call Girls Service Provide Real Call Girls Number. Make Your Sexual Pleasure Memorable with Our Navi Mumbai Call Girls at Affordable Price. Top VIP Escorts Call Girls, High Profile Independent Escorts Call Girls, Housewife Women Escorts Call Girl, College Girls Escorts Call Girls, Russian Escorts Call girls Service in Your Budget.

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

Deepika Singh

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on the deployment of external web forms using Jotform for Bonterra Impact Management. This solution can be customized to your organization’s needs and deployed to support the common use cases below: - Intake and consent - Assessments - Surveys - Applications - Program registration Interested in deploying web form automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Jeffrey Haguewood

Manulife - Insurer Transformation Award 2024

The Digital Insurer

The Good, the Bad and the Governed - Why is governance a dirty word? David O'Neill, Chief Operating Officer - APIContext Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

apidays

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

presentation ICT roal in 21st century education

jfdjdjcjdnsjd

💉💊+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI}}+971581248768 +971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Cl

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

Kürzlich hochgeladen (20)

MS Copilot expands with MS Graph connectors

Axa Assurance Maroc - Insurer Innovation Award 2024

Strategies for Landing an Oracle DBA Job as a Fresher

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Powerful Google developer tools for immediate impact! (2023-24 C)

GenAI Risks & Security Meetup 01052024.pdf

AWS Community Day CPH - Three problems of Terraform

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

A Beginners Guide to Building a RAG App Using Open Source Milvus

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...

How to Troubleshoot Apps for the Modern Connected Worker

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Manulife - Insurer Transformation Award 2024

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

presentation ICT roal in 21st century education

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Boost Fertility New Invention Ups Success Rates.pdf

Hadoop Distributions: Bottlenecks and Tuning

1. Diomin Aliaksey R&D 2014, Minsk

3. OpenSource Monitoring Target Group Apache Hadoop Yes X Developers Cloudera Yes Good All Hortonworks Yes Good All MapR No Bad Enterprise PivotalHD No Bad Enterprise 3

4. How to find the bottleneck? 4

5. 5

6. 6

8. 8

9. 9

10. 10

11. 11

12. 12

13. 1. Increase size of cluster 2. Increase input block size 3. Increase buffer size 13

14. 1. Increase size of cluster 2. Increase input block size 3. Increase buffer size 14

15. 15

16. 16

17. 17

18. 1. Increase size of cluster 2. Increase input block size 3. Increase buffer size 18

19. 19

20. 1. Increase size of cluster 2. Increase input block size 3. Increase buffer size 20

21. 1. Compression 21

22. 1. Compression 2. Combiner 22

23. Wordcount Reduce function as Combine combine 1: <a, 1> <b, 1> <a, 1> => <a, 2> <b, 1> combine 2: <a, 1> <b, 1> => <a, 1> <b, 1> Reduce: <a, {1, 2}> <b, {1, 1}> => <a, 3> <b, 2> 23

24. Mean combine 1: <k,40> <k,30> <k,20> => <k, 30> combine 2: <k,2> <k,8> => <k, 5> Reduce: => <k, 17.5> <k, {30, 5}> 24

25. Mean combine 1: <k,40> <k,30> <k,20> => <k, 30> combine 2: <k,2> <k,8> => <k, 5> Reduce: => <k, 17.5> <k, {30, 5}> (40 + 30 + 20 + 2 + 8)/5 = 17.5 25

26. Mean combine 1: <k,<40,1>> <k,<30,1>>, <k,<20,1>> => <k, <90,3> > <k,<2,1>> <k, <8,1>> => <k, <10, 2> > Reduce: => <k, 20> combine 2: <k, {<90,3>, <10,2>} > 26

27. 27

Hinweis der Redaktion

вывод map, если в буфер не влазит то сброс на диск, потом merge-sort.в определенный момент 2х кратное превышение использования диска относительно вывода map
данные гоняются по сети, нагрузка на io – disk read & network
вывод map, если в буфер не влазит то сброс на диск, потом merge-sort.в определенный момент 2х кратное превышение использования диска относительно вывода map
Задачка: сколько записей и чтений на диск можно получить имея вывод X.идеально: X записали из map, X считали на этапе fetchсуровая реальность: write: X(spill) + X (merge-sort) + X (fetch/spill) = 3 Xread: X (merge-sort) + X (fetch) + X (toreducer) = 3 X
Задачка: сколько записей и чтений на диск можно получить имея вывод X.идеально: X записали из map, X считали на этапе fetchсуровая реальность: write: X(spill) + X (merge-sort) + X (fetch/spill) = 3 Xread: X (merge-sort) + X (fetch) + X (toreducer) = 3 X
увеличим количество машин в 2 раза, а заодно и в параметрах проставим в 2 раза больше map и reducemap и reduce => eachother => в 4 раза больше коннектов на получение данных => лимиты на обработку handlers, на самой датанодеВЫВОД: количество одновременно запущенных map/reduceинстансов должно определяться в первую очередь задачей, линейное масштабирование это сказка
увеличим количество машин в 2 раза, а заодно и в параметрах проставим в 2 раза больше map и reducemap и reduce => eachother => в 4 раза больше коннектов на получение данных => лимиты на обработку handlers, на самой датанодеВЫВОД: количество одновременно запущенных map/reduceинстансов должно определяться в первую очередь задачей, линейное масштабирование это сказка
2) увеличим блок данных для map => выскочили за размеры буфера => лишний spill на диск => больше дискового io => все медленней. ВЫВОД: размер блока для обработки на вход map должен быть достаточно большим чтобы заполнить буфер, но не больше, иначе лишняя активность на диске
2) увеличим блок данных для map => выскочили за размеры буфера => лишний spill на диск => больше дискового io => все медленней. ВЫВОД: размер блока для обработки на вход map должен быть достаточно большим чтобы заполнить буфер, но не больше, иначе лишняя активность на диске
3) увеличим размер кеша на map/reduce => ограничения размера для буфера в jvm (больше 2х гб на массив не выделить)Тут уже ничего не поделать, нужно учитывать что у map/reduce функций есть свои лимиты и они легко достижимы
компрессия => размен cpu на diskio => snappy, достаточно шустрое решение для потокового сжатия
Combiner - не всегда возможно использовать в лоб (например мы считаем с помощью hive/pig) или у нас веселая функция
incorrect
incorrect
правильное решение, но требует дополнительных манипуляций на всех уровнях: 1) меняем MapOutputFormat (в значении не просто число, а сумма свернутых чисел и количество чисел для получения текущей суммы)2) отдельная функция для Combine

Hadoop Distributions: Bottlenecks and Tuning

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (11)

Ähnlich wie Hadoop Distributions: Bottlenecks and Tuning

Ähnlich wie Hadoop Distributions: Bottlenecks and Tuning (11)

Mehr von Altoros

Mehr von Altoros (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Hadoop Distributions: Bottlenecks and Tuning

Hinweis der Redaktion