SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
In-situ MapReduce for Log Processing


          Speaker: LIN Qian
 http://www.comp.nus.edu.sg/~linqian
Log analytics
• Data centers with 1000s of
  servers

• Data-intensive computing:
  Store and analyze TBs of logs

Examples:
• Click logs
   – ad-targeting, personalization
• Social media feeds
   – brand monitoring
• Purchase logs
   – fraud detection
• System logs
   – anomaly detection, debugging      1
Log analytics today
• “Store-first-query later”   Servers

Problems:
• Scale
   – Stress network and
     disks
                                               Store first ...
• Failures
   – Delay analysis or
     process incomplete                       ... query later
     data
• Timeliness
                                   MapReduce
   – Hinder real-time apps
                                  Dedicated cluster
                                                          2
In-situ MapReduce (iMR)
Idea:                              Servers
• Move analysis to the
   servers
• MapReduce for continuous              MapReduce
   data
• Ability to trade fidelity for
   latency

Optimized for:
• Highly selective workloads
   – e.g., up to 80% data
     filtered or summarized!
• Online analytics
   – e.g., ad re-targeting based
     on most recent clicks             Dedicated cluster
                                                           3
An iMR query
The same:
• MapReduce API
  – map(r)  {k,v} : extract/filter data
  – reduce({k,v[]})  v’ : data aggregation
  – combine({k,v[]})  v’ : early, partial aggregation


The new:
• Provides continuous results
  – because logs are continuous
                                                         4
Continuous MapReduce
                                   Log entries
• Input
   – An infinite stream of logs                                       ...
                                                                            Time
                                  0’’            30’’         60’’   90’’
• Bound input with sliding
  windows
                                                        Map
   – Range of data (R)                              Combine
   – Update frequency (S)


• Output
                                                    Reduce
   – Stream of results, one
     for each window
                                                                              5
Processing windows in-network
                                                   Overlapping data
 User’s reduce function
                                                            ...
                                                                  Time
                             0’’   30’’         60’’       90’’



                                          Map
                                      Combine


                                          ...

                                      Reduce



Aggregation tree for efficiency                                     6
Efficient processing with panes
                            P1 P2 P3 P4 P5
• Divide window into
  panes (sub-windows)                                    ...
  – Each pane is                                               Time
                          0’’     30’’           60’’   90’’
    processed and sent
    only once
  – Root combines panes                  Map
    to produce window                  Combine
• Eliminate redundant             P1
                                  P2
  work                            P3
                                  P4

  – Save CPU & network
                                  P5


    resources, faster
    analysis                           Reduce


                                                                 7
Impact of data loss on analysis
• Servers may get
                       P1 P2 P3 P4 P5


  overloaded or fail                     ...




                             X
Challenges:
• Characterize
                                Map
                               Combine
  incomplete results
• Allow users to
  trade fidelity for
  latency                      Reduce


                                 ?             8
Quantifying data fidelity
• Data are naturally
  distributed
  – Space (server nodes)
  – Time (processing window)


• C2 metric
  – Annotates result windows
    with a “scoreboard”
                                    9
Trading fidelity for latency
• Use C2 to trade fidelity for
  latency
  – Maximum latency requirement
  – Minimum fidelity requirement


• Different ways to meet
  minimum fidelity
  – 4 useful classes of C2
    specifications

                                        10
Minimizing result latency




• Minimum fidelity with earlier results
• Give freedom to decrease latency
  – Return the earliest data available
• Appropriate for uniformly distributed
  events
                                          11
Sampling non-uniform events




• Minimum fidelity with random sampling
• Less freedom to decrease latency
  – Included data may not be the first
    available
• Appropriate even for non-uniform data
                                          12
Correlating events across time and space

Leverage knowledge about data distribution
• Temporal completeness
  – Include all data from a
    node or no data at all


• Spatial completeness
  – Each pane contains data
    from all nodes
                                             13
Prototype
• Build upon Mortar
  – Sliding windows
  – In-network aggregation trees

• Extended to support:
  – MapReduce API
  – Pane-based processing
  – Fault tolerance mechanisms
                                   14
Processing data in-situ
• Useful when ...
• Goal: use available resources intelligently

• Load shedding mechanism
  – Nodes monitor local processing rate
  – Shed panes that cannot be processed on
    time
• Increase result fidelity under time and
  resource constraints
                                                15
Evaluation
• System scalability
• Usefulness of C2 metric
  – Understanding incomplete results
  – Trading fidelity for latency
• Processing data in-situ
  – Improving fidelity under load with load
    shedding
  – Minimizing impact on services
                                              16
Scaling
• Synthetic input data, reducer of word
  count
• 3 reducers provide sufficient processing
  to handle the 30 map tasks




                                             17
Exploring fidelity-latency trade-offs
Data loss affects accuracy of
distribution
                                    100%
                                   accuracy
• Temporal completeness
• Spatial completeness and
  random sampling                 >25%
                                  decrease



C2 allows to trade fidelity for
lower latency
                                              18
In-situ performance
• iMR side-by-side with a
  real service (Hadoop)
                                   560%

• Vary CPU allocated to iMR
  – Result fidelity
  – Hadoop performance (job
    throughput)                 <11% overhead




                                            19
Conclusion
• In-situ architecture processes logs at the
  sources, avoids bulk data transfers, and
  reduces analysis time
• Model allows incomplete data under failures
  or server load, provides timely analysis
• C2 metric helps understand incomplete data
  and trade fidelity for latency
• Pro-actively sheds load, improves data fidelity
  under resource and time constraints
                                                    20

Weitere ähnliche Inhalte

Was ist angesagt?

Automatic Energy-based Scheduling
Automatic Energy-based SchedulingAutomatic Energy-based Scheduling
Automatic Energy-based SchedulingMaria Stylianou
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Massive Solutions Clustrx Os
Massive Solutions Clustrx OsMassive Solutions Clustrx Os
Massive Solutions Clustrx OsViktor Sovietov
 
Intelligent cloud computing
Intelligent cloud computingIntelligent cloud computing
Intelligent cloud computingLINE+
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
Distributed Processing Frameworks
Distributed Processing FrameworksDistributed Processing Frameworks
Distributed Processing FrameworksAntonios Katsarakis
 
Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCOlga Lavrentieva
 

Was ist angesagt? (7)

Automatic Energy-based Scheduling
Automatic Energy-based SchedulingAutomatic Energy-based Scheduling
Automatic Energy-based Scheduling
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Massive Solutions Clustrx Os
Massive Solutions Clustrx OsMassive Solutions Clustrx Os
Massive Solutions Clustrx Os
 
Intelligent cloud computing
Intelligent cloud computingIntelligent cloud computing
Intelligent cloud computing
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Distributed Processing Frameworks
Distributed Processing FrameworksDistributed Processing Frameworks
Distributed Processing Frameworks
 
Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPC
 

Andere mochten auch

C-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
C-MR: Continuously Executing MapReduce Workflows on Multi-Core ProcessorsC-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
C-MR: Continuously Executing MapReduce Workflows on Multi-Core ProcessorsQian Lin
 
Kineograph: Taking the Pulse of a Fast-Changing and Connected World
Kineograph: Taking the Pulse of a Fast-Changing and Connected WorldKineograph: Taking the Pulse of a Fast-Changing and Connected World
Kineograph: Taking the Pulse of a Fast-Changing and Connected WorldQian Lin
 
Adaptive Execution Support for Malleable Computation
Adaptive Execution Support for Malleable ComputationAdaptive Execution Support for Malleable Computation
Adaptive Execution Support for Malleable ComputationQian Lin
 
C-Cube: Elastic Continuous Clustering in the Cloud
C-Cube: Elastic Continuous Clustering in the CloudC-Cube: Elastic Continuous Clustering in the Cloud
C-Cube: Elastic Continuous Clustering in the CloudQian Lin
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...Qian Lin
 
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse MatricesPresto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse MatricesQian Lin
 
Optimizing Virtual Machines Using Hybrid Virtualization
Optimizing Virtual Machines Using Hybrid VirtualizationOptimizing Virtual Machines Using Hybrid Virtualization
Optimizing Virtual Machines Using Hybrid VirtualizationQian Lin
 
Trinity: A Distributed Graph Engine on a Memory Cloud
Trinity: A Distributed Graph Engine on a Memory CloudTrinity: A Distributed Graph Engine on a Memory Cloud
Trinity: A Distributed Graph Engine on a Memory CloudQian Lin
 

Andere mochten auch (8)

C-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
C-MR: Continuously Executing MapReduce Workflows on Multi-Core ProcessorsC-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
C-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
 
Kineograph: Taking the Pulse of a Fast-Changing and Connected World
Kineograph: Taking the Pulse of a Fast-Changing and Connected WorldKineograph: Taking the Pulse of a Fast-Changing and Connected World
Kineograph: Taking the Pulse of a Fast-Changing and Connected World
 
Adaptive Execution Support for Malleable Computation
Adaptive Execution Support for Malleable ComputationAdaptive Execution Support for Malleable Computation
Adaptive Execution Support for Malleable Computation
 
C-Cube: Elastic Continuous Clustering in the Cloud
C-Cube: Elastic Continuous Clustering in the CloudC-Cube: Elastic Continuous Clustering in the Cloud
C-Cube: Elastic Continuous Clustering in the Cloud
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
 
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse MatricesPresto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
 
Optimizing Virtual Machines Using Hybrid Virtualization
Optimizing Virtual Machines Using Hybrid VirtualizationOptimizing Virtual Machines Using Hybrid Virtualization
Optimizing Virtual Machines Using Hybrid Virtualization
 
Trinity: A Distributed Graph Engine on a Memory Cloud
Trinity: A Distributed Graph Engine on a Memory CloudTrinity: A Distributed Graph Engine on a Memory Cloud
Trinity: A Distributed Graph Engine on a Memory Cloud
 

Ähnlich wie In-situ MapReduce for Log Processing

Mobile+Cloud: a viable replacement for desktop cheminformatics?
Mobile+Cloud: a viable replacement for desktop cheminformatics?Mobile+Cloud: a viable replacement for desktop cheminformatics?
Mobile+Cloud: a viable replacement for desktop cheminformatics?Alex Clark
 
Cloud connect 03 08-2011
Cloud connect 03 08-2011Cloud connect 03 08-2011
Cloud connect 03 08-2011Colin Clark
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Performance Management in ‘Big Data’ Applications
Performance Management in ‘Big Data’ ApplicationsPerformance Management in ‘Big Data’ Applications
Performance Management in ‘Big Data’ ApplicationsMichael Kopp
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010Cloudera, Inc.
 
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk
 
In memory grids IMDG
In memory grids IMDGIn memory grids IMDG
In memory grids IMDGPrateek Jain
 
Architecture Challenges In Cloud Computing
Architecture Challenges In Cloud ComputingArchitecture Challenges In Cloud Computing
Architecture Challenges In Cloud ComputingIndicThreads
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101MongoDB
 
MonogDB Admin 101 - MonogDBDays Munich
MonogDB Admin 101 - MonogDBDays MunichMonogDB Admin 101 - MonogDBDays Munich
MonogDB Admin 101 - MonogDBDays MunichMarc Schwering
 
Performance Models for Apache Accumulo
Performance Models for Apache AccumuloPerformance Models for Apache Accumulo
Performance Models for Apache AccumuloSqrrl
 
Performance challenges in software networking
Performance challenges in software networkingPerformance challenges in software networking
Performance challenges in software networkingStephen Hemminger
 
Introduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network IssuesIntroduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network IssuesJason TC HOU (侯宗成)
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysisData Science London
 
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Jeff Hung
 
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...DataStax Academy
 
数据中心网络研究:机遇与挑战
数据中心网络研究:机遇与挑战数据中心网络研究:机遇与挑战
数据中心网络研究:机遇与挑战Weiwei Fang
 

Ähnlich wie In-situ MapReduce for Log Processing (20)

Mobile+Cloud: a viable replacement for desktop cheminformatics?
Mobile+Cloud: a viable replacement for desktop cheminformatics?Mobile+Cloud: a viable replacement for desktop cheminformatics?
Mobile+Cloud: a viable replacement for desktop cheminformatics?
 
Cloud connect 03 08-2011
Cloud connect 03 08-2011Cloud connect 03 08-2011
Cloud connect 03 08-2011
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Performance Management in ‘Big Data’ Applications
Performance Management in ‘Big Data’ ApplicationsPerformance Management in ‘Big Data’ Applications
Performance Management in ‘Big Data’ Applications
 
Building a Front End for a Sensor Data Cloud
Building a Front End for a Sensor Data CloudBuilding a Front End for a Sensor Data Cloud
Building a Front End for a Sensor Data Cloud
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
 
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
In memory grids IMDG
In memory grids IMDGIn memory grids IMDG
In memory grids IMDG
 
Architecture Challenges In Cloud Computing
Architecture Challenges In Cloud ComputingArchitecture Challenges In Cloud Computing
Architecture Challenges In Cloud Computing
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101
 
MonogDB Admin 101 - MonogDBDays Munich
MonogDB Admin 101 - MonogDBDays MunichMonogDB Admin 101 - MonogDBDays Munich
MonogDB Admin 101 - MonogDBDays Munich
 
Disco workshop
Disco workshopDisco workshop
Disco workshop
 
Performance Models for Apache Accumulo
Performance Models for Apache AccumuloPerformance Models for Apache Accumulo
Performance Models for Apache Accumulo
 
Performance challenges in software networking
Performance challenges in software networkingPerformance challenges in software networking
Performance challenges in software networking
 
Introduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network IssuesIntroduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network Issues
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysis
 
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
 
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
 
数据中心网络研究:机遇与挑战
数据中心网络研究:机遇与挑战数据中心网络研究:机遇与挑战
数据中心网络研究:机遇与挑战
 

Kürzlich hochgeladen

Diploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfDiploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfMohonDas
 
How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17Celine George
 
CAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxCAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxSaurabhParmar42
 
Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.raviapr7
 
General views of Histopathology and step
General views of Histopathology and stepGeneral views of Histopathology and step
General views of Histopathology and stepobaje godwin sunday
 
AUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptxAUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptxiammrhaywood
 
Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...raviapr7
 
How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17Celine George
 
Presentation on the Basics of Writing. Writing a Paragraph
Presentation on the Basics of Writing. Writing a ParagraphPresentation on the Basics of Writing. Writing a Paragraph
Presentation on the Basics of Writing. Writing a ParagraphNetziValdelomar1
 
M-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxM-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxDr. Santhosh Kumar. N
 
The basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxThe basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxheathfieldcps1
 
How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17Celine George
 
HED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdfHED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdfMohonDas
 
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptx
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptxPractical Research 1: Lesson 8 Writing the Thesis Statement.pptx
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptxKatherine Villaluna
 
CapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapitolTechU
 
Practical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxPractical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxKatherine Villaluna
 
Human-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming ClassesHuman-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming ClassesMohammad Hassany
 
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfP4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfYu Kanazawa / Osaka University
 
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptxClinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptxraviapr7
 
How to Use api.constrains ( ) in Odoo 17
How to Use api.constrains ( ) in Odoo 17How to Use api.constrains ( ) in Odoo 17
How to Use api.constrains ( ) in Odoo 17Celine George
 

Kürzlich hochgeladen (20)

Diploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfDiploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdf
 
How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17
 
CAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxCAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptx
 
Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.
 
General views of Histopathology and step
General views of Histopathology and stepGeneral views of Histopathology and step
General views of Histopathology and step
 
AUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptxAUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptx
 
Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...
 
How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17
 
Presentation on the Basics of Writing. Writing a Paragraph
Presentation on the Basics of Writing. Writing a ParagraphPresentation on the Basics of Writing. Writing a Paragraph
Presentation on the Basics of Writing. Writing a Paragraph
 
M-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxM-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptx
 
The basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxThe basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptx
 
How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17
 
HED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdfHED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdf
 
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptx
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptxPractical Research 1: Lesson 8 Writing the Thesis Statement.pptx
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptx
 
CapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptx
 
Practical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxPractical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptx
 
Human-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming ClassesHuman-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming Classes
 
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfP4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
 
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptxClinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
 
How to Use api.constrains ( ) in Odoo 17
How to Use api.constrains ( ) in Odoo 17How to Use api.constrains ( ) in Odoo 17
How to Use api.constrains ( ) in Odoo 17
 

In-situ MapReduce for Log Processing

  • 1. In-situ MapReduce for Log Processing Speaker: LIN Qian http://www.comp.nus.edu.sg/~linqian
  • 2. Log analytics • Data centers with 1000s of servers • Data-intensive computing: Store and analyze TBs of logs Examples: • Click logs – ad-targeting, personalization • Social media feeds – brand monitoring • Purchase logs – fraud detection • System logs – anomaly detection, debugging 1
  • 3. Log analytics today • “Store-first-query later” Servers Problems: • Scale – Stress network and disks Store first ... • Failures – Delay analysis or process incomplete ... query later data • Timeliness MapReduce – Hinder real-time apps Dedicated cluster 2
  • 4. In-situ MapReduce (iMR) Idea: Servers • Move analysis to the servers • MapReduce for continuous MapReduce data • Ability to trade fidelity for latency Optimized for: • Highly selective workloads – e.g., up to 80% data filtered or summarized! • Online analytics – e.g., ad re-targeting based on most recent clicks Dedicated cluster 3
  • 5. An iMR query The same: • MapReduce API – map(r)  {k,v} : extract/filter data – reduce({k,v[]})  v’ : data aggregation – combine({k,v[]})  v’ : early, partial aggregation The new: • Provides continuous results – because logs are continuous 4
  • 6. Continuous MapReduce Log entries • Input – An infinite stream of logs ... Time 0’’ 30’’ 60’’ 90’’ • Bound input with sliding windows Map – Range of data (R) Combine – Update frequency (S) • Output Reduce – Stream of results, one for each window 5
  • 7. Processing windows in-network Overlapping data User’s reduce function ... Time 0’’ 30’’ 60’’ 90’’ Map Combine ... Reduce Aggregation tree for efficiency 6
  • 8. Efficient processing with panes P1 P2 P3 P4 P5 • Divide window into panes (sub-windows) ... – Each pane is Time 0’’ 30’’ 60’’ 90’’ processed and sent only once – Root combines panes Map to produce window Combine • Eliminate redundant P1 P2 work P3 P4 – Save CPU & network P5 resources, faster analysis Reduce 7
  • 9. Impact of data loss on analysis • Servers may get P1 P2 P3 P4 P5 overloaded or fail ... X Challenges: • Characterize Map Combine incomplete results • Allow users to trade fidelity for latency Reduce ? 8
  • 10. Quantifying data fidelity • Data are naturally distributed – Space (server nodes) – Time (processing window) • C2 metric – Annotates result windows with a “scoreboard” 9
  • 11. Trading fidelity for latency • Use C2 to trade fidelity for latency – Maximum latency requirement – Minimum fidelity requirement • Different ways to meet minimum fidelity – 4 useful classes of C2 specifications 10
  • 12. Minimizing result latency • Minimum fidelity with earlier results • Give freedom to decrease latency – Return the earliest data available • Appropriate for uniformly distributed events 11
  • 13. Sampling non-uniform events • Minimum fidelity with random sampling • Less freedom to decrease latency – Included data may not be the first available • Appropriate even for non-uniform data 12
  • 14. Correlating events across time and space Leverage knowledge about data distribution • Temporal completeness – Include all data from a node or no data at all • Spatial completeness – Each pane contains data from all nodes 13
  • 15. Prototype • Build upon Mortar – Sliding windows – In-network aggregation trees • Extended to support: – MapReduce API – Pane-based processing – Fault tolerance mechanisms 14
  • 16. Processing data in-situ • Useful when ... • Goal: use available resources intelligently • Load shedding mechanism – Nodes monitor local processing rate – Shed panes that cannot be processed on time • Increase result fidelity under time and resource constraints 15
  • 17. Evaluation • System scalability • Usefulness of C2 metric – Understanding incomplete results – Trading fidelity for latency • Processing data in-situ – Improving fidelity under load with load shedding – Minimizing impact on services 16
  • 18. Scaling • Synthetic input data, reducer of word count • 3 reducers provide sufficient processing to handle the 30 map tasks 17
  • 19. Exploring fidelity-latency trade-offs Data loss affects accuracy of distribution 100% accuracy • Temporal completeness • Spatial completeness and random sampling >25% decrease C2 allows to trade fidelity for lower latency 18
  • 20. In-situ performance • iMR side-by-side with a real service (Hadoop) 560% • Vary CPU allocated to iMR – Result fidelity – Hadoop performance (job throughput) <11% overhead 19
  • 21. Conclusion • In-situ architecture processes logs at the sources, avoids bulk data transfers, and reduces analysis time • Model allows incomplete data under failures or server load, provides timely analysis • C2 metric helps understand incomplete data and trade fidelity for latency • Pro-actively sheds load, improves data fidelity under resource and time constraints 20