Big data meetup

•Als PPTX, PDF herunterladen•

1 gefällt mir•690 views

Vertascale Real-time big data hadoop AUDIO: http://files.meetup.com/7151882/Real-Time%20Big%20Data%20130312%20Geoff%20Hendrey.opus

Technologie

Architecture for real-time ad-hoc
query on distributed filesystems

Geoffrey Hendrey
@geoffhendrey

Motivation
• Big Data is more opaque than small data
– Spreadsheets choke
– BI tools can’t scale
– Small samples often fail to replicate issues
• Engineers, data scientists, analysts need:
– Faster “time to answer” on Big Data
– Rapid “find, quantify, extract”
• Solve “I don’t know what I don’t know”
• This is NOT about looking up items in a product
catalog (i.e. not a consumer search problem)

Classic “side system” approach
• Definition of KLUDGE: “a system and
especially a computer system made up of
poorly matched components” –Merriam-Webster

Search
Hadoop ?????
Cluster

Classic “search toolkit”
• Built around fulltext use case
• Inverted Indexes optimized for on-the-fly
ranking of results
– TF-IDF
– Okapi BM-25
• Yet never able to fully realize google-style
search capability
• Issues:
– Phrase detection
– Pseudo synonymy
– Open loop architecture

Big data ad-hoc query
• Not typically a fulltext “document search” problem
• Data is structured, mixed structured, and
denormalized
– Log lines
– Json records
– CSV files
– Hadoop native formats (SequenceFile)
• Ranking is explicit (ORDER BY), not relevance based
• Sometimes “needle in haystack” (support,
debugging)
• Sometimes “haystack in haystack” (summary
analytics, segmentation)

Finer points of Dremel architecture
• MapReduce friendly
• In-Situ approach is DFS friendly
• Excels at aggregation. Not so much for needle-in-
haystack.
• Column storage format accelerates mapreduce
(less extraneous data pushed through)
• But in some regards still a “side system”
• Applications must explicitly store their data in a
columnar format
• “massive” is both a benefit and a hazard
– Complex (operationally and WRT query execution)
– Queries can execute quickly…on huge clusters

Crawled In-Situ Index Architecture

Hadoop
Data Crawl
Application

MapReduce HDFS SimpleSearch

In-situ Index

Benefits to crawled In-Situ index
• No changes to application data format
– CSV
– JSON
– SequenceFile
• Clear “separation of concerns” between data
and index
• Indexes become “disposable”: easily built,
easily thrown away
• There is no “side system” that needs to be
maintained
• Use the mapreduce “hammer” to pound a nail

Architect for Elasticity

Crawl
Application

Elastic
AWS JetS3t
EC2
MapReduce S3 HTTP
M1.large

Index

Interesting: you don’t actually need to have hadoop installed…

Declarative Crawl Indexing
{
"filter”:"column[4]=="athens""
Hadoop }

Data Crawl
Parse.json Application
HDFS SimpleSearc
MapReduce h

In-situ Index

• Indexer reads declarative instructions from in-situ file
• “pull” vs. traditional “push” indexing approach

Thin index
Data Crawl

MapReduce Data
HDFS

Index

In-situ Index

• Index size is small because data is a holistic
part of the system
• data does not need to be “put into” the search
system and repicated in the index.

Lazy data loading

Data Crawl

Lazy Pull
Data
Execution
MapReduce HDFS Runtime
LRU
Index Lazy Pull Index
Cache

Contact Info
Email:
geoff@vertascale.com

Private Beta
http://vertascale.com

Empfohlen

Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini

Netflix Keystone Pipeline at Samza Meetup 10-13-2015Monal Daxini

Beaming flink to the cloud @ netflix ff 2016-monal-daxiniMonal Daxini

Netflix at-disney-09-26-2014Monal Daxini

netflix-real-time-data-strata-talkDanny Yuan

Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016Monal Daxini

Real Time Data Infrastructure team overviewMonal Daxini

The Netflix Way to deal with Big Data ProblemsMonal Daxini

Empfohlen

Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini

Netflix Keystone Pipeline at Samza Meetup 10-13-2015Monal Daxini

Beaming flink to the cloud @ netflix ff 2016-monal-daxiniMonal Daxini

Netflix at-disney-09-26-2014Monal Daxini

netflix-real-time-data-strata-talkDanny Yuan

Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016Monal Daxini

Real Time Data Infrastructure team overviewMonal Daxini

The Netflix Way to deal with Big Data ProblemsMonal Daxini

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

A Year of the Servo Reboot: Where Are We Now?Igalia

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Slack Application Development 101 Slidespraypatel2

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

A Domino Admins Adventures (Engage 2024)Gabriella Davis

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Artificial Intelligence: Facts and MythsJoaquim Jorge

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

2024 State of Marketing Report – by HubspotMarius Sescu

Everything You Need To Know About ChatGPTExpeed Software

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

A Year of the Servo Reboot: Where Are We Now?Igalia

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Slack Application Development 101 Slidespraypatel2

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

A Domino Admins Adventures (Engage 2024)Gabriella Davis

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Artificial Intelligence: Facts and MythsJoaquim Jorge

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Kürzlich hochgeladen (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

A Year of the Servo Reboot: Where Are We Now?

[2024]Digital Global Overview Report 2024 Meltwater.pdf

CNv6 Instructor Chapter 6 Quality of Service

Slack Application Development 101 Slides

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Advantages of Hiring UIUX Design Service Providers for Your Business

A Domino Admins Adventures (Engage 2024)

How to Troubleshoot Apps for the Modern Connected Worker

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Artificial Intelligence: Facts and Myths

IAC 2024 - IA Fast Track to Search Focused AI Solutions

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Driving Behavioral Change for Information Management through Data-Driven Gree...

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

What Are The Drone Anti-jamming Systems Technology?

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

2024: Domino Containers - The Next Step. News from the Domino Container commu...

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Automating Google Workspace (GWS) & more with Apps Script

Empfohlen

2024 State of Marketing Report – by HubspotMarius Sescu

Everything You Need To Know About ChatGPTExpeed Software

Product Design Trends in 2024 | Teenage EngineeringsPixeldarts

How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow

AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork

Skeleton Culture CodeSkeleton Technologies

PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Empfohlen (20)

2024 State of Marketing Report – by Hubspot

Everything You Need To Know About ChatGPT

Product Design Trends in 2024 | Teenage Engineerings

How Race, Age and Gender Shape Attitudes Towards Mental Health

AI Trends in Creative Operations 2024 by Artwork Flow.pdf

Skeleton Culture Code

PEPSICO Presentation to CAGNY Conference Feb 2024

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Big data meetup

1. Architecture for real-time ad-hoc query on distributed filesystems Geoffrey Hendrey @geoffhendrey

2. Motivation • Big Data is more opaque than small data – Spreadsheets choke – BI tools can’t scale – Small samples often fail to replicate issues • Engineers, data scientists, analysts need: – Faster “time to answer” on Big Data – Rapid “find, quantify, extract” • Solve “I don’t know what I don’t know” • This is NOT about looking up items in a product catalog (i.e. not a consumer search problem)

3. Scaling search with classic sharding

4. Classic “side system” approach • Definition of KLUDGE: “a system and especially a computer system made up of poorly matched components” –Merriam-Webster Search Hadoop ????? Cluster

5. Classic “search toolkit” • Built around fulltext use case • Inverted Indexes optimized for on-the-fly ranking of results – TF-IDF – Okapi BM-25 • Yet never able to fully realize google-style search capability • Issues: – Phrase detection – Pseudo synonymy – Open loop architecture

6. Big data ad-hoc query • Not typically a fulltext “document search” problem • Data is structured, mixed structured, and denormalized – Log lines – Json records – CSV files – Hadoop native formats (SequenceFile) • Ranking is explicit (ORDER BY), not relevance based • Sometimes “needle in haystack” (support, debugging) • Sometimes “haystack in haystack” (summary analytics, segmentation)

7. Dremel MPP query execution tree

8. Finer points of Dremel architecture • MapReduce friendly • In-Situ approach is DFS friendly • Excels at aggregation. Not so much for needle-in- haystack. • Column storage format accelerates mapreduce (less extraneous data pushed through) • But in some regards still a “side system” • Applications must explicitly store their data in a columnar format • “massive” is both a benefit and a hazard – Complex (operationally and WRT query execution) – Queries can execute quickly…on huge clusters

9. Crawled In-Situ Index Architecture Hadoop Data Crawl Application MapReduce HDFS SimpleSearch In-situ Index

10. Benefits to crawled In-Situ index • No changes to application data format – CSV – JSON – SequenceFile • Clear “separation of concerns” between data and index • Indexes become “disposable”: easily built, easily thrown away • There is no “side system” that needs to be maintained • Use the mapreduce “hammer” to pound a nail

11. Architect for Elasticity Crawl Application Elastic AWS JetS3t EC2 MapReduce S3 HTTP M1.large Index Interesting: you don’t actually need to have hadoop installed…

12. Declarative Crawl Indexing { "filter”:"column[4]=="athens"" Hadoop } Data Crawl Parse.json Application HDFS SimpleSearc MapReduce h In-situ Index • Indexer reads declarative instructions from in-situ file • “pull” vs. traditional “push” indexing approach

13. Thin index Data Crawl MapReduce Data HDFS Index In-situ Index • Index size is small because data is a holistic part of the system • data does not need to be “put into” the search system and repicated in the index.

14. Lazy data loading Data Crawl Lazy Pull Data Execution MapReduce HDFS Runtime LRU Index Lazy Pull Index Cache

15. Column Oriented Approach

16. Contact Info Email: geoff@vertascale.com Private Beta http://vertascale.com