Insight Data Engineering project

•

3 gefällt mir•1,111 views

Hoa Nguyen

Ingesting 150 terabytes of data using Spark

Ingenieurwesen

Crawling for names
A search through 150 terabytes of data

Goals
Functional:
● Parse through data, counting websites that mention Donald Trump, Ted
Cruz, Hillary Clinton, Bernie Sanders
Engineering:
● Do this as fast and efficiently as possible on the entire corpus
● Learn Scala

Challenges
● 35,700 zipped text files of modest sizes on Amazon S3
● Each file on average holds data from 34,000 URIs
● Data from one URI (one record) spans multiple lines

Parsing Common Crawl text file
Header / URI
Find these names

Coding challenges
1. Spark prefers to ingest files in which one record spans single line
➔ sc.textFile(filename)
2. For multi-line records, must use
➔ config.set(“textinputformat.record.delimiter”, “WARC-Target-URI: “)
➔ ingestMe = sc.newAPIHadoopFile(filename, classOf[TextInputFormat],
classOf[LongWritable], classOf[Text], config)
First method allows bulk loading of files; second method limited to single file at
one time

How much time?
Original estimate: 21 days
Helpful Not so helpful
✔ Eliminate debug printlns
✔ Limit “filter”, “map”
functions
✔✔ Union RDD (data sets)
triggered distributed
computing
❗ Pool database calls
(Must use sparingly)
❌ Multiple Spark-Submit jobs
(Held promise but resource
intensive; crashed JVM)
Revised estimate: 18-35 hours

Results: How the candidates stack up
Check out which candidate got the most mentions:
http://namecrawler.xyz

About me
Most recently news reporter.
Background in computer science
Avid cook, baker

What else would have helped boost speed
● Amp up cluster computing power
○ Upgrade from m4large (8GB RAM) to r3large (15.25GB) or r3xlarge (30.5GB)
● Concatenate files prior to processing
○ Eliminates having to manually join datasets
○ Pros: Java libraries exist to do so
○ Cons: Must make room for 150 terabytes of files
● Split batch processing into multiple jobs

Optimizations: Union data
// Grab file off Amazon's S3
val hdFile = sc.newAPIHadoopFile(fullCrawlName, classOf[TextInputFormat],
classOf[LongWritable], classOf[Text], localConfig)
// Hold on to file until there are enough for a trio
hdFiles(i-1) = hdFile
if (i % 3 == 0) { // Act only on batches of three RDDs
val hdFile = hdFiles(i-3).union(hdFiles(i-2).union(hdFiles(i-1)))
// Send the three-large RDD for saving
saveCrawlData(crawlFileID, hdFile)
// Reset batch counter
i=0
}

Empfohlen

Using the whole web as your datasetTuri, Inc.

Analytics and Access to the UK web archiveLewis Crawford

AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...Amazon Web Services

useR! 2012 Talkrtelmore

London HUGBoudicca

BDT204 Awesome Applications of Open Data - AWS re: Invent 2012Amazon Web Services

Try It The Google Way .abhinavbom

The Web of data and web data commonsJesse Wang

Empfohlen

Using the whole web as your datasetTuri, Inc.

Analytics and Access to the UK web archiveLewis Crawford

AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...Amazon Web Services

useR! 2012 Talkrtelmore

London HUGBoudicca

BDT204 Awesome Applications of Open Data - AWS re: Invent 2012Amazon Web Services

Try It The Google Way .abhinavbom

The Web of data and web data commonsJesse Wang

Mining a Large Web CorpusRobert Meusel

Cenitpede: Analyzing WebcrawlPrimal Pappachan

Overview of Dan Olteanu's Research presentationDBOnto

Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...NoSQLmatters

Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB

Data Science Stack with MongoDB and RStudioWinston Chen

LD4KD 2015 - Demos and toolsVrije Universiteit Amsterdam

A Data Ecosystem to Support Machine Learning in Materials ScienceGlobus

Mapreduce in SearchAmund Tveit

Big data analysis in python @ PyCon.tw 2013Jimmy Lai

The Real-time Web in the Age of AgentsJoshua Shinavier

Graph Analysis over JSON, LarusNeo4j

Congressional PageRank: Graph Analytics of US Congress With Neo4jWilliam Lyon

Using MongoDB + Hadoop TogetherMongoDB

Querying the Web of DataRinke Hoekstra

Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...guest5b1607

Python pandas LibraryMd. Sohag Miah

Connecting Stream Reasoners on the WebJean-Paul Calbimonte

Introduction to data analysis using RVictoria López

Introduction to Graph DatabasesMax De Marzi

Is Crawling Legal? Web Crawling legal PoliciesPromptCloud

Common Crawl: An Open Repository of Web Datahuguk

Weitere ähnliche Inhalte

Was ist angesagt?

Mining a Large Web CorpusRobert Meusel

Cenitpede: Analyzing WebcrawlPrimal Pappachan

Overview of Dan Olteanu's Research presentationDBOnto

Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...NoSQLmatters

Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB

Data Science Stack with MongoDB and RStudioWinston Chen

LD4KD 2015 - Demos and toolsVrije Universiteit Amsterdam

A Data Ecosystem to Support Machine Learning in Materials ScienceGlobus

Mapreduce in SearchAmund Tveit

Big data analysis in python @ PyCon.tw 2013Jimmy Lai

The Real-time Web in the Age of AgentsJoshua Shinavier

Graph Analysis over JSON, LarusNeo4j

Congressional PageRank: Graph Analytics of US Congress With Neo4jWilliam Lyon

Using MongoDB + Hadoop TogetherMongoDB

Querying the Web of DataRinke Hoekstra

Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...guest5b1607

Python pandas LibraryMd. Sohag Miah

Connecting Stream Reasoners on the WebJean-Paul Calbimonte

Introduction to data analysis using RVictoria López

Introduction to Graph DatabasesMax De Marzi

Was ist angesagt? (20)

Mining a Large Web Corpus

Cenitpede: Analyzing Webcrawl

Overview of Dan Olteanu's Research presentation

Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...

Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...

Data Science Stack with MongoDB and RStudio

LD4KD 2015 - Demos and tools

A Data Ecosystem to Support Machine Learning in Materials Science

Mapreduce in Search

Big data analysis in python @ PyCon.tw 2013

The Real-time Web in the Age of Agents

Graph Analysis over JSON, Larus

Congressional PageRank: Graph Analytics of US Congress With Neo4j

Using MongoDB + Hadoop Together

Querying the Web of Data

Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...

Python pandas Library

Connecting Stream Reasoners on the Web

Introduction to data analysis using R

Introduction to Graph Databases

Andere mochten auch

Is Crawling Legal? Web Crawling legal PoliciesPromptCloud

Common Crawl: An Open Repository of Web Datahuguk

VenmoPlusQingpeng "Q.P." Zhang

The future of Big Data toolingData Science Society

Gephi Consortium PresentationGephi Consortium

Reactive Design Patterns: a talk by Typesafe's Dr. Roland KuhnZalando Technology

Enterprise Data World 2016 and CDO Vision Mural SummaryDATAVERSITY

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Spark Summit

Parallelizing Existing R Packages with SparkRDatabricks

Gephi Quick StartGephi Consortium

Andere mochten auch (10)

Is Crawling Legal? Web Crawling legal Policies

Common Crawl: An Open Repository of Web Data

VenmoPlus

The future of Big Data tooling

Gephi Consortium Presentation

Reactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn

Enterprise Data World 2016 and CDO Vision Mural Summary

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...

Parallelizing Existing R Packages with SparkR

Gephi Quick Start

Ähnlich wie Insight Data Engineering project

Google Cluster InnardsMartin Dvorak

QuestDB: ingesting a million time series per second on a single instance. Big...javier ramirez

Amazon Redshift Deep Dive Amazon Web Services

Emerging technologies /frameworks in Big DataRahul Jain

Making sense of your data jugGerald Muecke

Masterclass - RedshiftAmazon Web Services

Using Document Databases with TYPO3 FlowKarsten Dambekalns

MySQL And Search At CraigslistJeremy Zawodny

Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez

(ARC311) Decoding The Genetic Blueprint Of Life On A Cloud EcosystemAmazon Web Services

Journey through high performance django applicationbangaloredjangousergroup

Hadoop basicsAntonio Silveira

Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez

CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerSean Golliher

Data Warehousing with Amazon RedshiftAmazon Web Services

Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh

Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsMapR Technologies

(HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, A...Amazon Web Services

Design and Implementation of a High- Performance Distributed Web CrawlerGeorge Ang

Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Amazon Web Services

Ähnlich wie Insight Data Engineering project (20)

Google Cluster Innards

QuestDB: ingesting a million time series per second on a single instance. Big...

Amazon Redshift Deep Dive

Emerging technologies /frameworks in Big Data

Making sense of your data jug

Masterclass - Redshift

Using Document Databases with TYPO3 Flow

MySQL And Search At Craigslist

Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...

(ARC311) Decoding The Genetic Blueprint Of Life On A Cloud Ecosystem

Journey through high performance django application

Hadoop basics

Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...

CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler

Data Warehousing with Amazon Redshift

Ledingkart Meetup #2: Scaling Search @Lendingkart

Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

(HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, A...

Design and Implementation of a High- Performance Distributed Web Crawler

Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...

Kürzlich hochgeladen

Employee leave management system project.Kamal Acharya

data_management_and _data_science_cheat_sheet.pdfJiananWang21

DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal

Double Revolving field theory-how the rotor develops torqueBhangaleSonal

Thermal Engineering-R & A / C - unit - VDineshKumar4165

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b

Thermal Engineering -unit - III & IV.pptDineshKumar4165

2016EF22_0 solar project report rooftop projectssmsksolar

Thermal Engineering Unit - I & II . pptDineshKumar4165

chapter 5.pptx: drainage and irrigation engineeringmulugeta48

UNIT - IV - Air Compressors and its Performancesivaprakash250

Unit 1 - Soil Classification and Compaction.pdfRagavanV2

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)

Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control

Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...9953056974 Low Rate Call Girls In Saket, Delhi NCR

University management System project report..pdfKamal Acharya

Unit 2- Effective stress & Permeability.pdfRagavanV2

22-prompt engineering noted slide shown.pdf203318pmpc

Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...soginsider

KubeKraft presentation @CloudNativeHooghlysanyuktamishra911

Kürzlich hochgeladen (20)

Employee leave management system project.

data_management_and _data_science_cheat_sheet.pdf

DC MACHINE-Motoring and generation, Armature circuit equation

Double Revolving field theory-how the rotor develops torque

Thermal Engineering-R & A / C - unit - V

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Thermal Engineering -unit - III & IV.ppt

2016EF22_0 solar project report rooftop projects

Thermal Engineering Unit - I & II . ppt

chapter 5.pptx: drainage and irrigation engineering

UNIT - IV - Air Compressors and its Performance

Unit 1 - Soil Classification and Compaction.pdf

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...

Water Industry Process Automation & Control Monthly - April 2024

Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...

University management System project report..pdf

Unit 2- Effective stress & Permeability.pdf

22-prompt engineering noted slide shown.pdf

Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...

KubeKraft presentation @CloudNativeHooghly

Insight Data Engineering project

1. Crawling for names A search through 150 terabytes of data

2. Common Crawl Large repository of archival web page data on the Internet November 2015 crawl has more than 150 terabytes of data (150,000,000,000,000 bytes, 1.2 billion URLs) Used to broadly gauge brand awareness, name recognition? ➔ Cons: False positives (e.g., same name, different person) ➔ Pros: Dataset is large and fairly complete

3. Data pipeline Common Crawl on

4. Goals Functional: ● Parse through data, counting websites that mention Donald Trump, Ted Cruz, Hillary Clinton, Bernie Sanders Engineering: ● Do this as fast and efficiently as possible on the entire corpus ● Learn Scala

5. Challenges ● 35,700 zipped text files of modest sizes on Amazon S3 ● Each file on average holds data from 34,000 URIs ● Data from one URI (one record) spans multiple lines

6. Sample Common Crawl record

7. Parsing Common Crawl text file Header / URI Find these names

8. Coding challenges 1. Spark prefers to ingest files in which one record spans single line ➔ sc.textFile(filename) 2. For multi-line records, must use ➔ config.set(“textinputformat.record.delimiter”, “WARC-Target-URI: “) ➔ ingestMe = sc.newAPIHadoopFile(filename, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], config) First method allows bulk loading of files; second method limited to single file at one time

9. How much time? Original estimate: 21 days Helpful Not so helpful ✔ Eliminate debug printlns ✔ Limit “filter”, “map” functions ✔✔ Union RDD (data sets) triggered distributed computing ❗ Pool database calls (Must use sparingly) ❌ Multiple Spark-Submit jobs (Held promise but resource intensive; crashed JVM) Revised estimate: 18-35 hours

10. Results: How the candidates stack up Check out which candidate got the most mentions: http://namecrawler.xyz

11. About me Most recently news reporter. Background in computer science Avid cook, baker

12. What else would have helped boost speed ● Amp up cluster computing power ○ Upgrade from m4large (8GB RAM) to r3large (15.25GB) or r3xlarge (30.5GB) ● Concatenate files prior to processing ○ Eliminates having to manually join datasets ○ Pros: Java libraries exist to do so ○ Cons: Must make room for 150 terabytes of files ● Split batch processing into multiple jobs

13. Optimizations: Union data // Grab file off Amazon's S3 val hdFile = sc.newAPIHadoopFile(fullCrawlName, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], localConfig) // Hold on to file until there are enough for a trio hdFiles(i-1) = hdFile if (i % 3 == 0) { // Act only on batches of three RDDs val hdFile = hdFiles(i-3).union(hdFiles(i-2).union(hdFiles(i-1))) // Send the three-large RDD for saving saveCrawlData(crawlFileID, hdFile) // Reset batch counter i=0 }