SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
Nov 15th
2016
@ Apache Big Data EU 2016, Seville, Spain
Thamme GowdaKaranjeet Singh
SPARKLER
Information Retrieval
and Data Science
Chris Mattmann
ABOUT: USC INFORMATION RETRIEVAL
AND DATA SCIENCE GROUP
● Established in August 2012 at the University of Southern California (USC)
● Dr. Chris Mattmann, Director of IRDS and our Advisor
● Funding from NSF, DARPA, NASA, DHS, private industry and other agencies
- in collaboration with NASA JPL
● 3 Postdocs, and 30+ Masters and PhD students, 20+ JPLers past 7 years
● Recent topical research in the DARPA XDATA/MEMEX program
Information Retrieval
and Data Science
Email : irds-L@mymaillists.usc.edu
Website : http://irds.usc.edu/
GitHub : https://github.com/USCDataScience/
ABOUT: US
Karanjeet Singh
Graduate Student at the University of Southern California, USA
Research Interest: Information Retrieval & Natural Language Processing
Research Affiliate at NASA Jet Propulsion Laboratory
Committer and PMC member of Apache Nutch
Information Retrieval
and Data Science
Thamme Gowda
Graduate Student at the University of Southern California, USA
Research Intern at NASA Jet Propulsion Laboratory, Co Founder at Datoin
Research Interest: NLP, Machine Learning and Information Retrieval
Committer and PMC member of Apache Nutch, Tika, and Joshua (Incubating)
Dr. Chris Mattmann
Director & Vice Chairman, Apache Software Foundation
Research Interest: Data Science, Open Source, Information Retrieval & NLP
Committer and PMC member of Apache Nutch, Tika, (former) Lucene, OODT, Incubator
OVERVIEW
● About Sparkler
● Motivations for building Sparkler
● Quick intro to Apache Spark
● Sparkler technology stack, internals
● Features of Sparkler
● Comparison with Nutch
● Going forward
Information Retrieval
and Data Science
ABOUT: SPARKLER
● New Open Source Web Crawler
○ A bot program that can fetch resources from the web
● Name: Spark Crawler
● Inspired by Apache Nutch
● Like Nutch: Distributed crawler that can scale horizontally
● Unlike Nutch: Runs on top of Apache Spark
● Easy to deploy and easy to use
Information Retrieval
and Data Science
Information Retrieval
and Data Science
MOTIVATION #1
● Challenges in DARPA MEMEX
○ Intro: MEMEX System has crawlers to fetch deep and
dark web data for assisting law keeping agencies
○ Crawls are kind of blackbox, we wanted real-time
progress reports
● Dr. Chris Mattmann was considering an upgrade since 3
years
● Technology upgrade needed
Information Retrieval
and Data Science
https://twitter.com/cutting/status/796566255830503424
Modern Hadoop cluster has no Hadoop (Map-Reduce) left in it!
WHY A NEW CRAWLER?
Information Retrieval
and Data Science
MOTIVATION #2
● Challenges at DATOIN
○ Intro: Datoin is a distributed text analytics platform
○ Late 2014 - migrated the infrastructure from Hadoop
Map Reduce to Apache Spark
○ But the crawler component (powered by Apache Nutch)
was left behind
● Met Dr. Chris Mattmann at USC in Web Search Engines
class
○ Enquired about his thoughts for running Nutch on Spark
○ Agreed to work on it.
● High performance & Fault tolerance
● Real time crawl analysis
● Easy to customize
Is the food ready?
How is it going?
I want less salt.
Information Retrieval
and Data Science
KEY FEATURES
APACHE SPARK: OVERVIEW
● Introduction
● Resilient Distributed Dataset (RDD)
● Driver, Workers & Executors
Information Retrieval
and Data Science
APACHE SPARK: INTRODUCTION
● Fast and general engine for large scale data processing
● Started at UC Berkeley in 2009
● The most popular distributed computing framework
● Provides high level APIs in Scala, Java, Python, R
● Integration with Hadoop and its ecosystem
● Open sourced in 2010 under Apache v2.0 license
● Mattmann helped to bring Spark to Apache under
DARPA XDATA effort
Information Retrieval
and Data Science
Resilient Distributed Dataset (RDD)
● A basic abstraction in Spark
● Immutable, Partitioned collection of elements operated in parallel
● Data in persistent store (HDFS, Cassandra) or in cache (memory, disk)
● Partitions are recomputed on failure or cache eviction
● Two classes of operations
○ Transformations
○ Actions
● Custom RDDs can also be
implemented - we have one!
Information Retrieval
and Data Science
Information Retrieval
and Data Science
Driver, Workers & Executors
* Photo credit - spark.apache.org
SPARKLER: TECH STACK
● Batch crawling (similar to Apache Nutch)
● Apache Solr as crawl database
● Multi module Maven project with OSGi bundles
● Stream crawled content through Apache Kafka
● Parses everything using Apache Tika
● Crawl visualization - Banana
Information Retrieval
and Data Science
SPARKLER: INTERNALS & WORKFLOW
Information Retrieval
and Data Science
SPARKLER: FEATURES
Information Retrieval
and Data Science
● Crawldb needed indexing
○ For real time analytics
○ For instant visualizations
● This is internal data structure of sparkler
○ Exposed over REST API
○ Used by Sparkler-ui, the web application
● We chose Apache Solr
● Standalone Solr Server or Solr Cloud?
● Glued the crawldb and spark using CrawldbRDD
SPARKLER #1: Lucene/Solr powered Crawldb
Information Retrieval
and Data Science
SPARKLER #2: Partitioning by host
Information Retrieval
and Data Science
● Politeness
* Doesn’t hit same server too many times in distributed mode
● First version
○ Group by: Host name
○ Sort by: depth, score
● Customization is easy
○ Write your own Solr query
○ Take advantage of boosting to alter the ranking
● Partitions the dataset based on the above criteria
● Lazy evaluations and delay between the requests
■ Performs parsing instead of waiting
■ Inserts delay only when it is necessary
SPARKLER #3: OSGI Plugins
Information Retrieval
and Data Science
● Plugins Interfaces are inspired by Nutch
● Plugins are developed as per Open Service Gateway
Interface (OSGI)
● We chose Apache Felix implementation of OSGI
● Migrated a plugin from Nutch
○ Regex URL Filter Plugin → The most used plugin in
Nutch
● Added JavaScript plugin (described in the next slide)
● //TODO: Migrate more plugins from Nutch
○ Mavenize nutch [NUTCH-2293]
SPARKLER #4: JavaScript Rendering
Information Retrieval
and Data Science
● Java Script Execution* has first class support
● Distributable on Spark Cluster without pain
○ Pure JVM based JavaScript engine
● This is an implementation of FetchFunction
● FetchFunction
○ Stream<URL> → Stream<Content>
○ Note: URLS are grouped by host
○ It preserves cookies and reuses sessions for each iteration
Thanks to: Madhav Sharan
Member of USC IRDS* JBrowserDriver by MachinePublishers
SPARKLER #5: Output in Kafka Streams
Information Retrieval
and Data Science
● Crawler is sometimes input for the applications that does
deeper analysis
○ Can’t fit all those deeper analysis into crawler
● Integrating to such applications made easy via Queues
● We chose Apache Kafka
○ Suits our need
■ Distributable, Scalable, Fault Tolerant
● FIXME: Larger messages such as Videos
● This is optional, default output on Shared File System
(such as HDFS), compatible with Nutch
*
Thanks to: Rahul Palamuttam
MS CS @ Stanford University; Intern @ NASA JPL)
SPARKLER #6: Tika, the universal parser
Information Retrieval
and Data Science
● Apache Tika
○ Is a toolkit of parsers
○ Detects and extracts metadata, text, and URLS
○ Over a thousand different file types
● Main application is to discover outgoing links
● The default Implementation for our ParseFunction
SPARKLER #7: Visual Analytics
Information Retrieval
and Data Science
● Charts and Graphs provides nice summary of crawl job
● Real time analytics
● Example:
○ Distribution of URLS across hosts/domains
○ Temporal activities
○ Status reports
● Customizable in real time
● Using Banana Dashboard from Lucidworks
● Sparkler has a sub component named sparkler-ui
* Thanks to : Manish Dwibedy
MS CS University of Southern California
SPARKLER #Next: what’s coming?
Information Retrieval
and Data Science
● Interactive UI
● More plugins
● Scoring Crawled Pages
● Focussed Crawling
● Crawl Graph Analysis
● Domain Discovery (another research challenge)
● Other useful plugins from Nutch
● Detailed documentation and tutorials on wiki
Nutch Configuration
Version : 1.12
topN : 50,000
Fetcher Thread : 1
Hadoop Configuration
Version : 2.6.0-cdh5.8.2
Slaves : 2
Memory : 8G (Map), 16G (Reduce)
22 Mappers, 11 reducers
HOW FAST IT RUNS - Comparison with Nutch
Information Retrieval
and Data Science
Crawl Iterations : 5
Fetch Delay : 1 sec
Sparkler Configuration
Version : 0.1-SNAPSHOT
topGroups : 252
topN : 1000
Spark Configuration
Version : 1.6.1 with Scala v2.11
Slaves : 2
22 Worker Instances with 210G memory
Information Retrieval
and Data Science
DIVERSIFIED - Comparison with Nutch
Information Retrieval
and Data Science
Sparkler Dashboard
Information Retrieval
and Data Science
SPARKLER IS COMING TO APACHE
proposal later this week!
Look for
● Get involved with our journey of Incubator
● Get started: Checkout README and wiki at
https://github.com/USCDataScience/sparkler
Information Retrieval
and Data Science
Questions?
THANK YOU

Weitere ähnliche Inhalte

Was ist angesagt?

Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkdatamantra
 
Solr for Indexing and Searching Logs
Solr for Indexing and Searching LogsSolr for Indexing and Searching Logs
Solr for Indexing and Searching LogsSematext Group, Inc.
 
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...Monica Beckwith
 
Analyzing and Interpreting AWR
Analyzing and Interpreting AWRAnalyzing and Interpreting AWR
Analyzing and Interpreting AWRpasalapudi
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
Pegasus KV Storage, Let the Users focus on their work (2018/07)
Pegasus KV Storage, Let the Users focus on their work (2018/07)Pegasus KV Storage, Let the Users focus on their work (2018/07)
Pegasus KV Storage, Let the Users focus on their work (2018/07)涛 吴
 
Understanding How CQL3 Maps to Cassandra's Internal Data Structure
Understanding How CQL3 Maps to Cassandra's Internal Data StructureUnderstanding How CQL3 Maps to Cassandra's Internal Data Structure
Understanding How CQL3 Maps to Cassandra's Internal Data StructureDataStax
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInDatabricks
 
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...Henning Jacobs
 
Apache Camel K - Copenhagen
Apache Camel K - CopenhagenApache Camel K - Copenhagen
Apache Camel K - CopenhagenClaus Ibsen
 
PostgreSQL on Kubernetes: Realizing High Availability with PGO (Postgres Ibiz...
PostgreSQL on Kubernetes: Realizing High Availability with PGO (Postgres Ibiz...PostgreSQL on Kubernetes: Realizing High Availability with PGO (Postgres Ibiz...
PostgreSQL on Kubernetes: Realizing High Availability with PGO (Postgres Ibiz...NTT DATA Technology & Innovation
 
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...HostedbyConfluent
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks
 
patroni-based citrus high availability environment deployment
patroni-based citrus high availability environment deploymentpatroni-based citrus high availability environment deployment
patroni-based citrus high availability environment deploymenthyeongchae lee
 
Tanel Poder Oracle Scripts and Tools (2010)
Tanel Poder Oracle Scripts and Tools (2010)Tanel Poder Oracle Scripts and Tools (2010)
Tanel Poder Oracle Scripts and Tools (2010)Tanel Poder
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 

Was ist angesagt? (20)

Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Solr for Indexing and Searching Logs
Solr for Indexing and Searching LogsSolr for Indexing and Searching Logs
Solr for Indexing and Searching Logs
 
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
 
Analyzing and Interpreting AWR
Analyzing and Interpreting AWRAnalyzing and Interpreting AWR
Analyzing and Interpreting AWR
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Pegasus KV Storage, Let the Users focus on their work (2018/07)
Pegasus KV Storage, Let the Users focus on their work (2018/07)Pegasus KV Storage, Let the Users focus on their work (2018/07)
Pegasus KV Storage, Let the Users focus on their work (2018/07)
 
Understanding How CQL3 Maps to Cassandra's Internal Data Structure
Understanding How CQL3 Maps to Cassandra's Internal Data StructureUnderstanding How CQL3 Maps to Cassandra's Internal Data Structure
Understanding How CQL3 Maps to Cassandra's Internal Data Structure
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
 
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...
 
The Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache FlinkThe Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache Flink
 
Apache Camel K - Copenhagen
Apache Camel K - CopenhagenApache Camel K - Copenhagen
Apache Camel K - Copenhagen
 
PostgreSQL on Kubernetes: Realizing High Availability with PGO (Postgres Ibiz...
PostgreSQL on Kubernetes: Realizing High Availability with PGO (Postgres Ibiz...PostgreSQL on Kubernetes: Realizing High Availability with PGO (Postgres Ibiz...
PostgreSQL on Kubernetes: Realizing High Availability with PGO (Postgres Ibiz...
 
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
patroni-based citrus high availability environment deployment
patroni-based citrus high availability environment deploymentpatroni-based citrus high availability environment deployment
patroni-based citrus high availability environment deployment
 
Tanel Poder Oracle Scripts and Tools (2010)
Tanel Poder Oracle Scripts and Tools (2010)Tanel Poder Oracle Scripts and Tools (2010)
Tanel Poder Oracle Scripts and Tools (2010)
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 

Ähnlich wie Sparkler - Spark Crawler

Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to SchoolAdam Doyle
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanDatabricks
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in RetailHari Shreedharan
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark Summit
 
Deep learning and Apache Spark
Deep learning and Apache SparkDeep learning and Apache Spark
Deep learning and Apache SparkQuantUniversity
 
OpenSearch.pdf
OpenSearch.pdfOpenSearch.pdf
OpenSearch.pdfAbhi Jain
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding HadoopAhmed Ossama
 
Open stack @ iiit hyderabad
Open stack @ iiit hyderabad Open stack @ iiit hyderabad
Open stack @ iiit hyderabad openstackindia
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
Apache Cassandra Lunch #50: Machine Learning with Spark + CassandraApache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
Apache Cassandra Lunch #50: Machine Learning with Spark + CassandraAnant Corporation
 
Apache spark its place within a big data stack
Apache spark  its place within a big data stackApache spark  its place within a big data stack
Apache spark its place within a big data stackJunjun Olympia
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Wee Hyong Tok
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014spinningmatt
 
Interactive querying of streams using Apache Pulsar_Jerry peng
Interactive querying of streams using Apache Pulsar_Jerry pengInteractive querying of streams using Apache Pulsar_Jerry peng
Interactive querying of streams using Apache Pulsar_Jerry pengStreamNative
 

Ähnlich wie Sparkler - Spark Crawler (20)

Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
 
Deep learning and Apache Spark
Deep learning and Apache SparkDeep learning and Apache Spark
Deep learning and Apache Spark
 
OpenSearch.pdf
OpenSearch.pdfOpenSearch.pdf
OpenSearch.pdf
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Open stack @ iiit hyderabad
Open stack @ iiit hyderabad Open stack @ iiit hyderabad
Open stack @ iiit hyderabad
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as Scale
 
Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
Apache Cassandra Lunch #50: Machine Learning with Spark + CassandraApache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
 
Apache spark its place within a big data stack
Apache spark  its place within a big data stackApache spark  its place within a big data stack
Apache spark its place within a big data stack
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
InternReport
InternReportInternReport
InternReport
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014
 
Interactive querying of streams using Apache Pulsar_Jerry peng
Interactive querying of streams using Apache Pulsar_Jerry pengInteractive querying of streams using Apache Pulsar_Jerry peng
Interactive querying of streams using Apache Pulsar_Jerry peng
 

Mehr von Thamme Gowda

Thamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda
 
Macro average: rare types are important too
Macro average: rare types are important tooMacro average: rare types are important too
Macro average: rare types are important tooThamme Gowda
 
500 languages to English Machine Translation Model
500 languages to English Machine Translation Model500 languages to English Machine Translation Model
500 languages to English Machine Translation ModelThamme Gowda
 
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]Thamme Gowda
 
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGData Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGThamme Gowda
 
Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Thamme Gowda
 
Thamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL InternshipThamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL InternshipThamme Gowda
 
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style SimilarityIEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style SimilarityThamme Gowda
 
Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkThamme Gowda
 

Mehr von Thamme Gowda (9)

Thamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slides
 
Macro average: rare types are important too
Macro average: rare types are important tooMacro average: rare types are important too
Macro average: rare types are important too
 
500 languages to English Machine Translation Model
500 languages to English Machine Translation Model500 languages to English Machine Translation Model
500 languages to English Machine Translation Model
 
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
 
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGData Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
 
Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Sparkler at spark summit east 2017
Sparkler at spark summit east 2017
 
Thamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL InternshipThamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL Internship
 
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style SimilarityIEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
 
Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache Spark
 

Kürzlich hochgeladen

PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROmotivationalword821
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 

Kürzlich hochgeladen (20)

PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTRO
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 

Sparkler - Spark Crawler

  • 1. Nov 15th 2016 @ Apache Big Data EU 2016, Seville, Spain Thamme GowdaKaranjeet Singh SPARKLER Information Retrieval and Data Science Chris Mattmann
  • 2. ABOUT: USC INFORMATION RETRIEVAL AND DATA SCIENCE GROUP ● Established in August 2012 at the University of Southern California (USC) ● Dr. Chris Mattmann, Director of IRDS and our Advisor ● Funding from NSF, DARPA, NASA, DHS, private industry and other agencies - in collaboration with NASA JPL ● 3 Postdocs, and 30+ Masters and PhD students, 20+ JPLers past 7 years ● Recent topical research in the DARPA XDATA/MEMEX program Information Retrieval and Data Science Email : irds-L@mymaillists.usc.edu Website : http://irds.usc.edu/ GitHub : https://github.com/USCDataScience/
  • 3. ABOUT: US Karanjeet Singh Graduate Student at the University of Southern California, USA Research Interest: Information Retrieval & Natural Language Processing Research Affiliate at NASA Jet Propulsion Laboratory Committer and PMC member of Apache Nutch Information Retrieval and Data Science Thamme Gowda Graduate Student at the University of Southern California, USA Research Intern at NASA Jet Propulsion Laboratory, Co Founder at Datoin Research Interest: NLP, Machine Learning and Information Retrieval Committer and PMC member of Apache Nutch, Tika, and Joshua (Incubating) Dr. Chris Mattmann Director & Vice Chairman, Apache Software Foundation Research Interest: Data Science, Open Source, Information Retrieval & NLP Committer and PMC member of Apache Nutch, Tika, (former) Lucene, OODT, Incubator
  • 4. OVERVIEW ● About Sparkler ● Motivations for building Sparkler ● Quick intro to Apache Spark ● Sparkler technology stack, internals ● Features of Sparkler ● Comparison with Nutch ● Going forward Information Retrieval and Data Science
  • 5. ABOUT: SPARKLER ● New Open Source Web Crawler ○ A bot program that can fetch resources from the web ● Name: Spark Crawler ● Inspired by Apache Nutch ● Like Nutch: Distributed crawler that can scale horizontally ● Unlike Nutch: Runs on top of Apache Spark ● Easy to deploy and easy to use Information Retrieval and Data Science
  • 6. Information Retrieval and Data Science MOTIVATION #1 ● Challenges in DARPA MEMEX ○ Intro: MEMEX System has crawlers to fetch deep and dark web data for assisting law keeping agencies ○ Crawls are kind of blackbox, we wanted real-time progress reports ● Dr. Chris Mattmann was considering an upgrade since 3 years ● Technology upgrade needed
  • 7. Information Retrieval and Data Science https://twitter.com/cutting/status/796566255830503424 Modern Hadoop cluster has no Hadoop (Map-Reduce) left in it! WHY A NEW CRAWLER?
  • 8. Information Retrieval and Data Science MOTIVATION #2 ● Challenges at DATOIN ○ Intro: Datoin is a distributed text analytics platform ○ Late 2014 - migrated the infrastructure from Hadoop Map Reduce to Apache Spark ○ But the crawler component (powered by Apache Nutch) was left behind ● Met Dr. Chris Mattmann at USC in Web Search Engines class ○ Enquired about his thoughts for running Nutch on Spark ○ Agreed to work on it.
  • 9. ● High performance & Fault tolerance ● Real time crawl analysis ● Easy to customize Is the food ready? How is it going? I want less salt. Information Retrieval and Data Science KEY FEATURES
  • 10. APACHE SPARK: OVERVIEW ● Introduction ● Resilient Distributed Dataset (RDD) ● Driver, Workers & Executors Information Retrieval and Data Science
  • 11. APACHE SPARK: INTRODUCTION ● Fast and general engine for large scale data processing ● Started at UC Berkeley in 2009 ● The most popular distributed computing framework ● Provides high level APIs in Scala, Java, Python, R ● Integration with Hadoop and its ecosystem ● Open sourced in 2010 under Apache v2.0 license ● Mattmann helped to bring Spark to Apache under DARPA XDATA effort Information Retrieval and Data Science
  • 12. Resilient Distributed Dataset (RDD) ● A basic abstraction in Spark ● Immutable, Partitioned collection of elements operated in parallel ● Data in persistent store (HDFS, Cassandra) or in cache (memory, disk) ● Partitions are recomputed on failure or cache eviction ● Two classes of operations ○ Transformations ○ Actions ● Custom RDDs can also be implemented - we have one! Information Retrieval and Data Science
  • 13. Information Retrieval and Data Science Driver, Workers & Executors * Photo credit - spark.apache.org
  • 14. SPARKLER: TECH STACK ● Batch crawling (similar to Apache Nutch) ● Apache Solr as crawl database ● Multi module Maven project with OSGi bundles ● Stream crawled content through Apache Kafka ● Parses everything using Apache Tika ● Crawl visualization - Banana Information Retrieval and Data Science
  • 15. SPARKLER: INTERNALS & WORKFLOW Information Retrieval and Data Science
  • 17. ● Crawldb needed indexing ○ For real time analytics ○ For instant visualizations ● This is internal data structure of sparkler ○ Exposed over REST API ○ Used by Sparkler-ui, the web application ● We chose Apache Solr ● Standalone Solr Server or Solr Cloud? ● Glued the crawldb and spark using CrawldbRDD SPARKLER #1: Lucene/Solr powered Crawldb Information Retrieval and Data Science
  • 18. SPARKLER #2: Partitioning by host Information Retrieval and Data Science ● Politeness * Doesn’t hit same server too many times in distributed mode ● First version ○ Group by: Host name ○ Sort by: depth, score ● Customization is easy ○ Write your own Solr query ○ Take advantage of boosting to alter the ranking ● Partitions the dataset based on the above criteria ● Lazy evaluations and delay between the requests ■ Performs parsing instead of waiting ■ Inserts delay only when it is necessary
  • 19. SPARKLER #3: OSGI Plugins Information Retrieval and Data Science ● Plugins Interfaces are inspired by Nutch ● Plugins are developed as per Open Service Gateway Interface (OSGI) ● We chose Apache Felix implementation of OSGI ● Migrated a plugin from Nutch ○ Regex URL Filter Plugin → The most used plugin in Nutch ● Added JavaScript plugin (described in the next slide) ● //TODO: Migrate more plugins from Nutch ○ Mavenize nutch [NUTCH-2293]
  • 20. SPARKLER #4: JavaScript Rendering Information Retrieval and Data Science ● Java Script Execution* has first class support ● Distributable on Spark Cluster without pain ○ Pure JVM based JavaScript engine ● This is an implementation of FetchFunction ● FetchFunction ○ Stream<URL> → Stream<Content> ○ Note: URLS are grouped by host ○ It preserves cookies and reuses sessions for each iteration Thanks to: Madhav Sharan Member of USC IRDS* JBrowserDriver by MachinePublishers
  • 21. SPARKLER #5: Output in Kafka Streams Information Retrieval and Data Science ● Crawler is sometimes input for the applications that does deeper analysis ○ Can’t fit all those deeper analysis into crawler ● Integrating to such applications made easy via Queues ● We chose Apache Kafka ○ Suits our need ■ Distributable, Scalable, Fault Tolerant ● FIXME: Larger messages such as Videos ● This is optional, default output on Shared File System (such as HDFS), compatible with Nutch * Thanks to: Rahul Palamuttam MS CS @ Stanford University; Intern @ NASA JPL)
  • 22. SPARKLER #6: Tika, the universal parser Information Retrieval and Data Science ● Apache Tika ○ Is a toolkit of parsers ○ Detects and extracts metadata, text, and URLS ○ Over a thousand different file types ● Main application is to discover outgoing links ● The default Implementation for our ParseFunction
  • 23. SPARKLER #7: Visual Analytics Information Retrieval and Data Science ● Charts and Graphs provides nice summary of crawl job ● Real time analytics ● Example: ○ Distribution of URLS across hosts/domains ○ Temporal activities ○ Status reports ● Customizable in real time ● Using Banana Dashboard from Lucidworks ● Sparkler has a sub component named sparkler-ui * Thanks to : Manish Dwibedy MS CS University of Southern California
  • 24. SPARKLER #Next: what’s coming? Information Retrieval and Data Science ● Interactive UI ● More plugins ● Scoring Crawled Pages ● Focussed Crawling ● Crawl Graph Analysis ● Domain Discovery (another research challenge) ● Other useful plugins from Nutch ● Detailed documentation and tutorials on wiki
  • 25. Nutch Configuration Version : 1.12 topN : 50,000 Fetcher Thread : 1 Hadoop Configuration Version : 2.6.0-cdh5.8.2 Slaves : 2 Memory : 8G (Map), 16G (Reduce) 22 Mappers, 11 reducers HOW FAST IT RUNS - Comparison with Nutch Information Retrieval and Data Science Crawl Iterations : 5 Fetch Delay : 1 sec Sparkler Configuration Version : 0.1-SNAPSHOT topGroups : 252 topN : 1000 Spark Configuration Version : 1.6.1 with Scala v2.11 Slaves : 2 22 Worker Instances with 210G memory
  • 26. Information Retrieval and Data Science DIVERSIFIED - Comparison with Nutch
  • 27. Information Retrieval and Data Science Sparkler Dashboard
  • 28. Information Retrieval and Data Science SPARKLER IS COMING TO APACHE proposal later this week! Look for
  • 29. ● Get involved with our journey of Incubator ● Get started: Checkout README and wiki at https://github.com/USCDataScience/sparkler Information Retrieval and Data Science Questions? THANK YOU