SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
TO INFINITY AND BEYOND
Pranav Prakash
in.linkedin.com/in/prakashpranav
Search @LinkedIn
Hari Prasanna
in.linkedin.com/in/mostlycached
BigData @LinkedIn
The story of how solving one problem the OpenSource way
opened doors to so much more
OpenSource Chain Reaction
How “it” begins
OpenSource Chain Reaction
How “it” begins
How “it” grows
OpenSource Chain Reaction
How “it” begins
How “it” grows
How “it” contributes
LUCENE
Information Retrieval Library
Started in 1999 as SourceForge.net project
Joins Apache in 2001 in Jakarta’s family
Top Level Project in 2005
LinkedIn, Twitter, Comcast
LUCENE
IR requirements
What would you do next?
Be better at searching
Crawl the web
Web Wrapper around Lucene
Full Text Search, NRT Indexing
Faceted Search, Clustering
NUTCH
Web Crawler
Billions of pages on the internet
Alternate to commercial engines
From a single tool to an ecosystem
• Breaking away from the initial problem statement
• The Google factor - GFS(2003), BigTable(2006), Pregel(2009) leading to
HDFS, HBase and Giraph
• The thrill and chaos of working with alpha software - from dealing with
compatibility issues to being a part of active development
• Interoperability between various systems
• Ever widening scope of the project and leveraging other tools in the
ecosystem
Ecosystem
• Features:
• Distributed storage - HDFS
• Distributed processing - MapReduce
• Fault tolerance
• Horizontal scalability
• Comparisons
• RDBMS
• Grid computing
• Use Cases
• Analytics (trends, predictions, summaries etc.,)
• Searching and Indexing
Hadoop
• Features:
• Column based storage
• Horizontal scalability
• Low latency reads
• MapReduce support
• SQL Support with Phoenix
• Coprocessors and secondary indexes
• RDBMS vs HBase
• Use cases
• Facebook messages
• Monitoring with openTSDB
HBase
Vanilla MapReduce
!
!
!
!
!
Higher Abstractions
• Pig - data flow language
• Hive - SQL to MapReduce adapter
• Cascading - Pipeline primitives and other powerful abstractions
• Even higher abstractions with Cascalog(cascading + prolog), PigPen(clojure for pig) and Pig libraries like
datafu
Java MapReduce
Having run through how the MapReduce program works, the next step is to express it
in code. We need three things: a map function, a reduce function, and some code to
run the job. The map function is represented by the Mapper class, which declares an
abstract map() method. Example 2-3 shows the implementation of our map method.
Example 2-3. Mapper for maximum temperature example
import java.io.IOException;
Figure 2-1. MapReduce logical data flow
Data Processing
• Data collection, aggregation and forwarding with
Kafka, Flume, Scribe.
• Real time stream processing with Storm to enable
online machine learning, real time analytics in
twitter, groupon.
• Graph processing a trillion edges in facebook with
Apache Giraph
• Quickstarting with the cloudera distribution
• Getting one step through the door - SlideShare’s journey
• Can your app survive without it? - Raising your bar
• Programmer, Administrator, DBA, Data Scientist - what
hat are you wearing today?
• The road ahead
• Keeping track of the developments and giving back
Leveraging “Big Data”
• Scientific Research - Scihadoop, decoding DNA
• Finance - Fraud Detection, Algorithmic trading, Risk
Management
• Web - Network Analysis, Recommendation Engines,
Personalization
• Government - Election campaigns, intelligence
systems
• Supply chain optimization, Weather forecasting
In the Wild
How an open source project led to new opportunities and an entire ecosystem

Weitere ähnliche Inhalte

Was ist angesagt?

Start Flying with Python & Apache TinkerPop
Start Flying with Python & Apache TinkerPopStart Flying with Python & Apache TinkerPop
Start Flying with Python & Apache TinkerPopJason Plurad
 
FME-Based Tool for Automatic Updating of Geographical Git Repositories (Pushi...
FME-Based Tool for Automatic Updating of Geographical Git Repositories (Pushi...FME-Based Tool for Automatic Updating of Geographical Git Repositories (Pushi...
FME-Based Tool for Automatic Updating of Geographical Git Repositories (Pushi...Safe Software
 
Apache Storm - Real Time Analytics
Apache Storm - Real Time AnalyticsApache Storm - Real Time Analytics
Apache Storm - Real Time AnalyticsEdureka!
 
Graph Processing with Apache TinkerPop and Gremlin
Graph Processing with Apache TinkerPop and GremlinGraph Processing with Apache TinkerPop and Gremlin
Graph Processing with Apache TinkerPop and GremlinJason Plurad
 
Asynchronous Hyperparameter Optimization with Apache Spark
Asynchronous Hyperparameter Optimization with Apache SparkAsynchronous Hyperparameter Optimization with Apache Spark
Asynchronous Hyperparameter Optimization with Apache SparkDatabricks
 
Data Science with Elastic MapReduce (EMR) at Netflix
Data Science with Elastic MapReduce (EMR) at NetflixData Science with Elastic MapReduce (EMR) at Netflix
Data Science with Elastic MapReduce (EMR) at NetflixKurt Brown
 
How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017
How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017
How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017Juantomás García Molina
 
Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014Eva Tse
 
Apache Spark & Scala
Apache Spark & ScalaApache Spark & Scala
Apache Spark & ScalaEdureka!
 
DevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka Connect
DevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka ConnectDevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka Connect
DevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka ConnectEdwardBloom
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015Lance Co Ting Keh
 
Computing at scale
Computing at scaleComputing at scale
Computing at scalejerjou
 
Twisting Data into Cool Shapes
Twisting Data into Cool ShapesTwisting Data into Cool Shapes
Twisting Data into Cool ShapesShane Coughlan
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenDatabricks
 

Was ist angesagt? (14)

Start Flying with Python & Apache TinkerPop
Start Flying with Python & Apache TinkerPopStart Flying with Python & Apache TinkerPop
Start Flying with Python & Apache TinkerPop
 
FME-Based Tool for Automatic Updating of Geographical Git Repositories (Pushi...
FME-Based Tool for Automatic Updating of Geographical Git Repositories (Pushi...FME-Based Tool for Automatic Updating of Geographical Git Repositories (Pushi...
FME-Based Tool for Automatic Updating of Geographical Git Repositories (Pushi...
 
Apache Storm - Real Time Analytics
Apache Storm - Real Time AnalyticsApache Storm - Real Time Analytics
Apache Storm - Real Time Analytics
 
Graph Processing with Apache TinkerPop and Gremlin
Graph Processing with Apache TinkerPop and GremlinGraph Processing with Apache TinkerPop and Gremlin
Graph Processing with Apache TinkerPop and Gremlin
 
Asynchronous Hyperparameter Optimization with Apache Spark
Asynchronous Hyperparameter Optimization with Apache SparkAsynchronous Hyperparameter Optimization with Apache Spark
Asynchronous Hyperparameter Optimization with Apache Spark
 
Data Science with Elastic MapReduce (EMR) at Netflix
Data Science with Elastic MapReduce (EMR) at NetflixData Science with Elastic MapReduce (EMR) at Netflix
Data Science with Elastic MapReduce (EMR) at Netflix
 
How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017
How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017
How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017
 
Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014
 
Apache Spark & Scala
Apache Spark & ScalaApache Spark & Scala
Apache Spark & Scala
 
DevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka Connect
DevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka ConnectDevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka Connect
DevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka Connect
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015
 
Computing at scale
Computing at scaleComputing at scale
Computing at scale
 
Twisting Data into Cool Shapes
Twisting Data into Cool ShapesTwisting Data into Cool Shapes
Twisting Data into Cool Shapes
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
 

Andere mochten auch

How to Create an Engaging Social Media Experience
How to Create an Engaging Social Media ExperienceHow to Create an Engaging Social Media Experience
How to Create an Engaging Social Media ExperienceArun
 
Apple banana oranges_peaches
Apple banana oranges_peachesApple banana oranges_peaches
Apple banana oranges_peachesPranav Prakash
 
Implementing Ajax In ColdFusion 7
Implementing Ajax In ColdFusion 7Implementing Ajax In ColdFusion 7
Implementing Ajax In ColdFusion 7Pranav Prakash
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic WebJohn Breslin
 
A Hybrid Recommendation system
A Hybrid Recommendation systemA Hybrid Recommendation system
A Hybrid Recommendation systemPranav Prakash
 
Twilio Signal 2016 Keynote
Twilio Signal 2016 Keynote Twilio Signal 2016 Keynote
Twilio Signal 2016 Keynote Twilio Inc
 

Andere mochten auch (12)

Solidry @ bakheda2
Solidry @ bakheda2Solidry @ bakheda2
Solidry @ bakheda2
 
#comments
#comments#comments
#comments
 
Webtech1b
Webtech1bWebtech1b
Webtech1b
 
Ibm haifa.mq.final
Ibm haifa.mq.finalIbm haifa.mq.final
Ibm haifa.mq.final
 
Test document
Test documentTest document
Test document
 
How to Create an Engaging Social Media Experience
How to Create an Engaging Social Media ExperienceHow to Create an Engaging Social Media Experience
How to Create an Engaging Social Media Experience
 
Apple banana oranges_peaches
Apple banana oranges_peachesApple banana oranges_peaches
Apple banana oranges_peaches
 
Banana peaches
Banana peachesBanana peaches
Banana peaches
 
Implementing Ajax In ColdFusion 7
Implementing Ajax In ColdFusion 7Implementing Ajax In ColdFusion 7
Implementing Ajax In ColdFusion 7
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic Web
 
A Hybrid Recommendation system
A Hybrid Recommendation systemA Hybrid Recommendation system
A Hybrid Recommendation system
 
Twilio Signal 2016 Keynote
Twilio Signal 2016 Keynote Twilio Signal 2016 Keynote
Twilio Signal 2016 Keynote
 

Ähnlich wie How an open source project led to new opportunities and an entire ecosystem

Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about SparkGiivee The
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiSlim Baltagi
 
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...confluent
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for SparkMark Kerzner
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
LarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empowerDurga Gadiraju
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and JujuMining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and Jujuseoul_engineer
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...Geoffrey Fox
 
Polyglot Processing - An Introduction 1.0
Polyglot Processing - An Introduction 1.0 Polyglot Processing - An Introduction 1.0
Polyglot Processing - An Introduction 1.0 Dr. Mohan K. Bavirisetty
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open DataMuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data21Style
 

Ähnlich wie How an open source project led to new opportunities and an entire ecosystem (20)

Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
 
963
963963
963
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
LarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - Introduction
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Data Sciences Learning
Data Sciences LearningData Sciences Learning
Data Sciences Learning
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and JujuMining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
 
Polyglot Processing - An Introduction 1.0
Polyglot Processing - An Introduction 1.0 Polyglot Processing - An Introduction 1.0
Polyglot Processing - An Introduction 1.0
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open DataMuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
 

Mehr von Pranav Prakash

Mehr von Pranav Prakash (19)

Data engineering track module 2
Data engineering track module 2Data engineering track module 2
Data engineering track module 2
 
Data engineering track module 2
Data engineering track module 2Data engineering track module 2
Data engineering track module 2
 
Machine Learning Introduction
Machine Learning IntroductionMachine Learning Introduction
Machine Learning Introduction
 
Oranges
OrangesOranges
Oranges
 
Oranges peaches
Oranges peachesOranges peaches
Oranges peaches
 
Banana
BananaBanana
Banana
 
Banana oranges
Banana orangesBanana oranges
Banana oranges
 
Banana oranges peaches
Banana oranges peachesBanana oranges peaches
Banana oranges peaches
 
Apple
AppleApple
Apple
 
Apple peaches
Apple peachesApple peaches
Apple peaches
 
Apple oranges
Apple orangesApple oranges
Apple oranges
 
Apple oranges peaches
Apple oranges peachesApple oranges peaches
Apple oranges peaches
 
Apple banana
Apple bananaApple banana
Apple banana
 
Apple banana peaches
Apple banana peachesApple banana peaches
Apple banana peaches
 
Apple banana oranges
Apple banana orangesApple banana oranges
Apple banana oranges
 
Peaches
PeachesPeaches
Peaches
 
MIT Project Oxygen - A seminar report
MIT Project Oxygen - A seminar reportMIT Project Oxygen - A seminar report
MIT Project Oxygen - A seminar report
 
Introduction to Category Theory for software engineers
Introduction to Category Theory for software engineersIntroduction to Category Theory for software engineers
Introduction to Category Theory for software engineers
 
PyCon India 2010 Building Scalable apps using appengine
PyCon India 2010 Building Scalable apps using appenginePyCon India 2010 Building Scalable apps using appengine
PyCon India 2010 Building Scalable apps using appengine
 

Kürzlich hochgeladen

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Kürzlich hochgeladen (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

How an open source project led to new opportunities and an entire ecosystem

  • 1. TO INFINITY AND BEYOND Pranav Prakash in.linkedin.com/in/prakashpranav Search @LinkedIn Hari Prasanna in.linkedin.com/in/mostlycached BigData @LinkedIn The story of how solving one problem the OpenSource way opened doors to so much more
  • 3. OpenSource Chain Reaction How “it” begins How “it” grows
  • 4. OpenSource Chain Reaction How “it” begins How “it” grows How “it” contributes
  • 5.
  • 6.
  • 7.
  • 8.
  • 9. LUCENE Information Retrieval Library Started in 1999 as SourceForge.net project Joins Apache in 2001 in Jakarta’s family Top Level Project in 2005 LinkedIn, Twitter, Comcast
  • 10. LUCENE IR requirements What would you do next? Be better at searching Crawl the web
  • 11. Web Wrapper around Lucene Full Text Search, NRT Indexing Faceted Search, Clustering
  • 12. NUTCH Web Crawler Billions of pages on the internet Alternate to commercial engines
  • 13. From a single tool to an ecosystem • Breaking away from the initial problem statement • The Google factor - GFS(2003), BigTable(2006), Pregel(2009) leading to HDFS, HBase and Giraph • The thrill and chaos of working with alpha software - from dealing with compatibility issues to being a part of active development • Interoperability between various systems • Ever widening scope of the project and leveraging other tools in the ecosystem
  • 15. • Features: • Distributed storage - HDFS • Distributed processing - MapReduce • Fault tolerance • Horizontal scalability • Comparisons • RDBMS • Grid computing • Use Cases • Analytics (trends, predictions, summaries etc.,) • Searching and Indexing Hadoop
  • 16. • Features: • Column based storage • Horizontal scalability • Low latency reads • MapReduce support • SQL Support with Phoenix • Coprocessors and secondary indexes • RDBMS vs HBase • Use cases • Facebook messages • Monitoring with openTSDB HBase
  • 17. Vanilla MapReduce ! ! ! ! ! Higher Abstractions • Pig - data flow language • Hive - SQL to MapReduce adapter • Cascading - Pipeline primitives and other powerful abstractions • Even higher abstractions with Cascalog(cascading + prolog), PigPen(clojure for pig) and Pig libraries like datafu Java MapReduce Having run through how the MapReduce program works, the next step is to express it in code. We need three things: a map function, a reduce function, and some code to run the job. The map function is represented by the Mapper class, which declares an abstract map() method. Example 2-3 shows the implementation of our map method. Example 2-3. Mapper for maximum temperature example import java.io.IOException; Figure 2-1. MapReduce logical data flow Data Processing
  • 18. • Data collection, aggregation and forwarding with Kafka, Flume, Scribe. • Real time stream processing with Storm to enable online machine learning, real time analytics in twitter, groupon. • Graph processing a trillion edges in facebook with Apache Giraph
  • 19. • Quickstarting with the cloudera distribution • Getting one step through the door - SlideShare’s journey • Can your app survive without it? - Raising your bar • Programmer, Administrator, DBA, Data Scientist - what hat are you wearing today? • The road ahead • Keeping track of the developments and giving back Leveraging “Big Data”
  • 20. • Scientific Research - Scihadoop, decoding DNA • Finance - Fraud Detection, Algorithmic trading, Risk Management • Web - Network Analysis, Recommendation Engines, Personalization • Government - Election campaigns, intelligence systems • Supply chain optimization, Weather forecasting In the Wild