SlideShare ist ein Scribd-Unternehmen logo
1 von 8
Hadoop
Ecosystem
ACM Bay Area Data Mining Camp 2011
Patrick Nicolas
September 19, 2011
http://patricknicolas.blogspot.com
http://www.slideshare.net/pnicolas
https://github.com/prnicolas

Copyright 2011 Patrick Nicolas - All rights reserved

1
Overview
Beside providing developers and analysts with an open source
implementation of map-reduce functional model, the Hadoop
ecosystem incorporates analytical algorithms, tasks/workflow
managers and NoSQL stores.
Client code, Scripts
NoSQL

Analytics

Key-Values stores Mahout
Document stores
Multi-column stores
Graph databases

Configuration
Zookeeper

Workflow
Hive
Pig
Cascading

Map/Reduce framework
HDFS
Java Virtual Machine

Copyright 2011 Patrick Nicolas - All rights reserved

2
Key Components
The Hadoop ecosystem can be described as a data centric
taxonomy to analyze, aggregate, store and report data.
Admin.
File System

GFS,HDFS

MapReduce

K-V Stores

Redis, Memcache, Kyoto Cabinet

Doc Stores

Hadoop

Zookeeper

MongoDB, CouchDB

NoSQL

Multi-column
stores

HBase, Hypertable, BigData,
Cassandra, BerkeleyDB

Graph DB
Script
Workflow

Neo4j, GraphDB, InfiniteGraph
Pig
Cascading

SQL
Analytics

API

Hive

Mahout, Chunkwa

Copyright 2011 Patrick Nicolas - All rights reserved

3
NoSQL: Overview

Non relational data stores allow large amount of data to be
collected very efficiently. Contrary to RDBMS, NoSQL
schemas are optimized for sequential writes and therefore are
not appropriate for querying and reporting.

Key

Value

Column families, nested structures

NoSQL storages share the same basic key-value schema but
provide different method to describe values.

Copyright 2011 Patrick Nicolas - All rights reserved

4
NoSQL: Document Stores
Key-Value files (HDFS)
<key, value>
Distributed replicable blocks of sequential key-value string pairs

Key-Value stores (Redis, Memcache)
<key*, value>
Language independent, distributed, sorted key value pairs (keys
are list, sets or hashes) with in-memory caching and support for
atomic operations.

Document stores (MongoDB, CouchDB)
{ “k1”:val1, “k2”:val2 }
Fault-tolerant, document centric using dynamic schema of sorted
javascript objects and supports limited SQL like syntax.

Copyright 2011 Patrick Nicolas - All rights reserved

5
NoSQL: Tuples & Graphs

Sorted, ordered tuples(Cassandra, HBase,..)
{ name:x value: { key1: {name:key1, value:v1, tstamp:x}, key2:x}}

Fault-tolerant, distributed sorted, ordered, grouped (family)
‘super-column’ (map of unbounded number of columns)

Graph databases(Neo4j, GraphDB, InfiniteGraph,..)
Efficient transactional, traversal & storage of entity (vertice),
attribute & relationship (edge)

Copyright 2011 Patrick Nicolas - All rights reserved

6
Data Flow Managers
Map & Reduce tasks can be abstracted to a tasks or workflow
managers using high level language such as scripts, SQL or
UNIX-pipe like API. Those data flow tools hide the functional
complexity of Map-Reduce from domain experts.
Scripting

Pig

SQL

Hive

API: Pipes & flows

Cascading

API

Map
Map
Map
Map
Map

Combine
Combine

Reduce
Reduce
Reduce
Reduce

Copyright 2011 Patrick Nicolas - All rights reserved

7
Data Flow Code Samples
Pig Latin
A = LOAD „mydata' USING PigStorage() AS (f1:int, name:string);
B = GROUP A BY f1;
C = FOREACH B GENERATE COUNT ($0);

Hive
LOAD DATA LOCAL INPATH „xxx' OVERWRITE INTO TABLE z;
INSERT OVERWRITE TABLE z SELECT count(*) FROM y GROUP BY f1;

Cascading
Scheme srcScheme = new TextLine( new Fields( “line”));
Tap src = new Hfs(srcScheme, inpath);
Pipe counter = new Pipe (“count”);
counter = new GroupBy( counter, new Fields(“f1”);
FlowConnector connector = new FlowConnector(props);
Flow flow = connector.connect( “count”, src, sink, pipe);
flow.complete();

Copyright 2011 Patrick Nicolas - All rights reserved

8

Weitere ähnliche Inhalte

Was ist angesagt?

HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...Databricks
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Payment Gateway Live hadoop project
Payment Gateway Live hadoop projectPayment Gateway Live hadoop project
Payment Gateway Live hadoop projectKamal A
 
Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Tugdual Grall
 
Summer Shorts: Big Data Integration
Summer Shorts: Big Data IntegrationSummer Shorts: Big Data Integration
Summer Shorts: Big Data Integrationibi
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLliuknag
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseAsis Mohanty
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Summary machine learning and model deployment
Summary machine learning and model deploymentSummary machine learning and model deployment
Summary machine learning and model deploymentNovita Sari
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveMike Frampton
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft PlatformAndrew Brust
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
 

Was ist angesagt? (20)

Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Payment Gateway Live hadoop project
Payment Gateway Live hadoop projectPayment Gateway Live hadoop project
Payment Gateway Live hadoop project
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 
Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015
 
Apache drill
Apache drillApache drill
Apache drill
 
Summer Shorts: Big Data Integration
Summer Shorts: Big Data IntegrationSummer Shorts: Big Data Integration
Summer Shorts: Big Data Integration
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQL
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Summary machine learning and model deployment
Summary machine learning and model deploymentSummary machine learning and model deployment
Summary machine learning and model deployment
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft Platform
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 

Andere mochten auch

Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
Creating an Ecosystem Platform with Vertical PaaS
Creating an Ecosystem Platform with Vertical PaaSCreating an Ecosystem Platform with Vertical PaaS
Creating an Ecosystem Platform with Vertical PaaSWSO2
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Understanding the Online Advertising Technology Landscape
Understanding the Online Advertising Technology Landscape Understanding the Online Advertising Technology Landscape
Understanding the Online Advertising Technology Landscape Karina Sanz
 
Business Ecosystem Design
Business Ecosystem DesignBusiness Ecosystem Design
Business Ecosystem DesignJan Schmiedgen
 

Andere mochten auch (6)

Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Creating an Ecosystem Platform with Vertical PaaS
Creating an Ecosystem Platform with Vertical PaaSCreating an Ecosystem Platform with Vertical PaaS
Creating an Ecosystem Platform with Vertical PaaS
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Media Buying Platform Ecosystem
Media Buying Platform EcosystemMedia Buying Platform Ecosystem
Media Buying Platform Ecosystem
 
Understanding the Online Advertising Technology Landscape
Understanding the Online Advertising Technology Landscape Understanding the Online Advertising Technology Landscape
Understanding the Online Advertising Technology Landscape
 
Business Ecosystem Design
Business Ecosystem DesignBusiness Ecosystem Design
Business Ecosystem Design
 

Ähnlich wie Hadoop Ecosystem

Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Milos Milovanovic
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlKhanderao Kand
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Darko Marjanovic
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoopmarklpollack
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaSunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaMopuru Babu
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)Claudiu Barbura
 

Ähnlich wie Hadoop Ecosystem (20)

Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosql
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoop
 
Meetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management TrendsMeetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management Trends
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaSunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
 

Mehr von Patrick Nicolas

Autonomous medical coding with discriminative transformers
Autonomous medical coding with discriminative transformersAutonomous medical coding with discriminative transformers
Autonomous medical coding with discriminative transformersPatrick Nicolas
 
Open Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learningOpen Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learningPatrick Nicolas
 
AI for electronic health records
AI for electronic health recordsAI for electronic health records
AI for electronic health recordsPatrick Nicolas
 
Monadic genetic kernels in Scala
Monadic genetic kernels in ScalaMonadic genetic kernels in Scala
Monadic genetic kernels in ScalaPatrick Nicolas
 
Scala for Machine Learning
Scala for Machine LearningScala for Machine Learning
Scala for Machine LearningPatrick Nicolas
 
Stock Market Prediction using Hidden Markov Models and Investor sentiment
Stock Market Prediction using Hidden Markov Models and Investor sentimentStock Market Prediction using Hidden Markov Models and Investor sentiment
Stock Market Prediction using Hidden Markov Models and Investor sentimentPatrick Nicolas
 
Advanced Functional Programming in Scala
Advanced Functional Programming in ScalaAdvanced Functional Programming in Scala
Advanced Functional Programming in ScalaPatrick Nicolas
 
Adaptive Intrusion Detection Using Learning Classifiers
Adaptive Intrusion Detection Using Learning ClassifiersAdaptive Intrusion Detection Using Learning Classifiers
Adaptive Intrusion Detection Using Learning ClassifiersPatrick Nicolas
 
Data Modeling using Symbolic Regression
Data Modeling using Symbolic RegressionData Modeling using Symbolic Regression
Data Modeling using Symbolic RegressionPatrick Nicolas
 
Semantic Analysis using Wikipedia Taxonomy
Semantic Analysis using Wikipedia TaxonomySemantic Analysis using Wikipedia Taxonomy
Semantic Analysis using Wikipedia TaxonomyPatrick Nicolas
 
Taxonomy-based Contextual Ads Targeting
Taxonomy-based Contextual Ads TargetingTaxonomy-based Contextual Ads Targeting
Taxonomy-based Contextual Ads TargetingPatrick Nicolas
 
Multi-tenancy in Private Clouds
Multi-tenancy in Private CloudsMulti-tenancy in Private Clouds
Multi-tenancy in Private CloudsPatrick Nicolas
 

Mehr von Patrick Nicolas (12)

Autonomous medical coding with discriminative transformers
Autonomous medical coding with discriminative transformersAutonomous medical coding with discriminative transformers
Autonomous medical coding with discriminative transformers
 
Open Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learningOpen Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learning
 
AI for electronic health records
AI for electronic health recordsAI for electronic health records
AI for electronic health records
 
Monadic genetic kernels in Scala
Monadic genetic kernels in ScalaMonadic genetic kernels in Scala
Monadic genetic kernels in Scala
 
Scala for Machine Learning
Scala for Machine LearningScala for Machine Learning
Scala for Machine Learning
 
Stock Market Prediction using Hidden Markov Models and Investor sentiment
Stock Market Prediction using Hidden Markov Models and Investor sentimentStock Market Prediction using Hidden Markov Models and Investor sentiment
Stock Market Prediction using Hidden Markov Models and Investor sentiment
 
Advanced Functional Programming in Scala
Advanced Functional Programming in ScalaAdvanced Functional Programming in Scala
Advanced Functional Programming in Scala
 
Adaptive Intrusion Detection Using Learning Classifiers
Adaptive Intrusion Detection Using Learning ClassifiersAdaptive Intrusion Detection Using Learning Classifiers
Adaptive Intrusion Detection Using Learning Classifiers
 
Data Modeling using Symbolic Regression
Data Modeling using Symbolic RegressionData Modeling using Symbolic Regression
Data Modeling using Symbolic Regression
 
Semantic Analysis using Wikipedia Taxonomy
Semantic Analysis using Wikipedia TaxonomySemantic Analysis using Wikipedia Taxonomy
Semantic Analysis using Wikipedia Taxonomy
 
Taxonomy-based Contextual Ads Targeting
Taxonomy-based Contextual Ads TargetingTaxonomy-based Contextual Ads Targeting
Taxonomy-based Contextual Ads Targeting
 
Multi-tenancy in Private Clouds
Multi-tenancy in Private CloudsMulti-tenancy in Private Clouds
Multi-tenancy in Private Clouds
 

Kürzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Kürzlich hochgeladen (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Hadoop Ecosystem

  • 1. Hadoop Ecosystem ACM Bay Area Data Mining Camp 2011 Patrick Nicolas September 19, 2011 http://patricknicolas.blogspot.com http://www.slideshare.net/pnicolas https://github.com/prnicolas Copyright 2011 Patrick Nicolas - All rights reserved 1
  • 2. Overview Beside providing developers and analysts with an open source implementation of map-reduce functional model, the Hadoop ecosystem incorporates analytical algorithms, tasks/workflow managers and NoSQL stores. Client code, Scripts NoSQL Analytics Key-Values stores Mahout Document stores Multi-column stores Graph databases Configuration Zookeeper Workflow Hive Pig Cascading Map/Reduce framework HDFS Java Virtual Machine Copyright 2011 Patrick Nicolas - All rights reserved 2
  • 3. Key Components The Hadoop ecosystem can be described as a data centric taxonomy to analyze, aggregate, store and report data. Admin. File System GFS,HDFS MapReduce K-V Stores Redis, Memcache, Kyoto Cabinet Doc Stores Hadoop Zookeeper MongoDB, CouchDB NoSQL Multi-column stores HBase, Hypertable, BigData, Cassandra, BerkeleyDB Graph DB Script Workflow Neo4j, GraphDB, InfiniteGraph Pig Cascading SQL Analytics API Hive Mahout, Chunkwa Copyright 2011 Patrick Nicolas - All rights reserved 3
  • 4. NoSQL: Overview Non relational data stores allow large amount of data to be collected very efficiently. Contrary to RDBMS, NoSQL schemas are optimized for sequential writes and therefore are not appropriate for querying and reporting. Key Value Column families, nested structures NoSQL storages share the same basic key-value schema but provide different method to describe values. Copyright 2011 Patrick Nicolas - All rights reserved 4
  • 5. NoSQL: Document Stores Key-Value files (HDFS) <key, value> Distributed replicable blocks of sequential key-value string pairs Key-Value stores (Redis, Memcache) <key*, value> Language independent, distributed, sorted key value pairs (keys are list, sets or hashes) with in-memory caching and support for atomic operations. Document stores (MongoDB, CouchDB) { “k1”:val1, “k2”:val2 } Fault-tolerant, document centric using dynamic schema of sorted javascript objects and supports limited SQL like syntax. Copyright 2011 Patrick Nicolas - All rights reserved 5
  • 6. NoSQL: Tuples & Graphs Sorted, ordered tuples(Cassandra, HBase,..) { name:x value: { key1: {name:key1, value:v1, tstamp:x}, key2:x}} Fault-tolerant, distributed sorted, ordered, grouped (family) ‘super-column’ (map of unbounded number of columns) Graph databases(Neo4j, GraphDB, InfiniteGraph,..) Efficient transactional, traversal & storage of entity (vertice), attribute & relationship (edge) Copyright 2011 Patrick Nicolas - All rights reserved 6
  • 7. Data Flow Managers Map & Reduce tasks can be abstracted to a tasks or workflow managers using high level language such as scripts, SQL or UNIX-pipe like API. Those data flow tools hide the functional complexity of Map-Reduce from domain experts. Scripting Pig SQL Hive API: Pipes & flows Cascading API Map Map Map Map Map Combine Combine Reduce Reduce Reduce Reduce Copyright 2011 Patrick Nicolas - All rights reserved 7
  • 8. Data Flow Code Samples Pig Latin A = LOAD „mydata' USING PigStorage() AS (f1:int, name:string); B = GROUP A BY f1; C = FOREACH B GENERATE COUNT ($0); Hive LOAD DATA LOCAL INPATH „xxx' OVERWRITE INTO TABLE z; INSERT OVERWRITE TABLE z SELECT count(*) FROM y GROUP BY f1; Cascading Scheme srcScheme = new TextLine( new Fields( “line”)); Tap src = new Hfs(srcScheme, inpath); Pipe counter = new Pipe (“count”); counter = new GroupBy( counter, new Fields(“f1”); FlowConnector connector = new FlowConnector(props); Flow flow = connector.connect( “count”, src, sink, pipe); flow.complete(); Copyright 2011 Patrick Nicolas - All rights reserved 8