SlideShare ist ein Scribd-Unternehmen logo
1 von 67
Liferay & Big Data 
Getting value from your data 
! 
Miguel Ángel Pastor Olivar 
miguel.pastor@liferay.com
Who am I? 
! 
• Some random guy 
! 
• Member of the Liferay core infrastructure 
team 
! 
•Disclaimer: Not a computer scientist 
! 
• @miguelinlas3
What are we going to talk about? 
! 
• Big Data: what is this about? 
! 
• Simple architecture proposal 
! 
• Use cases 
! 
• Questions (and hopefully answers)
Big Data?
• Data is so big that regular solutions are: 
! 
–Extremely slow 
! 
–Too small 
! 
–Really expensive 
! 
• How we use all the data we already own
! 
• Volume 
–Transactions, data streaming from social media, … 
! 
• Velocity 
–Torrents of data in real time 
! 
• Variety 
–Numerical data, text, email, video, audio, …
Popular usages
• Recommender systems 
! 
• Predicting the future: 
– Netflix does autoscaling based on past 
network data traffic 
! 
• Churn models 
– Big telco companies build social networks 
to reduce the churn
• Sentiment analysis 
–Are talking about you in the Internet? 
! 
• Real Time Bidding 
–Optimise advertising 
! 
• Health care 
–Improve patients health while reducing costs 
–Improve quality of life of multiple sclerosis patients
Terminology
• Storage models 
• How to store relevant information 
! 
• Computation models 
• Process and transform all the information 
! 
• Analytics 
• How we can take actions based on the 
previous steps
Big Data 
Architectures
Data storage
Hadoop Distributed File System (HDFS) 
! 
• Java based file system 
! 
• Scalable, fault-tolerant, distributed storage 
! 
• Designed to run on commodity hardware 
! 
• Closely related to MapReduce
Source: http://hortonworks.com/
NoSQL storage
• Semistructured data 
! 
• Focused on 
! 
• Horizontal scalability 
! 
• Availability 
! 
• Different trade-offs: CAP, BASE, … 
!
NewSQL 
storage
• Modern relational databases 
! 
• Same scalable performance than NoSQL for 
OLTP 
! 
• Maintain ACID guarantees 
! 
• A few alternatives: VoltDB, Google Spanner, 
FoundationDB, …
Computation 
and analytics
Apache Hadoop
Apache Hadoop Map Reduce 
! 
• Distributed processing 
! 
• Large datasets 
! 
•Clusters of computers 
#LRNAS2014 
! 
• Simple programming model 
! 
• Verbose and hard to use API
Liferay 
projects 
is 
the 
best 
Open 
Source 
project 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1]) 
(Liferay: 1) 
(Open, [1]) 
(project, [1,1]) 
(Source, [1]) 
(the, [1])
• Batch model data crunching 
! 
• Not so good event stream processing 
! 
• But … 
! 
• Many algorithms hard to implement using 
MapReduce 
! 
• Cascading, Scalding, Cascalog, Impala, …
Apache Storm
• Distributed realtime computation system 
! 
• Easy to reliably process unbounded streams of data 
! 
• Multi language support 
! 
• Realtime analytics, online machine learning, continuous 
computation, distributed RPC, ETL, …
Spout 
Spout 
Bolt Bolt 
Bolt
Apache Spark
• Fast and general-purpose cluster computing 
• Developed by Berkeley AMP 
! 
• High level APIs (not MapReduce) 
! 
• Optimised engine: 
• supports general execution graphs 
! 
• Higher-level tools: 
• Spark SQL, MLib, Spark Streaming, Graphx
Apache Mahout
! 
• Scalable machine learning library 
#LRNAS2014 
! 
• Built on top of Hadoop 
! 
• Some algorithms don’t require Hadoop at all 
#LRNAS2014
R language
• Focused on: 
• Data visualisation 
• Statistical computations 
• Analysis of data 
! 
• Tons of built-in packages 
! 
• Connect to Hadoop through Hadoop Streaming 
! 
• Not a fast language
Reference 
Architecture
RDBMS 
Event Broker 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data 
Logs 
Monitoring Dataware 
House 
Streaming Social 
Graph
Datasources
RDBMS 
Event Broker 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data 
Logs 
Monitoring Dataware 
House 
Streaming Social 
Graph
• System events 
! 
• User tracking (client side) 
• Clicks, navigation, activities, … 
! 
• Monitoring (transactions, load page times, …) 
! 
• Models (message boards, blogs, wiki …) 
! 
• Custom developments …
Event broker
RDBMS 
Event Broker 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data 
Logs 
Monitoring Dataware 
House 
Streaming Social 
Graph
Data Source 
0 1 2 3 4 5 6 7 8 
Writes 
9 
Reads Reads 
System A System B
Apache Kafka 
! 
• Publish-subscribe as distributed commit log 
! 
• Fast 
! 
• Scalable 
! 
• Durable 
! 
• Distributed by design
Broker A 
Broker B 
Producer Consumer 
Broker C 
ZooKeeper
Computation 
and analytics
RDBMS 
Event Broker 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data 
Logs 
Monitoring Dataware 
House 
Streaming Social 
Graph
Batch processing? 
! 
Real time processing? 
! 
Machine learning algorithms? 
! 
Graph analysis? 
! 
Unified programming model?
! 
• Fast and general engine for large-scale data 
processing 
! 
• Write your apps in Java, Scala or Python 
! 
• Run on YARN cluster manager 
! 
• Can read any existing Hadoop data (HDFS) 
! 
• In memory or disk
Apache Spark Main Components 
Apache Spark 
Spark SQL 
Spark 
Streaming MLib GraphX
Spark Core
• Driver main function and executes various 
parallel operations on a cluster 
! 
• Resilient Distributed Datasets (RDD) 
• HDFS (or any Hadoop file system) 
! 
• Scala collection 
! 
• Second abstraction: shared variables
Spark SQL
• Mix SQL queries with Spark programs 
! 
• Unified Data Access 
! 
• Hive compatibility 
! 
• Standard JDBC or ODBC connectivity 
! 
• Same engine for both interactive and long running 
queries
Spark Streaming
• Build your apps using high-level operators 
! 
• Fault tolerance: exactly-once semantics out of the box 
! 
• Combine streaming with batch and interactive queries 
! 
• Can read from HDFS, Flume, Kafka, Twitter and ZeroMQ 
! 
• Define your own custom data sources
Spark MLib
! 
• Basic statistics 
• Summary statistics 
• Correlations 
• …. 
! 
• Classification and regression 
• Linear models 
• Decision tress 
• Naive Bayes
! 
• Clustering 
• K-Means 
! 
• Collaborative filtering 
• Alternate least squares 
! 
• Dimensionality reduction 
• Singular value decomposition 
! 
• Principal component analysis
Spark GraphX
! 
• Graphs API and graph-parallel computation 
! 
• Growing scale and importance 
• From social networks to language modelling 
! 
• Directed multigraph with properties attached to each 
vertex and edge 
! 
• Growing collection of graph algorithms and builders
Live demo! 
Building a messages 
classifier
Takeaways
• Not about data size, but how you use it 
! 
• You already own tons of data, you just need to take get 
value from it 
! 
• There is no silver bullet: you’ve plenty of alternatives 
! 
• JVM Big data related techs are usually a great choice 
! 
• Try it yourself!!
References
!• 
Apache Kafka 
! 
• Apache Spark 
! 
• Apache Storm 
! 
• Apache Hadoop 
! 
• Big Data definition at Wikipedia 
! 
• Liferay Kafka Bridge 
! 
• What every software engineer should know about a log
Thank you!!
Questions 
(and hopefully answers)

Weitere ähnliche Inhalte

Was ist angesagt?

Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbenchRan Wei
 
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...Databricks
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoopgregchanan
 
How do spark_kafka_and_syncsort_dmx-h
How do spark_kafka_and_syncsort_dmx-hHow do spark_kafka_and_syncsort_dmx-h
How do spark_kafka_and_syncsort_dmx-hPrecisely
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platformhadooparchbook
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Lucidworks
 
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLBig Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLTugdual Grall
 
Elasticsearch JVM-MX Meetup April 2016
Elasticsearch JVM-MX Meetup April 2016Elasticsearch JVM-MX Meetup April 2016
Elasticsearch JVM-MX Meetup April 2016Domingo Suarez Torres
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platformhadooparchbook
 
Real time monitoring of hadoop and spark workflows
Real time monitoring of hadoop and spark workflowsReal time monitoring of hadoop and spark workflows
Real time monitoring of hadoop and spark workflowsShankar Manian
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platformhadooparchbook
 
Hybrid Apache Spark Architecture with YARN and Kubernetes
Hybrid Apache Spark Architecture with YARN and KubernetesHybrid Apache Spark Architecture with YARN and Kubernetes
Hybrid Apache Spark Architecture with YARN and KubernetesDatabricks
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About Jesus Rodriguez
 
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Lucidworks
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 

Was ist angesagt? (20)

Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
 
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDB
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
How do spark_kafka_and_syncsort_dmx-h
How do spark_kafka_and_syncsort_dmx-hHow do spark_kafka_and_syncsort_dmx-h
How do spark_kafka_and_syncsort_dmx-h
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
 
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLBig Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQL
 
Elasticsearch JVM-MX Meetup April 2016
Elasticsearch JVM-MX Meetup April 2016Elasticsearch JVM-MX Meetup April 2016
Elasticsearch JVM-MX Meetup April 2016
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
 
Real time monitoring of hadoop and spark workflows
Real time monitoring of hadoop and spark workflowsReal time monitoring of hadoop and spark workflows
Real time monitoring of hadoop and spark workflows
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 
Hybrid Apache Spark Architecture with YARN and Kubernetes
Hybrid Apache Spark Architecture with YARN and KubernetesHybrid Apache Spark Architecture with YARN and Kubernetes
Hybrid Apache Spark Architecture with YARN and Kubernetes
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
 
Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 

Andere mochten auch

3. Sinagogas, inspiración para Grupos Pequeños
3. Sinagogas, inspiración para Grupos Pequeños3. Sinagogas, inspiración para Grupos Pequeños
3. Sinagogas, inspiración para Grupos PequeñosHeyssen Cordero Maraví
 
Arianrod prefacio1
Arianrod prefacio1Arianrod prefacio1
Arianrod prefacio1raceaguilart
 
Curso Comunicacion 2
Curso Comunicacion 2Curso Comunicacion 2
Curso Comunicacion 2juan pablo
 
Arrow ECS - One Source, IT Skills & Serivces
Arrow ECS - One Source, IT Skills & SerivcesArrow ECS - One Source, IT Skills & Serivces
Arrow ECS - One Source, IT Skills & SerivcesArrow ECS UK
 
Water and Waste Water Treatment - EN - 140716 - webreduced
Water and Waste Water Treatment - EN - 140716 - webreducedWater and Waste Water Treatment - EN - 140716 - webreduced
Water and Waste Water Treatment - EN - 140716 - webreducedRenan Norbiate de Melo
 
CyberAttack -- Whose side is your computer on?
CyberAttack -- Whose side is your computer on?CyberAttack -- Whose side is your computer on?
CyberAttack -- Whose side is your computer on?Jim Isaak
 
Origen y significado del día de muertos
Origen y significado del día de muertosOrigen y significado del día de muertos
Origen y significado del día de muertosommasi
 
Dermlite Dermatoscopes
Dermlite DermatoscopesDermlite Dermatoscopes
Dermlite DermatoscopesSchuco
 
Como funciona el alcohol en el cuerpo
Como funciona el alcohol en el cuerpoComo funciona el alcohol en el cuerpo
Como funciona el alcohol en el cuerpogallardoeliass
 
Mr. Eduard Rodès Director of the European Short Sea Shipping School
Mr. Eduard Rodès Director of the   European Short Sea Shipping School Mr. Eduard Rodès Director of the   European Short Sea Shipping School
Mr. Eduard Rodès Director of the European Short Sea Shipping School ASCAME
 
Customer Lifestage
Customer LifestageCustomer Lifestage
Customer LifestageJoe Hage
 

Andere mochten auch (20)

3. Sinagogas, inspiración para Grupos Pequeños
3. Sinagogas, inspiración para Grupos Pequeños3. Sinagogas, inspiración para Grupos Pequeños
3. Sinagogas, inspiración para Grupos Pequeños
 
Arianrod prefacio1
Arianrod prefacio1Arianrod prefacio1
Arianrod prefacio1
 
KIAC_Conference Report_Print
KIAC_Conference Report_PrintKIAC_Conference Report_Print
KIAC_Conference Report_Print
 
Curso Comunicacion 2
Curso Comunicacion 2Curso Comunicacion 2
Curso Comunicacion 2
 
Ruta de la tapa
Ruta de la tapaRuta de la tapa
Ruta de la tapa
 
Arrow ECS - One Source, IT Skills & Serivces
Arrow ECS - One Source, IT Skills & SerivcesArrow ECS - One Source, IT Skills & Serivces
Arrow ECS - One Source, IT Skills & Serivces
 
Algo de astronomia
Algo de astronomiaAlgo de astronomia
Algo de astronomia
 
Water and Waste Water Treatment - EN - 140716 - webreduced
Water and Waste Water Treatment - EN - 140716 - webreducedWater and Waste Water Treatment - EN - 140716 - webreduced
Water and Waste Water Treatment - EN - 140716 - webreduced
 
Integración prevención 03 10-10
Integración prevención 03 10-10Integración prevención 03 10-10
Integración prevención 03 10-10
 
CyberAttack -- Whose side is your computer on?
CyberAttack -- Whose side is your computer on?CyberAttack -- Whose side is your computer on?
CyberAttack -- Whose side is your computer on?
 
Origen y significado del día de muertos
Origen y significado del día de muertosOrigen y significado del día de muertos
Origen y significado del día de muertos
 
HSBP June Invite
HSBP June InviteHSBP June Invite
HSBP June Invite
 
Netherlands Fuel Card Briefing
Netherlands Fuel Card Briefing Netherlands Fuel Card Briefing
Netherlands Fuel Card Briefing
 
Dermlite Dermatoscopes
Dermlite DermatoscopesDermlite Dermatoscopes
Dermlite Dermatoscopes
 
Como funciona el alcohol en el cuerpo
Como funciona el alcohol en el cuerpoComo funciona el alcohol en el cuerpo
Como funciona el alcohol en el cuerpo
 
Vhigo Mase
Vhigo MaseVhigo Mase
Vhigo Mase
 
Reputacion online C4E
Reputacion online C4EReputacion online C4E
Reputacion online C4E
 
Future Academy - Cerificate
Future Academy - CerificateFuture Academy - Cerificate
Future Academy - Cerificate
 
Mr. Eduard Rodès Director of the European Short Sea Shipping School
Mr. Eduard Rodès Director of the   European Short Sea Shipping School Mr. Eduard Rodès Director of the   European Short Sea Shipping School
Mr. Eduard Rodès Director of the European Short Sea Shipping School
 
Customer Lifestage
Customer LifestageCustomer Lifestage
Customer Lifestage
 

Ähnlich wie Liferay & Big Data Dev Con 2014

Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practiceDarko Marjanovic
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologiesneeraj rathore
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...BigDataEverywhere
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!gagravarr
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksData Con LA
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Chris Fregly
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Open Analytics
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenChristopher Whitaker
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSSri Ambati
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 

Ähnlich wie Liferay & Big Data Dev Con 2014 (20)

Apache drill
Apache drillApache drill
Apache drill
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologies
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Hands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop EcosystemHands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop Ecosystem
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Apache Hadoop Hive
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop Hive
 

Mehr von Miguel Pastor

Microservices: The OSGi way A different vision on microservices
Microservices: The OSGi way A different vision on microservicesMicroservices: The OSGi way A different vision on microservices
Microservices: The OSGi way A different vision on microservicesMiguel Pastor
 
Reactive applications and Akka intro used in the Madrid Scala Meetup
Reactive applications and Akka intro used in the Madrid Scala MeetupReactive applications and Akka intro used in the Madrid Scala Meetup
Reactive applications and Akka intro used in the Madrid Scala MeetupMiguel Pastor
 
Reactive applications using Akka
Reactive applications using AkkaReactive applications using Akka
Reactive applications using AkkaMiguel Pastor
 
Liferay Devcon 2013: Our way towards modularity
Liferay Devcon 2013: Our way towards modularityLiferay Devcon 2013: Our way towards modularity
Liferay Devcon 2013: Our way towards modularityMiguel Pastor
 
Liferay Module Framework
Liferay Module FrameworkLiferay Module Framework
Liferay Module FrameworkMiguel Pastor
 
Hadoop, Cloud y Spring
Hadoop, Cloud y Spring Hadoop, Cloud y Spring
Hadoop, Cloud y Spring Miguel Pastor
 
Scala: un vistazo general
Scala: un vistazo generalScala: un vistazo general
Scala: un vistazo generalMiguel Pastor
 
Platform as a Service overview
Platform as a Service overviewPlatform as a Service overview
Platform as a Service overviewMiguel Pastor
 
Aspect Oriented Programming introduction
Aspect Oriented Programming introductionAspect Oriented Programming introduction
Aspect Oriented Programming introductionMiguel Pastor
 
Software measure-slides
Software measure-slidesSoftware measure-slides
Software measure-slidesMiguel Pastor
 
Groovy and Grails intro
Groovy and Grails introGroovy and Grails intro
Groovy and Grails introMiguel Pastor
 

Mehr von Miguel Pastor (17)

Microservices: The OSGi way A different vision on microservices
Microservices: The OSGi way A different vision on microservicesMicroservices: The OSGi way A different vision on microservices
Microservices: The OSGi way A different vision on microservices
 
Reactive applications and Akka intro used in the Madrid Scala Meetup
Reactive applications and Akka intro used in the Madrid Scala MeetupReactive applications and Akka intro used in the Madrid Scala Meetup
Reactive applications and Akka intro used in the Madrid Scala Meetup
 
Reactive applications using Akka
Reactive applications using AkkaReactive applications using Akka
Reactive applications using Akka
 
Liferay Devcon 2013: Our way towards modularity
Liferay Devcon 2013: Our way towards modularityLiferay Devcon 2013: Our way towards modularity
Liferay Devcon 2013: Our way towards modularity
 
Liferay Module Framework
Liferay Module FrameworkLiferay Module Framework
Liferay Module Framework
 
Liferay and Cloud
Liferay and CloudLiferay and Cloud
Liferay and Cloud
 
Jvm fundamentals
Jvm fundamentalsJvm fundamentals
Jvm fundamentals
 
Scala Overview
Scala OverviewScala Overview
Scala Overview
 
Hadoop, Cloud y Spring
Hadoop, Cloud y Spring Hadoop, Cloud y Spring
Hadoop, Cloud y Spring
 
Scala: un vistazo general
Scala: un vistazo generalScala: un vistazo general
Scala: un vistazo general
 
Platform as a Service overview
Platform as a Service overviewPlatform as a Service overview
Platform as a Service overview
 
HadoopDB
HadoopDBHadoopDB
HadoopDB
 
Aspect Oriented Programming introduction
Aspect Oriented Programming introductionAspect Oriented Programming introduction
Aspect Oriented Programming introduction
 
Software measure-slides
Software measure-slidesSoftware measure-slides
Software measure-slides
 
Arquitecturas MMOG
Arquitecturas MMOGArquitecturas MMOG
Arquitecturas MMOG
 
Software Failures
Software FailuresSoftware Failures
Software Failures
 
Groovy and Grails intro
Groovy and Grails introGroovy and Grails intro
Groovy and Grails intro
 

Kürzlich hochgeladen

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfSayantanBiswas37
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 

Kürzlich hochgeladen (20)

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 

Liferay & Big Data Dev Con 2014

  • 1. Liferay & Big Data Getting value from your data ! Miguel Ángel Pastor Olivar miguel.pastor@liferay.com
  • 2. Who am I? ! • Some random guy ! • Member of the Liferay core infrastructure team ! •Disclaimer: Not a computer scientist ! • @miguelinlas3
  • 3. What are we going to talk about? ! • Big Data: what is this about? ! • Simple architecture proposal ! • Use cases ! • Questions (and hopefully answers)
  • 5. • Data is so big that regular solutions are: ! –Extremely slow ! –Too small ! –Really expensive ! • How we use all the data we already own
  • 6. ! • Volume –Transactions, data streaming from social media, … ! • Velocity –Torrents of data in real time ! • Variety –Numerical data, text, email, video, audio, …
  • 8. • Recommender systems ! • Predicting the future: – Netflix does autoscaling based on past network data traffic ! • Churn models – Big telco companies build social networks to reduce the churn
  • 9. • Sentiment analysis –Are talking about you in the Internet? ! • Real Time Bidding –Optimise advertising ! • Health care –Improve patients health while reducing costs –Improve quality of life of multiple sclerosis patients
  • 11. • Storage models • How to store relevant information ! • Computation models • Process and transform all the information ! • Analytics • How we can take actions based on the previous steps
  • 14. Hadoop Distributed File System (HDFS) ! • Java based file system ! • Scalable, fault-tolerant, distributed storage ! • Designed to run on commodity hardware ! • Closely related to MapReduce
  • 17. • Semistructured data ! • Focused on ! • Horizontal scalability ! • Availability ! • Different trade-offs: CAP, BASE, … !
  • 19. • Modern relational databases ! • Same scalable performance than NoSQL for OLTP ! • Maintain ACID guarantees ! • A few alternatives: VoltDB, Google Spanner, FoundationDB, …
  • 22. Apache Hadoop Map Reduce ! • Distributed processing ! • Large datasets ! •Clusters of computers #LRNAS2014 ! • Simple programming model ! • Verbose and hard to use API
  • 23. Liferay projects is the best Open Source project best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  • 24. • Batch model data crunching ! • Not so good event stream processing ! • But … ! • Many algorithms hard to implement using MapReduce ! • Cascading, Scalding, Cascalog, Impala, …
  • 26. • Distributed realtime computation system ! • Easy to reliably process unbounded streams of data ! • Multi language support ! • Realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, …
  • 27. Spout Spout Bolt Bolt Bolt
  • 29. • Fast and general-purpose cluster computing • Developed by Berkeley AMP ! • High level APIs (not MapReduce) ! • Optimised engine: • supports general execution graphs ! • Higher-level tools: • Spark SQL, MLib, Spark Streaming, Graphx
  • 31. ! • Scalable machine learning library #LRNAS2014 ! • Built on top of Hadoop ! • Some algorithms don’t require Hadoop at all #LRNAS2014
  • 33. • Focused on: • Data visualisation • Statistical computations • Analysis of data ! • Tons of built-in packages ! • Connect to Hadoop through Hadoop Streaming ! • Not a fast language
  • 35. RDBMS Event Broker Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  • 37. RDBMS Event Broker Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  • 38. • System events ! • User tracking (client side) • Clicks, navigation, activities, … ! • Monitoring (transactions, load page times, …) ! • Models (message boards, blogs, wiki …) ! • Custom developments …
  • 40. RDBMS Event Broker Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  • 41. Data Source 0 1 2 3 4 5 6 7 8 Writes 9 Reads Reads System A System B
  • 42. Apache Kafka ! • Publish-subscribe as distributed commit log ! • Fast ! • Scalable ! • Durable ! • Distributed by design
  • 43. Broker A Broker B Producer Consumer Broker C ZooKeeper
  • 45. RDBMS Event Broker Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  • 46. Batch processing? ! Real time processing? ! Machine learning algorithms? ! Graph analysis? ! Unified programming model?
  • 47.
  • 48. ! • Fast and general engine for large-scale data processing ! • Write your apps in Java, Scala or Python ! • Run on YARN cluster manager ! • Can read any existing Hadoop data (HDFS) ! • In memory or disk
  • 49. Apache Spark Main Components Apache Spark Spark SQL Spark Streaming MLib GraphX
  • 51. • Driver main function and executes various parallel operations on a cluster ! • Resilient Distributed Datasets (RDD) • HDFS (or any Hadoop file system) ! • Scala collection ! • Second abstraction: shared variables
  • 53. • Mix SQL queries with Spark programs ! • Unified Data Access ! • Hive compatibility ! • Standard JDBC or ODBC connectivity ! • Same engine for both interactive and long running queries
  • 55. • Build your apps using high-level operators ! • Fault tolerance: exactly-once semantics out of the box ! • Combine streaming with batch and interactive queries ! • Can read from HDFS, Flume, Kafka, Twitter and ZeroMQ ! • Define your own custom data sources
  • 57. ! • Basic statistics • Summary statistics • Correlations • …. ! • Classification and regression • Linear models • Decision tress • Naive Bayes
  • 58. ! • Clustering • K-Means ! • Collaborative filtering • Alternate least squares ! • Dimensionality reduction • Singular value decomposition ! • Principal component analysis
  • 60. ! • Graphs API and graph-parallel computation ! • Growing scale and importance • From social networks to language modelling ! • Directed multigraph with properties attached to each vertex and edge ! • Growing collection of graph algorithms and builders
  • 61. Live demo! Building a messages classifier
  • 63. • Not about data size, but how you use it ! • You already own tons of data, you just need to take get value from it ! • There is no silver bullet: you’ve plenty of alternatives ! • JVM Big data related techs are usually a great choice ! • Try it yourself!!
  • 65. !• Apache Kafka ! • Apache Spark ! • Apache Storm ! • Apache Hadoop ! • Big Data definition at Wikipedia ! • Liferay Kafka Bridge ! • What every software engineer should know about a log