SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Starschema
Experience and Innovation
• Who we are and what we are doing
• Big Data era
• BSP (Bulk synchronous parallel)
• Apache Giraph
• Storm
• Our use case
• Conclusion
Topics today
Starschema
Experience and Innovation
Continuous growth
25 FTE plus external resources,
over $1.5million EBIT
Open source projects
Share the knowledge with the public.
Open source project in ETL and data
warehousing fields.
Founded in 2006
Company was founded by private
owners with decade of BI and data
warehouse background
R&D
Cooperation with Obuda University,
NKE, EU co-founded technology
research and development
COMPANY Data
Facts about Starschema
Starschema
Experience and Innovation
Big Data eraThe rise of Hadoop
Starschema
Experience and Innovation
Google Year of WP Apache Year
GFS 2003 HDFS 2007
MapReduce 2004 Hadoop MR 2007
BigTable 2006 HBase 2007
Chubby Lock Service 2006 ZooKeeper 2007
Pregel 2009 Giraph 2011
Dremel 2010 Drill 2012 ?
Which is next? (Curator, Falcon, MRQL, etc.)
• Leslie Valiant - article in nov. 1990
• Supersteps
• Data stored in local memory
• Asynchronous data processing
• Barrier sync
• Optimal load balacing (more logical processes
than physcal processors, random allocation of
processes)
• Solution differences (procotols, buffer
management, routing strategies)
• No deadlock or any other race conditions
(since no circular dependency)
• Use cases
BSP (Bulk synchronous parallel)What is it? What is it good for?
Starschema
Experience and Innovation
Storm Apache Giraph
Starschema
Experience and Innovation
Apache Giraph
Starschema
Experience and Innovation
• A loose implementation of Pregel
• Avery Chink: We can't use it at Yahoo, that's too bad
• Developed at Yahoo
• Runs on existing MapReduce infrastructure
• Netty based comm. instead of Hadoop RPC
• In-memory
• Fault tolerant
• Internal state is saved at user-defined intervals
• Master/slave architecture
What is it?
Storm
Starschema
Experience and Innovation
• Storm is a free and open source distributed real time
computation system
• Developed at BackType, open-sourced by Twitter in 2011
• Guaranteed data processing
• Horizontal scalability
• Fault tolerance
• ZeroMQ for message passing
• Processing unbounded
sequence of tuples
• Groupings
What is it?
Storm
Starschema
Experience and Innovation
What is it for?
• Analyze, clean, normalize
• Real-time calculation
• Real-time ETL
• Failure detection from log files
• Machine data analysis
• IT early-warning systems, security and fraud detection
• Traffic information, DOS attack
• Stream processing - Continous computation -
Distributed RPC
Our use case
Starschema
Experience and Innovation
• Real-time calculation
• Error detection
• Horizontal scalability
• Fast implementation
• High-availability
• Error prediction
POC: Processing machine data from sensors
Requirements
Our use case part 2
Starschema
Experience and Innovation
• Choosen tool: Storm
• One spout for each sensor
• Dynamic add and remove of spouts
• Error detection based on statistical calculations
• ~ 200 lines
• HA capability of Storm
POC: Processing machine data from sensors
Solution:
Conclusion
• Extend existing infrastructure
• Answer to new questions
• Re-think old problems
• New solutions, new features
• Happy customers/users
• $$$
Starschema
Experience and Innovation
Starschema
Experience and Innovation
What else to use?
• Yahoo S4 (Apache Incubator project)
• Apache Hama (Top level Apache project)
• GoldenOrb
• Signal/Collect
QUESTIONS & ANSWERS
Q…A
Starschema
Experience and Innovation
borosg@starschema.net
www.starschema.net

Weitere ähnliche Inhalte

Was ist angesagt?

Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Spark Summit
 
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Spark Summit
 

Was ist angesagt? (20)

Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
 
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ..."Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
 
James Corcoran, Head of Engineering EMEA, First Derivatives, "Simplifying Bi...
James Corcoran, Head of Engineering EMEA, First Derivatives,  "Simplifying Bi...James Corcoran, Head of Engineering EMEA, First Derivatives,  "Simplifying Bi...
James Corcoran, Head of Engineering EMEA, First Derivatives, "Simplifying Bi...
 
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
 
Architecture at Scale
Architecture at ScaleArchitecture at Scale
Architecture at Scale
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFI
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
 
How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...
 
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSmack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
 
The of Operational Analytics Data Store
The of Operational Analytics Data StoreThe of Operational Analytics Data Store
The of Operational Analytics Data Store
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Microsoft cosmos
Microsoft cosmosMicrosoft cosmos
Microsoft cosmos
 
Impala turbocharge your big data access
Impala   turbocharge your big data accessImpala   turbocharge your big data access
Impala turbocharge your big data access
 
PyCon Singapore 2013 Keynote
PyCon Singapore 2013 KeynotePyCon Singapore 2013 Keynote
PyCon Singapore 2013 Keynote
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
 

Ähnlich wie Budapest Big Data Meetup Real-time stream processing

Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India eventBig Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
The Hive
 

Ähnlich wie Budapest Big Data Meetup Real-time stream processing (20)

Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019
 
Ultralight data movement for IoT with SDC Edge. Guglielmo Iozzia - Optum
Ultralight data movement for IoT with SDC Edge. Guglielmo Iozzia - OptumUltralight data movement for IoT with SDC Edge. Guglielmo Iozzia - Optum
Ultralight data movement for IoT with SDC Edge. Guglielmo Iozzia - Optum
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Ankus, bigdata deployment and orchestration framework
Ankus, bigdata deployment and orchestration frameworkAnkus, bigdata deployment and orchestration framework
Ankus, bigdata deployment and orchestration framework
 
Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India eventBig Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
 
IBM Aspera overview
IBM Aspera overview IBM Aspera overview
IBM Aspera overview
 
Self-Driving Data Center
Self-Driving Data CenterSelf-Driving Data Center
Self-Driving Data Center
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Open source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applicationsOpen source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applications
 
Wasp2 - IoT and Streaming Platform
Wasp2 - IoT and Streaming PlatformWasp2 - IoT and Streaming Platform
Wasp2 - IoT and Streaming Platform
 
Chris Nicholson, CEO Skymind at The AI Conference
Chris Nicholson, CEO Skymind at The AI Conference Chris Nicholson, CEO Skymind at The AI Conference
Chris Nicholson, CEO Skymind at The AI Conference
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
 

Kürzlich hochgeladen

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 

Budapest Big Data Meetup Real-time stream processing

  • 2. • Who we are and what we are doing • Big Data era • BSP (Bulk synchronous parallel) • Apache Giraph • Storm • Our use case • Conclusion Topics today Starschema Experience and Innovation
  • 3. Continuous growth 25 FTE plus external resources, over $1.5million EBIT Open source projects Share the knowledge with the public. Open source project in ETL and data warehousing fields. Founded in 2006 Company was founded by private owners with decade of BI and data warehouse background R&D Cooperation with Obuda University, NKE, EU co-founded technology research and development COMPANY Data Facts about Starschema Starschema Experience and Innovation
  • 4. Big Data eraThe rise of Hadoop Starschema Experience and Innovation Google Year of WP Apache Year GFS 2003 HDFS 2007 MapReduce 2004 Hadoop MR 2007 BigTable 2006 HBase 2007 Chubby Lock Service 2006 ZooKeeper 2007 Pregel 2009 Giraph 2011 Dremel 2010 Drill 2012 ? Which is next? (Curator, Falcon, MRQL, etc.)
  • 5. • Leslie Valiant - article in nov. 1990 • Supersteps • Data stored in local memory • Asynchronous data processing • Barrier sync • Optimal load balacing (more logical processes than physcal processors, random allocation of processes) • Solution differences (procotols, buffer management, routing strategies) • No deadlock or any other race conditions (since no circular dependency) • Use cases BSP (Bulk synchronous parallel)What is it? What is it good for? Starschema Experience and Innovation
  • 7. Apache Giraph Starschema Experience and Innovation • A loose implementation of Pregel • Avery Chink: We can't use it at Yahoo, that's too bad • Developed at Yahoo • Runs on existing MapReduce infrastructure • Netty based comm. instead of Hadoop RPC • In-memory • Fault tolerant • Internal state is saved at user-defined intervals • Master/slave architecture What is it?
  • 8. Storm Starschema Experience and Innovation • Storm is a free and open source distributed real time computation system • Developed at BackType, open-sourced by Twitter in 2011 • Guaranteed data processing • Horizontal scalability • Fault tolerance • ZeroMQ for message passing • Processing unbounded sequence of tuples • Groupings What is it?
  • 9. Storm Starschema Experience and Innovation What is it for? • Analyze, clean, normalize • Real-time calculation • Real-time ETL • Failure detection from log files • Machine data analysis • IT early-warning systems, security and fraud detection • Traffic information, DOS attack • Stream processing - Continous computation - Distributed RPC
  • 10. Our use case Starschema Experience and Innovation • Real-time calculation • Error detection • Horizontal scalability • Fast implementation • High-availability • Error prediction POC: Processing machine data from sensors Requirements
  • 11. Our use case part 2 Starschema Experience and Innovation • Choosen tool: Storm • One spout for each sensor • Dynamic add and remove of spouts • Error detection based on statistical calculations • ~ 200 lines • HA capability of Storm POC: Processing machine data from sensors Solution:
  • 12. Conclusion • Extend existing infrastructure • Answer to new questions • Re-think old problems • New solutions, new features • Happy customers/users • $$$ Starschema Experience and Innovation
  • 13. Starschema Experience and Innovation What else to use? • Yahoo S4 (Apache Incubator project) • Apache Hama (Top level Apache project) • GoldenOrb • Signal/Collect
  • 14. QUESTIONS & ANSWERS Q…A Starschema Experience and Innovation borosg@starschema.net www.starschema.net