SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Systems for Big Data

      Srihari Srinivasan
       ThoughtWorks
Web 2.x

Platform for software development
Its
Cloud Computing
Its
              Machine Learning




       Its
Cloud Computing
Its Distributed and
 Service Oriented                         Its
                                    Machine Learning




                             Its
                      Cloud Computing
Coming up


•   Going distributed - When and Why?

•   The landscape of Big Data systems - What are the apps?
When do we go distributed?

•   A truly distributed design is usually a second/third generation solution

    •   Amazon started off as simple web application talking to a database
        15+ years ago

    •   Twitter started out as a simple Ruby on Rails application talking to
        MySQL in 2006
When do we go distributed?

•   As the application’s / organization’s complexity grows

•   Data, request volume is too large for a single machine

•   Your software needs to be deployed in multiple data centers

•   Your teams deliver software in the form of services


                                                             Courtesy : Jeff Dean’s LADIS 2009 Keynote
Designing systems for scale


•   Many production grade systems have been built and written about in
    recent times

•   Need for a taxonomy that describes the big data systems landscape
A taxonomy for distributed systems

•   Distributed Storage Systems

•   Distributed Applications

•   Monitoring & Management

•   Personalization & Recommendation
Distributed Storage


•   Distributed Filesystems

•   Distributed/Parallel Databases

•   Messaging and Notification engines
Distributed Filesystems


•   Allows clients to access files from multiple networked hosts

•   Clients don’t access underlying block storage directly, go through
    protocols

•   Modern DFSs are good at providing replication & fault tolerance
Distributed Databases
•   A database engine that allows storage and retrieval across different
    machines in a network, a.k.a NoSQL databases.

•   Apache Hive, Amazon Dynamo, HadoopDB, FB Cassandra, Google
    Bigtable

•   They tend to be non relational, distributed, open-source and
    horizontally scalable

•   Are schema free, easy support for replication, eventually consistent
    (BASE over ACID)
Distributed Apps

•   Data parallel programming frameworks

•   Graph processing engines

•   P2P content delivery

•   Multi tenanted SaaS applications

•   Content delivery networks
Monitoring and Management


•   Distributed debuggers, tracers and profiling applications

•   Monitoring systems
Personalization & Recommendation


•   Recommendation engines

•   Sentiment analyzers

•   Personalized news & content discovery systems
</presentation>
         Visit
www.systemswemake.com

   Follow on Twitter
  @systems_we_make

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (9)

Stimultan
StimultanStimultan
Stimultan
 
One World Romania
One World RomaniaOne World Romania
One World Romania
 
Co Branding
Co BrandingCo Branding
Co Branding
 
Youth Summit 08
Youth Summit 08Youth Summit 08
Youth Summit 08
 
Arbitration2
Arbitration2Arbitration2
Arbitration2
 
Arbitration2
Arbitration2Arbitration2
Arbitration2
 
Diigo Presentation
Diigo PresentationDiigo Presentation
Diigo Presentation
 
esercitazione1
esercitazione1esercitazione1
esercitazione1
 
Web20pres
Web20presWeb20pres
Web20pres
 

Kürzlich hochgeladen

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 

Systems for Big Data Processing

  • 1. Systems for Big Data Srihari Srinivasan ThoughtWorks
  • 2. Web 2.x Platform for software development
  • 3.
  • 5. Its Machine Learning Its Cloud Computing
  • 6. Its Distributed and Service Oriented Its Machine Learning Its Cloud Computing
  • 7. Coming up • Going distributed - When and Why? • The landscape of Big Data systems - What are the apps?
  • 8. When do we go distributed? • A truly distributed design is usually a second/third generation solution • Amazon started off as simple web application talking to a database 15+ years ago • Twitter started out as a simple Ruby on Rails application talking to MySQL in 2006
  • 9. When do we go distributed? • As the application’s / organization’s complexity grows • Data, request volume is too large for a single machine • Your software needs to be deployed in multiple data centers • Your teams deliver software in the form of services Courtesy : Jeff Dean’s LADIS 2009 Keynote
  • 10. Designing systems for scale • Many production grade systems have been built and written about in recent times • Need for a taxonomy that describes the big data systems landscape
  • 11. A taxonomy for distributed systems • Distributed Storage Systems • Distributed Applications • Monitoring & Management • Personalization & Recommendation
  • 12. Distributed Storage • Distributed Filesystems • Distributed/Parallel Databases • Messaging and Notification engines
  • 13. Distributed Filesystems • Allows clients to access files from multiple networked hosts • Clients don’t access underlying block storage directly, go through protocols • Modern DFSs are good at providing replication & fault tolerance
  • 14. Distributed Databases • A database engine that allows storage and retrieval across different machines in a network, a.k.a NoSQL databases. • Apache Hive, Amazon Dynamo, HadoopDB, FB Cassandra, Google Bigtable • They tend to be non relational, distributed, open-source and horizontally scalable • Are schema free, easy support for replication, eventually consistent (BASE over ACID)
  • 15. Distributed Apps • Data parallel programming frameworks • Graph processing engines • P2P content delivery • Multi tenanted SaaS applications • Content delivery networks
  • 16. Monitoring and Management • Distributed debuggers, tracers and profiling applications • Monitoring systems
  • 17. Personalization & Recommendation • Recommendation engines • Sentiment analyzers • Personalized news & content discovery systems
  • 18. </presentation> Visit www.systemswemake.com Follow on Twitter @systems_we_make

Hinweis der Redaktion

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. A distributed design is usually a second or third generation architectural choice. Most systems begin their journey as a simple application running on a web server talking to an RDBMS.\n\nExamples -&amp;#xA0;\nAmazon&amp;#xA0;\nstarted out as an application more than a 15 years ago based on this simple web-app-talking-to-db architecture.\nThis C++ application, called Obidos, evolved to hold all the business logic, all the display logic, and all the functionality that Amazon is now famous for: similarities, recommendations, reviews, etc.&amp;#xA0;\nFor years the scaling efforts at Amazon were focused on making the back-end databases scale to hold more items, more customers, more orders, and to support multiple international sites. This went on until 2001 when it became clear that this monolithic application couldn&amp;#x2019;t scale anymore.\n\nTwitter\nLaunched in 2006 and hit mainstream in 2008&amp;#xA0;\nstarted out as a simple Ruby on rails application talking to a MySQL database.\n
  8. The watershed moment in the history of most large scale systems is when their teams stopped thinking of the system as just a simple webapp and instead shifted to a view of the system that was a fully distributed, decentralized platform.\nThe benefits of this approach are many -&amp;#xA0;\nFirstly service orientation offers a whole new level of isolation. This isolation by clearly articulating the ownership boundaries thus enabling teams take greater control and ownership of the services they develop and operate.&amp;#xA0;\nIt also allows for dependencies to be clearly specified, makes testing easier and allows small teams to work independently.\n\nThe other aspect facilitated by the move to SOA is that by preventing direct access to the underly data store used by a service allows us the freedom to change the underlying implementation of the services without modifying the rest of the systems. This comes really handy when you want to make reliability and scalability improvements without having to involve your clients. As long as the information contracts and SLAs are adhered to the clients of a service do not have to be involved the in the process.\n\n
  9. The best way to learn/understand more about distributed systems is by looking at the real, production grade systems that have been built in recent times. In the reminder of this talk I am going to talk through what the landscape looks like of Big data systems. Along the way I&amp;#x2019;ll also describe some of the key systems in each of the categories and hopefully one of them will arouse your curiosity and spark off further experimentation in the area of your choice.\n\n
  10. \n
  11. \n
  12. A DFS is any file system that allows access to files from multiple hosts sharing via a computer network. This makes it possible for multiple users on multiple machines to share files and storage resources.\nClients don&apos;t have direct access to the underlying block storage and instead interact over the network using a protocol.\nSome of the more modern DFSs have started to include facilities for replication and fault tolerance. In fact the replication and fault tolerance aspects contribute to the bulk to features they have.\n\nTrends in this space :\nSeveral research-ware DFSs have been built and continue to redesigned in light of feedback and experience.&amp;#xA0;\nBut the systems that brought DFS into the mainstream were firstly the Google Filesystem and following closely on its heels the Hadoop Distributed Filesystem.\nThe differentiating aspect of GFS has been the fact that its design has been motivated by the actual application workload characteristics.&amp;#xA0;\nBoth GFS and HDFs (which is an open source implementation inspired by GFS) are designed based on a master slave model. The master, which is responsible for managing metadata is called the Name Node (in HDFS terminology) and the slaves, that actually store the data are called Data Nodes. The whole system has only one Name node with whom multiple data nodes coordinate.&amp;#xA0;\n\nMost distributed file systems have or are exploring truly&amp;#xA0;distributed implementations of the namespace/name node. For instance the Ceph filesystem implements the metadata service as a cluster of name nodes. GFS&apos;s has also moved to a distributed name space implementation where there are 100s of namespace servers with each master managing about 100 million files. So just exploring newer ways of name node/metadata service scalability could be a good topic for active research.&amp;#xA0;\n
  13. Just as with DFS a distributed database is an engine that allows storage and retrieval of records across different machines in a network over just one node. Now a days distributed databases are used to also describe the NoSQL class of databases. Some NoSQL databases are distributed but not all of them. Example of distributed databases are Hive, HadoopDB, Amazon&apos;s Dynamo, Apache Cassandra and Google Big Table/Megastore.\n\nThese Databases mostly address some of the points such as being non-relational, distributed, open-source and horizontally scalable. The movement began early 2009 and is growing rapidly. Often more characteristics apply such as: schema-free, easy replication support, simple API, eventually consistent / BASE (not ACID), a huge amount, of data&amp;#xA0; and more.&amp;#xA0;\n\nTrends in this space :\n\nBy and large this space is quite hyperactive. Both the industry and the open source community are creating more domain specific or in some cases data access pattern specific storage engines. Amazon&apos;s Dynamo was motivated by the fact that about 70% of data was accessed based on the primary key across the whole platform. The dynamo paper describes an absolutely fabulous and seminal piece of work. Strongly recommended for DS people.\n\nWhile this trend continues&amp;#xA0;the current noSQL market satisfies the three characteristics of a monopolistically competitive market: the barriers to entry and exit are low; there are many small suppliers; and these suppliers produce technically heterogeneous, highly differentiated products.&amp;#xA0;\nSo as you can see the conditions are not ripe for perfect competition to occur. Hence in the long run monopolistically competitive firms will make zero economic profit.&amp;#xA0;In the early 1970s, the database world was in a similar sorry state.\nThe landscape changed radically when Ted Codd proposed a new data model and a structured query language (SQL) based on the mathematical concept of relations and foreign-/primary-key relationships.\nCodd&apos;s relational model and SQL allowed implementations from different vendors to be (near) perfect substitutes, and hence provided the conditions for perfect competition.\n\nToday, the relational database market is a classic example of an oligopoly. The market has a few large players (Oracle, IBM, Microsoft, MySQL), the barriers to entry are high, and all existing SQL-based relational database products are largely indistinguishable. Oligopolies can retain high profits in the long run; today the database industry is worth an estimated $32 billion and still growing in the double digits.\n\nSo the million dollar question is can someone come up with a mathematical model for NoSQL databases?\nThere is already work that is out there -&amp;#xA0;\nA co-Relational Model of Data for Large Shared Data Banks&amp;#xA0; - Erik Meijer and Gavin Bierman, Microsoft\n\nWe need more of such models for different categories of NoSQL databases.&amp;#xA0;\n\n
  14. This is another hyperactive area and is witnessing a never before growth. Some themes include -&amp;#xA0;\n1) Distributed Crawlers\n2) Data parallel programming frameworks such as MapReduce, Dryad, Haystack etc.\n3) Large scale graph processing engines - Pregel from Google and HipG lead the pack\n4) Peer to Peer architectures - Spotify uses a peer to peer architecture to large scale, low latency on demand music streaming.&amp;#xA0;\nThe service is not web-based, but instead uses a proprietary client and protocol. At the heart of the system is this custom music streaming protocol that is optimized for accessing a large library of tracks.\n5) Multi-tenanted SaaS applications - Salesforce.com&amp;#xA0;\n6) Content Delivery Networks\n\n\n
  15. \n
  16. \n
  17. \n