Systems for Big Data Processing

•Als KEY, PDF herunterladen•

1 gefällt mir•330 views

Srihari Srinivasan

Technologie

Web 2.x

Platform for software development

Its
Machine Learning

Its
Cloud Computing

Its Distributed and
Service Oriented Its
Machine Learning

Its
Cloud Computing

Coming up

• Going distributed - When and Why?

• The landscape of Big Data systems - What are the apps?

When do we go distributed?

• A truly distributed design is usually a second/third generation solution

• Amazon started off as simple web application talking to a database
15+ years ago

• Twitter started out as a simple Ruby on Rails application talking to
MySQL in 2006

When do we go distributed?

• As the application’s / organization’s complexity grows

• Data, request volume is too large for a single machine

• Your software needs to be deployed in multiple data centers

• Your teams deliver software in the form of services

Courtesy : Jeff Dean’s LADIS 2009 Keynote

Designing systems for scale

• Many production grade systems have been built and written about in
recent times

• Need for a taxonomy that describes the big data systems landscape

A taxonomy for distributed systems

• Distributed Storage Systems

• Distributed Applications

• Monitoring & Management

• Personalization & Recommendation

Distributed Storage

• Distributed Filesystems

• Distributed/Parallel Databases

• Messaging and Notiﬁcation engines

Distributed Filesystems

• Allows clients to access ﬁles from multiple networked hosts

• Clients don’t access underlying block storage directly, go through
protocols

• Modern DFSs are good at providing replication & fault tolerance

Distributed Databases
• A database engine that allows storage and retrieval across different
machines in a network, a.k.a NoSQL databases.

• Apache Hive, Amazon Dynamo, HadoopDB, FB Cassandra, Google
Bigtable

• They tend to be non relational, distributed, open-source and
horizontally scalable

• Are schema free, easy support for replication, eventually consistent
(BASE over ACID)

Distributed Apps

• Data parallel programming frameworks

• Graph processing engines

• P2P content delivery

• Multi tenanted SaaS applications

• Content delivery networks

Monitoring and Management

• Distributed debuggers, tracers and proﬁling applications

• Monitoring systems

Personalization & Recommendation

• Recommendation engines

• Sentiment analyzers

• Personalized news & content discovery systems

</presentation>
Visit
www.systemswemake.com

Follow on Twitter
@systems_we_make

Weitere ähnliche Inhalte

Andere mochten auch

StimultanCosmin Pojoranu

One World RomaniaCosmin Pojoranu

Co Brandingbrandiste

Youth Summit 08Cosmin Pojoranu

Arbitration2cag

Diigo PresentationKimberly James

esercitazione1brandiste

Web20presKimberly James

Andere mochten auch (9)

Stimultan

One World Romania

Co Branding

Youth Summit 08

Arbitration2

Diigo Presentation

esercitazione1

Web20pres

Kürzlich hochgeladen

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

"ML in Production",Oleksandr BaganFwdays

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

unit 4 immunoblotting technique complete.pptxBkGupta21

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

From Family Reminiscence to Scholarly Archive .Alan Dix

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

Kürzlich hochgeladen (20)

SIP trunking in Janus @ Kamailio World 2024

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

Unraveling Multimodality with Large Language Models.pdf

DevoxxFR 2024 Reproducible Builds with Apache Maven

"ML in Production",Oleksandr Bagan

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf

unit 4 immunoblotting technique complete.pptx

Connect Wave/ connectwave Pitch Deck Presentation

What's New in Teams Calling, Meetings and Devices March 2024

"Debugging python applications inside k8s environment", Andrii Soldatenko

Unleash Your Potential - Namagunga Girls Coding Club

Dev Dives: Streamline document processing with UiPath Studio Web

DMCC Future of Trade Web3 - Special Edition

Moving Beyond Passwords: FIDO Paris Seminar.pdf

The Ultimate Guide to Choosing WordPress Pros and Cons

The State of Passkeys with FIDO Alliance.pptx

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

From Family Reminiscence to Scholarly Archive .

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx

Systems for Big Data Processing

1. Systems for Big Data Srihari Srinivasan ThoughtWorks

2. Web 2.x Platform for software development

4. Its Cloud Computing

5. Its Machine Learning Its Cloud Computing

6. Its Distributed and Service Oriented Its Machine Learning Its Cloud Computing

7. Coming up • Going distributed - When and Why? • The landscape of Big Data systems - What are the apps?

8. When do we go distributed? • A truly distributed design is usually a second/third generation solution • Amazon started off as simple web application talking to a database 15+ years ago • Twitter started out as a simple Ruby on Rails application talking to MySQL in 2006

9. When do we go distributed? • As the application’s / organization’s complexity grows • Data, request volume is too large for a single machine • Your software needs to be deployed in multiple data centers • Your teams deliver software in the form of services Courtesy : Jeff Dean’s LADIS 2009 Keynote

10. Designing systems for scale • Many production grade systems have been built and written about in recent times • Need for a taxonomy that describes the big data systems landscape

11. A taxonomy for distributed systems • Distributed Storage Systems • Distributed Applications • Monitoring & Management • Personalization & Recommendation

12. Distributed Storage • Distributed Filesystems • Distributed/Parallel Databases • Messaging and Notiﬁcation engines

13. Distributed Filesystems • Allows clients to access ﬁles from multiple networked hosts • Clients don’t access underlying block storage directly, go through protocols • Modern DFSs are good at providing replication & fault tolerance

14. Distributed Databases • A database engine that allows storage and retrieval across different machines in a network, a.k.a NoSQL databases. • Apache Hive, Amazon Dynamo, HadoopDB, FB Cassandra, Google Bigtable • They tend to be non relational, distributed, open-source and horizontally scalable • Are schema free, easy support for replication, eventually consistent (BASE over ACID)

15. Distributed Apps • Data parallel programming frameworks • Graph processing engines • P2P content delivery • Multi tenanted SaaS applications • Content delivery networks

16. Monitoring and Management • Distributed debuggers, tracers and proﬁling applications • Monitoring systems

17. Personalization & Recommendation • Recommendation engines • Sentiment analyzers • Personalized news & content discovery systems

18. </presentation> Visit www.systemswemake.com Follow on Twitter @systems_we_make

Hinweis der Redaktion

\n
\n
\n
\n
\n
\n
A distributed design is usually a second or third generation architectural choice. Most systems begin their journey as a simple application running on a web server talking to an RDBMS.\n\nExamples -&#xA0;\nAmazon&#xA0;\nstarted out as an application more than a 15 years ago based on this simple web-app-talking-to-db architecture.\nThis C++ application, called Obidos, evolved to hold all the business logic, all the display logic, and all the functionality that Amazon is now famous for: similarities, recommendations, reviews, etc.&#xA0;\nFor years the scaling efforts at Amazon were focused on making the back-end databases scale to hold more items, more customers, more orders, and to support multiple international sites. This went on until 2001 when it became clear that this monolithic application couldn&#x2019;t scale anymore.\n\nTwitter\nLaunched in 2006 and hit mainstream in 2008&#xA0;\nstarted out as a simple Ruby on rails application talking to a MySQL database.\n
The watershed moment in the history of most large scale systems is when their teams stopped thinking of the system as just a simple webapp and instead shifted to a view of the system that was a fully distributed, decentralized platform.\nThe benefits of this approach are many -&#xA0;\nFirstly service orientation offers a whole new level of isolation. This isolation by clearly articulating the ownership boundaries thus enabling teams take greater control and ownership of the services they develop and operate.&#xA0;\nIt also allows for dependencies to be clearly specified, makes testing easier and allows small teams to work independently.\n\nThe other aspect facilitated by the move to SOA is that by preventing direct access to the underly data store used by a service allows us the freedom to change the underlying implementation of the services without modifying the rest of the systems. This comes really handy when you want to make reliability and scalability improvements without having to involve your clients. As long as the information contracts and SLAs are adhered to the clients of a service do not have to be involved the in the process.\n\n
The best way to learn/understand more about distributed systems is by looking at the real, production grade systems that have been built in recent times. In the reminder of this talk I am going to talk through what the landscape looks like of Big data systems. Along the way I&#x2019;ll also describe some of the key systems in each of the categories and hopefully one of them will arouse your curiosity and spark off further experimentation in the area of your choice.\n\n
\n
\n
A DFS is any file system that allows access to files from multiple hosts sharing via a computer network. This makes it possible for multiple users on multiple machines to share files and storage resources.\nClients don't have direct access to the underlying block storage and instead interact over the network using a protocol.\nSome of the more modern DFSs have started to include facilities for replication and fault tolerance. In fact the replication and fault tolerance aspects contribute to the bulk to features they have.\n\nTrends in this space :\nSeveral research-ware DFSs have been built and continue to redesigned in light of feedback and experience.&#xA0;\nBut the systems that brought DFS into the mainstream were firstly the Google Filesystem and following closely on its heels the Hadoop Distributed Filesystem.\nThe differentiating aspect of GFS has been the fact that its design has been motivated by the actual application workload characteristics.&#xA0;\nBoth GFS and HDFs (which is an open source implementation inspired by GFS) are designed based on a master slave model. The master, which is responsible for managing metadata is called the Name Node (in HDFS terminology) and the slaves, that actually store the data are called Data Nodes. The whole system has only one Name node with whom multiple data nodes coordinate.&#xA0;\n\nMost distributed file systems have or are exploring truly&#xA0;distributed implementations of the namespace/name node. For instance the Ceph filesystem implements the metadata service as a cluster of name nodes. GFS's has also moved to a distributed name space implementation where there are 100s of namespace servers with each master managing about 100 million files. So just exploring newer ways of name node/metadata service scalability could be a good topic for active research.&#xA0;\n
Just as with DFS a distributed database is an engine that allows storage and retrieval of records across different machines in a network over just one node. Now a days distributed databases are used to also describe the NoSQL class of databases. Some NoSQL databases are distributed but not all of them. Example of distributed databases are Hive, HadoopDB, Amazon's Dynamo, Apache Cassandra and Google Big Table/Megastore.\n\nThese Databases mostly address some of the points such as being non-relational, distributed, open-source and horizontally scalable. The movement began early 2009 and is growing rapidly. Often more characteristics apply such as: schema-free, easy replication support, simple API, eventually consistent / BASE (not ACID), a huge amount, of data&#xA0; and more.&#xA0;\n\nTrends in this space :\n\nBy and large this space is quite hyperactive. Both the industry and the open source community are creating more domain specific or in some cases data access pattern specific storage engines. Amazon's Dynamo was motivated by the fact that about 70% of data was accessed based on the primary key across the whole platform. The dynamo paper describes an absolutely fabulous and seminal piece of work. Strongly recommended for DS people.\n\nWhile this trend continues&#xA0;the current noSQL market satisfies the three characteristics of a monopolistically competitive market: the barriers to entry and exit are low; there are many small suppliers; and these suppliers produce technically heterogeneous, highly differentiated products.&#xA0;\nSo as you can see the conditions are not ripe for perfect competition to occur. Hence in the long run monopolistically competitive firms will make zero economic profit.&#xA0;In the early 1970s, the database world was in a similar sorry state.\nThe landscape changed radically when Ted Codd proposed a new data model and a structured query language (SQL) based on the mathematical concept of relations and foreign-/primary-key relationships.\nCodd's relational model and SQL allowed implementations from different vendors to be (near) perfect substitutes, and hence provided the conditions for perfect competition.\n\nToday, the relational database market is a classic example of an oligopoly. The market has a few large players (Oracle, IBM, Microsoft, MySQL), the barriers to entry are high, and all existing SQL-based relational database products are largely indistinguishable. Oligopolies can retain high profits in the long run; today the database industry is worth an estimated $32 billion and still growing in the double digits.\n\nSo the million dollar question is can someone come up with a mathematical model for NoSQL databases?\nThere is already work that is out there -&#xA0;\nA co-Relational Model of Data for Large Shared Data Banks&#xA0; - Erik Meijer and Gavin Bierman, Microsoft\n\nWe need more of such models for different categories of NoSQL databases.&#xA0;\n\n
This is another hyperactive area and is witnessing a never before growth. Some themes include -&#xA0;\n1) Distributed Crawlers\n2) Data parallel programming frameworks such as MapReduce, Dryad, Haystack etc.\n3) Large scale graph processing engines - Pregel from Google and HipG lead the pack\n4) Peer to Peer architectures - Spotify uses a peer to peer architecture to large scale, low latency on demand music streaming.&#xA0;\nThe service is not web-based, but instead uses a proprietary client and protocol. At the heart of the system is this custom music streaming protocol that is optimized for accessing a large library of tracks.\n5) Multi-tenanted SaaS applications - Salesforce.com&#xA0;\n6) Content Delivery Networks\n\n\n
\n
\n
\n

Systems for Big Data Processing

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (9)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Systems for Big Data Processing

Hinweis der Redaktion