Introduction to Hadoop, HBase, and NoSQL

•Als KEY, PDF herunterladen•

15 gefällt mir•7,468 views

Nick Dimiduk

Agenda

what NoSQL is not
motivation
Hadoop
HBase

whoami
Computer Science & Engineering at Ohio State:
Artiﬁcial Intelligence, Programming Languages, Systems
Engineering
Applied Technical Systems: Hierarchical, non-relational
data storage and analysis systems (no-sql before there was
NoSQL). Information Retrieval, Wire Serialization/RPC
(before there was Thrift/Avro), Data Visualization (GB's)
Visible Technologies: Social Media Storage, Processing,
Analytics. Monitoring, Engagement, Warehousing, and BI. (TB's)
Drawn to Scale: Big Data Storage, Processing, Retrieval,
Analytics (TB's, PB's)

What NoSQL is not.

movement - no ANSI NoSQL-2010
one-size-ﬁts-all

It’s about Choice!

http://www.ﬂickr.com/photos/zakh/337938459/

What NoSQL is not.

movement - no ANSI NoSQL-2010
one-size-ﬁts-all - it’s about choice
silver bullet

What NoSQL is not.

movement - no ANSI NoSQL-2010
one-size-ﬁts-all - it’s about choice
silver bullet - guarantees are hard

motivation
more, More, MORE Data!
ACID Burns

motivation
more, More, MORE Data!
ACID Burns
BASE is good enough

motivation
more, More, MORE Data!
ACID Burns
BASE is good enough
Life’s too short

“typical” application
Data Server Village People

App Server

growing pains
Data Server Villages of People

App Servers

vertical partitioning
Data Server Villages of People

App Servers

Data Server Villages of People

App Servers

vertical partitioning
Data Server Villages of People Data Server Villages of People

App Servers App Servers

Data Server Villages of People Data Server Villages of People

App Servers App Servers

growing pains
Data Servers Villages of People

App Servers

horizontal partitioning
Villages of People

horizontal partitioning
Villages of People

Data Layer Application Layer

“open source, reliable, distributed
computing”

MapReduce - API for parallel computing
HDFS - distributed, replicated ﬁle system

MapReduce - API for parallel computing
HDFS - distributed, replicated ﬁle system
ZooKeeper - distributed synchronization

MapReduce - API for parallel computing
HDFS - distributed, replicated ﬁle system
ZooKeeper - distributed synchronization
Avro - Data Serialization / RPC

structured, distributed database for your
horizontally scalable FS

random access
real-time reads/writes
simple API

random access
real-time reads/writes
simple API
big table

references
: http://www.nosql-database.org
Eventually Consistent: http://www.allthingsdistributed.com/2007/12/
eventually_consistent.html
Soft State: http://mercury.lcs.mit.edu/~jnc/tech/hard_soft.html
Accuracy and Precision: http://en.wikipedia.org/wiki/Accuracy_and_precision
Compare and Swap: http://en.wikipedia.org/wiki/Compare-and-swap
Apache Hadoop: http://hadoop.apache.org
Google MapReduce: http://labs.google.com/papers/mapreduce.html
Google FS: http://labs.google.com/papers/gfs.html
Apache Thrift: http://incubator.apache.org/thrift/
Protobuf: http://code.google.com/p/protobuf/
Google BigTable: http://labs.google.com/papers/bigtable.html
Google Chubby: http://labs.google.com/papers/chubby.html

Questions?

Nick Dimiduk - @xefyr
Founder, Drawn to Scale
nick@drawntoscalehq.com

April 28, 2010

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Spark in Scientific ApplciationsDr. Mirko Kämpf

Overview of stinger interactive query for hiveDavid Kaiser

Hive at Yahoo: Letters from the trenchesDataWorks Summit

Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit

Hadoop EcosystemPatrick Nicolas

The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.

Apache Spark PDFNaresh Rupareliya

Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance

Hadoop_arunam_pptjerrin joseph

Announcing Databricks Cloud (Spark Summit 2014)Databricks

Analysing big data with cluster service and RLushi Chen

Interactive query using hadoopArvind Radhakrishnen

Comparison among rdbms, hadoop and sparkAgnihotriGhosh2

Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Cloudera, Inc.

Processing Big Datacwensel

Interactive query in hadoopRommel Garcia

Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training

Hadoop Big Data A big pictureJ S Jodha

Hadoop introduction , Why and What is Hadoop ?sudhakara st

Hadoop 2 - More than MapReduceUwe Printz

Was ist angesagt? (20)

Apache Spark in Scientific Applciations

Overview of stinger interactive query for hive

Hive at Yahoo: Letters from the trenches

Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks

Hadoop Ecosystem

The Future of Hadoop: A deeper look at Apache Spark

Apache Spark PDF

Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...

Hadoop_arunam_ppt

Announcing Databricks Cloud (Spark Summit 2014)

Analysing big data with cluster service and R

Interactive query using hadoop

Comparison among rdbms, hadoop and spark

Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...

Processing Big Data

Interactive query in hadoop

Top Hadoop Big Data Interview Questions and Answers for Fresher

Hadoop Big Data A big picture

Hadoop introduction , Why and What is Hadoop ?

Hadoop 2 - More than MapReduce

Andere mochten auch

HBase Client APIs (for webapps?)Nick Dimiduk

Apache HBase for ArchitectsNick Dimiduk

Vpork Nosqlelliando dias

Hadoop distributed computing framework for big dataCyanny LIANG

NoSQL with Hadoop and HBaseNGDATA

Apache Spark OverviewVadim Y. Bichutskiy

HBase Low Latency, StrataNYC 2014Nick Dimiduk

Bring Cartography to the CloudNick Dimiduk

HBase Data Types (WIP)Nick Dimiduk

HBase Data TypesNick Dimiduk

Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies

Apache Big Data EU 2015 - HBaseNick Dimiduk

[Spark meetup] Spark Streaming OverviewStratio

Apache Apex & Apace Geode In-Memory Computation, Storage & Analysis Apache Apex

Spark architecturedatamantra

Apache Big Data EU 2015 - PhoenixNick Dimiduk

Apache HBase 1.0 ReleaseNick Dimiduk

Apache HBase Low LatencyNick Dimiduk

Apache Spark 2.0: Faster, Easier, and SmarterDatabricks

Introduction to Apache Spark Developer TrainingCloudera, Inc.

Andere mochten auch (20)

HBase Client APIs (for webapps?)

Apache HBase for Architects

Vpork Nosql

Hadoop distributed computing framework for big data

NoSQL with Hadoop and HBase

Apache Spark Overview

HBase Low Latency, StrataNYC 2014

Bring Cartography to the Cloud

HBase Data Types (WIP)

HBase Data Types

Let Spark Fly: Advantages and Use Cases for Spark on Hadoop

Apache Big Data EU 2015 - HBase

[Spark meetup] Spark Streaming Overview

Apache Apex & Apace Geode In-Memory Computation, Storage & Analysis

Spark architecture

Apache Big Data EU 2015 - Phoenix

Apache HBase 1.0 Release

Apache HBase Low Latency

Apache Spark 2.0: Faster, Easier, and Smarter

Introduction to Apache Spark Developer Training

Ähnlich wie Introduction to Hadoop, HBase, and NoSQL

Large Scale Data Analysis Toolsboorad

Patterns of Cloud Applications Using Microsoft Azure Services PlatformDavid Chou

Intro To Live FrameworkMicrosoft Iceland

HP Microsoft SQL Server Data Management SolutionsEduardo Castro

Windows Azure Platform OverviewRobert MacLean

13h00 p duff-building-applications-with-aws-finalLuiz Gustavo Santos

[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure PlatformVitor Tomaz

Building Applications with AWSAmazon Web Services LATAM

Arquitectura dos Serviços da plataforma Windows AzureComunidade NetPonto

Windows Azure OverviewStefano Paluello

[NetPonto] Arquitectura dos Serviços da plataforma Windows AzureVitor Tomaz

VSX 2012 Desktop Virtualization 101sbramfitt

A Behind the Scenes Look at the Force.com PlatformSalesforce Developers

Sap Virtualization Week 2009Sherry Yu

Hydrologic Information Systems and the CUAHSI HIS Desktop ApplicationACSG Section Montréal

Lap around windows azureManish Corriea

Microsoft PaaS Cloud Windows Azure PlatformEsri

MS TechDays 2011 - Cloud Computing with the Windows Azure PlatformSpiffy

[AzurePT] Desenvolvimento para o Windows Azure: Diferença para o developerVitor Tomaz

Couchbase presentationsharonyb

Ähnlich wie Introduction to Hadoop, HBase, and NoSQL (20)

Large Scale Data Analysis Tools

Patterns of Cloud Applications Using Microsoft Azure Services Platform

Intro To Live Framework

HP Microsoft SQL Server Data Management Solutions

Windows Azure Platform Overview

13h00 p duff-building-applications-with-aws-final

[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform

Building Applications with AWS

Arquitectura dos Serviços da plataforma Windows Azure

Windows Azure Overview

[NetPonto] Arquitectura dos Serviços da plataforma Windows Azure

VSX 2012 Desktop Virtualization 101

A Behind the Scenes Look at the Force.com Platform

Sap Virtualization Week 2009

Hydrologic Information Systems and the CUAHSI HIS Desktop Application

Lap around windows azure

Microsoft PaaS Cloud Windows Azure Platform

MS TechDays 2011 - Cloud Computing with the Windows Azure Platform

[AzurePT] Desenvolvimento para o Windows Azure: Diferença para o developer

Couchbase presentation

Introduction to Hadoop, HBase, and NoSQL

1. Nick Dimiduk - @xefyr Founder, Drawn to Scale nick@drawntoscalehq.com April 28, 2010

2. Agenda what NoSQL is not motivation Hadoop HBase

3. whoami Computer Science & Engineering at Ohio State: Artiﬁcial Intelligence, Programming Languages, Systems Engineering Applied Technical Systems: Hierarchical, non-relational data storage and analysis systems (no-sql before there was NoSQL). Information Retrieval, Wire Serialization/RPC (before there was Thrift/Avro), Data Visualization (GB's) Visible Technologies: Social Media Storage, Processing, Analytics. Monitoring, Engagement, Warehousing, and BI. (TB's) Drawn to Scale: Big Data Storage, Processing, Retrieval, Analytics (TB's, PB's)

4. Agenda what NoSQL is not motivation Hadoop HBase

5. What NoSQL is not. movement

6. What NoSQL is not. movement - no ANSI NoSQL-2010 one-size-ﬁts-all

8. It’s not Anti-RDBMS

9. It’s about Choice! http://www.ﬂickr.com/photos/zakh/337938459/

10. What NoSQL is not. movement - no ANSI NoSQL-2010 one-size-ﬁts-all - it’s about choice silver bullet

11. What NoSQL is not. movement - no ANSI NoSQL-2010 one-size-ﬁts-all - it’s about choice silver bullet - guarantees are hard

12. Agenda what NoSQL is not motivation Hadoop HBase

13. motivation more, More, MORE Data!

14. motivation more, More, MORE Data! ACID Burns

15. motivation more, More, MORE Data! ACID Burns BASE is good enough

16. motivation more, More, MORE Data! ACID Burns BASE is good enough Life’s too short

17. motivation more, More, MORE Data! ACID Burns BASE is good enough Life’s too short

18. “typical” application

19. “typical” application Data Server Village People App Server

20. growing pains Data Server Villages of People App Servers

21. vertical partitioning Data Server Villages of People App Servers Data Server Villages of People App Servers

22. vertical partitioning Data Server Villages of People Data Server Villages of People App Servers App Servers Data Server Villages of People Data Server Villages of People App Servers App Servers

23. vertical partitioning Data Server Villages of People App Servers Data Server Villages of People App Servers

24. “typical” application

25. growing pains Data Servers Villages of People App Servers

26. horizontal partitioning Villages of People

27. horizontal partitioning Villages of People

28. horizontal partitioning Villages of People Data Layer Application Layer

29. Agenda what NoSQL is not motivation Hadoop HBase

30. “open source, reliable, distributed computing”

31. “open source, reliable, distributed computing”

32. MapReduce - API for parallel computing

33. MapReduce - API for parallel computing HDFS - distributed, replicated ﬁle system

34. MapReduce - API for parallel computing HDFS - distributed, replicated ﬁle system ZooKeeper - distributed synchronization

35. MapReduce - API for parallel computing HDFS - distributed, replicated ﬁle system ZooKeeper - distributed synchronization Avro - Data Serialization / RPC

36. Agenda what NoSQL is not motivation Hadoop HBase

37. structured, distributed database for your horizontally scalable FS

38. structured, distributed database for your horizontally scalable FS

39. random access

40. random access real-time reads/writes

41. random access real-time reads/writes simple API

42. random access real-time reads/writes simple API big table

43. references : http://www.nosql-database.org Eventually Consistent: http://www.allthingsdistributed.com/2007/12/ eventually_consistent.html Soft State: http://mercury.lcs.mit.edu/~jnc/tech/hard_soft.html Accuracy and Precision: http://en.wikipedia.org/wiki/Accuracy_and_precision Compare and Swap: http://en.wikipedia.org/wiki/Compare-and-swap Apache Hadoop: http://hadoop.apache.org Google MapReduce: http://labs.google.com/papers/mapreduce.html Google FS: http://labs.google.com/papers/gfs.html Apache Thrift: http://incubator.apache.org/thrift/ Protobuf: http://code.google.com/p/protobuf/ Google BigTable: http://labs.google.com/papers/bigtable.html Google Chubby: http://labs.google.com/papers/chubby.html

44. Questions? Nick Dimiduk - @xefyr Founder, Drawn to Scale nick@drawntoscalehq.com April 28, 2010

Hinweis der Redaktion

I&#x2019;m Not an RDBMS Guy!
squish the FUD
no central point of organization no committee or standardizing body no plan/strategy/illuminati to take down the RDBMS; lots of "in-fighting"
central tenant - there IS NO one-size-fits-all unlike RDBMS assumptions, each engineering effort must be evaluated for data needs
is it &#x201C;anti-RDBMS&#x201D;?
not so much
will not magically solve all your data or performance problems applications won&#x2019;t magically stop crashing, data corruption, etc. Big Data is still hard. These tools make it possible/affordable/approachable
data persistence comes down to garantees
why are we here?
"web scale" more users, content, connections more trends, insight, knowledge
Atomicity: fault-tolerance is moving to the application layer - smaller atomic units Consistency: yes! but not necessarily immediate - "availability" (latency, reads) is more important. Isolation: smaller atomic units (multi-step transaction vs. compare-and-swap), greater availability, denormalization => reduced dependency on isolation Durability: some things are more important that getting every last detail, i.e. latency of response, view in aggregate
Basically Available: is the data layer up or not? are we serving content to our users or not? Soft State: shifting burden of "correctness" up to application layer. availability is more important than precision. accuracy (correct) vs. precision (repeatable). Eventual Consistency: all operations are recorded and ordered. played back as resources permit.
agile dev moves too fast for schema and constraints - this isn&#x2019;t waterfall data models change quickly up-front schema modeling is akin to waterfall development - not always practical/feasible/possible data is messy - record what you have and leave constraints up to the application
at scale, data services look like a DHT anyway! isolated independent services introduced caching layers partitioned data by logical and range boundaries.
webapp
app servers/session self-contained - load-balanced data&#x2019;s in one spot - what do you do?
37-signals approach - DHH &#x201C;scaling is a good thing because scaling => users => $$$&#x201D;
more users, more instances. easy!
doesn&#x2019;t work for social applications: - users cannot interact - old MMO&#x2019;s vs. new social games
redesign data server as &#x201C;data services&#x201D; separate independent logical components
knowing each service by name becomes &#x201C;vexing&#x201D;
configuration/logistical nightmare!
abstractions! wouldn&#x2019;t it be nice if...
Distributed Computing Made Easy Less Hard
programming model/API for parallel computing Google's MapReduce paper
replicated, high throughput, fairly UNIX-y (not POSIX). Google FS Paper
Distributed Group Services - coordination, synchronization, configuration, naming. Google Chubby Paper
efficient, cross-language messaging Facebook/Apache Thrift Google Protobufs
Google BigTable
Addresses limitations of Raw M/R, HDFS access
request by key: vs. hdfs sequential reads
low-latency, ms response times vs. m/r high-latency
row/column concepts DHT semantics Java, ReST, thrift
Billions of rows, millions of columns

Introduction to Hadoop, HBase, and NoSQL

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Introduction to Hadoop, HBase, and NoSQL

Ähnlich wie Introduction to Hadoop, HBase, and NoSQL (20)

Introduction to Hadoop, HBase, and NoSQL

Hinweis der Redaktion