3. whoami
Computer Science & Engineering at Ohio State:
Artificial Intelligence, Programming Languages, Systems
Engineering
Applied Technical Systems: Hierarchical, non-relational
data storage and analysis systems (no-sql before there was
NoSQL). Information Retrieval, Wire Serialization/RPC
(before there was Thrift/Avro), Data Visualization (GB's)
Visible Technologies: Social Media Storage, Processing,
Analytics. Monitoring, Engagement, Warehousing, and BI. (TB's)
Drawn to Scale: Big Data Storage, Processing, Retrieval,
Analytics (TB's, PB's)
22. vertical partitioning
Data Server Villages of People Data Server Villages of People
App Servers App Servers
Data Server Villages of People Data Server Villages of People
App Servers App Servers
no central point of organization
no committee or standardizing body
no plan/strategy/illuminati to take down the RDBMS; lots of "in-fighting"
central tenant - there IS NO one-size-fits-all
unlike RDBMS assumptions, each engineering effort must be evaluated for data needs
is it “anti-RDBMS”?
not so much
will not magically solve all your data or performance problems
applications won’t magically stop crashing, data corruption, etc.
Big Data is still hard. These tools make it possible/affordable/approachable
data persistence comes down to garantees
why are we here?
"web scale"
more users, content, connections
more trends, insight, knowledge
Atomicity: fault-tolerance is moving to the application layer - smaller atomic units
Consistency: yes! but not necessarily immediate - "availability" (latency, reads) is more important.
Isolation: smaller atomic units (multi-step transaction vs. compare-and-swap), greater availability, denormalization => reduced dependency on isolation
Durability: some things are more important that getting every last detail, i.e. latency of response, view in aggregate
Basically Available: is the data layer up or not? are we serving content to our users or not?
Soft State: shifting burden of "correctness" up to application layer. availability is more important than precision. accuracy (correct) vs. precision (repeatable).
Eventual Consistency: all operations are recorded and ordered. played back as resources permit.
agile dev moves too fast for schema and constraints - this isn’t waterfall
data models change quickly
up-front schema modeling is akin to waterfall development - not always practical/feasible/possible
data is messy - record what you have and leave constraints up to the application
at scale, data services look like a DHT anyway!
isolated independent services
introduced caching layers
partitioned data by logical and range boundaries.
webapp
app servers/session self-contained - load-balanced
data’s in one spot - what do you do?
37-signals approach - DHH “scaling is a good thing because scaling => users => $$$”
more users, more instances. easy!
doesn’t work for social applications:
- users cannot interact
- old MMO’s vs. new social games
redesign data server as “data services”
separate independent logical components
knowing each service by name becomes “vexing”
configuration/logistical nightmare!
abstractions!
wouldn’t it be nice if...
Distributed Computing Made Easy Less Hard
programming model/API for parallel computing
Google's MapReduce paper
replicated, high throughput, fairly UNIX-y (not POSIX).
Google FS Paper
Distributed Group Services - coordination, synchronization, configuration, naming.
Google Chubby Paper
efficient, cross-language messaging
Facebook/Apache Thrift
Google Protobufs
Google BigTable
Addresses limitations of Raw M/R, HDFS access
request by key: vs. hdfs sequential reads
low-latency, ms response times vs. m/r high-latency