2. Unix
•
When Unix appeared in the early 1970s, it was not just a
new system, but a new way of thinking about systems
•
Instead of a sealed monolith, the operating system was
a collection of small, easily understood programs
•
First Edition Unix (1971) contained many programs that
we still use today (ls, rm, cat, mv)
•
Its very name conveyed this minimalist aesthetic: Unix is
a homophone of “eunuchs” — a castrated Multics
We were a bit oppressed by the big system mentality. Ken
wanted to do something simple. — Dennis Ritchie
3. Unix: Let there be light
•
In 1969, Doug McIlroy had the idea of connecting
different components:
At the same time that Thompson and Ritchie were sketching
out a file system, I was sketching out how to do data
processing on the blackboard by connecting together
cascades of processes
•
This was the primordial pipe, but it took three years to
persuade Thompson to adopt it:
And one day I came up with a syntax for the shell that went
along with the piping, and Ken said, “I’m going to do it!”
4. Unix: ...and there was light
And the next morning we had this
orgy of one-liners. — Doug McIlroy
5. The Unix philosophy
•
The pipe — coupled with the small-system aesthetic —
gave rise to the Unix philosophy, as articulated by Doug
McIlroy:
•
•
Write programs to work together
•
•
Write programs that do one thing and do it well
Write programs that handle text streams, because
that is a universal interface
Four decades later, this philosophy remains the single
most important revolution in software systems thinking!
6. Doug McIlroy v. Don Knuth: FIGHT!
•
In 1986, Jon Bentley posed the challenge that became
the Epic Rap Battle of computer science history:
Read a file of text, determine the n most frequently used
words, and print out a sorted list of those words along with
their frequencies.
•
Don Knuth’s solution: an elaborate program in WEB, a
Pascal-like literate programming system of his own
invention, using a purpose-built algorithm
•
Doug McIlroy’s solution shows the power of the Unix
philosophy:
tr -cs A-Za-z 'n' | tr A-Z a-z |
sort | uniq -c | sort -rn | sed ${1}q
7. Big Data: History repeats itself?
•
The original Google MapReduce paper (Dean et al.,
OSDI ’04) poses a problem disturbingly similar to
Bentley’s challenge nearly two decades prior:
Count of URL Access Frequency: The function processes
logs of web page requests and outputs ⟨URL, 1⟩. The
reduce function adds together all values for the same URL
and emits a ⟨URL, total count⟩ pair
•
•
But the solutions do not adhere to the Unix philosophy...
•
e.g., Appendix A of the OSDI ’04 paper has a 71 line
word count in C++ — with nary a wc in sight
...and nor do they make use of the substantial Unix
foundation for data processing
8. Big Data: Challenges
•
Must be able to scale storage to allow for “big data” —
quantities of data that dwarf a single machine
•
•
•
Must allow for massively parallel execution
Must allow for multi-tenancy
To make use of both the Unix philosophy and its toolset,
must be able to virtualize the operating system
9. Scaling storage
•
There are essentially three protocols for scalable
storage: block, file and object
•
Block (i.e., a SAN) is far too low an abstraction — and
notoriously expensive to scale
•
File (i.e., NAS) is too permissive an abstraction — it
implies a coherent store for arbitrary (partial) writes,
trying (and failing) to be both C and A in CAP
•
Object (e.g., S3) is similar “enough” to a file-based
abstraction, but by not allowing partial writes, allows for
proper CAP tradeoffs
10. Object storage
•
•
Object storage systems do not allow for partial updates
•
A different approach is to have a highly reliable local file
system that erasure encodes across local spindles —
with entire objects duplicated across nodes for
availability
•
ZFS pioneered both reliability and efficiency of this
model with RAID-Z — and has refined it over the past
decade of production use
•
ZFS is one of the four foundational technologies in
Joyent’s open source SmartOS
For both durability and availability, objects are generally
erasure encoded across spindles on different nodes
11. Virtualizing the operating system?
•
Historically — since the 1960s — systems have been
virtualized at the level of hardware
•
Hardware virtualization has its advantages, but it’s
heavyweight: operating systems are not designed to
share resources like DRAM, CPU, I/O devices, etc.
•
One can instead virtualize at the level of the operating
system: a single OS kernel that creates lightweight
containers — on the metal, but securely partitioned
•
Pioneered by BSD’s jails; taken to a logical extreme by
zones found in Joyent’s SmartOS
12. Idea: ZFS + Zones?
•
Can we combine the efficiency and reliability of ZFS
with the abstraction provided by zones to develop an
object store that has compute as a first-class citizen?
•
ZFS rollback allows for zones to be trashed — simply
rollback the zone after compute completes on an object
•
Add a job scheduling system that allows for both map
and reduce phases of distributed work
•
Would allow for the Unix toolset to be used on arbitrary
large amounts of data — unlocking big data one-liners
•
If it perhaps seems obvious now, it wasn’t at the time...
14. Manta: ZFS + Zones!
•
Building a sophisticated distributed system on top of
ZFS and zones, we have built Manta, an internet-facing
object storage system offering in situ compute
•
That is, the description of compute can be brought to
where objects reside instead of having to backhaul
objects to transient compute
•
The abstractions made available for computation are
anything that can run on the OS...
•
...and as a reminder, the OS — Unix — was built around
the notion of ad hoc unstructured data processing, and
allows for remarkably terse expressions of computation
15. Manta: Unix for Big Data
•
Manta allows for an arbitrarily scalable variant of
McIlroy’s solution to Bentley’s challenge:
mfind -t o /bcantrill/public/v7/usr/man |
mjob create -o -m "tr -cs A-Za-z 'n' |
tr A-Z a-z | sort | uniq -c" -r
"awk '{ x[$2] += $1 }
END { for (w in x) { print x[w] " " w } }' |
sort -rn | sed ${1}q"
•
This description not only terse, it is high performing: data
is left at rest — with the “map” phase doing heavy
reduction of the data stream
•
As such, Manta — like Unix — is not merely syntactic
sugar; it converges compute and data in a new way
16. Manta: CAP tradeoffs
•
Eventual consistency represents the wrong CAP
tradeoffs for most; we prefer consistency over
availability for writes (but still availability for reads)
•
Many more details:
http://dtrace.org/blogs/dap/2013/07/03/fault-tolerance-in-manta/
•
Celebrity endorsement:
17. Manta: Other design principles
•
Hierarchical storage is an excellent idea (ht: Multics);
Manta implements proper directories, delimited with a
forward slash
•
Manta implements a snapshot/link hybrid dubbed a
snaplink; can be used to effect versioning
•
•
Manta has full support for CORS headers
•
•
Manta SDKs exist for node.js, Java, Ruby, Python
Manta uses SSH-based HTTP auth for client-side
tooling (IETF draft-cavage-http-signatures-00)
“npm install manta” for command line interface
18. Manta and the future of big data
•
We believe compute/data convergence to be the future
of big data: stores of record must support computation
as a first-class, in situ operation
•
We believe that Unix is a natural way of expressing this
computation — and that the OS is the right level at
which to virtualize to support this securely
•
We believe that ZFS is the only sane storage substrate
underpinning for such a system
•
Manta will surely not be the only system to represent the
confluence of these — but it is the first
•
We are actively retooling our software stack in terms of
Manta — Manta is changing the way we develop
software!
19. Manta: More information
•
Product page:
http://joyent.com/products/manta
•
node.js module:
https://github.com/joyent/node-manta
•
Manta documentation:
http://apidocs.joyent.com/manta/
•
IRC, e-mail, Twitter, etc.:
#manta on freenode, manta@joyent.com, @mcavage,
@dapsays, @yunongx, @joyent
•
Here’s to the orgy of big data one-liners!