8. Blob Storage
adrian@googlestorage
Love Letters
Movies
Tron putBlob
The One Shrek
Goonies The Blob
3d = true
url = http://disney.go.com/tron
8
Wednesday, November 2, 11
11. Big data pipelines with
Scale-out on the cloud
@tiborkisstibor
11
Wednesday, November 2, 11
12. bioinformatic pipelines
Usually requires high
CPU
Continuously increasing
data volumes
Complex algorithms on
top of large datasets
12
Wednesday, November 2, 11
14. challenges of SaaS building
Hadoop cluster startup/shutdown
- Cluster starting problems
- Automatic cluster shutdown strategies
Hadoop cluster monitoring on the cloud
System monitoring
Consumption based monitoring
Data transfer paths
AWS Import -> S3 -> hdfs -> S3 -> AWS Export
ACL settings for client's buckets
S3 <=> hdfs transfers
14
Wednesday, November 2, 11
15. where did we start?
30GB file @max 16MB/s upload to S3
32 minutes
1PB file @max 16MB/s upload to S3
18.2 hours
15
Wednesday, November 2, 11
16. where did we end up?
30GB file @max 100MB/s upload to S3
32 5 minutes
1PB file @max 100MB/s upload to S3
18.2 2.9 hours
16
Wednesday, November 2, 11
17. How did we get there?
Add multi-part upload support
Optimize slicing
Optimize parallel upload strategy
Find big guns
17
Wednesday, November 2, 11
18. Multi-Part upload
Large Blobs cannot be sent in a single request in most
BlobStores. (ex. 5GB max in S3)
Large X-fers are likely to fail at inconvenient positions,
and without resume.
Multi-part uploads allow you to send slices of a
payload, which the server assembles later
18
Wednesday, November 2, 11
19. Slicing
Each upload part must advance to the appropriate
position in the source payload efficiently.
Payload slice(Payload input, long offset, long length);
ex. NettyPayloadSlicer uses ChunkedFileInputStream
19
Wednesday, November 2, 11
20. Slicing Algorithm
A Blob can be sliced into a maximum number of parts,
and these parts have min and max sizes.
up to 3.2GB, converge 32M parts
then increase part size approaching max (5GB)
then continue at max part size or overflow
20
Wednesday, November 2, 11
21. Upload Strategy
Start sequential, stabilize, then parallelize
SequentialMultipartUploadStrategy
Simpler, less likely to fail, easier to retry, little to optimize outside chunk size
ParallelMultipartUploadStrategy
Much better throughput, but need to optimize degree, retries & error
handling
21
Wednesday, November 2, 11
24. Is this as good as it gets?
10GigE should be able to do 1280MB/s
cc1.4xlarge has been measured up to ~560MB/s local
but we’re only getting ~100MB/s sustained
24
Wednesday, November 2, 11
25. So, where do we go now?
zero copy transfer
more work on slice algorithms
tools and integrations (ex. hdfs)
add implementations for other blobstores
25
Wednesday, November 2, 11
26. Wanna play?
blobStore.putBlob(“movies”, blob, multipart());
(put-blob *blobstore* “movies” blob
:multipart? true)
or just visit github jclouds-examples
blobstore-largeblob
blobstore-hdfs
26
Wednesday, November 2, 11