3. 3
Open source software for reliable,
scalable, distributed computing
HDFS - A distributed file system for
“large” I/O
YARN – A framework for resource
scheduling and management
MapReduce – Popular paradigm for
parallel batch processing
Apache Hadoop
4. 4
Enterprise grade Hadoop distribution
• Cluster Management and Monitoring
• Bulk Data Loader
• Extensions for Virtualization
Advanced Database Services
• World’s Fastest SQL on Hadoop
• 100% SQL Compliance
Pivotal Hadoop
5. 5
Provision Hadoop resources to power data-centric Cloud
Foundry Apps
• Park unstructured data on HDFS.
• Execute batch processing via MapReduce.
• Perform deep, complex analytics in SQL using HAWQ.
Pivotal HD for Cloud Foundry
24. 24
Pivotal HD
• Deployable by BOSH
• Exposed as a Cloud Foundry service
Data intensive apps are coming
Only possible through extensibility of Cloud Foundry
Conclusion
As a user of Cloud Foundry, you’re probably aware of openness, as it pertains to avoiding vendor lock-in. But, similarly fundamental is the notion that the Cloud Foundry PaaS itself can be extended to enhance the PaaS value proposition.
Its through exactly this extensibility that Pivotal CF allows Hadoop to exist as a complementary service to application developers. With this “enhanced” PaaS, the application developer, besides hosting support, getting domain names, and single-node services like Postgres or MySQL, he or she will now have the ability to leverage large Hadoop clusters for analytics. But, how exactly does this work? What is this extensibility we’re talking about?
At the core of this extensibility, is a communication that occurs between the Cloud Controller and a Service Broker, whose responsibility is to negotiate service capabilities on behalf of the other nodes comprising the service, whatever that service may be. This communication is responsible for establishing a couple of exchanges:Catalog Management – Declaring what service is available, and variants of it can be requested by CF adminstrators. Think of shared MySQL servers, dedicated MySQL servers.Provisioning – The act of reserving resources on the cluster.Binding – The act of enabling access of particulars apps to the cluster.What’s key to note about this communication is the flexibility of the protocol, which allows the provisioning to be service-defined. This is going to be critical as we start to look at what it means to treat Hadoop as a service. For such a complex, distributed service like Hadoop, there are many configurations and different use cases for how a typical configuration exists in an enterprise. We’ll start with what’s likely the most accessible and straightforward approach to provisioning Hadoop.
Our first of many variants of Hadoop-as-a-service is comprised of a shared, static HDFS cluster that gets BOSH-deployed, along with the service broker, using the same infrastructure that your Cloud Foundry PaaS was deployed upon.
In this model, the provision request will be received by the Service Broker are propagated to the various sub-components of the cluster.
Ultimately, the act of provisioning will have reserved resources on each of the Hadoop components. For example, on HDFS, some amount of space will have been reserved on the filesystem; and with HAWQ, a database will have been created to house SQL data. The ensuing bind requests will allow apps to gain access to the HDFS subfolder, to that HAWQ database, and so on.
Shared cluster that is BOSH deployed side-by-side with your CF.
Shared cluster that is BOSH deployed side-by-side with your CF.
Shared cluster that is BOSH deployed side-by-side with your CF.
Shared cluster that is BOSH deployed side-by-side with your CF.
Shared cluster that is BOSH deployed side-by-side with your CF.
Shared cluster that is BOSH deployed side-by-side with your CF.
Stepping stone to dynamic MapReduce queries, namely the ability to through a simple API, spin up the cluster, send the mapreduce job, execute, return analysis, and tear down the cluster.