Hw09 Next Steps For Hadoop

•

2 gefällt mir•1,150 views

Cloudera, Inc.

Technologie

Next Steps for Hadoop

Doug Cutting
Cloudera

Proviso
● Linus Torvalds:
● “Whatever they contribute.”
● diverse set of contributors
● central planning impossible

The Dream
● faster, more reliable, available
● of course
● spreadsheet-like interfaces
● provide non-programmers
● with powerful, interactive tools
● easier sharing
● of data & hardware resources

Requirements
● security
● facilitate sharing of resources
● stable cross-language APIs
● facilitate diverse tools & apps
● expressive, inter-operable data
● facilitates sharing of datasets
● facilitates dynamic analyses

Data Formats
● today in Hadoop:
● text
– pro: inter-operable
– con: not expressive, inefficient
● - Java Writable
– pro: expressive, efficient
– con: platform-specific, fragile

Protocol Buffers & Thrift
● expressive
● efficient (small & fast)
● but not very dynamic
● cannot browse arbitrary data
● no DESCRIBE or SHOW
● viewing a new dataset
– requires code generation & load
● writing a new dataset
– requires generating schema text
– plus code generation & load

Avro Data
● as expressive
● smaller and faster
● dynamic
● schema stored with data
– but factored out of instances
● API permits reading & creating
– arbitrary datatypes
– without generating & loading code

Avro Data
● includes a file format
● includes a textual encoding
● handles versioning
● if schema changes
● can still process data
● Hadoop apps can
● upgrade from text
● and standardize on Avro for data

Avro RPC
● leverage versioning support
● to permit different versions of services to
interoperate
● for Hadoop services, will
● provide cross-language access
● let apps talk to clusters running different versions

Avro Status
● 1.1 release out
● added JSON and comparators
● 1.2 soon
● adds HTTP & UDP-based RPC
● will first appear in Hadoop 0.21
● as format for job history
● in sequence files

Avro Near Future
● full mapreduce support
● used for RPC in Hadoop 0.22 (1.0)?

Empfohlen

Interactive Data Analysis in Spark Streamingdatamantra

Build real time stream processing applications using Apache KafkaHotstar

Evolution of apache sparkdatamantra

Structured Streaming with Kafkadatamantra

Build intelligent, real-time applications using Machine LearningHotstar

Building end to end streaming application on Sparkdatamantra

Scaling ELK Stack - DevOpsDays SingaporeAngad Singh

Building microservices with grpcSathiyaseelan Muthu kumar

Empfohlen

Interactive Data Analysis in Spark Streamingdatamantra

Build real time stream processing applications using Apache KafkaHotstar

Evolution of apache sparkdatamantra

Structured Streaming with Kafkadatamantra

Build intelligent, real-time applications using Machine LearningHotstar

Building end to end streaming application on Sparkdatamantra

Scaling ELK Stack - DevOpsDays SingaporeAngad Singh

Building microservices with grpcSathiyaseelan Muthu kumar

Understanding transactional writes in datasource v2datamantra

Automatic Scaling Iterative ComputationsGuozhang Wang

Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf

Cassandra Summit 2015 - Building a multi-tenant API PaaS with DataStax Enterp...Restlet

Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedHostedbyConfluent

Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...HostedbyConfluent

Exploratory Data Analysis in Sparkdatamantra

Bootstrap SaaS startup using Open Source Toolsbotsplash.com

It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyHostedbyConfluent

CDC to the Max!Bronco Oostermeyer

Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberHostedbyConfluent

State management in Structured Streamingdatamantra

Introduction to Structured streamingdatamantra

Productionalizing a spark applicationdatamantra

Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...confluent

Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...StreamNative

Docker for mac & local developer environment optimizationRadek Baczynski

The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...Redis Labs

Case Study: Stream Processing on AWS using Kappa ArchitectureJoey Bolduc-Gilbert

Prashant_Agrawal_CVPrashant Agrawal

Hw09 Counting And Clustering And Other Data TricksCloudera, Inc.

Hw09 Map Reduce Over Tahoe A Least Authority Encrypted Distributed Filesy...Cloudera, Inc.

Weitere ähnliche Inhalte

Was ist angesagt?

Understanding transactional writes in datasource v2datamantra

Automatic Scaling Iterative ComputationsGuozhang Wang

Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf

Cassandra Summit 2015 - Building a multi-tenant API PaaS with DataStax Enterp...Restlet

Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedHostedbyConfluent

Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...HostedbyConfluent

Exploratory Data Analysis in Sparkdatamantra

Bootstrap SaaS startup using Open Source Toolsbotsplash.com

It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyHostedbyConfluent

CDC to the Max!Bronco Oostermeyer

Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberHostedbyConfluent

State management in Structured Streamingdatamantra

Introduction to Structured streamingdatamantra

Productionalizing a spark applicationdatamantra

Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...confluent

Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...StreamNative

Docker for mac & local developer environment optimizationRadek Baczynski

The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...Redis Labs

Case Study: Stream Processing on AWS using Kappa ArchitectureJoey Bolduc-Gilbert

Prashant_Agrawal_CVPrashant Agrawal

Was ist angesagt? (20)

Understanding transactional writes in datasource v2

Automatic Scaling Iterative Computations

Jack Gudenkauf sparkug_20151207_7

Cassandra Summit 2015 - Building a multi-tenant API PaaS with DataStax Enterp...

Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized

Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...

Exploratory Data Analysis in Spark

Bootstrap SaaS startup using Open Source Tools

It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify

CDC to the Max!

Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber

State management in Structured Streaming

Introduction to Structured streaming

Productionalizing a spark application

Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...

Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...

Docker for mac & local developer environment optimization

The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...

Case Study: Stream Processing on AWS using Kappa Architecture

Prashant_Agrawal_CV

Andere mochten auch

Hw09 Counting And Clustering And Other Data TricksCloudera, Inc.

Hw09 Map Reduce Over Tahoe A Least Authority Encrypted Distributed Filesy...Cloudera, Inc.

Hw09 Protein AlignmentCloudera, Inc.

Hadoop Summit 2012 | HDFS High AvailabilityCloudera, Inc.

Strata + Hadoop World 2012: Apache HBase Features for the EnterpriseCloudera, Inc.

Hadoop Lecture for Harvard's CS 264 -- October 19, 2009Cloudera, Inc.

Hw09 Sqoop Database Import For HadoopCloudera, Inc.

Masterclass Webinar - Amazon Elastic MapReduce (EMR)Amazon Web Services

Andere mochten auch (8)

Hw09 Counting And Clustering And Other Data Tricks

Hw09 Map Reduce Over Tahoe A Least Authority Encrypted Distributed Filesy...

Hw09 Protein Alignment

Hadoop Summit 2012 | HDFS High Availability

Strata + Hadoop World 2012: Apache HBase Features for the Enterprise

Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

Hw09 Sqoop Database Import For Hadoop

Masterclass Webinar - Amazon Elastic MapReduce (EMR)

Ähnlich wie Hw09 Next Steps For Hadoop

ApacheCon09: AvroCloudera, Inc.

Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Cloudera, Inc.

Introduction to Apache Sparkdatamantra

Savanna - Elastic Hadoop on OpenStackSergey Lukjanov

Hadoop Introductionsheetal sharma

Intro to Apache HadoopSufi Nawaz

Hw09 Security And Api CompatibilityCloudera, Inc.

Plugging the Holes: Security and Compatability in HadoopOwen O'Malley

Blackray @ SAPO CodeBits 2009fschupp

Change data captureRon Barabash

Present and future of unified, portable, and efficient data processing with A...DataWorks Summit

Getting started big dataKibrom Gebrehiwot

Streamsets and spark at SF Hadoop User GroupHari Shreedharan

BIGDATA pptsKrisshhna Daasaarii

Go at uberRob Skillington

Apache PIGAnuja Gunale

Cloud Native API Design and ManagementAllBits BVBA (freelancer)

Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar

Apache frameworks for Big and Fast DataNaveen Korakoppa

Apache Tez -- A modern processing enginebigdatagurus_meetup

Ähnlich wie Hw09 Next Steps For Hadoop (20)

ApacheCon09: Avro

Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...

Introduction to Apache Spark

Savanna - Elastic Hadoop on OpenStack

Hadoop Introduction

Intro to Apache Hadoop

Hw09 Security And Api Compatibility

Plugging the Holes: Security and Compatability in Hadoop

Blackray @ SAPO CodeBits 2009

Change data capture

Present and future of unified, portable, and efficient data processing with A...

Getting started big data

Streamsets and spark at SF Hadoop User Group

BIGDATA ppts

Go at uber

Apache PIG

Cloud Native API Design and Management

Big Data Hoopla Simplified - TDWI Memphis 2014

Apache frameworks for Big and Fast Data

Apache Tez -- A modern processing engine

Mehr von Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.

Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.

2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.

Edc event vienna presentation 1 oct 2019Cloudera, Inc.

Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.

Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.

Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.

Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.

Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.

Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.

Extending Cloudera SDX beyond the PlatformCloudera, Inc.

Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.

Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.

Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.

Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.

Mehr von Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx

Cloudera Data Impact Awards 2021 - Finalists

2020 Cloudera Data Impact Awards Finalists

Edc event vienna presentation 1 oct 2019

Machine Learning with Limited Labeled Data 4/3/19

Data Driven With the Cloudera Modern Data Warehouse 3.19.19

Introducing Cloudera DataFlow (CDF) 2.13.19

Introducing Cloudera Data Science Workbench for HDP 2.12.19

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19

Leveraging the cloud for analytics and machine learning 1.29.19

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19

Leveraging the Cloud for Big Data Analytics 12.11.18

Modern Data Warehouse Fundamentals Part 3

Modern Data Warehouse Fundamentals Part 2

Modern Data Warehouse Fundamentals Part 1

Extending Cloudera SDX beyond the Platform

Federated Learning: ML with Privacy on the Edge 11.15.18

Analyst Webinar: Doing a 180 on Customer 360

Build a modern platform for anti-money laundering 9.19.18

Introducing the data science sandbox as a service 8.30.18

Kürzlich hochgeladen

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

"ML in Production",Oleksandr BaganFwdays

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Search Engine Optimization SEO PDF for 2024.pdfRankYa

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Story boards and shot lists for my a level piececharlottematthew16

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

From Family Reminiscence to Scholarly Archive .Alan Dix

How to write a Business Continuity PlanDatabarracks

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Kürzlich hochgeladen (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

SIP trunking in Janus @ Kamailio World 2024

Ensuring Technical Readiness For Copilot in Microsoft 365

"ML in Production",Oleksandr Bagan

Anypoint Exchange: It’s Not Just a Repo!

"Debugging python applications inside k8s environment", Andrii Soldatenko

Search Engine Optimization SEO PDF for 2024.pdf

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Scanning the Internet for External Cloud Exposures via SSL Certs

DMCC Future of Trade Web3 - Special Edition

DevoxxFR 2024 Reproducible Builds with Apache Maven

DevEX - reference for building teams, processes, and platforms

Story boards and shot lists for my a level piece

Human Factors of XR: Using Human Factors to Design XR Systems

From Family Reminiscence to Scholarly Archive .

How to write a Business Continuity Plan

Powerpoint exploring the locations used in television show Time Clash

Designing IA for AI - Information Architecture Conference 2024

Nell’iperspazio con Rocket: il Framework Web di Rust!

Unraveling Multimodality with Large Language Models.pdf

Hw09 Next Steps For Hadoop

1. Next Steps for Hadoop Doug Cutting Cloudera

2. Proviso ● Linus Torvalds: ● “Whatever they contribute.” ● diverse set of contributors ● central planning impossible

3. The Dream ● faster, more reliable, available ● of course ● spreadsheet-like interfaces ● provide non-programmers ● with powerful, interactive tools ● easier sharing ● of data & hardware resources

4. Requirements ● security ● facilitate sharing of resources ● stable cross-language APIs ● facilitate diverse tools & apps ● expressive, inter-operable data ● facilitates sharing of datasets ● facilitates dynamic analyses

5. Data Formats ● today in Hadoop: ● text – pro: inter-operable – con: not expressive, inefficient ● - Java Writable – pro: expressive, efficient – con: platform-specific, fragile

6. Protocol Buffers & Thrift ● expressive ● efficient (small & fast) ● but not very dynamic ● cannot browse arbitrary data ● no DESCRIBE or SHOW ● viewing a new dataset – requires code generation & load ● writing a new dataset – requires generating schema text – plus code generation & load

7. Avro Data ● as expressive ● smaller and faster ● dynamic ● schema stored with data – but factored out of instances ● API permits reading & creating – arbitrary datatypes – without generating & loading code

8. Avro Data ● includes a file format ● includes a textual encoding ● handles versioning ● if schema changes ● can still process data ● Hadoop apps can ● upgrade from text ● and standardize on Avro for data

9. Avro RPC ● leverage versioning support ● to permit different versions of services to interoperate ● for Hadoop services, will ● provide cross-language access ● let apps talk to clusters running different versions

10. Avro Status ● 1.1 release out ● added JSON and comparators ● 1.2 soon ● adds HTTP & UDP-based RPC ● will first appear in Hadoop 0.21 ● as format for job history ● in sequence files

11. Avro Near Future ● full mapreduce support ● used for RPC in Hadoop 0.22 (1.0)?

12. Thanks! What are your next steps?