SlideShare ist ein Scribd-Unternehmen logo
1 von 29
1© Cloudera, Inc. All rights reserved.
Introducing RecordService
Lenni Kuff
2© Cloudera, Inc. All rights reserved.
RecordService is a distributed,
scalable, data access service for
unified authorization in Hadoop.
3© Cloudera, Inc. All rights reserved.
Motivation
• As the Hadoop ecosystem expands, new components continue to be added
• Speaks to the overall flexibility of Hadoop
• This is good - more functionality, more workloads, more use cases.
• As use cases for Hadoop mature, user requirements and expectations increase:
• Security
• Performance
• Compatibility
• The flexibility of Hadoop has come at cost of increased complexity
4© Cloudera, Inc. All rights reserved.
Storage
Compute
5© Cloudera, Inc. All rights reserved.
Storage
Compute
…
6© Cloudera, Inc. All rights reserved.
Example: Security
Challenge: Provide unified fine-grained security across compute frameworks
• Integrating consistent security layer into every components is not scalable.
• Securing data at file-level precludes fine grained access control (column/row)
• File ACLs not enough - User can view all or nothing.
• Currently, must split files, duplicate data – large operational cost.
Solution: Add a level of abstraction - secure service to access datasets in “record”
format
• Can now apply fine-grained constraints on projection of dataset
• Same access control policy can be applied uniformly across compute
frameworks; uncoupled from underlying storage layer
7© Cloudera, Inc. All rights reserved.
Introducing RecordService
8© Cloudera, Inc. All rights reserved.
Record Service - Overview
• Simplifies
• Provides a higher level, logical abstraction for data (ie Tables or Views)
• Returns schemed objects (instead of paths and bytes). No need for applications
to worry about storage APIs and file formats.
• HCatalog? Similar concept - RecordService is secure, performant. Plan to
support HCatalog as a data model on RecordService.
• Secures
• Central location for all authorization checks using Sentry metadata.
• Secure service that does not execute arbitrary user code
• Accelerates
• Unified data access path allows platform-wide performance improvements.
9© Cloudera, Inc. All rights reserved.
Architecture
10© Cloudera, Inc. All rights reserved.
Architecture
• Runs as a distributed service: Planner Servers & Worker Servers
• Servers do not store any state
• Easy HA, fault tolerance.
• Planner Servers responsible for request planning
• Retrieve and combine metadata (NN, HMS, Sentry)
• Split generation -> Creates tasks for workers
• Performs authorization
• Worker Servers reads from storage and constructs records.
• IO, file parsing, predicate evaluation
• Runs as the “source” for a DAG computation
11© Cloudera, Inc. All rights reserved.
Architecture – Server APIs
• Planner and Worker services expose thrift APIs
• PlanRequest(), Exec(), Fetch()
• PlanRequest()
• Accepts SQL to specify request: Support SELECT and PROJECT
• Access to tables and views stored in HMS
• Does not run operators that require data exchange; “map only”
• Generates a list of tasks which contain the request, each with locality
• Exec()/Fetch()
• Returns records in a canonical optimized, columnar-format.
12© Cloudera, Inc. All rights reserved.
Architecture – Fault tolerance
• Cluster state persisted in ZK
• Membership, delegation tokens, secret keys
• Servers do not communicate with each other directly => scalability
• Planner services
• Expected to run a few (i.e. 3) for HA
• Fault tolerance handled with clients getting a list of planners and failing over
• Plan requests are short
• Worker services
• Expect to run on each node in the cluster with data
• Fault tolerance handled by framework (e.g. MR) rescheduling task
13© Cloudera, Inc. All rights reserved.
Architecture – Security
• Authentication using Kerberos and delegation tokens
• Planner authorizes request using metadata in Sentry
• Column level ACLs
• Row level ACLs – create a view with a predicate
• Masking – create a view with the masking function in the select list
• Tasks generated by the planner are signed with a shared key
• Worker runs generated tasks.
• Does not authorize, relies on signed tasks
• Runs as user with full access to data, does not run user code
14© Cloudera, Inc. All rights reserved.
Architecture – Security example
CREATE VIEW v as
SELECT mask(credit_card_number) as ccn,
name, balance, region
FROM data WHERE region = “Europe”
1. Restrict access to the data set: disable access to ‘data’ table and underlying
files in HDFS.
2. Give access by creating view, v
3. Set column level permissions on v per user if necessary
Write path (ingest) unchanged. Job expected to run as privileged user.
15© Cloudera, Inc. All rights reserved.
Client APIs – Integration with ecosystem
• Similar APIs designed to integrate with MapReduce and Spark
• Client APIs make things simpler
• Don’t need to interact with HMS
• Care about the underlying storage format: worker always returns records in a
canonical format.
• Storage engine details (e.g. s3)
16© Cloudera, Inc. All rights reserved.
Client Integration APIs
• Drop in replacements for common existing InputFormats
• Text, Avro
• Can be used with Spark as well
• SparkSQL: integration with the Data Sources API
• Predicate pushdown, projection
• Migration should be easy
17© Cloudera, Inc. All rights reserved.
MR Example
//FileInputFormat.setInputPaths(job, new Path(args[0]));
//job.setInputFormatClass(AvroKeyInputFormat.class);
RecordServiceConfig.setInputTable(configuration, null, args[0]);
job.setInputFormatClass(
com.cloudera.recordservice.avro.mapreduce.AvroKeyInputFormat.class);
18© Cloudera, Inc. All rights reserved.
Spark Example
// Comment out one or the other
val file = sc.recordServiceTextFile(path)
//val file = sc.textFile(path)
19© Cloudera, Inc. All rights reserved.
Spark SQL Example
ctx.sql(s"""
|CREATE TEMPORARY TABLE $tbl
|USING com.cloudera.recordservice.spark.DefaultSource
|OPTIONS (
| RecordServiceTable '$db.$tbl',
| RecordServiceTableSize '$size'
|)
""".stripMargin)
20© Cloudera, Inc. All rights reserved.
Performance
• Shares some core components with Impala
• IO management, optimized C++ code, runtime code generation, uses low level
storage APIs
• Highly efficient implementation of the scan functionality
• Optimized columnar on wire format
• Inspired by Apache Parquet
• Accelerates performance for many workloads
21© Cloudera, Inc. All rights reserved.
Terasort
• ~Worst case scenario. Minimal schema: a single STRING column
• Custom RecordServiceTeraInputFormat (similar to TeraInputFormat)
• 78 Node cluster (12 cores/24 Hyper-Threaded, 12 disks)
• Ran on 1 billion, 50 billion and 1 trillion (~100TB) scales
• See Github repo for more details and runnable examples.
22© Cloudera, Inc. All rights reserved.
TeraChecksum
1
0.48
0.23
1.03
0.8
0.85
0
0.2
0.4
0.6
0.8
1
1.2
1B (MapReduce) 50B (MapReduce) 1T (MapReduce) 1B (Spark) 50B (Spark) 1T (Spark)
Normalizedjobtime
TeraChecksum
Without RecordService
With RecordService
23© Cloudera, Inc. All rights reserved.
Spark SQL
• Represents a more expected use case
• Data is fully schemed
• TPCDS
• 500GB scale factor, on parquet
• Cluster
• 5 node cluster
24© Cloudera, Inc. All rights reserved.
0
50
100
150
200
250
300
350
TPCDS
SparkSQL
SparkSQL
SparkSQL with RecordService
Spark SQL
~15% improvement in query times; queries are not scan bound
25© Cloudera, Inc. All rights reserved.
Spark SQL
29.5
31
14
23.5
0
5
10
15
20
25
30
35
2% Selective Scan Sum(col)
SparkSQL
SparkSQL
SparkSQL with RecordService
26© Cloudera, Inc. All rights reserved.
State of the project
• Available in v0.2 beta:
• Integration with Spark, MR, Pig (via HCatalog)
• Planner HA
• Apache 2.0 Licensed
• Sentry Column-Level Privilege Support
• Mini Roadmap:
• Improved multi-tenancy
• Complex types
• More InputFormat support / integration options
• Intend to donate to Apache Software Foundation
27© Cloudera, Inc. All rights reserved.
Conclusion
• RecordService provides a schemed data access service for Hadoop
• Logical data access instead of physical
• Much more powerful abstraction
• Demonstrated security enforcement, improved performance
• Simpler: clients don’t need to worry about low level details: storage APIs, file
formats
• Opens the door for future improvements
28© Cloudera, Inc. All rights reserved.
Contributing!
• Mailing list: recordservice-user@googlegroups.com
• Discussion forum: http://community.cloudera.com/t5/Beta-Releases/bd-
p/Beta
• Contributions: http://github.com/cloudera/RecordServiceClient/
• Documentation: http://cloudera.github.io/RecordServiceClient/
• Bug Reporting: https://issues.cloudera.org/projects/RS
• Beta Download:
http://www.cloudera.com/downloads/beta/record-service/0-2-0.html
29© Cloudera, Inc. All rights reserved.
Thank you

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
 
A deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloudA deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloud
 
Risk Management for Data: Secured and Governed
Risk Management for Data: Secured and GovernedRisk Management for Data: Secured and Governed
Risk Management for Data: Secured and Governed
 
Road to Cloudera certification
Road to Cloudera certificationRoad to Cloudera certification
Road to Cloudera certification
 
Data Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the EnterpriseData Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the Enterprise
 
Solr consistency and recovery internals
Solr consistency and recovery internalsSolr consistency and recovery internals
Solr consistency and recovery internals
 
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudPart 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Using Hadoop to Drive Down Fraud for Telcos
Using Hadoop to Drive Down Fraud for TelcosUsing Hadoop to Drive Down Fraud for Telcos
Using Hadoop to Drive Down Fraud for Telcos
 
Part 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to EndPart 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to End
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduEnabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache Kudu
 
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessIntel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data Success
 
Apache Spark Operations
Apache Spark OperationsApache Spark Operations
Apache Spark Operations
 
Kudu Deep-Dive
Kudu Deep-DiveKudu Deep-Dive
Kudu Deep-Dive
 

Andere mochten auch

Cross cultural communication in business world
Cross cultural communication in business worldCross cultural communication in business world
Cross cultural communication in business world
onlyvvek
 
Waste water treatment processes
Waste water treatment processesWaste water treatment processes
Waste water treatment processes
Ashish Agarwal
 
Agile Product Management Basics
Agile Product Management BasicsAgile Product Management Basics
Agile Product Management Basics
Rich Mironov
 
college assignment on Applications of ipsec
college assignment on Applications of ipsec college assignment on Applications of ipsec
college assignment on Applications of ipsec
bigchill29
 
Informatica transformation guide
Informatica transformation guideInformatica transformation guide
Informatica transformation guide
sonu_pal
 
How to measure illumination
How to measure illuminationHow to measure illumination
How to measure illumination
ajsatienza
 
Top 8 print production manager resume samples
Top 8 print production manager resume samplesTop 8 print production manager resume samples
Top 8 print production manager resume samples
kelerdavi
 
Optimized Learning and Development
Optimized Learning and Development Optimized Learning and Development
Optimized Learning and Development
AIESEC
 
6 May 2015 - INCREASING BANKING SALES PRODUCTIVITY - Management Excellence
6 May 2015 - INCREASING BANKING SALES PRODUCTIVITY - Management Excellence6 May 2015 - INCREASING BANKING SALES PRODUCTIVITY - Management Excellence
6 May 2015 - INCREASING BANKING SALES PRODUCTIVITY - Management Excellence
Change Management Institute
 

Andere mochten auch (20)

Securing Your Apache Spark Applications
Securing Your Apache Spark ApplicationsSecuring Your Apache Spark Applications
Securing Your Apache Spark Applications
 
PCRF-Policy Charging System-Functional Analysis
PCRF-Policy Charging System-Functional AnalysisPCRF-Policy Charging System-Functional Analysis
PCRF-Policy Charging System-Functional Analysis
 
Switchyard design overview
Switchyard design overviewSwitchyard design overview
Switchyard design overview
 
Benefits And Applications of PET Plastic Packaging
Benefits And Applications of PET Plastic PackagingBenefits And Applications of PET Plastic Packaging
Benefits And Applications of PET Plastic Packaging
 
1. GRID COMPUTING
1. GRID COMPUTING1. GRID COMPUTING
1. GRID COMPUTING
 
Cross cultural communication in business world
Cross cultural communication in business worldCross cultural communication in business world
Cross cultural communication in business world
 
Waste water treatment processes
Waste water treatment processesWaste water treatment processes
Waste water treatment processes
 
Green Storage 1: Economics, Environment, Energy and Engineering
Green Storage 1: Economics, Environment, Energy and EngineeringGreen Storage 1: Economics, Environment, Energy and Engineering
Green Storage 1: Economics, Environment, Energy and Engineering
 
Agile Product Management Basics
Agile Product Management BasicsAgile Product Management Basics
Agile Product Management Basics
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 
Improving Utilization of Infrastructure Cloud
Improving Utilization of Infrastructure CloudImproving Utilization of Infrastructure Cloud
Improving Utilization of Infrastructure Cloud
 
college assignment on Applications of ipsec
college assignment on Applications of ipsec college assignment on Applications of ipsec
college assignment on Applications of ipsec
 
Basics of print planning
Basics of print planningBasics of print planning
Basics of print planning
 
Compulsory motor third party liability in Mozambique
Compulsory motor third party liability in MozambiqueCompulsory motor third party liability in Mozambique
Compulsory motor third party liability in Mozambique
 
Informatica transformation guide
Informatica transformation guideInformatica transformation guide
Informatica transformation guide
 
How to measure illumination
How to measure illuminationHow to measure illumination
How to measure illumination
 
Top 8 print production manager resume samples
Top 8 print production manager resume samplesTop 8 print production manager resume samples
Top 8 print production manager resume samples
 
Optimized Learning and Development
Optimized Learning and Development Optimized Learning and Development
Optimized Learning and Development
 
Ironport Data Loss Prevention
Ironport Data Loss PreventionIronport Data Loss Prevention
Ironport Data Loss Prevention
 
6 May 2015 - INCREASING BANKING SALES PRODUCTIVITY - Management Excellence
6 May 2015 - INCREASING BANKING SALES PRODUCTIVITY - Management Excellence6 May 2015 - INCREASING BANKING SALES PRODUCTIVITY - Management Excellence
6 May 2015 - INCREASING BANKING SALES PRODUCTIVITY - Management Excellence
 

Ähnlich wie Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path for Compute Frameworks

Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructure
harendra_pathak
 
Azure from scratch part 3 By Girish Kalamati
Azure from scratch part 3 By Girish KalamatiAzure from scratch part 3 By Girish Kalamati
Azure from scratch part 3 By Girish Kalamati
Girish Kalamati
 

Ähnlich wie Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path for Compute Frameworks (20)

Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructure
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Spark etl
Spark etlSpark etl
Spark etl
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Cloudera Altus: Big Data in der Cloud einfach gemacht
Cloudera Altus: Big Data in der Cloud einfach gemachtCloudera Altus: Big Data in der Cloud einfach gemacht
Cloudera Altus: Big Data in der Cloud einfach gemacht
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18   asher bartchBig data journey to the cloud 5.30.18   asher bartch
Big data journey to the cloud 5.30.18 asher bartch
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Backup multi-cloud solution based on named pipes
Backup multi-cloud solution based on named pipesBackup multi-cloud solution based on named pipes
Backup multi-cloud solution based on named pipes
 
Building a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for AnalystsBuilding a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for Analysts
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
 
Azure from scratch part 3 By Girish Kalamati
Azure from scratch part 3 By Girish KalamatiAzure from scratch part 3 By Girish Kalamati
Azure from scratch part 3 By Girish Kalamati
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 

Mehr von Cloudera, Inc.

Mehr von Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path for Compute Frameworks

  • 1. 1© Cloudera, Inc. All rights reserved. Introducing RecordService Lenni Kuff
  • 2. 2© Cloudera, Inc. All rights reserved. RecordService is a distributed, scalable, data access service for unified authorization in Hadoop.
  • 3. 3© Cloudera, Inc. All rights reserved. Motivation • As the Hadoop ecosystem expands, new components continue to be added • Speaks to the overall flexibility of Hadoop • This is good - more functionality, more workloads, more use cases. • As use cases for Hadoop mature, user requirements and expectations increase: • Security • Performance • Compatibility • The flexibility of Hadoop has come at cost of increased complexity
  • 4. 4© Cloudera, Inc. All rights reserved. Storage Compute
  • 5. 5© Cloudera, Inc. All rights reserved. Storage Compute …
  • 6. 6© Cloudera, Inc. All rights reserved. Example: Security Challenge: Provide unified fine-grained security across compute frameworks • Integrating consistent security layer into every components is not scalable. • Securing data at file-level precludes fine grained access control (column/row) • File ACLs not enough - User can view all or nothing. • Currently, must split files, duplicate data – large operational cost. Solution: Add a level of abstraction - secure service to access datasets in “record” format • Can now apply fine-grained constraints on projection of dataset • Same access control policy can be applied uniformly across compute frameworks; uncoupled from underlying storage layer
  • 7. 7© Cloudera, Inc. All rights reserved. Introducing RecordService
  • 8. 8© Cloudera, Inc. All rights reserved. Record Service - Overview • Simplifies • Provides a higher level, logical abstraction for data (ie Tables or Views) • Returns schemed objects (instead of paths and bytes). No need for applications to worry about storage APIs and file formats. • HCatalog? Similar concept - RecordService is secure, performant. Plan to support HCatalog as a data model on RecordService. • Secures • Central location for all authorization checks using Sentry metadata. • Secure service that does not execute arbitrary user code • Accelerates • Unified data access path allows platform-wide performance improvements.
  • 9. 9© Cloudera, Inc. All rights reserved. Architecture
  • 10. 10© Cloudera, Inc. All rights reserved. Architecture • Runs as a distributed service: Planner Servers & Worker Servers • Servers do not store any state • Easy HA, fault tolerance. • Planner Servers responsible for request planning • Retrieve and combine metadata (NN, HMS, Sentry) • Split generation -> Creates tasks for workers • Performs authorization • Worker Servers reads from storage and constructs records. • IO, file parsing, predicate evaluation • Runs as the “source” for a DAG computation
  • 11. 11© Cloudera, Inc. All rights reserved. Architecture – Server APIs • Planner and Worker services expose thrift APIs • PlanRequest(), Exec(), Fetch() • PlanRequest() • Accepts SQL to specify request: Support SELECT and PROJECT • Access to tables and views stored in HMS • Does not run operators that require data exchange; “map only” • Generates a list of tasks which contain the request, each with locality • Exec()/Fetch() • Returns records in a canonical optimized, columnar-format.
  • 12. 12© Cloudera, Inc. All rights reserved. Architecture – Fault tolerance • Cluster state persisted in ZK • Membership, delegation tokens, secret keys • Servers do not communicate with each other directly => scalability • Planner services • Expected to run a few (i.e. 3) for HA • Fault tolerance handled with clients getting a list of planners and failing over • Plan requests are short • Worker services • Expect to run on each node in the cluster with data • Fault tolerance handled by framework (e.g. MR) rescheduling task
  • 13. 13© Cloudera, Inc. All rights reserved. Architecture – Security • Authentication using Kerberos and delegation tokens • Planner authorizes request using metadata in Sentry • Column level ACLs • Row level ACLs – create a view with a predicate • Masking – create a view with the masking function in the select list • Tasks generated by the planner are signed with a shared key • Worker runs generated tasks. • Does not authorize, relies on signed tasks • Runs as user with full access to data, does not run user code
  • 14. 14© Cloudera, Inc. All rights reserved. Architecture – Security example CREATE VIEW v as SELECT mask(credit_card_number) as ccn, name, balance, region FROM data WHERE region = “Europe” 1. Restrict access to the data set: disable access to ‘data’ table and underlying files in HDFS. 2. Give access by creating view, v 3. Set column level permissions on v per user if necessary Write path (ingest) unchanged. Job expected to run as privileged user.
  • 15. 15© Cloudera, Inc. All rights reserved. Client APIs – Integration with ecosystem • Similar APIs designed to integrate with MapReduce and Spark • Client APIs make things simpler • Don’t need to interact with HMS • Care about the underlying storage format: worker always returns records in a canonical format. • Storage engine details (e.g. s3)
  • 16. 16© Cloudera, Inc. All rights reserved. Client Integration APIs • Drop in replacements for common existing InputFormats • Text, Avro • Can be used with Spark as well • SparkSQL: integration with the Data Sources API • Predicate pushdown, projection • Migration should be easy
  • 17. 17© Cloudera, Inc. All rights reserved. MR Example //FileInputFormat.setInputPaths(job, new Path(args[0])); //job.setInputFormatClass(AvroKeyInputFormat.class); RecordServiceConfig.setInputTable(configuration, null, args[0]); job.setInputFormatClass( com.cloudera.recordservice.avro.mapreduce.AvroKeyInputFormat.class);
  • 18. 18© Cloudera, Inc. All rights reserved. Spark Example // Comment out one or the other val file = sc.recordServiceTextFile(path) //val file = sc.textFile(path)
  • 19. 19© Cloudera, Inc. All rights reserved. Spark SQL Example ctx.sql(s""" |CREATE TEMPORARY TABLE $tbl |USING com.cloudera.recordservice.spark.DefaultSource |OPTIONS ( | RecordServiceTable '$db.$tbl', | RecordServiceTableSize '$size' |) """.stripMargin)
  • 20. 20© Cloudera, Inc. All rights reserved. Performance • Shares some core components with Impala • IO management, optimized C++ code, runtime code generation, uses low level storage APIs • Highly efficient implementation of the scan functionality • Optimized columnar on wire format • Inspired by Apache Parquet • Accelerates performance for many workloads
  • 21. 21© Cloudera, Inc. All rights reserved. Terasort • ~Worst case scenario. Minimal schema: a single STRING column • Custom RecordServiceTeraInputFormat (similar to TeraInputFormat) • 78 Node cluster (12 cores/24 Hyper-Threaded, 12 disks) • Ran on 1 billion, 50 billion and 1 trillion (~100TB) scales • See Github repo for more details and runnable examples.
  • 22. 22© Cloudera, Inc. All rights reserved. TeraChecksum 1 0.48 0.23 1.03 0.8 0.85 0 0.2 0.4 0.6 0.8 1 1.2 1B (MapReduce) 50B (MapReduce) 1T (MapReduce) 1B (Spark) 50B (Spark) 1T (Spark) Normalizedjobtime TeraChecksum Without RecordService With RecordService
  • 23. 23© Cloudera, Inc. All rights reserved. Spark SQL • Represents a more expected use case • Data is fully schemed • TPCDS • 500GB scale factor, on parquet • Cluster • 5 node cluster
  • 24. 24© Cloudera, Inc. All rights reserved. 0 50 100 150 200 250 300 350 TPCDS SparkSQL SparkSQL SparkSQL with RecordService Spark SQL ~15% improvement in query times; queries are not scan bound
  • 25. 25© Cloudera, Inc. All rights reserved. Spark SQL 29.5 31 14 23.5 0 5 10 15 20 25 30 35 2% Selective Scan Sum(col) SparkSQL SparkSQL SparkSQL with RecordService
  • 26. 26© Cloudera, Inc. All rights reserved. State of the project • Available in v0.2 beta: • Integration with Spark, MR, Pig (via HCatalog) • Planner HA • Apache 2.0 Licensed • Sentry Column-Level Privilege Support • Mini Roadmap: • Improved multi-tenancy • Complex types • More InputFormat support / integration options • Intend to donate to Apache Software Foundation
  • 27. 27© Cloudera, Inc. All rights reserved. Conclusion • RecordService provides a schemed data access service for Hadoop • Logical data access instead of physical • Much more powerful abstraction • Demonstrated security enforcement, improved performance • Simpler: clients don’t need to worry about low level details: storage APIs, file formats • Opens the door for future improvements
  • 28. 28© Cloudera, Inc. All rights reserved. Contributing! • Mailing list: recordservice-user@googlegroups.com • Discussion forum: http://community.cloudera.com/t5/Beta-Releases/bd- p/Beta • Contributions: http://github.com/cloudera/RecordServiceClient/ • Documentation: http://cloudera.github.io/RecordServiceClient/ • Bug Reporting: https://issues.cloudera.org/projects/RS • Beta Download: http://www.cloudera.com/downloads/beta/record-service/0-2-0.html
  • 29. 29© Cloudera, Inc. All rights reserved. Thank you

Hinweis der Redaktion

  1. In this talk we will be introducing Record Service … In Short, RecordService is a highly scalable, distributed, data access service for Hadoop that provides unified authorization while also simplifying the platform.
  2. Before digging in to the details of RecordService, let’s take a step back and look at the current state of the Hadoop ecosystem. What we have seen is more components, continue added to the stack at an accelerated rate.
  3. * RS provides layer of abstraction over storage so compute frameworks don’t need to care as where data is stored Provides platform for uniform, fine grained security across all compute engines Helps to simplify Hadoop – Unified data access path
  4. Single place for performance enhancements