Presto @ Netflix: Interactive Queries at Petabyte Scale

•

3 gefällt mir•1,561 views

DataWorks Summit

Hadoop Summit 2014

Presto @ Netflix: Interactive
Queries at Petabyte Scale
Nezih Yigitbasi and Zhenxiao Luo

Outline
Our big data platform
Presto @ Netflix
Netflix integration
Our contributions
What’s next?

Cloud
Apps
S3
Suro/Kafka Ursula
SSTable
s
Cassandra Aegisthus
Event Data
500 bn/day, 15m
Daily
Dimension Data
Netflix Data Pipeline

Data
Warehouse
Service
Tool
s
Gateways
Our Big Data Platform
Prod
Clients
Clusters
Query Prod TestProd
Big Data API/Portal
Metacat

Our Use Cases
Batch jobs (Pig, Hive)
ETL jobs
reporting and other analysis
Interactive jobs

What is Presto?
An open source distributed SQL engine
for running interactive queries against
large datasets

0
200
400
600
800
Group By Join + Group By Needle in
Haystack
Presto Hive
QueryCompletionTime[s]

Why we Love Presto?
Fast
Scalable
ANSI SQL
Open source
Works well on AWS
Hadoop friendly

presto-cli, Python, R, BI tools (ODBC/JDBC), etc.
Our Deployment
v 0.100
Java 8™
1 coordinator (r3.4xlarge)
~220 workers (r3.4xlarge)
Clients

15+ PB
Total data size
2.5K
Queries/day
300+
Presto users
Data Size
100MB 1GB 1TB 10TB
0
20
40
60
80
100
%ofQueries
Query Runtime
0
20
40
60
80
100
4s 1m 5m 10m
%ofQueries

S3
Atlas
Sidecar
PrestoAmazon EMR
Amazon
RDS
HCat
Server
Coordinator
Worker

S3
Atlas
Sidecar
PrestoAmazon EMR
Data Lineage
query
completion
events

S3
Atlas
Sidecar
PrestoAmazon EMR
Monitoring
metrics

S3
Suro
Atlas
Sidecar
PrestoAmazon EMR
BI Tools

S3 Filesystem
Query Optimizer
Parquet File Format
Complex Types
Multipart upload
Instance credentials
Role support
Reliability
Single distinct => Group By
Joins with similar subqueries
Schema evolution
Parquet 1.6
Various new
functions
Comparability

presto-cli
other
clients
Odbc/Jd
bc
S3
Worker
Worker
Worker
Parser Optimizer
Scheduler
Distributed
Planner
Coordinator
Functions
Type
System
1
2 3
4
5
6
7

Single Distinct => Group By
select
count(distinct c)
from t
select count(*)
from (select c
from t
group by c)
Output
Count
Aggregation
masks = {column$distinct}
Distinct
marker = column$distinct
Table Scan
Output
Count
Aggregation
masks = {}
Group By
Aggregation
count
Table Scan

Joins with Similar Subqueries
select *
from (select k,
agg1,
agg2
from t
group by k) a
join (select k,
agg3,
agg4
from t
group by k) b
on ( a.k = b.k )
Output
Table Scan
table = t
Join
key= k
Group By
Aggregation
key= k
agg1, agg2
Group By
Aggregation
key= k
agg3, agg4
Table Scan
table = t

Output
Table Scan
table = t
Group By
Aggregation
key= k
agg1, agg2, agg3, agg4
select k, agg1,
agg2, agg3,
agg4
from t
group by k
Joins with Similar Subqueries

Complex Type Support
map_agg()
map_keys()
map_values()
map<K,V> row(F T)
=, !=
bug fixes
array<T>
array_join()
sort_array()
concat()
=, !=, <, >

presto-cli
other
clients
Odbc/Jd
bc
S3
Worker
S3
Filesystem
Worker
Worker
S3
Filesystem
Parser Optimizer
Scheduler
Distributed
Planner
Coordinator
Functions
Type
System
1
2 3
4
5
6
7

Presto S3 FileSystem
(multipart upload, instance/static credentials,
assume role, reliability, etc.)
S3
open() seek() list()
Get Object Get Object
Metadata
List Objects

RowGroup Metadata
codec, encoding, etc.
Column Chunk
Page
Page
Page
Column Chunk
Page
Page
Page
Column Chunk
Page
Page
Page
RowGroup
Footer
schema, version, etc.
Column Metadata
value count
size,
min, max
Column Metadata
value count
size,
min, max
Column Metadata
value count
size,
min, max

What’s next?
Parquet optimizations
vectorized reader
predicate pushdown
lazy load
lazy decompression/decoding
Better resource management
Better BI tool integration

Weitere ähnliche Inhalte

Was ist angesagt?

Learn how to deploy a managed Presto environment to interactively query log data on AWS Organizations often need to quickly analyze large amounts of data, such as logs, generated from a wide variety of sources and formats. However, traditional approaches require a lot of time and effort designing complex data transformation and loading processes; and configuring data warehouses. Using AWS, you can start querying your datasets within minutes In this webinar you will learn how you can deploy a managed Presto environment in minutes to interactively query log data using plain ANSI SQL. Presto is a popular open source SQL engine for running interactive analytic queries against data sources of all sizes. We will talk about common use cases and best practices for running Presto on Amazon EMR. Learning Objectives: • Learn how to deploy a managed Presto environment running on Amazon EMR • Understand best practices for running Presto on Amazon EMR, including use of Amazon EC2 Spot instances • Learn how other customers are using Presto to analyze large data sets

Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...

Amazon Web Services

Running Presto and Spark on the Netflix Big Data Platform

Eva Tse

AWS provides a wide set of services to manage your data, which allow our customers to choose the right tool to the right workload. Learn how to make your databases up to 10x faster and less expensive with Amazon ElastiCache for Redis and utilize DynamoDB Accelerator (DAX) to access your data on DynamoDB faster with no additional development efforts. If you need fast access to your data, these services might be the right services for your workload.

Fast Data at Scale - AWS Summit Tel Aviv 2017

Amazon Web Services

A few years ago, Netflix had a fairly classic business intelligence tech stack. Now, things have changed. Netflix is a heavy user of AWS for much of its ongoing operations, and Data Science & Engineering (DSE) is no exception. In this talk, we dive into the Netflix DSE architecture: what and why. Key topics include their use of Big Data technologies (Cassandra, Hadoop, Pig + Python, and Hive); their Amazon S3 central data hub; their multiple persistent Amazon EMR clusters; how they benefit from AWS elasticity; their data science-as-a-service approach, how they made a hybrid AWS/data center setup work well, their open-source Hadoop-related software, and more.

Data Science at Netflix with Amazon EMR (BDT306) | AWS re:Invent 2013

Amazon Web Services

Querying and Analyzing Data in Amazon S3

Amazon Web Services

"Growing data is a massive computational challenge across the enterprise. The opportunity to draw insights from huge data sets is wide open, but traditional computing environments often can’t scale to those volumes. In this session, Intel Chief Data Scientist Bob Rogers PhD explains how developers can take advantage of technologies from Intel with the AWS platform. Also in this session, AOL Systems Architect Durga Nemani provides insights into how AOL was able to reduce the time and cost to process massive amounts of clickstream data by leveraging big data technologies in AWS. AOL can process data as fast as possible or as cheaply as possible, depending on the SLA, by choosing the number and types of instances without any changes to the code. Session sponsored by Intel."

(BDT210) Building Scalable Big Data Solutions: Intel & AOL

Amazon Web Services

Join the principal engineer of Citus Cloud for a brief overview of Citus, best use cases for it, and a drill down into how it's run and managed as a hosted service on top of AWS. The orchestration of Citus is homegrown, but comes from years of experience of running millions of PostgreSQL databases on top of AWS. Even if you aren't looking to leverage Citus to help you scale out, in this session you'll gain insights applicable to running and managing your stateful services on top of AWS. Citus is a PostgreSQL extension that transforms the database into a distributed, horizontally scalable database. Companies like Cloudflare use Citus to process 40 TB per day. With Citus MX, applications can take advantage of every node in the cluster for writes and yielding near-linear write scaling. Citus MX provide up to 500,000 durable writes per second.

AWS re:Invent 2016: How Citus Enables Scalable PostgreSQL on AWS (DAT207)

Amazon Web Services

Amazon Kinesis Firehose is a fully-managed, elastic service to deliver real-time data streams to Amazon S3, Amazon Redshift, and other destinations. In this session, we start with overviews of Amazon Kinesis Firehose and Amazon Kinesis Analytics. We then discuss how Amazon Kinesis Firehose makes it even easier to get started with streaming data, without writing a stream processing application or provisioning a single resource. You learn about the key features of Amazon Kinesis Firehose, including its companion agent that makes emitting data from data producers even easier. We walk through capture and delivery with an end-to-end demo, and discuss key metrics that will help developers and architects understand their streaming data flow. Finally, we look at some patterns for data consumption as the data streams into S3. We show two examples: using AWS Lambda, and how you can use Apache Spark running within Amazon EMR to query data directly in Amazon S3 through EMRFS.

(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose

Amazon Web Services

An overview of Amazon Athena

Julien SIMON

Want to get ramped up on how to use Amazon's big data web services and launch your first big data application on AWS? Join us on our journey as we build a big data application in real-time using Amazon EMR, Amazon Redshift, Amazon Kinesis, Amazon DynamoDB, and Amazon S3. We review architecture design patterns for big data solutions on AWS, and give you access to a take-home lab so that you can rebuild and customize the application yourself.

(BDT205) Your First Big Data Application On AWS

Amazon Web Services

Netflix running Presto in the AWS Cloud

Zhenxiao Luo

Amazon EMR Deep Dive & Best Practices

Amazon Web Services

Traditional data warehouses become expensive and slow down as the volume of your data grows. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it easy to analyze all of your data using existing business intelligence tools for as low as $1000/TB/year. This webinar will provide an introduction to Amazon Redshift and cover the essentials you need to deploy your data warehouse in the cloud so that you can achieve faster analytics and save costs. Learning Objectives: • Get an introduction to Amazon Redshift's massively parallel processing, columnar, scale-out architecture • Learn how to configure your data warehouse cluster, optimize schema, and load data efficiently • Get an overview of all the latest features including interleaved sorting and user-defined functions

Getting Started with Amazon Redshift - AWS July 2016 Webinar Series

Amazon Web Services

"In this session, you will learn how to easily access your data on S3, and how to visualize and generate insights from Amazon Athena and other data sources through Amazon QuickSight. In addition we will share some tips & best practices for using Athena & QuickSight. Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Amazon QuickSight is a fast, cloud-powered business analytics service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from various data sources (Amazon Redshift, Amazon Athena, Amazon EMR, Amazon RDS and more)."

Interactive Analytics on AWS - AWS Summit Tel Aviv 2017

Amazon Web Services

Introduction to AWS Glue

Amazon Web Services

Amazon EMR provides a managed framework which makes it easy, cost effective, and secure to run data processing frameworks such as Apache Hadoop, Apache Spark, and Presto on AWS. In this session, you learn the key design principles behind running these frameworks on the cloud and the feature set that Amazon EMR offers. We discuss the benefits of decoupling compute and storage and strategies to take advantage of the scale and the parallelism that the cloud offers, while lowering costs. In this session, you learn the benefits of decoupling storage and compute and allowing them to scale independently; how to run Hadoop, Spark, Presto and other supported Hadoop Applications on Amazon EMR; how to use Amazon S3 as a persistent data-store and process data directly from Amazon S3; Deployment strategies and how to avoid common mistakes when deploying at scale; and how to use Spot instances to scale your transient infrastructure effectively.

Big data with amazon EMR - Pop-up Loft Tel Aviv

Amazon Web Services

Elasticsearch has quickly become the leading open source technology for scaling search and building document services on. Many software providers have come to rely on it to serve the needs of high-performance, production applications. In this talk, we’ll go deep on lessons learned from three years in production scaling from a few shards to more than 100 spread across 100s of nodes on AWS--to serve real-time queries against 100s of millions of documents. Attendees will learn: * How to capacity plan for ES on AWS * How to scale and reshard on AWS with zero downtime * What AWS and ES metrics to collect and alert on * Tips on day to day ES operations Session sponsored by SignalFx.

AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Amazon Web Services

Amazon EMR is one of the largest Hadoop operators in the world. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We will also share best practices to keep your Amazon EMR cluster cost-efficient. Finally, we dive into some of our recent launches to keep you current on our latest features.

(BDT305) Amazon EMR Deep Dive and Best Practices

Amazon Web Services

Apache Spark and the Hadoop Ecosystem on AWS

Amazon Web Services

If you are interested to know more about AWS Chicago Summit, please use the following to register: http://amzn.to/1RooPPL Many AWS customers store vast amounts of data in Amazon S3, a low cost, scalable, and durable object store; Amazon DynamoDB, a NoSQL database; or Amazon Kinesis, a real time data stream processing service. With large datasets in various AWS services, how do you derive value from this information in a cost-effective way? Using Amazon Elastic MapReduce (Amazon EMR) with applications in the Apache Hadoop ecosystem, you can directly interact with data in each of these storage services for scalable analytics workloads or ad hoc queries. You can quickly and easily launch an Amazon EMR cluster from the AWS Management Console, and scale your cluster to match the compute and memory resources needed for your workflow, independent from the storage capacity used in your AWS storage services. The webinar will accelerate your use of Amazon EMR by showing you how to create and monitor Amazon EMR clusters, and provide several use cases and architectures for using Amazon EMR with different AWS data stores. Learning Objectives: • Recognize when to use Amazon EMR • Understand the steps required to set up and monitor an Amazon EMR cluster • Architect applications that effectively use Amazon EMR • Understand how to use HUE for ad hoc query of data in Amazon S3 Who Should Attend: • Developers, LOB owners, Continuous Integration & Continuous Delivery (CICD) practitioners

AWS May Webinar Series - Getting Started with Amazon EMR

Amazon Web Services

Was ist angesagt? (20)

Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...

Running Presto and Spark on the Netflix Big Data Platform

Fast Data at Scale - AWS Summit Tel Aviv 2017

Data Science at Netflix with Amazon EMR (BDT306) | AWS re:Invent 2013

Querying and Analyzing Data in Amazon S3

(BDT210) Building Scalable Big Data Solutions: Intel & AOL

AWS re:Invent 2016: How Citus Enables Scalable PostgreSQL on AWS (DAT207)

(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose

An overview of Amazon Athena

(BDT205) Your First Big Data Application On AWS

Netflix running Presto in the AWS Cloud

Amazon EMR Deep Dive & Best Practices

Getting Started with Amazon Redshift - AWS July 2016 Webinar Series

Interactive Analytics on AWS - AWS Summit Tel Aviv 2017

Introduction to AWS Glue

Big data with amazon EMR - Pop-up Loft Tel Aviv

AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

(BDT305) Amazon EMR Deep Dive and Best Practices

Apache Spark and the Hadoop Ecosystem on AWS

AWS May Webinar Series - Getting Started with Amazon EMR

Ähnlich wie Presto @ Netflix: Interactive Queries at Petabyte Scale

presto-at-netflix-hadoop-summit-15

Zhenxiao Luo

Tapad's data pipeline is an elastic combination of technologies (Kafka, Hadoop, Avro, Scalding) that forms a reliable system for analytics, realtime and batch graph-building, and logging. In this talk, I will speak about the creation and evolution of the pipeline, and a concrete example – a day in the life of an event tracking pixel. We'll also talk about common challenges that we've overcome such as integrating different pieces of the system, schema evolution, queuing, and data retention policies.

Data Pipeline at Tapad

Toby Matejovsky

Talk at Scala Up North Jul 21 2017 We will talk about Spotify's story with Scala big data and our journey to migrate our entire data infrastructure to Google Cloud and how Justin Bieber contributed to breaking it. We'll talk about Scio, a Scala API for Apache Beam and Google Cloud Dataflow, and the technology behind it, including macros, algebird, chill and shapeless. There'll also be a live coding demo.

Sorry - How Bieber broke Google Cloud at Spotify

Neville Li

"Kafka Connect, the framework for building scalable and reliable data pipelines, has gained immense popularity in the data engineering landscape. This session will provide a comprehensive guide to creating Kafka connectors using Kotlin, a language known for its conciseness and expressiveness. In this session, we will explore a step-by-step approach to crafting Kafka connectors with Kotlin, from inception to deployment using an simple use case. The process includes the following key aspects: Understanding Kafka Connect: We'll start with an overview of Kafka Connect and its architecture, emphasizing its importance in real-time data integration and streaming. Connector Design: Delve into the design principles that govern connector creation. Learn how to choose between source and sink connectors and identify the data format that suits your use case. Building a Source Connector: We'll start with building a Kafka source connector, exploring key considerations, such as data transformations, serialization, deserialization, error handling and delivery guarantees. You will see how Kotlin's concise syntax and type safety can simplify the implementation. Testing: Learn how to rigorously test your connector to ensure its reliability and robustness, utilizing best practices for testing in Kotlin. Connector Deployment: go through the process of deploying your connector in a Kafka Connect cluster, and discuss strategies for monitoring and scaling. Real-World Use Cases: Explore real-world examples of Kafka connectors built with Kotlin. By the end of this session, you will have a solid foundation for creating and deploying Kafka connectors using Kotlin, equipped with practical knowledge and insights to make your data integration processes more efficient and reliable. Whether you are a seasoned developer or new to Kafka Connect, this guide will help you harness the power of Kafka and Kotlin for seamless data flow in your applications."

Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...

HostedbyConfluent

Traditionally database systems were optimized either for OLAP either for OLTP workloads. Such mainstream DBMSes like Postgres,MySQL,... are mostly used for OLTP, while Greenplum, Vertica, Clickhouse, SparkSQL,... are oriented on analytic queries. But right now many companies do not want to have two different data stores for OLAP/OLTP and need to perform analytic queries on most recent data. I want to discuss which features should be added to Postgres to efficiently handle HTAP workload.

OLTP+OLAP=HTAP

EDB

Deploying your Data Warehouse on AWS

Amazon Web Services

Interactively querying Google Analytics reports from R using ganalytics

Johann de Boer

Presentation_BigData_NenaMarin

n5712036

Spark Sql and DataFrame

Prashant Gupta

Data visualization in python/Django

kenluck2001

10 Reasons to Start Your Analytics Project with PostgreSQL

Satoshi Nagayasu

Slides of the Barcelona Spark meetup of the 24th of October 2019. The recording is available at https://www.youtube.com/watch?v=eCoCcBH4hIU. Abstract One of the key strengths of Spark is its flexibility as it integrates with dozens of different storage systems and file formats. However, it is not the same reading from a CSV file, or a SQL database, or an exotic stratified sampled multidimensional database. And finding the right balance between modularity and flexibility is not easy! In this presentation, we will talk about the evolution of Spark's DataSource API, and how it integrates with the SQL optimizer, highlighting how we can make much faster queries with logical and the physical plans that better integrates with the storage. From theory to practise, we will then discuss how we extended the Spark's internals, and we built a new source integration that allows the push-down of both sampling and multidimensional filtering. About the speakers: Paola Pardo is a Computer Engineer from Barcelona. She graduated in Computer engineer this last summer at the Technical University of Catalunya with a thesis focused on Data storage push down optimization based on Apache Spark. She is, and she is currently working at Barcelona Supercomputing Center and in its spin-off Qbeast developing a Qbeast-Spark connector. Cesare Cugnasco is a PhD in Computer Architecture and a researcher at the Barcelona Supercomputing Center. His research focuses on NoSQL databases, distributed computing and High-performance storage. He invented and patented a new database architecture for Big Data, and he is building a spin-off for its commercialization.

Extending Spark for Qbeast's SQL Data Source with Paola Pardo and Cesare Cug...

Qbeast

A talk given by Julian Hyde at DataCouncil SF on April 18, 2019 How do you organize your data so that your users get the right answers at the right time? That question is a pretty good definition of data engineering — but it is also describes the purpose of every DBMS (database management system). And it’s not a coincidence that these are so similar. This talk looks at the patterns that reoccur throughout data management — such as caching, partitioning, sorting, and derived data sets. As the speaker is the author of Apache Calcite, we first look at these patterns through the lens of Relational Algebra and DBMS architecture. But then we apply these patterns to the modern data pipeline, ETL and analytics. As a case study, we look at how Looker’s “derived tables” blur the line between ETL and caching, and leverage the power of cloud databases.

Tactical data engineering

Julian Hyde

R basics

Sagun Baijal

MongoDB.local Berlin: Building a GraphQL API with MongoDB, Prisma and Typescript

MongoDB

Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet’s features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he’s learned, creating the missing guide you need. Topics include: * The tools and techniques Netflix uses to analyze Parquet tables * How to spot common problems * Recommendations for Parquet configuration settings to get the best performance out of your processing platform * The impact of this work in speeding up applications like Netflix’s telemetry service and A/B testing platform

Parquet performance tuning: the missing guide

Ryan Blue

ScalaTo July 2019 - No more struggles with Apache Spark workloads in production

Chetan Khatri

Building the next generation Spark SQL engine at speed poses new challenges to both automation and testing. At Databricks, we are implementing a new testing framework for assessing the quality and performance of new developments as they produced. Having more than 1,200 worldwide contributors, Apache Spark follows a rapid pace of development. At this scale, new testing tooling such as random query and data generation, fault injection, longevity stress, and scalability tests are essential to guarantee a reliable and performance Spark later in production. By applying such techniques, we will demonstrate the effectiveness of our testing infrastructure by drilling-down into cases where correctness and performance regressions have been found early. In addition, showing how they have been root-caused and fixed to prevent regressions in production and boosting the continuous delivery of new features.

Fast and Reliable Apache Spark SQL Engine

Databricks

Apache Kafka, and the Rise of Stream Processing

Guozhang Wang

GridSQL is commonly thought of as a replication solution along the likes of Slony and Bucardo, but the open source GridSQL project actually allows PostgreSQL queries to be parallelized across many servers allowing performance to scale nearly linearly. In this session, we will discuss the advantages to using GridSQL for large multi-terabyte data warehouses and how to design your PostgreSQL schemas and queries to leverage GridSQL. We will dig into how GridSQL plans a query capable of spanning multiple PostgreSQL servers and executes across those nodes. We will delve into some performance expectations and where GridSQL should be deployed.

Scaling PostgreSQL With GridSQL

Jim Mlodgenski

Ähnlich wie Presto @ Netflix: Interactive Queries at Petabyte Scale (20)

presto-at-netflix-hadoop-summit-15

Data Pipeline at Tapad

Sorry - How Bieber broke Google Cloud at Spotify

Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...

OLTP+OLAP=HTAP

Deploying your Data Warehouse on AWS

Interactively querying Google Analytics reports from R using ganalytics

Presentation_BigData_NenaMarin

Spark Sql and DataFrame

Data visualization in python/Django

10 Reasons to Start Your Analytics Project with PostgreSQL

Extending Spark for Qbeast's SQL Data Source with Paola Pardo and Cesare Cug...

Tactical data engineering

R basics

MongoDB.local Berlin: Building a GraphQL API with MongoDB, Prisma and Typescript

Parquet performance tuning: the missing guide

ScalaTo July 2019 - No more struggles with Apache Spark workloads in production

Fast and Reliable Apache Spark SQL Engine

Apache Kafka, and the Rise of Stream Processing

Scaling PostgreSQL With GridSQL

Mehr von DataWorks Summit

Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL). Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW). Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models. Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.

Data Science Crash Course

DataWorks Summit

In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort. This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.

Floating on a RAFT: HBase Durability with Apache Ratis

DataWorks Summit

Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase. Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs. Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables. Resources: https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

DataWorks Summit

Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.

HBase Tales From the Trenches - Short stories about most common HBase operati...

DataWorks Summit

LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

DataWorks Summit

Managing the Dewey Decimal System

DataWorks Summit

Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL. Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist). In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.

Practical NoSQL: Accumulo's dirlist Example

DataWorks Summit

Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber. Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable. At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads. At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.

HBase Global Indexing to support large-scale data ingestion at Uber

DataWorks Summit

Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions. These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

DataWorks Summit

Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms. To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

DataWorks Summit

Supporting Apache HBase : Troubleshooting and Supportability Improvements

DataWorks Summit

In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”

Security Framework for Multitenant Architecture

DataWorks Summit

Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores. With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.

Presto: Optimizing Performance of SQL-on-Anything Engine

DataWorks Summit

Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

DataWorks Summit

Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.

Extending Twitter's Data Platform to Google Cloud

DataWorks Summit

At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

DataWorks Summit

Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

DataWorks Summit

Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

DataWorks Summit

Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as: ● Optimizing merchandising execution, in-stocks and sell-thru ● Enhancing operational efficiencies, enable real-time customer engagement ● Enhancing loss prevention capabilities, response time ● Creating frictionless experiences for shoppers Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry. We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey. Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables. We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance. We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing. Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems. By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.

Computer Vision: Coming to a Store Near You

DataWorks Summit

Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

DataWorks Summit

Mehr von DataWorks Summit (20)

Data Science Crash Course

Floating on a RAFT: HBase Durability with Apache Ratis

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

HBase Tales From the Trenches - Short stories about most common HBase operati...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Managing the Dewey Decimal System

Practical NoSQL: Accumulo's dirlist Example

HBase Global Indexing to support large-scale data ingestion at Uber

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Security Framework for Multitenant Architecture

Presto: Optimizing Performance of SQL-on-Anything Engine

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Extending Twitter's Data Platform to Google Cloud

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Computer Vision: Coming to a Store Near You

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Presto @ Netflix: Interactive Queries at Petabyte Scale

1. Presto @ Netflix: Interactive Queries at Petabyte Scale Nezih Yigitbasi and Zhenxiao Luo

2. Outline Our big data platform Presto @ Netflix Netflix integration Our contributions What’s next?

3. Cloud Apps S3 Suro/Kafka Ursula SSTable s Cassandra Aegisthus Event Data 500 bn/day, 15m Daily Dimension Data Netflix Data Pipeline

4. Data Warehouse Service Tool s Gateways Our Big Data Platform Prod Clients Clusters Query Prod TestProd Big Data API/Portal Metacat

5. Our Use Cases Batch jobs (Pig, Hive) ETL jobs reporting and other analysis Interactive jobs

6. Presto @ Netflix

7. What is Presto? An open source distributed SQL engine for running interactive queries against large datasets

8. Why we love Presto? Fast

9. 0 200 400 600 800 Group By Join + Group By Needle in Haystack Presto Hive QueryCompletionTime[s]

10. Why we Love Presto? Fast Scalable ANSI SQL Open source Works well on AWS Hadoop friendly

11. presto-cli, Python, R, BI tools (ODBC/JDBC), etc. Our Deployment v 0.100 Java 8™ 1 coordinator (r3.4xlarge) ~220 workers (r3.4xlarge) Clients

12. 15+ PB Total data size 2.5K Queries/day 300+ Presto users Data Size 100MB 1GB 1TB 10TB 0 20 40 60 80 100 %ofQueries Query Runtime 0 20 40 60 80 100 4s 1m 5m 10m %ofQueries

13. Netflix Integration

14. S3 Atlas Sidecar PrestoAmazon EMR Amazon RDS HCat Server Coordinator Worker

15. S3 Atlas Sidecar PrestoAmazon EMR Data Lineage query completion events

16. S3 Atlas Sidecar PrestoAmazon EMR Monitoring metrics

17. S3 Suro Atlas Sidecar PrestoAmazon EMR BI Tools

18. Our Contributions

19. S3 Filesystem Query Optimizer Parquet File Format Complex Types Multipart upload Instance credentials Role support Reliability Single distinct => Group By Joins with similar subqueries Schema evolution Parquet 1.6 Various new functions Comparability

20. presto-cli other clients Odbc/Jd bc S3 Worker Worker Worker Parser Optimizer Scheduler Distributed Planner Coordinator Functions Type System 1 2 3 4 5 6 7

21. Single Distinct => Group By select count(distinct c) from t select count(*) from (select c from t group by c) Output Count Aggregation masks = {column$distinct} Distinct marker = column$distinct Table Scan Output Count Aggregation masks = {} Group By Aggregation count Table Scan

22. Joins with Similar Subqueries select * from (select k, agg1, agg2 from t group by k) a join (select k, agg3, agg4 from t group by k) b on ( a.k = b.k ) Output Table Scan table = t Join key= k Group By Aggregation key= k agg1, agg2 Group By Aggregation key= k agg3, agg4 Table Scan table = t

23. Output Table Scan table = t Group By Aggregation key= k agg1, agg2, agg3, agg4 select k, agg1, agg2, agg3, agg4 from t group by k Joins with Similar Subqueries

24. presto-cli other clients Odbc/Jd bc S3 Worker Worker Worker Parser Optimizer Scheduler Distributed Planner Coordinator Functions Type System 1 2 3 4 5 6 7

25. Complex Type Support map_agg() map_keys() map_values() map<K,V> row(F T) =, != bug fixes array<T> array_join() sort_array() concat() =, !=, <, >

26. presto-cli other clients Odbc/Jd bc S3 Worker S3 Filesystem Worker Worker S3 Filesystem Parser Optimizer Scheduler Distributed Planner Coordinator Functions Type System 1 2 3 4 5 6 7

27. Presto S3 FileSystem (multipart upload, instance/static credentials, assume role, reliability, etc.) S3 open() seek() list() Get Object Get Object Metadata List Objects

28. presto-cli other clients Odbc/Jd bc S3 Worker S3 Filesystem Worker Worker S3 Filesystem Parser Optimizer Scheduler Distributed Planner Coordinator Functions Type System 1 2 3 4 5 6 7 Parquet Cursor Parquet Cursor

29. RowGroup Metadata codec, encoding, etc. Column Chunk Page Page Page Column Chunk Page Page Page Column Chunk Page Page Page RowGroup Footer schema, version, etc. Column Metadata value count size, min, max Column Metadata value count size, min, max Column Metadata value count size, min, max

30. What’s next? Parquet optimizations vectorized reader predicate pushdown lazy load lazy decompression/decoding Better resource management Better BI tool integration

31. THANK YOU

Hinweis der Redaktion

----- Meeting Notes (6/3/15 10:31) ----- more story telling here of why we chose presto

Presto @ Netflix: Interactive Queries at Petabyte Scale

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Presto @ Netflix: Interactive Queries at Petabyte Scale

Ähnlich wie Presto @ Netflix: Interactive Queries at Petabyte Scale (20)

Mehr von DataWorks Summit

Mehr von DataWorks Summit (20)

Presto @ Netflix: Interactive Queries at Petabyte Scale

Hinweis der Redaktion