Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks

Big Data Day LA 2016 Keynote - Reynold Xin, Co Founder of Databricks

Technologie

Scaling Big Data, a Spark perspective
Reynold Xin
@rxin
2016-07-09 Big Data LA

Scaling Big Data
Early adopters
Data Scientists
Statisticians
Physicists
R users
PyData
…
Citizen data scientists
Sophisticated
engineering
teams

Spark Philosophy
Unified engine
Support end-to-end applications
High-level APIs
Easy to use, rich optimizations
Integrate broadly
Storage systems, libraries, etc
SQLStreaming ML Graph
…
1
2
3

Apache Spark 2.0
Next major release,coming out in the next few weeks
• Unstable preview release at spark.apache.org
• 2.0.0-rc2 available on dev@sparkmailing list
Remains highly compatible with ApacheSpark 1.X
17k patches (2500 for 2.0) from 1200+ contributors

New in 2.0
Structured API improvements
(DataFrame, Dataset, SparkSession)
Structured Streaming
MLlib model export
R bindings
SQL 2003
Performance improvements
Deep learning libraries
(Baidu, Yahoo!, Berkeley, Databricks)
GraphFrames
PyData integration
Reactive streams
C# bindings:Mobius
JS bindings:EclairJS
Broader Community

Growing the Community
New initiatives from Databricks

The largest challenge in applying big data is
the skills gap.
StackOverflow Developer Survey 2016

Massive Open Online Courses
Free 5-course series on big
data with Apache Spark
dbricks.co/mooc16
Introduction
to Apache Spark
TM
Distributed
Machine Learning
with Apache Spark
TM
Big Data Analysis
with Apache Spark
TM
Advanced Apache Spark
for Data Science and
Data Engineering
TM
Advanced
Machine Learning
with Apache Spark
TM

Databricks Community Edition
Free version of Databricks with:
• Interactive tutorials
• Apache Spark and populardata
science libraries
• Visualization & debug tools
databricks.com/ce

Demo
Link to demo: http://tinyurl.com/big-data-la-2016-demo

2016 Apache Spark Survey
http://tinyurl.com/spark2016survey

Weitere ähnliche Inhalte

Was ist angesagt?

Simplifying Big Data Applications with Apache Spark 2.0

Presented by David Smith at The Data Science Summit, Chicago, April 20 2017. The ability to independently reproduce results is a critical issue within the scientific community today, and is equally important for collaboration and compliance in business. In this talk, I'll introduce several features available in R that help you make reproducibility a standard part of your data science workflow. The talk will include tips on working with data and files, combining code and output, and managing R's changing package ecosystem.

Reproducible Data Science with R

Revolution Analytics

Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin

Databricks is going to Strata San Jose! This presentation introduces our flagship product, Databricks Cloud. More details: Databricks Cloud combines the power of Spark with a zero-management hosted platform and an initial set of applications built around common workflows to simplify the pain of provisioning a Spark cluster, exploring data, and building data products. Spark is a unified processing engine that eliminates the need to stitch together a disjointed set of tools, and provides support for interactive queries (Spark SQL), streaming data (Spark Streaming), machine learning (MLlib) and graph computation (GraphX) in a single common API across the entire pipeline. Additionally, Databricks Cloud reaps the benefit of the rapid pace of innovation in Spark, the fastest growing Apache project with over 400 contributors

Databricks @ Strata SJ

In this presentation we'll explain how to use the R programming language with Spark using a Databricks notebook and the SparkR package. We'll discuss how to push data wrangling to the Spark nodes for massive scale and how to bring it back to a single node so we can use open source packages on the data. We'll demonstrate converting SQL tables into R distributed data frames and how to convert R data frames to SQL tables. We'll also have a look at how to train predictive models using data distributed over the Spark nodes. Bring your popcorn. This is a fun and interesting presentation. Speaker: Bryan Cafferky

InkSpot Science presentation at Open Science Meeting

David Leahy

Databricks with R: Deep Dive

Data Science at Scale by Sarah Guido

R at Microsoft

Revolution Analytics

How Spark Fits into Baidu's Scale-(James Peng, Baidu)

Persisting data from Amazon Kinesis using Amazon Kinesis Firehose is a popular pattern for streaming projects. However, building real-time analytics on these data introduces challenges, including managing the format, size and frequency of the files created. This session will present an end-to-end use case for deploying machine learning streaming analytics at-scale using Structured Streaming on Databricks. We will deploy a high-volume Kinesis producer, persist the data to S3 using Kinesis Firehose, partition and write the data using Parquet, create a machine learning model and, finally, query and visualize the data in real time. Key takeaways include: – Create a Kinesis producer – Persist to S3 using Kinesis Firehose – ETL, machine learning, and exploratory data analysis using Structured Streaming

Genomics on aws-webinar-april2018

Brendan Bouffler

BDM26: Spark Summit 2014 Debriefing

David Lauzon

Processing genetic data at scale

Mark Schroering

Rdf saturator

INRIA-OAK

Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...

Build a simple open data lake on AWS using a combination of open-source software (OSS), including Red Hat’s Debezium, Apache Kafka, and Kafka Connect for change data capture (CDC), and Apache Hive, Apache Spark, Apache Hudi, and Hudi’s DeltaStreamer for managing our data lake. We will use fully-managed AWS services to host the open data lake components, including Amazon RDS, Amazon MKS, Amazon EKS, and EMR. Link to the blog post and video: https://garystafford.medium.com/building-open-data-lakes-with-debezium-and-apache-hudi-c3370d3f86fb

Introduction to Microsoft R Services

Gregg Barrett

Next Generation Big Data Platform at Netflix 2014

Eva Tse

Revolution R Enterprise - Portland R User Group, November 2013

Revolution Analytics

Building Open Data Lakes on AWS with Debezium and Apache Hudi

Gary Stafford

How Spark Enables the Internet of Things- Paula Ta-Shma

Making Earth observation data available by using Amazon S3 is accelerating scientific discovery and enabling the creation of new products. Attend and learn how the scale and performance of Amazon S3 lets earth scientists, researchers, startups, and GIS professionals gather and analyse planetary-scale data without worrying about limitations of bandwidth, storage, memory, or processing power. Co-presented with support of the Australian Geoscience Data Cube collaboration, DigitalGlobe’s Geospatial Big Data Platform and the developer of the popular ObservedEarth mobile app. Speakers: Craig Lawton, Public Sector Solutions Architect, Amazon Web Services Lachlan Hurst, Observed Earth Matt Paget, Senior Experimental Scientist, CSIRO Dan Getman, Digital Globe

Earth on AWS - Next-Generation Open Data Platforms

Amazon Web Services

Was ist angesagt? (20)

Simplifying Big Data Applications with Apache Spark 2.0

Reproducible Data Science with R

Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin

Databricks @ Strata SJ

InkSpot Science presentation at Open Science Meeting

Databricks with R: Deep Dive

Data Science at Scale by Sarah Guido

R at Microsoft

How Spark Fits into Baidu's Scale-(James Peng, Baidu)

Genomics on aws-webinar-april2018

BDM26: Spark Summit 2014 Debriefing

Processing genetic data at scale

Rdf saturator

Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...

Introduction to Microsoft R Services

Next Generation Big Data Platform at Netflix 2014

Revolution R Enterprise - Portland R User Group, November 2013

Building Open Data Lakes on AWS with Debezium and Apache Hudi

How Spark Enables the Internet of Things- Paula Ta-Shma

Earth on AWS - Next-Generation Open Data Platforms

Andere mochten auch

Prototypes are typically re-implemented in another language due to compatibility issues with R in the enterprise, but TIBCO Enterprise Runtime for R (TERR) allows the language to be run on several platforms. Enterprise-level scalability has been brought to the R language, enabling rapid iteration without the need to recode, re-implement and test. This presentation will delve further into these topics, highlighting specific use cases and the true value that can be gained from utilizing R. The session will be followed by a lively, open Q&A discussion.

Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...

Abstract:- Tracking user events as they happen can challenge anyone providing real time user interaction. It can demand both huge scale and a lot of processing to support dynamic adjustment to targeting products and services. As the operational data store Couchbase data services are capable of processing tens of millions of updates a day. Streaming through systems such as Apache Spark and Kafka into Hadoop, information about these key events can be turned into deeper knowledge. We will review Lambda architectures deployed at sites like PayPal, Live Person and LinkedIn that leverage a Couchbase Data Pipeline. Bio:- Justin Michaels. With over 20 years experience in deploying mission critical systems, Justin Michaels industry experience covers capacity planning, architecture and industry vertical experience. Justin brings his passion for architecting, implementing and improving Couchbase to the community as a Solution Architect. His expertise involves both conventional application platforms as well as distributed data management systems. He regularly engages with existing and new Couchbase customers in performance reviews, architecture planning and best practice guidance.

Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...

In this interactive panel discussion, you will hear from these Spark experts as to why they chose to go "all-in" on Spark, leveraging the rich core capabilities that make Spark so exciting, and committing to significant IP that turns Spark into a world-class enterprise data preparation engine. Raymond and David will explain specific cases where capabilities were built on top of core Spark to provide a true interactive data prep application experience. Innovations such as creating a Domain Specific Language (DSL), an optimizing compiler, a persistent columnar caching layer, application specific Resilient Distributed Datasets (RDDs), on-line aggregation operators to solve the core memory, pipelining and shuffling obstacles to produce a highly interactive application with the core user and data volume scale-out benefits of Spark.

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...

Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...

This talk explores how Netflix equips its engineers with the freedom to find and introduce the right software for the job - even if it isn't used anywhere else in-house. Examples include how Netflix has enabled analysts to fluidly switch between MPP RDBMS and an auto-scaling Presto cluster, how Spark + NoSQL stores are used when deploying data sets to internal web apps, and how data scientists are enabled to work in the ML framework of their choosing and deploy models as a service.

Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...

The goal of this talk to lay out a framework for what algorithms work best in which situations, and why. Drawing on results of hundreds of crowd-sourced predictive modeling contests, this talk shows examples of how structure informs a choice in algorithm. As an illustration of these concepts, ZestFinance's work with China's retail giant, JD.com is used to describe how the right algorithms were applied to the right datasets to turn shopping data into credit data -- creating credit scores from scratch.

Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...

Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...

Tuning tips for Apache Spark Jobs

Samir Bessalah

Impala presentation ahad rana

Hybrid architecture integrateduserviewdata-peyman_mohajerian

VoltDB Big Data Camp LA 2014 - Scott Jar

Aziksa hadoop architecture santosh jha

Feature engineering- writing code to map raw input data into a set of signals that will be fed into a machine learning algorithm- is the dark art of data science. Although the process of crafting new features is tedious and failure-prone, the key to a successful model is a diverse set of high-quality features that are informed by domain experts. Recently, academic researchers have begun to focus on the problem of feature engineering, and have started to publish research that addresses the relative lack of tools that are designed to support the feature engineering process. In this talk, I will review some of my favorite papers and present some efforts to convert these ideas into tools that leverage the principles of reactive application design in order to make feature engineering (dare I say it) fun.

Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...

Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...

ver the last 13 months the Apache Hive community, which included 145 developers and 44 companies working together through the Stinger initiative, delivered 390,000 lines of code and 1600 resolved JIRA tickets. This is only the beginning. The Hive community has already started the next phase of extending the Speed, Scale, and SQL compliance in Hive. As Hadoop 2.0 with YARN evolves to enable a dizzying array of powerful engines that allow us to interact with ever growing data in new ways, well known tools such as SQL need to scale with it. This session will provide a technical illustration of the challenges facing SQL on Hadoop today and what the road ahead looks like as the user community drives more innovation. Stinger.next is the next multi-phase initiative to evolve Hive as the de facto SQL engine for Hadoop designed to deliver Speed, Scale and better SQL.

Stinger.Next by Alan Gates of Hortonworks

Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...