SlideShare ist ein Scribd-Unternehmen logo
1 von 46
Downloaden Sie, um offline zu lesen
SPARK SUMMIT
EUROPE2016
HOW TO CONNECT SPARK TO
YOUR OWN DATASOURCE
Ross Lawley
MongoDB
{	name:	"Ross	Lawley",	role:	"Senior	Software	Engineer",	twitter:	"@RossC0"	}
MongoDB Spark connector timeline
•  Initial interest
–  Skunkworks project in March 2015
–  Introduction to Big Data with Apache Spark
–  Intern project in Summer 2015
•  Official Project started in Jan 2016
–  Written in Scala with very little Java needed
–  Python and R support via SparkSQL
–  Much interest internally and externally
Spark 101 - the core
RDDs maintain lineage information that can be used to
reconstruct lost partitions
val searches = spark.textFile("hdfs://...")
.filter(_.contains("Search"))
.map(_.split("t")(2)).cache()
.filter(_.contains("MongoDB"))
.count()
Mapped
RDD
Filtered
RDD
HDFS
RDD
Cached
RDD
Filtered
RDD Count
. . .
Spark Driver
Worker 1 Worker nWorker 2
Cluster
Manager
Data source
Spark topology
SPARK SUMMIT
EUROPE2016
&
Prior to the Spark Connector
HDFS HDFS HDFS
MongoDB
Hadoop
Connector
The MongoDB Spark Connector
MongoDB
Spark
Connector
Roll your own connector
The Golden rule
Learn from other connectors
1. Connecting to your data source
https://www.flickr.com/photos/memake/5506361372
Making a connection
•  Has a cost
–  The Mongo Java Driver runs a connection pool
Authenticates connections, replica set discovery etc..
•  There are two modes to support
–  Reading
–  Writing
Connections need configuration
•  Read Configuration
•  URI, database name and collection name
•  Partitioner
•  Sample Size (for inferring the schema)
•  Read Preference, Read Concern
•  Local threshold (for choosing the MongoS)
•  Write Configuration
•  URI, database name and collection name
•  Write concern
•  Local threshold (for choosing the MongoS)
Connector
case class MongoConnector(mongoClientFactory: MongoClientFactory) extends !
Logging with Serializable with Closeable {!
!
def withMongoClientDo[T](code: MongoClient => T): T = {!
val client = acquireClient()!
try {!
code(client)!
} finally {!
releaseClient(client)!
}!
}!
!
def withDatabaseDo[T](config: MongoCollectionConfig, code: MongoDatabase => T): T =!
withMongoClientDo({ client => code(client.getDatabase(config.databaseName)) })!
!
def withCollectionDo[D, T](config: MongoCollectionConfig, code: MongoCollection[D] => T) …!
!
private[spark] def acquireClient(): MongoClient = mongoClientCache.acquire(mongoClientFactory)!
private[spark] def releaseClient(client: MongoClient): Unit = mongoClientCache.release(client)!
}
Connection Optimization
•  MongoClientCache
–  A lockless cache for MongoClients.
–  Allowing multiple tasks access to a MongoClient.
–  Keeps clients alive for a period of time.
–  Timeouts and closes old.
Making a connection
•  Should be cheap as possible
–  Broadcast it so it can be reused.
–  Use a timed cache to promote reuse and ensure
closure of resources.
•  Configuration should be flexible
–  Spark Configuration
–  Options – Map[String, String]
–  ReadConfig / WriteConfig instances
2. Read data
https://www.flickr.com/photos/ericejohnson/4408073661/
Implement RDD[T] class
•  Partition the collection
•  Optionally provide preferred locations of a
partition
•  Compute the partition into an Iterator[T]
Partition
/**!
* An identifier for a partition in an RDD.!
*/!
trait Partition extends Serializable {!
/**!
* Get the partition's index within its parent RDD!
*/!
def index: Int!
!
// A better default implementation of HashCode!
override def hashCode(): Int = index!
!
override def equals(other: Any): Boolean = super.equals(other)!
}!
!
MongoPartition
Very simple, stores the information on how to get the data.
	
case	class	MongoPartition(	
				index:	Int,		
				queryBounds:	BsonDocument,		
				locations:	Seq[String])	extends	Partition
MongoPartitioner
/**!
* The MongoPartitioner provides the partitions of a collection!
*/!
trait MongoPartitioner extends Logging with Serializable {!
!
/**!
* Calculate the Partitions!
*!
* @param connector the MongoConnector!
* @param readConfig the [[com.mongodb.spark.config.ReadConfig]]!
* @return the partitions!
*/!
def partitions(connector: MongoConnector, !
readConfig: ReadConfig, !
pipeline: Array[BsonDocument]): Array[MongoPartition]!
!
}!
!
MongoSamplePartitioner
•  Over samples the collection
–  Calculate the number of partitions.
Uses the average document size and the configured partition size.
–  Samples the collection, sampling n number of documents per partition
–  Sorts the data by partition key
–  Takes each n partition
–  Adds a min and max key partition split at the start and end of the collection	
	
	{$gte: {_id: minKey}, $lt: {_id: 1}}
{$gte: {_id: 1}, $lt: {_id: 100}} {$gte: {_id: 5000}, $lt: {_id: maxKey}}
{$gte: {_id: 100}, $lt: {_id: 200}}
{$gte: {_id: 4900}, $lt: {_id: 5000}}
MongoShardedPartitioner
•  Examines the shard config database
–  Creates partitions based on the shard chunk min and max ranges
–  Stores the Shard location data for the chunk, to help promote locality
–  Adds a min and max key partition split at the start and end of the collection	
	
	
{$gte: {_id: minKey}, $lt: {_id: 1}} {$gte: {_id: 1000}, $lt: {_id: maxKey}}
Alternative Partitioners
•  MongoSplitVectorPartitioner
A partitioner for standalone or replicaSets. Command requires special privileges.
•  MongoPaginateByCountPartitioner
Creates a maximum number of partitions
Costs a query to calculate each partition
•  MongoPaginateBySizePartitioner
As above but using average document size to determine the partitions.
•  Create your own
Just implement the MongoPartitioner trait and add the full path to the config
Partitions
•  They are the foundations for RDD's
•  Super simple concept
•  Challenges for mutable data sources as not a
snapshot in time
RDD
Reads under the hood
MongoSpark.load(sparkSession).count()	
1.  Create a MongoRDD[Document]
2.  Partition the data
3.  Calculate the Partitions .
4.  Get the preferred locations and allocate workers
5.  For each partition:
i.  Queries and returns the cursor
ii.  Iterates the cursor and sums up the data
6.  Finally, the Spark application returns the sum of the sums.
Spark Driver
Each Spark Worker
Reads
•  Data must be serializable
•  Partitions provide parallelism
•  Partitioners should be configurable
No one size fits all
•  Partitioning strategy may be non obvious
–  Allow users to solve partitioning in their own way
Read Performance
•  MongoDB Usual Suspects
Document design
Indexes
Read Concern
•  Spark Specifics
Partitioning Strategy
Data Locality
. . .
Data locality
Data locality
MongoS	 MongoS	 MongoS	 MongoS	 MongoS	
. . .
Data locality
Configure: LocalThreshold, MongoShardedPartitioner
MongoS	 MongoS	 MongoS	 MongoS	 MongoS	
. . .
Data locality
MongoD	 MongoD	 MongoD	 MongoD	 MongoD	
MongoS	 MongoS	 MongoS	 MongoS	 MongoS	
. . .
Configure: ReadPreference, LocalThreshold, MongoShardedPartitioner
3. Writing data
https://www.flickr.com/photos/knightjh/3415926549
Writes under the hood
MongoSpark.save(dataFrame)	
1.  Create a connector
2.  For each partition:
1.  Group the data in batches
2.  Insert into the collection
* DataFrames / Datasets will upsert if there is an `_id`
4. Structured data
https://www.flickr.com/photos/fabola/15524427452
Why support structured data?
•  RDD's are the core to Spark
You can convert RDDs to Datasets – but the API can be painful for users
•  Fastest growing area of Spark
40% of users use DataFrames in production & 67% in prototypes*
•  You can provide Python and R support
62% of users use Spark with Python, behind Scala but gaining fast*
•  Performance improvements
Can passing filters and projections down to the data layer
* Figures from the Spark 2016 survey
Structured data in Spark
•  DataFrame == Dataset[Row]
–  RDD[Row]
•  Represents a row of data
•  Optionally define the schema of the data
•  Dataset
–  Efficiently decode and encode data
Create a DefaultSource
•  DataSourceRegister
Provide a shortname for the datasource (broken? not sure how its used)
•  RelationProvider
Produce relations for your data source – inferring the schema (read)
•  SchemaRelationProvider
Produce relations for your data source using the provided schema (read)
•  CreatableRelationProvider
Creates a relation based on the contents of the given DataFrame (write)
DefaultSource continued…
•  StreamSourceProvider (experimental)
Produce a streaming source of data
Not implemented - In theory could tail the OpLog or a capped collection
as a source of data.
•  StreamSinkProvider (experimental)
Produce a streaming sink for data
Inferring Schema
•  Provide it via a Case Class or Java bean
Uses Sparks reflection to provide the Schema
•  Alternatively Sample the collection
–  Sample 1000 documents by default
–  Convert each document to a StructType
For unsupported Bson Types we use extended json format
Would like support for User Defined Types
–  Use treeAggregate to find the compatible types
Uses TypeCoercion.findTightestCommonTypeOfTwo to coerce types
Conflicting types log a warning and downgrade to StringType
Create a BaseRelation
•  TableScan
Return all the data
•  PrunedScan
Return all the data but only the selected columns
•  PrunedFilteredScan
Returns filtered data and the selected columns
•  CatalystScan
Experimental access to logical execution plan
•  InsertableRelation
Allows data to be inserted via the INSERT INTO
Multi-language support!
// Scala!
sparkSession.read.format("com.mongodb.spark.sql").load()!
// Python!
sqlContext.read.format("com.mongodb.spark.sql").load()!
// R!
read.df(sqlContext, source = "com.mongodb.spark.sql")!
Rolling your own Connector
•  Review existing connectors
https://github.com/mongodb/mongo-spark
•  Partitioning maybe tricky but is core for parallelism.
•  DefaultSource opens the door to other languages
•  Schema inference is painful for Document
databases and users
•  Structured data is the future for Speedy Spark
Free, online training
http://university.mongodb.com
SPARK SUMMIT
EUROPE2016
THANK YOU.
Any questions come see us at the booth!

Weitere ähnliche Inhalte

Was ist angesagt?

The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkDataWorks Summit
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark DownscalingDatabricks
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Icebergkbajda
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward
 
Tuning PostgreSQL for High Write Throughput
Tuning PostgreSQL for High Write Throughput Tuning PostgreSQL for High Write Throughput
Tuning PostgreSQL for High Write Throughput Grant McAlister
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDatabricks
 
Under the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureScyllaDB
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 

Was ist angesagt? (20)

The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
 
Tuning PostgreSQL for High Write Throughput
Tuning PostgreSQL for High Write Throughput Tuning PostgreSQL for High Write Throughput
Tuning PostgreSQL for High Write Throughput
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
Presto overview
Presto overviewPresto overview
Presto overview
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Under the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database Architecture
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 

Andere mochten auch

Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Spark Summit
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Spark Summit
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit
 
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit
 
Spark Summit EU talk by Larisa Sawyer
Spark Summit EU talk by Larisa SawyerSpark Summit EU talk by Larisa Sawyer
Spark Summit EU talk by Larisa SawyerSpark Summit
 
Spark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus GoehausenSpark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus GoehausenSpark Summit
 
Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit
 
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit
 
Spark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit
 
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
How Apache Spark Is Helping Tame the Wild West of Wi-FiHow Apache Spark Is Helping Tame the Wild West of Wi-Fi
How Apache Spark Is Helping Tame the Wild West of Wi-FiSpark Summit
 
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit
 
Spark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 
Vital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent AppsVital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent AppsVital.AI
 

Andere mochten auch (20)

Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
 
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena Lazovik
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 
Spark Summit EU talk by Larisa Sawyer
Spark Summit EU talk by Larisa SawyerSpark Summit EU talk by Larisa Sawyer
Spark Summit EU talk by Larisa Sawyer
 
Spark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus GoehausenSpark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus Goehausen
 
Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub Hava
 
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til Piffl
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
Spark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean Wampler
 
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
How Apache Spark Is Helping Tame the Wild West of Wi-FiHow Apache Spark Is Helping Tame the Wild West of Wi-Fi
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
 
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat Patterson
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
 
Spark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim Hunter
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Vital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent AppsVital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent Apps
 

Ähnlich wie Spark Summit EU talk by Ross Lawley

How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...Antonios Giannopoulos
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Spring data presentation
Spring data presentationSpring data presentation
Spring data presentationOleksii Usyk
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
MongoDB Europe 2016 - Big Data meets Big Compute
MongoDB Europe 2016 - Big Data meets Big ComputeMongoDB Europe 2016 - Big Data meets Big Compute
MongoDB Europe 2016 - Big Data meets Big ComputeMongoDB
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkMartin Toshev
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to SparkLi Ming Tsai
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
MongoDB Knowledge Shareing
MongoDB Knowledge ShareingMongoDB Knowledge Shareing
MongoDB Knowledge ShareingPhilip Zhong
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Kai Chan
 
ESWC SS 2013 - Tuesday Keynote Steffen Staab: Programming the Semantic Web
ESWC SS 2013 - Tuesday Keynote Steffen Staab: Programming the Semantic WebESWC SS 2013 - Tuesday Keynote Steffen Staab: Programming the Semantic Web
ESWC SS 2013 - Tuesday Keynote Steffen Staab: Programming the Semantic Webeswcsummerschool
 
Staab programming thesemanticweb
Staab programming thesemanticwebStaab programming thesemanticweb
Staab programming thesemanticwebAneta Tu
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 

Ähnlich wie Spark Summit EU talk by Ross Lawley (20)

How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Spring data presentation
Spring data presentationSpring data presentation
Spring data presentation
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
MongoDB by Tonny
MongoDB by TonnyMongoDB by Tonny
MongoDB by Tonny
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
MongoDB Europe 2016 - Big Data meets Big Compute
MongoDB Europe 2016 - Big Data meets Big ComputeMongoDB Europe 2016 - Big Data meets Big Compute
MongoDB Europe 2016 - Big Data meets Big Compute
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
MongoDB Knowledge Shareing
MongoDB Knowledge ShareingMongoDB Knowledge Shareing
MongoDB Knowledge Shareing
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
 
ESWC SS 2013 - Tuesday Keynote Steffen Staab: Programming the Semantic Web
ESWC SS 2013 - Tuesday Keynote Steffen Staab: Programming the Semantic WebESWC SS 2013 - Tuesday Keynote Steffen Staab: Programming the Semantic Web
ESWC SS 2013 - Tuesday Keynote Steffen Staab: Programming the Semantic Web
 
Staab programming thesemanticweb
Staab programming thesemanticwebStaab programming thesemanticweb
Staab programming thesemanticweb
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Spark sql
Spark sqlSpark sql
Spark sql
 

Mehr von Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

Mehr von Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Kürzlich hochgeladen

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 

Kürzlich hochgeladen (20)

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 

Spark Summit EU talk by Ross Lawley

  • 1. SPARK SUMMIT EUROPE2016 HOW TO CONNECT SPARK TO YOUR OWN DATASOURCE Ross Lawley MongoDB
  • 3. MongoDB Spark connector timeline •  Initial interest –  Skunkworks project in March 2015 –  Introduction to Big Data with Apache Spark –  Intern project in Summer 2015 •  Official Project started in Jan 2016 –  Written in Scala with very little Java needed –  Python and R support via SparkSQL –  Much interest internally and externally
  • 4. Spark 101 - the core RDDs maintain lineage information that can be used to reconstruct lost partitions val searches = spark.textFile("hdfs://...") .filter(_.contains("Search")) .map(_.split("t")(2)).cache() .filter(_.contains("MongoDB")) .count() Mapped RDD Filtered RDD HDFS RDD Cached RDD Filtered RDD Count
  • 5. . . . Spark Driver Worker 1 Worker nWorker 2 Cluster Manager Data source Spark topology
  • 7. Prior to the Spark Connector HDFS HDFS HDFS MongoDB Hadoop Connector
  • 8. The MongoDB Spark Connector MongoDB Spark Connector
  • 9. Roll your own connector
  • 10. The Golden rule Learn from other connectors
  • 11. 1. Connecting to your data source https://www.flickr.com/photos/memake/5506361372
  • 12. Making a connection •  Has a cost –  The Mongo Java Driver runs a connection pool Authenticates connections, replica set discovery etc.. •  There are two modes to support –  Reading –  Writing
  • 13. Connections need configuration •  Read Configuration •  URI, database name and collection name •  Partitioner •  Sample Size (for inferring the schema) •  Read Preference, Read Concern •  Local threshold (for choosing the MongoS) •  Write Configuration •  URI, database name and collection name •  Write concern •  Local threshold (for choosing the MongoS)
  • 14. Connector case class MongoConnector(mongoClientFactory: MongoClientFactory) extends ! Logging with Serializable with Closeable {! ! def withMongoClientDo[T](code: MongoClient => T): T = {! val client = acquireClient()! try {! code(client)! } finally {! releaseClient(client)! }! }! ! def withDatabaseDo[T](config: MongoCollectionConfig, code: MongoDatabase => T): T =! withMongoClientDo({ client => code(client.getDatabase(config.databaseName)) })! ! def withCollectionDo[D, T](config: MongoCollectionConfig, code: MongoCollection[D] => T) …! ! private[spark] def acquireClient(): MongoClient = mongoClientCache.acquire(mongoClientFactory)! private[spark] def releaseClient(client: MongoClient): Unit = mongoClientCache.release(client)! }
  • 15. Connection Optimization •  MongoClientCache –  A lockless cache for MongoClients. –  Allowing multiple tasks access to a MongoClient. –  Keeps clients alive for a period of time. –  Timeouts and closes old.
  • 16. Making a connection •  Should be cheap as possible –  Broadcast it so it can be reused. –  Use a timed cache to promote reuse and ensure closure of resources. •  Configuration should be flexible –  Spark Configuration –  Options – Map[String, String] –  ReadConfig / WriteConfig instances
  • 18. Implement RDD[T] class •  Partition the collection •  Optionally provide preferred locations of a partition •  Compute the partition into an Iterator[T]
  • 19. Partition /**! * An identifier for a partition in an RDD.! */! trait Partition extends Serializable {! /**! * Get the partition's index within its parent RDD! */! def index: Int! ! // A better default implementation of HashCode! override def hashCode(): Int = index! ! override def equals(other: Any): Boolean = super.equals(other)! }! !
  • 20. MongoPartition Very simple, stores the information on how to get the data. case class MongoPartition( index: Int, queryBounds: BsonDocument, locations: Seq[String]) extends Partition
  • 21. MongoPartitioner /**! * The MongoPartitioner provides the partitions of a collection! */! trait MongoPartitioner extends Logging with Serializable {! ! /**! * Calculate the Partitions! *! * @param connector the MongoConnector! * @param readConfig the [[com.mongodb.spark.config.ReadConfig]]! * @return the partitions! */! def partitions(connector: MongoConnector, ! readConfig: ReadConfig, ! pipeline: Array[BsonDocument]): Array[MongoPartition]! ! }! !
  • 22. MongoSamplePartitioner •  Over samples the collection –  Calculate the number of partitions. Uses the average document size and the configured partition size. –  Samples the collection, sampling n number of documents per partition –  Sorts the data by partition key –  Takes each n partition –  Adds a min and max key partition split at the start and end of the collection {$gte: {_id: minKey}, $lt: {_id: 1}} {$gte: {_id: 1}, $lt: {_id: 100}} {$gte: {_id: 5000}, $lt: {_id: maxKey}} {$gte: {_id: 100}, $lt: {_id: 200}} {$gte: {_id: 4900}, $lt: {_id: 5000}}
  • 23. MongoShardedPartitioner •  Examines the shard config database –  Creates partitions based on the shard chunk min and max ranges –  Stores the Shard location data for the chunk, to help promote locality –  Adds a min and max key partition split at the start and end of the collection {$gte: {_id: minKey}, $lt: {_id: 1}} {$gte: {_id: 1000}, $lt: {_id: maxKey}}
  • 24. Alternative Partitioners •  MongoSplitVectorPartitioner A partitioner for standalone or replicaSets. Command requires special privileges. •  MongoPaginateByCountPartitioner Creates a maximum number of partitions Costs a query to calculate each partition •  MongoPaginateBySizePartitioner As above but using average document size to determine the partitions. •  Create your own Just implement the MongoPartitioner trait and add the full path to the config
  • 25. Partitions •  They are the foundations for RDD's •  Super simple concept •  Challenges for mutable data sources as not a snapshot in time RDD
  • 26. Reads under the hood MongoSpark.load(sparkSession).count() 1.  Create a MongoRDD[Document] 2.  Partition the data 3.  Calculate the Partitions . 4.  Get the preferred locations and allocate workers 5.  For each partition: i.  Queries and returns the cursor ii.  Iterates the cursor and sums up the data 6.  Finally, the Spark application returns the sum of the sums. Spark Driver Each Spark Worker
  • 27. Reads •  Data must be serializable •  Partitions provide parallelism •  Partitioners should be configurable No one size fits all •  Partitioning strategy may be non obvious –  Allow users to solve partitioning in their own way
  • 28. Read Performance •  MongoDB Usual Suspects Document design Indexes Read Concern •  Spark Specifics Partitioning Strategy Data Locality
  • 29. . . . Data locality
  • 30. Data locality MongoS MongoS MongoS MongoS MongoS . . .
  • 31. Data locality Configure: LocalThreshold, MongoShardedPartitioner MongoS MongoS MongoS MongoS MongoS . . .
  • 32. Data locality MongoD MongoD MongoD MongoD MongoD MongoS MongoS MongoS MongoS MongoS . . . Configure: ReadPreference, LocalThreshold, MongoShardedPartitioner
  • 34. Writes under the hood MongoSpark.save(dataFrame) 1.  Create a connector 2.  For each partition: 1.  Group the data in batches 2.  Insert into the collection * DataFrames / Datasets will upsert if there is an `_id`
  • 35.
  • 37. Why support structured data? •  RDD's are the core to Spark You can convert RDDs to Datasets – but the API can be painful for users •  Fastest growing area of Spark 40% of users use DataFrames in production & 67% in prototypes* •  You can provide Python and R support 62% of users use Spark with Python, behind Scala but gaining fast* •  Performance improvements Can passing filters and projections down to the data layer * Figures from the Spark 2016 survey
  • 38. Structured data in Spark •  DataFrame == Dataset[Row] –  RDD[Row] •  Represents a row of data •  Optionally define the schema of the data •  Dataset –  Efficiently decode and encode data
  • 39. Create a DefaultSource •  DataSourceRegister Provide a shortname for the datasource (broken? not sure how its used) •  RelationProvider Produce relations for your data source – inferring the schema (read) •  SchemaRelationProvider Produce relations for your data source using the provided schema (read) •  CreatableRelationProvider Creates a relation based on the contents of the given DataFrame (write)
  • 40. DefaultSource continued… •  StreamSourceProvider (experimental) Produce a streaming source of data Not implemented - In theory could tail the OpLog or a capped collection as a source of data. •  StreamSinkProvider (experimental) Produce a streaming sink for data
  • 41. Inferring Schema •  Provide it via a Case Class or Java bean Uses Sparks reflection to provide the Schema •  Alternatively Sample the collection –  Sample 1000 documents by default –  Convert each document to a StructType For unsupported Bson Types we use extended json format Would like support for User Defined Types –  Use treeAggregate to find the compatible types Uses TypeCoercion.findTightestCommonTypeOfTwo to coerce types Conflicting types log a warning and downgrade to StringType
  • 42. Create a BaseRelation •  TableScan Return all the data •  PrunedScan Return all the data but only the selected columns •  PrunedFilteredScan Returns filtered data and the selected columns •  CatalystScan Experimental access to logical execution plan •  InsertableRelation Allows data to be inserted via the INSERT INTO
  • 43. Multi-language support! // Scala! sparkSession.read.format("com.mongodb.spark.sql").load()! // Python! sqlContext.read.format("com.mongodb.spark.sql").load()! // R! read.df(sqlContext, source = "com.mongodb.spark.sql")!
  • 44. Rolling your own Connector •  Review existing connectors https://github.com/mongodb/mongo-spark •  Partitioning maybe tricky but is core for parallelism. •  DefaultSource opens the door to other languages •  Schema inference is painful for Document databases and users •  Structured data is the future for Speedy Spark
  • 46. SPARK SUMMIT EUROPE2016 THANK YOU. Any questions come see us at the booth!