SlideShare ist ein Scribd-Unternehmen logo
1 von 57
Downloaden Sie, um offline zu lesen
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Performance	in	Spark	2.0	
…and	onward	to	the	next	level	
Brad	Carlile	
Sr.	Director	Strategic	ApplicaIons	Engineer	SAE	
Oracle	Systems	Group	
August	18,	2016	
	
Addi$onal	info	on	system	performance	results:	
hNp://blogs.oracle.com/bestperf
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Safe	Harbor	Statement	
The	following	is	intended	to	outline	our	general	product	direcIon.	It	is	intended	for	
informaIon	purposes	only,	and	may	not	be	incorporated	into	any	contract.	It	is	not	a	
commitment	to	deliver	any	material,	code,	or	funcIonality,	and	should	not	be	relied	upon	
in	making	purchasing	decisions.	The	development,	release,	and	Iming	of	any	features	or	
funcIonality	described	for	Oracle’s	products	remains	at	the	sole	discreIon	of	Oracle.	
2
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
About	me	
Sr.	Director	Strategic	Applica0ons	Engineering,	Oracle	Systems	Group	
•  Oracle	Hardware	Systems	Performance	&	benchmarks:	x86	&	SPARC	products	
•  30	years	parallel	programming,	performance	opImizaIon	&	benchmarks:	aNached	processors,	Hypercube,	MPPs,	SMPs,	NUMA…	
•  FloaIng	Point	Systems,	Cray,	Sun,	Oracle	
•  Skier	
•  Traveler		
•  Big	Wall	Climber	
•  I	Drive	an	Art	Car	
3
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Spark	is	ExciIng	–	But	You	Already	Know	That!	
Spark’s	Whole	Ecosystem	totally	appeals	to	the	performance	expert	in	me	!	
•  Great	Ecosystem	for	AnalyIcs	
– Spark	forms	a	conInuum	with	other	technologies	(Kaaa,	Solr,	…)		
•  Spark	is	fast	&	scalable	because	of	clean	design	
– Spark’s	in-memory	focus	
•  Spark	speaks	SQL	&	DataFrames	
– Perfect	marriage	for	Data	ScienIsts	&	Hardware	AcceleraIon	
– Oracle’s	Sofware	in	Silicon	can	be	used	on	Apache	Spark	
•  Lots	of	Oracle	customers	using	Apache	Spark	
4
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Apache	Spark	
5	
Spark Core
(Implemented in Scala & JVM, operates on RDDs)
Spark SQL
(SQL &
DSL(Scala/Java/Python…)
Spark API: Scala, Python, R, Java
DataFrames / DataSets
Data Sources: Json, csv, Hadoop, Cassandra, Hive, Hbase, Postgres, MySQL, Elasticsearch,...
Spark Streaming
(Streaming Analytics
micro-batches)
Spark MLlib
(Machine Learning &
statistic routines)
GraphX
(Graph)
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Apache	Spark	2.0.0	Great	New	Features	
•  Spark	SQL	
– SQL	2003	and	Unified	DataFrames/Datasets	API.	
– Tungsten	Phase2:	Whole	Stage	CodeGen	
•  Spark	Mllib	&	GraphX	–	Large	Scale	Machine	Learning	on	Apache	Spark	
•  Structured	Streaming		
6	
Best	new	features		-	Star0ng	with	Spark	SQL
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
SQL	is	Powerful	Feature	of	Apache	Spark	
•  SQL	is	a	powerful	language	for	the	wide	range	of	Data	ScienIsts	
– Expresses	set	operaIons	on	data	of	any	size:	sort	,	filter,	manipulate	etc		
7
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
SQL	is	Powerful	Feature	of	Apache	Spark	
•  SQL	is	a	powerful	language	for	the	wide	range	of	Data	ScienIsts	
– Expresses	set	operaIons	on	data	of	any	size:	sort	,	filter,	manipulate	etc		
•  SQL	concisely	express	data	manipulaIon	at	scale	in	readable	way	
– ETL	(Extract,	Transform,	and	Load)	
– Feature	selecIon	for	ML	
– Feature	creaIon/generaIon	for	ML	
– Report	generaIon	
8
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
SQL	is	Powerful	Feature	of	Apache	Spark	
•  SQL	is	a	powerful	language	for	the	wide	range	of	Data	ScienIsts	
– Expresses	set	operaIons	on	data	of	any	size:	sort	,	filter,	manipulate	etc		
•  SQL	concisely	express	data	manipulaIon	at	scale	in	readable	way	
– ETL	(Extract,	Transform,	and	Load)	
– Feature	selecIon	for	ML	
– Feature	creaIon/generaIon	for	ML	
– Report	generaIon	
•  Many	well-known	techniques	to	efficiently	opImize	SQL	
– Apache	Spark	“Catalyst	opImizer”	reorganizes	query	for	fastest	execuIon	
– Extensible:	you	can	contribute	your	own	opImizaIons	
9
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Spark	SQL	&	DataFrames:	Perfect	for	Data	ScienIsts	
Amazing	work	which	also	efficiently	puts	Data		In	Memory		for	fast	access	
•  SQL	easiest	way	to	write	code	to	filter,	merge,	scan,	…	
–  scala>	dfr	=sparkSession.sql("SELECT	firstName,	age,	gender		
											FROM	cust	
											WHERE	age	>	20	
											ORDER	BY	age	DESC").show()	
•  DataFrames	columnar	and	typed	//	schema:			firstName:String,	lastName:String,	gender:String,	age:Int	
– DataFrames	beNer	for	analyIcs	than	generic	RDDs	
10	
RDD:	row-based,	original	
Apache	Spark	data	structure	
Jay,Lock,M,81
Rosa,Ruiz,F,14
Clair,Bride,F,23
DataFrame:	columnar			
Jay Lock M 81
Rosa Ruiz F 14
Clair Bride F 23
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Columns:	what	is	in	them?		Why	not	just	row-store?	
•  Each	customer	data	point,	event,	etc.		has	many	kinds	of	possible	data	
•  Spark	Schema’s	defines	the	columns	and	their	types,		
can	have	huge	numbers	10’s	to	100’s	(very	wide	columns)	
11	
Different	characteris0cs	of	a	data	point	are	stored	in	different	columns	
Name	 Addr	 St	 City	 Zip	 Age	
Id’ed	
Gender	
Edu	 Sal	 Marital	
Loyalty	
plan	
Cust	
Year	
Buy	
Freq	
Fav	
Devices	
Com	
mode	 …	
Duma	 6	nw	7	 Or	 PDX	 97223	 25	 F	 MA	 60k	 M	 Lev2b	 5.7	 .2	 iPh	 txt	
val	loSchema	=	StructType(	
								StructField("lo_orderkey",	IntegerType,	true)	::	
								StructField("lo_linenumber",	IntegerType,	true)	::	
								StructField("lo_custkey",	IntegerType,	true)	::	
								StructField("lo_partkey",	IntegerType,	true)	::	
								StructField("lo_suppkey",	IntegerType,	true)	::	
								StructField("lo_orderdate",	StringType,	true)	::	
								StructField("lo_orderpriority",	StringType,	true)	::	
								StructField("lo_shippriority",	StringType,	true)	::	
								StructField("lo_quanIty",	IntegerType,	true)	::	
								StructField("lo_extprice",	IntegerType,	true)	::	
								StructField("lo_ordtotalprice",	IntegerType,	true)	::	
								StructField("lo_discount",	IntegerType,	true)	::	
								StructField("lo_revenue",	IntegerType,	true)	::	
								StructField("lo_supplycost",	IntegerType,	true)	::	
								StructField("lo_tax",	IntegerType,	true)	::	
								StructField("lo_commitdate",	IntegerType,	true)	::	
								StructField("lo_shipmode",	StringType,	true)	::		
								StructField("lo_ordermode",	StringType,	true)	::		
								StructField("lo_webIme",	FloatType,	true)	::		
								StructField("lo_shopcarvme",	FloatType,		true)	::		
							.	.	.		
								StructField("lo_loyaltyPercnt",	FloatType,		true)	::		Nil)	
An	analysis	may	only	
need	to	explore	a	small	
subset	of	these	columns.	
	
Much	faster	if	only	access	
the	needed	column,	
also	many	server	benefits!
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
OLD
case class Customer (c_custkey: Int, c_name: String,
c_address: String, c_city: String, c_phone: String,
c_mktsegment: String)
...
object RCDBTestProgram {
def main(args: Array[String]) {
val sparkConf = newSparkConf()
.setAppName("RCDB")
val sc = new SparkContext(sparkConf)
val sqlContext =
new org.apache.spark.sql.SQLContext(sc)
...
val dataDir ="file:/Users/bc/datasets/R_SF1000/”
val df1 = sc.textFile(dataDir+”customer.csv")
.map(_.split(";"))
.map(p => Customer(p(0).trim.toInt,p(1),
.trim,p(2).trim,p(3)
.trim,p(4).trim,p(5).trim,p(6).trim))
.toDF()
df1.registerTempTable(”customer”)
...
query = "””SELECT COUNT(*) FROM customer""”
val count = sqlContext.sql(query).take(1)	
New
object RCDBTestProgram {
def main(args: Array[String]) {
val sparkSession = SparkSession.builder
.appName("RCDB”).getOrCreate()
...
val dataDir ="file:///Users/bc/datasets/R_SF1000/”
val custSchema = StructType(
StructField("c_custkey", IntegerType, true) ::
StructField("c_name", StringType, true) ::
StructField("c_address", StringType, true) ::
StructField("c_city", StringType, true) ::
StructField("c_phone", StringType, true) ::
StructField("c_mktsegment", StringType, true) :: Nil)
val df3 = sparkSession.read.option("sep", ";")
.option("header","false").schema(custSchema)
.csv(dataDir+"customer.csv").toDF().repartition(256)
df3.createOrReplaceTempView("customer")
df3.persist(StorageLevel.OFF_HEAP)
...
query = "””SELECT COUNT(*) FROM customer""”
val count = sparkSession.sql(query).take(1)
...
sparkSession.sql("SHOW TABLES").show()
sparkSession.catalog.listTables.show()
12	
Apache	Spark	2.0.0	
New	SparkSession	replacing	the	SQLContext	
https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
SQL	and	DSL	Both	OpImized	by	Catalyst	OpImizer	
13	
Ex:	“Select	all	books	by	authors	born	aEer	1980	named	‘Paulo’	from	books	&	authors”	
	
SQL	
SELECT	*		
FROM	author	a		
JOIN	book	b	ON	a.id	=	b.author_id		
WHERE	a.year_of_birth	>	1980		
					AND	a.first_name	=	'Paulo’	
ORDER	BY	b.Itle	
Query	DSL	(Domain	Specific	Language)	
val	joinDF	=		
author.as(‘a)	
.join(book.as(‘b),	$”a.id”	===		$”b.author_id”)	
.filter($”a.year_of_birth”	>	1980)	
.filter($”a.first_name”	=	“Paulo”))	
.orderby(‘b.Itle)	
Catalyst	Op0mizer	
Op$mized	Plan	
(can	I	reorder	various	opera$ons?)
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Apache	Spark	SQL:	Catalyst	OpImizes	SQL/DSL	ExecuIon	
14	
*HashAggregate(keys=[],	funcIons=[count(1)])	
+-	Exchange	SingleParIIon	
			+-	*HashAggregate(keys=[],	funcIons=[parIal_count(1)])	
						+-	*Project	
									+-	*Filter	((isnotnull(lo_quanIty#89)	&&	(lo_quanIty#89	>=	10))	&&	(lo_quanIty#89	<=	20))	
												+-	InMemoryTableScan	[lo_quanIty#89],	[isnotnull(lo_quanIty#89),	(lo_quanIty#89	>=	10),	(lo_quanIty#89	<=	20)]	
															:		+-	InMemoryRelaIon	[lo_orderkey#81,	lo_linenumber#82,	lo_custkey#83,	lo_partkey#84,	lo_suppkey#85,	lo_orderdate#86,		
																							lo_orderpriority#87,	lo_shippriority#88,	lo_quanIty#89,	lo_extendedprice#90,	lo_ordtotalprice#91,	lo_discount#92,	
																							lo_revenue#93,	lo_supplycost#94,	lo_tax#95,	lo_commitdate#96,	lo_shipmode#97],	
																							false,	16384,	StorageLevel(disk,	memory,	o‚eap,	1	replicas)	
															:					:		+-	Exchange	RoundRobinParIIoning(16)	
															:					:					+-	*Scan	csv	[lo_orderkey#81,lo_linenumber#82,lo_custkey#83,lo_partkey#84,lo_suppkey#85,lo_orderdate#86,	
																																		lo_orderpriority#87,lo_shippriority#88,lo_quan$ty#89,lo_extendedprice#90,lo_ordtotalprice#91,lo_discount#92,	
																																		lo_revenue#93,	lo_supplycost#94,lo_tax#95,lo_commitdate#96,lo_shipmode#97]	Format:	CSV,		
																																		InputPaths:	file:/Users/bradcarlile/datasets/RCDB_SF1/lineorder-new.csv,	
																																		PushedFilters:	[],	
																																		ReadSchema:	struct<lo_orderkey:int,lo_linenumber:int,lo_custkey:int,lo_partkey:int,lo_suppkey:int,lo_orderdat...
Qbet:		SELECT	count(*)	
												FROM	lineorder	
												WHERE	lo_quanIty	BETWEEN	10	and	20	
Can	look	at	OpQmized	plan	with	
Explain	plan	created	by	using:	
sparkSession.sql(query).explain()
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
SQL	and	DSL	Both	OpImized	by	Catalyst	OpImizer	
15	
Ex:	“Select	all	books	by	authors	born	aEer	1980	named	‘Paulo’	from	books	&	authors”	
	
SQL	
SELECT	*		
FROM	author	a		
JOIN	book	b	ON	a.id	=	b.author_id		
WHERE	a.year_of_birth	>	1980		
					AND	a.first_name	=	'Paulo’	
ORDER	BY	b.Itle	
Query	DSL	(Domain	Specific	Language)	
val	joinDF	=		
author.as(‘a)	
.join(book.as(‘b),	$”a.id”	===		$”b.author_id”)	
.filter($”a.year_of_birth”	>	1980)	
.filter($”a.first_name”	=	“Paulo”))	
.orderby(‘b.Itle)	
Catalyst	Op0mizer	
Op$mized	Plan	
(can	I	reorder	various	opera$ons?)	
Whole-Stage	CodeGen	
(new	in	Spark	2.0)
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Spark	2.0:	A	big	performance	Increase	with	Tungsten	Phase2	
	SELECT	count(*)	FROM	lineorder	WHERE	lo_quan0ty	BETWEEN	10	and	20	
1.  Volcano	Iterator	model:	SQL	plan	interpretaIon	
–  Open;	
–  Next	data	element;	
–  perform	predicate;	
–  Close	
–  …Iterate	
2.  “College	Freshman”	Java	code:		Tungsten	Whole	Stage	codeGen	pipelined	code	
–  	for	(lo_quan$ty	in	lineorder	{	
–  	 	if	(lo_quan$ty	>	10	and	lo_quan$ty	<	20)	
–  	 	{	count	+=	1}	
–  	}	
16
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Sample	Whole-Stage	CodeGen	
17	
val	code-df	=	sparkSession.sql("explain	codegen	"	+	query)	
code-df.show(false)	
Subtree	1/2	Generated	code:	
/*	033	*/			private	void	agg_doAggregateWithoutKey()	throws	java.io.IOExcepIon	{	
/*	034	*/					//	iniIalize	aggregaIon	buffer	
/*	035	*/					agg_bufIsNull	=	false;	
/*	036	*/					agg_bufValue	=	0L;	
/*	037	*/	
/*	038	*/					while	(inputadapter_input.hasNext())	{	
/*	039	*/							InternalRow	inputadapter_row	=	(InternalRow)	inputadapter_input.next();	
/*	040	*/							long	inputadapter_value	=	inputadapter_row.getLong(0);	
/*	041	*/	
/*	042	*/							//	do	aggregate	
/*	043	*/							//	common	sub-expressions	
/*	045	*/							//	evaluate	aggregate	funcIon	
/*	046	*/							boolean	agg_isNull3	=	false;	
/*	047	*/	
/*	048	*/							long	agg_value3	=	-1L;	
/*	049	*/							agg_value3	=	agg_bufValue	+	inputadapter_value;	
/*	050	*/							//	update	aggregaIon	buffer	
/*	051	*/							agg_bufIsNull	=	false;	
/*	052	*/							agg_bufValue	=	agg_value3;	
/*	053	*/							if	(shouldStop())	return;	
/*	054	*/					}	
/*	056	*/			}	
/*	058	*/			protected	void	processNext()	throws	java.io.IOExcepIon	{	
/*	059	*/					while	(!agg_initAgg)	{	
/*	060	*/							agg_initAgg	=	true;	
/*	061	*/							long	agg_beforeAgg	=	System.nanoTime();	
/*	062	*/							agg_doAggregateWithoutKey();	
/*	063	*/							agg_aggTime.add((System.nanoTime()	-	agg_beforeAgg)	/	1000000);	
/*	064	*/	
/*	065	*/							//	output	the	result	
/*	067	*/							agg_numOutputRows.add(1);	
/*	068	*/							agg_rowWriter.zeroOutNullBytes();	
/*	069	*/	
/*	070	*/							if	(agg_bufIsNull)	{	
/*	071	*/									agg_rowWriter.setNullAt(0);	
/*	072	*/							}	else	{	
/*	073	*/									agg_rowWriter.write(0,	agg_bufValue);	
/*	074	*/							}	
/*	075	*/							append(agg_result);	}	}	}	
Subtree	2/2	Generated	code:	
/*	033	*/			private	void	agg_doAggregateWithoutKey()	throws	java.io.IOExcepIon	{	
/*	034	*/					//	iniIalize	aggregaIon	buffer	
/*	035	*/					agg_bufIsNull	=	false;	
/*	036	*/					agg_bufValue	=	0L;	
/*	037	*/	
/*	038	*/					while	(inputadapter_input.hasNext())	{	
/*	039	*/							InternalRow	inputadapter_row	=	(InternalRow)	inputadapter_input.next();	
/*	040	*/							//	do	aggregate	
/*	041	*/							//	common	sub-expressions	
/*	043	*/							//	evaluate	aggregate	funcIon	
/*	044	*/							boolean	agg_isNull1	=	false;	
/*	045	*/	
/*	046	*/							long	agg_value1	=	-1L;	
/*	047	*/							agg_value1	=	agg_bufValue	+	1L;	
/*	048	*/							//	update	aggregaIon	buffer	
/*	049	*/							agg_bufIsNull	=	false;	
/*	050	*/							agg_bufValue	=	agg_value1;	
/*	051	*/							if	(shouldStop())	return;	
/*	052	*/					}	
/*	054	*/			}	
/*	056	*/			protected	void	processNext()	throws	java.io.IOExcepIon	{	
/*	057	*/					while	(!agg_initAgg)	{	
/*	058	*/							agg_initAgg	=	true;	
/*	059	*/							long	agg_beforeAgg	=	System.nanoTime();	
/*	060	*/							agg_doAggregateWithoutKey();	
/*	061	*/							agg_aggTime.add((System.nanoTime()	-	agg_beforeAgg)	/	1000000);	
/*	062	*/	
/*	063	*/							//	output	the	result	
/*	065	*/							agg_numOutputRows.add(1);	
/*	066	*/							agg_rowWriter.zeroOutNullBytes();	
/*	067	*/	
/*	068	*/							if	(agg_bufIsNull)	{	
/*	069	*/									agg_rowWriter.setNullAt(0);	
/*	070	*/							}	else	{	
/*	071	*/									agg_rowWriter.write(0,	agg_bufValue);	
/*	072	*/							}	
/*	073	*/							append(agg_result);	}	}	}	
	query	=	“SELECT	count(*)	FROM	lineorder	WHERE	lo_quan0ty	BETWEEN	10	and	20”
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
The	Basic	AnalyIcs	Flow	
A	lot	of	0me	spent	in	Data	Munging	
	
•  Data	can	come	from	many	sources	
– Databases,	NoSQL,	csv,	feeds…	
•  We	need	to	prepare	it	
– Data	Munging	of	all	sorts!	
•  Analyze	the	data	
– Find	the	“right	way”	to	analyze	it	
• ML,	Graph,	SQL…	
18	
Databases	
Data	Munging	
RESULTS	
Analy0cs	
NoSQL,	
Search.	…	
Streaming	
Kafa,	
Storm…
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
ConInuous	AnalyIcs	Cycle	
Results	Ofen	Enrich	TransacIons	
Analy0cs	is	more	than	one	pipeline	stream	
•  ConInuous	iteraIons	around	the	
data	analyIcs	wheel	
– Save,	catalog,	and	re-use	all	things:	
data,	SQL,	code,	and	analyIcs	
•  In-memory	advantages	at	each	stage	
– SPARC’s	DAX	&	leading	bandwidth	is	key	
•  Many	sources	of	data	
– Internal	proprietary,	public	data,	
external	streaming,	archives	
19	
Streaming	
Kafa,	Storm,	…	
Enhance	
TransacQons	
Reports	
Databases	
NoSQL	
ETL	
(SQL)	
ETL	
(SQL)	
In-Memory	
ML	&		
Graph	
(FP)	
Result	
Delivery	
(SQL)	
Feature	Extract,	
Generate	&	Transform	
(SQL)	
All	modern	Apps	(Uber,	
Neolix,	Amazon,	FB	…)	
	enhancing	transac$ons	
with	Real-$me	analy$cs	
Ad	Hoc
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
SQL	
sqlContext.udf.register("newTitle",		Itles.getOrElse((_:	String),"Other"))	
sqlContext.udf.register("toString",	(_:	Int).toString) 	 		
val	avgAge	=	dataDFRaw.select("Age”).agg(avg("Age")).first().getDouble(0)	
val	avgFare	=	dataDFRaw.select("Fare”).agg(avg("Fare")).first().getDouble(0)	
	
query	=	s"””SELECT		
				PassengerId,	
				toString(Survived)	AS	SurvivedString,	
				Pclass,	
				Name,	
				newTitle(regexp_extract(Name,	".*,	(.*?)..*",1))	AS	Title,	
				Sex,	
				NVL(Age,$avgAge)	AS	Age,	
				IF(SibSp	+	Parch	>	3,	1,	0)	AS	WithFamily,	
				NVL(Fare,$avgFare)	AS	Fare,	
				NVL(Embarked,'S')	AS	Embarked		
FROM	dataDFRaw"""	
…Same	with	Scala	
def	preprocess(data:	DataFrame,	sqlContext:	SQLContext,	train:	Boolean):	DataFrame	=	{	
				var	dataTrain	=	data	
				val	avgAge	=	dataTrain.select(mean("Age")).first()(0).asInstanceOf[Double]	
				println("avgAge	="	+	avgAge)	
	
val	avgFare	=	dataTrain.select(mean("Fare")).first()(0).asInstanceOf[Double]	
	
				val	withFamily	=	sqlContext.udf.register("withFamily",	(sib:	Int,	par:	Int)	=>	{	
						if	(sib	+	par	>	3)		1.0		else0.0})	
			
val	fillAge	=	sqlContext.udf.register("fillAge",	(age:	Double,	Itle:	String)	=>	{	
								var	newage	=	0.0d	
				if	((age	==	avgAge)	&&	(Itle.equals("Master.")	||	Itle.equals("Miss.")))	newage	=	14.1		else	newage	=		
age	
						newage	
				})	
val	addChild	=	sqlContext.udf.register("addChild",	(sex:	String,	age:	Double)	=>	{	
						if	(age	<	15)	
								if	(sex	==	"male")	"mChild”	else	"fChild”		else		sex		})	
val	toDouble	=	sqlContext.udf.register("toDouble",	((n:	Int)	=>	{	n.toDouble	}))	
	
			dataTrain	=	dataTrain.withColumn("Title",	findTitle(dataTrain("Name")))	
				dataTrain	=	dataTrain.na.fill(avgAge,	Seq("Age"))	
	dataTrain	=	dataTrain.withColumn("fixAge",fillAge(dataTrain("Age"),dataTrain("Title")))	
				dataTrain	=	dataTrain.withColumn("Pclass",toDouble(dataTrain("Pclass")))	
				dataTrain	=	dataTrain.na.fill(avgFare,	Seq("Fare"))	
				dataTrain	=	dataTrain	
				 	.withColumn("withFamily",	withFamily(dataTrain("SibSp"),	dataTrain("Parch")))	
				dataTrain.withColumn("sexMod",	addChild(dataTrain("Sex"),	dataTrain("Age")))	}	
20	
SelecIng	Features	&	GeneraIng	Features	
Kaggle	Titanic	Challenge:	SQL	&	Scala
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
ETL	(Extract,	Transpose,	Load)	&	Feature	CreaIon	
query	="””SELECT	x,	
				year	(x)	AS	Year,	
				month	(x)	AS	Month,	
				dayofmonth(x)	AS	DoM,	
				dayofyear(x)	AS	DoY,	
				date_format(x,"EEEE")	AS	longDoW,		
				date_format(x,"EE")	AS	shortDoW,	
				date_format(x,"u")	AS	DoW,	
				IF	(date_format(x,"u")	<	6	,	0,	1)	AS	weekendFlag,			
				IF	(date_format(x,"u")	<	6,	datediff(next_day(x,	"Sat"),x),	0)	AS	daystoWeekend,		
				floor(months_between(current_date(),x)/12)	AS	curAge,	
				IF	(dayofyear(CONCAT(year(x),"-12-31"))	>	365,	1,	0)	AS	leapYearFlag,	
				IF	(month(x)	=	12	AND	dayofmonth(x)	>	25,		months_between(CONCAT(year(x)+1,"-12-25"),x),	
												months_between(CONCAT(year(x),"-12-25"),x))	AS	monthstoXmas,		
				IF	(month(x)	=	12	AND	dayofmonth(x)	>	25,	datediff(CONCAT(year(x)+1,"-12-25"),x),	
												datediff(CONCAT(year(x),"-12-25"),x))	AS	daystoXmas,	
				IF	((datediff(CONCAT(year(x),"-12-25"),	x)	<=	14)	AND	(datediff(CONCAT(year(x),"-12-25"),	x)	>=	0),1,	0)	AS	x14dayBefore,	
				quarter(add_months(x,-2))	AS	Season,	
				quarter(x)	AS	Qtr	
				FROM	inputDate""".trim()		
21	
Can	you	SQL	to	create	new	features	from	data,	example	genera0ng	0me	features
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Apache	Spark	2.0.0	Great	New	Features	
•  Spark	SQL	
– SQL	2003	and	Unified	DataFrames/Datasets	API.	
– Tungsten	Phase2:	Whole	Stage	CodeGen	
•  Spark	MLlib	&	GraphX	–	Large	Scale	Machine	Learning	on	Apache	Spark	
– DataFrame	API	is	primary	API	for	MLlib	(RDD	mode	in	maintenance)	
– ML	persistence	can	create	model,	then	save	and	redeploy		
•  hNps://databricks.com/blog/2016/05/31	
•  Structured	Streaming		
– IntegraIon	of	DataFrames/Datasets	&	Streaming	
– Power	of	SQL	
22	
Best	new	features		-	MLlib	&	Structured	Streaming
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
ML	-	Machine	Learning	
•  AutomaQcally	sifing	through	large	amounts	of	data		
– to	find	previously	hidden	paNerns,		
– to	discover	valuable	new	insights	and	make	predicIons		
•  Examples:	
• Id	most	important	factors	(Apribute	Importance)	
• Predict	customer	behaviors	(Classifica$on)		
• Predict	or	esImate	a	value	(Regression)		
• Segment	a	populaIon	(Clustering)	
• Find	fraudulent	or	“rare	events”	(Anomaly	Detec$on)	
• Determine	co-occurring	items	in	a	“baskets”	(Associa$ons)	
• Find	profiles	of	targeted	people	or	items	(Decision	Trees)	
23	
A1	A2	A3	A4	A5
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Machine	Learning	(ML):	
Scoring/PredicIon	versus	Training/Learning	CharacterisIcs	
Predic0on/Scoring	operates	on	huge	amounts	of	data	with	low	compute	intensity	
ML	
Predic0on	
ML	
Train	
%	of	ac0vity	 Most	Data	 *periodic	
Computa0on	 O(n)	
O(n^3)	
Matrix-matrix	
Data	 O(n)	 O(n^2)	
Compute	Intensity	
(Compute/Data)	
Low	constant	 O(n)	
Memory	Bandwidth	
Requirement	
3x	to	6x	
per	core	
Up	to	1.3x		
per	core	
24	
Training	Set	
PredicIon/Scoring/	
Spark’s	Transform	
Results	
Can	train/fit	on	one	
server	then	move	
model	to	predic$on	
server	(ex:	StubHub)	
Model	
Training/
Learning/	
Spark’s	Fit	
Data	to	Evaluate	
*periodically	updates	to	models:	
		quarterly,	monthly,	weekly,	nightly
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Saving	the	ML	model	in	Apache	Spark	
•  Train/Fit	a	Random	Forest	Classifier	in	Python,	save	it	
•  Can	load	it	back	in	Python	
	
•  Can	Load	into	a	Scala	to	Predict/Transform	
25	
Model	can	be	moved	between	languages:	
trainingData	=	sqlContext.read...		#	data:	features,	label	
rf	=	RandomForestClassifier(numTrees=20)	
model	=	rf.fit(trainingData)	
model.save("myModelPath”)	
//	Load	the	model	in	Scala	
val	sameModel	=	RandomForestClassificaIonModel.load("myModelPath")	
val	predicIons	=	sameModel.transform(mybigdata)	
sameModel	=	RandomForestClassificaIonModel.load("myModelPath")	
MLlib	also	allows	
users	to	save/load	
en$re	ML	pipelines	
Training	
Predict	
hNps://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
How	can	we	make	Spark	2.x.x	even	Faster?	
It	can	be	made	a	LOT	FASTER:			8x	to	20x!	
	
	
“You	can	only	compute	as	fast	as	you	can	move	data”	–	Brad	Carlile	
26
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Let’s	Explore	First	Principles	Thinking	
What	are	the	true	ways	to	get	at	efficiency	–	A	quick	Analogy	
•  Where	to	Locate	a	Factory?				Example:	Elon	Musk’s	Tesla	GigaFactory	
27	
Tesla	GigaFactory	 Apache	Spark	Performance	
Conven0onal	Wisdom	 1)  Tax	IncenIves	 1)	Many	cheap	cores	&	Cloud	
hpp://fortune.com/2015/12/11/nevada-energy-tech-hub/
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Let’s	Explore	First	Principles	Thinking	
What	are	the	true	ways	to	get	at	efficiency	–	A	quick	Analogy	
•  Where	to	Locate	a	Factory?				Example:	Elon	Musk’s	Tesla	GigaFactory	
28	
Tesla	GigaFactory	 Apache	Spark	Performance	
Conven0onal	Wisdom	 1)  Tax	IncenIves	 1)	Many	cheap	cores	&	Cloud	
First	Principle	Thinking	
	
Closely	Look	at		each	component		
	
What	are	all	of	the	issues	
1)  Nevada’s	Energy	costs	
1)  Best	Geothermal	near	
2)  Sun	(2100	kWh/kW-yr)	
3)  Wind	(~8-10	ms/s)	
2)  Nevada’s	Raw	Materials	
1)  Only	Lithium	Mine	in	US	
2)  Lithium	salts	nearby	
3)  Few	hours	from	BIG	MARKETS	
1)  Tax	incen$ves	(don’t	hurt)	
hpp://fortune.com/2015/12/11/nevada-energy-tech-hub/
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Let’s	Explore	First	Principles	Thinking	
What	are	the	true	ways	to	get	at	efficiency	–	A	quick	Analogy	
•  Where	to	Locate	a	Factory?				Example:	Elon	Musk’s	Tesla	GigaFactory	
29	
Tesla	GigaFactory	 Apache	Spark	Performance	
Conven0onal	Wisdom	 1)  Tax	IncenIves	 1)	Many	cheap	cores	&	Cloud	
First	Principle	Thinking	
	
Closely	Look	at		each	component		
	
What	are	all	of	the	issues	
1)  Nevada’s	Energy	costs	
1)  Best	Geothermal	near	
2)  Sun	(2100	kWh/kW-yr)	
3)  Wind	(~8-10	ms/s)	
2)  Nevada’s	Raw	Materials	
1)  Only	Lithium	Mine	in	US	
2)  Lithium	salts	nearby	
3)  Few	hours	from	BIG	MARKETS	
1)  Tax	incen$ves	(don’t	hurt)	
1)  Use	all	of	Delivered	Bandwidth	
2)  JVM	performance	op0miza0ons	
	
3)  Innova0ons	for	scanning	opera0ons	
hpp://fortune.com/2015/12/11/nevada-energy-tech-hub/
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Whole-stage	CodeGen	Performance	
•  Let’s	evaluaIon	performance	on	Full	Table	Scan	of		600M	rows	
– Time	to	Scan	=	0.16	sec							Impressive:	104.7	Million	Rows/sec	per	core	
30	
2-chip	x86	E5	v3(Haswell),	total	36	cores,	72	threads/VCPU
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Whole-stage	CodeGen	Performance	
•  Let’s	evaluaIon	performance	on	Full	Table	Scan	of		600M	rows	
– Time	to	Scan	=	0.16	sec							Impressive:	104.7	Million	Rows/sec	per	core	
•  Going	back	to	1st	principles:	Scanning	is	about	data	movement	
– What	is	the	system	bandwidth	of	this	in-memory	Scan? 			15	GB/s	(Spark	2.0.0)	
– What	is	the	system	memory	bandwidth	of	the	system?	 	114	GB/s	(Stream	Triad)	
– 7.6x	more	bandwidth	available!	
•  Other	Queries:			“SELECT	count(*)	FROM	lineorder	WHERE	lo_quanIty	BETWEEN	10	and	20”	
– 19x	more	bandwidth	available!	(only	delivering	6GB/s	on	the	system)	
31	
2-chip	x86	E5	v3(Haswell),	total	36	cores,	72	threads/VCPU
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Afer	Spark	2.0:	Big	Leap	in	Performance	Possible	
Select	count(*)	from	store_sales	where	ss_item_sk	>	100	and	ss_item_sk	<	1000	
1.  Volcano	Iterator	model:	plan	interpretaIon	
–  Open;	Next	data	element;	perform	predicate;	Close	
2.  “College	Freshman”	Java	code:		Whole-Stage	CodeGen	pipelined	code	
–  	for	(ss_item_sk	in	store_sales)	{if	(ss_item_sk	>	100	and	ss_item_sk	<	1000	)	{	count	+=	1}}	
3.  Tuned	library:	True	VectorizaIon	highly	tuned	code	operates	on	whole	column	
–  vectorRangeFilter	(n,	VECTOR_OP_GT,	1000,VECTOR_OP_LT,	1000,	store_sales,	result,	result_cnt)	
4.  Hardware	Accelera0on	further	accelerates	scanning	
–  vectorRangeFilter	(n,	VECTOR_OP_GT,	1000,VECTOR_OP_LT,	1000,	store_sales,	result,	result_cnt)	
•  x86	AVX2	(Graphics	InstrucIon)	–		Can	achieve	70	GB/s	per	chip	
•  Oracle’s	SPARC	M7/S7	processors	–	even	faster	
32
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
libdax	Open	API	for	Free	&	Open	Source	Sofware	(FOSS)	
•  Designed	to	accelerate	wide	variety	of	Oracle	&	FOSS	sofware	
– Examples:	
• SQL	acceleraIon	for	Oracle	Database	in-memory	
• Apache	Spark	SQL	
– DataFrames(Python,	Scala,	Java,	R)	
– Parquet	experiments	in	progress	
• Published	sample	codes		
•  Open	API	for	libdax	–	sign	up	for	free	systems	to	actually	develop/try	code	
– hNps://swisdev.oracle.com	
– x86	&	SPARC	versions	(x86	&	generic	out	soon!)	
Libdax	designed	for	key	scan	&	dic0onary	opera0ons	to	accelerate	a	variety	of	soxware	
	
33
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
What	else	afer	one	uses	all	of	the	
bandwidth?	
It	can	be	made	a	LOT	FASTER:			8x	to	20x!	
34
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
In-Memory	Performance	OpImizaIons	
•  Columnar	Format	
•  Vector	Processing	
– Directly	on	DicIonary-encoded	Columns	
•  OperaIon	pushdown	
•  Join	Processing	acceleraIons	
– Bloom	Filters:	bit	vector	set	membership	tesIng	
•  In-Memory	Storage	Index	
•  Predicate	OpImizaIon	
– Using	DicIonary	values,	Min,	Max,	…	
35	
T.	Lahiri,	et.	al.	Oracle	Database	In-Memory:	A	Dual	Format	In-Memory	Database.	Proceed	of	the	ICDE2015.	
	
hNp://www.oracle.com/technetwork/database/in-memory/overview/twp-oracle-database-in-memory-2245633.html
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
In-Memory	Columnar	Format:	Faster	for	AnalyIcs	
Row-format	requires	skipping	over	data	slower	for	Analy0cs	
SELECT COL4 FROM MYTABLE
36	
In-memory	Column		Format	
IM	Column	Store		
RESULT	
With columnar we only Scan the data required by the query
Spark does Push-down Predicates for Parquet Files
… We need Columnar in-memory for Spark Internal format
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Lower	Cardinality	Data	is	Usually	Most	InteresIng	Data	
Analy0cs	analyzing	features	of	lower	cardinality	data	
AnalyIcs	ofen	disIlls	data	by	grouping	according	to	combinaIons	of	features	
In	ML,	we	ofen	“BuckeIze”	to	reduce	cardinality	
37	
Unique	or	
Random	
Data	
Gender	
Season	
5-point	
Scale	
Marital	
Status	
Top10	
10	ranking	
Month	
Hour	
State	
Weeks	
Minutes	
Age	
Test	Score	
Country	
US	City	>100k	
Days	
Top	500	 Job		
Classifica$on	
Area	code	
Nasdaq	 NYSE	
Top	5,000	
School	
districts	 zipcode	
DOB	last	
150	years	
Temperature	
Rainfall	Wind	direc$on	
Region	
Price	
Delivery	
status	
Most	Interes$ng	data	has	
fewer	dic$onary	bits	Make	
Model	
2-bit			3-bit			4-bit			5-bit			6-bit			7-bit			8-bit			9-bit			10-bit			11-bit			12-bit			13-bit			14-bit			15-bit			16-bit			17-bit			18-bit			19-bit...	
Cardinality	n-bits	(calculated	as	2^n-bits)	“Entropy”
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
In-Memory	Columnar	Compression	–	Persist	in	Memory	
•  Efficient	In	Memory	Columnar	Table	Scan		
–  ConIguous	storage	per	column		
•  DicIonary	encoding	huge	compression	
– 50	US	only	need	6	bits	(<1	byte)	
– Spelling	out	the	state	name	is	much	longer	
•  “South	Dakota”	needs	12	characters	or	24	unicode	
bytes	(192	bits	vs.	6	bits)	
•  Innova0ons:		
– Directly	scan	dic$onary	encoded	data	!	
–  Save	Min	&	Max	for	scan	eliminaIon	
–  AddiIonal	compression	on	column	also	possible	
–  Can	use	dicIonary	for	“featurizaIon”	of	data	for	ML	
38	
0	
1	
3	
2	
0	
2	
3	
Column	
Min:	South	Dakota	
Max:	Utah	
DicQonary	
Dict	
encode	
Column	value	list		
South	Dakota	
Tennessee	
Utah	
Texas	
South	Dakota	
Texas	
Utah	
DicQonary		
VALUE					ID	
South	Dakota	0	
Tennessee	1	
Texas	2	
Utah	3	
Zip	
+	
RLE)
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
	
Apache	Spark:	Objects	&	Tungsten	In-memory	Columnar	
Oracle	incorpora0ng	libdax	&	fixing	misalignments	in	Apache	Spark	
•  Typesafe	operaIons	in	Scala	access	
JVM	object	formaNed	representaIon	
•  SQL	&	Dataframe	operaIons	need	to	
access	in-memory	column	format	
–  Tungsten:	SQL	operates	on	internal	format	
–  Need	to	persist	in	column	format	
–  Currently	not	use	dic$onary	encoding	to	
speed	SQL/DataFrame	execu$on	
39	
Apache	Spark	
Currently	doesn’t	store	columnar	data	for	reuse,	
Only	regenerates	columns	each	Qme	on	fly	each	Qme	it’s	used	
JVM	
Off-heap	
Memory	
SALES	 SALES	
JVM	
Object	
Format	
Temp	
Internal		
Column	
Format	
Scala,	
Python,	
R,	…	
Future?	
SPARK-15687	
DataFrames	
SQL	execu$on	
Encode	
Decode
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Developers:	Areas	that	need	contribuIon	
•  Contribute	InnovaIve	New	Algorithms	throughout	Apache	Spark	!	
•  Contribute	third-party	packages	that	integrate	with	Apache	Spark	
–  hNps://sparkhub.databricks.com/				“Free	App-store	for	Apache	Spark”	
•  Improve	single-node	performance	&	scalability	
–  Improve	performance	of	large-core	systems	
•  x86	processor	contains	22-cores	per	chip,	SPARC	processor	contains	32-cores	per	chip	
•  This	trend	keeps	increasing	–	HUGE	BANDWIDTH	on	CHIP	
–  Network	bandwidth	not	keeping	up	
•  Fix	mis-alignments	which	hurt	performance	
–  Example:	pre-appended	4-byte	signed-ints	to	strings,	caused	mis-aligned,	convert	to	longs	
•  hNps://issues.apache.org/jira/browse/SPARK-16962	
40	
Shout	out	to	all	Developers
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Apache	Spark	is	Exci0ng	–	More	Great	Work	to	Come!	
Spark’s	Whole	Ecosystem	totally	appeals	to	the	performance	expert	in	me	!	
•  Great	Ecosystem	for	AnalyIcs	
– Spark	forms	a	conInuum	with	other	technologies	(Kaaa,	Solr,	…)		
•  Spark	is	fast	&	scalable	because	of	clean	design	
– Spark’s	in-memory	focus	
•  Spark	can	be	much	faster	
– Poised	for	many	innovaIons	that	take	advantage	of	hardware	systems	
– Oracle’s	Sofware	in	Silicon	can	be	used	on	Apache	Spark	
•  Keep	Using,	ContribuIng,	and	Sharing!	
41
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Oracle	SPARC	M7	&	S7:	InnovaIons	for	Cloud	&	AnalyIcs	
•  32	Cores	@	4.13	GHz,	512GB	memory	per	chip		
– >160	GB/s	delivered	memory	bandwidth	per	chip	
•  Java/JVM,	Database,	ApplicaIons,	etc.		
SPARC	1.6x	to	2.0x	faster	core	vs.	x86	
– LiNle	growth	in	x86	per	core	performance	
•  Sofware	in	Silicon	Features	
– In-Memory	SQL	AcceleraIon	&	Decompression		
– Hardware	accelerated	EncrypIon	
– Silicon	Secured	Memory		
Deep	innova0ons	differen0ate	SPARC	from	the	generic	compu0ng	
	
hNps://blogs.oracle.com/bestperf	
0.8x	
1.0x	
1.2x	
1.4x	
1.6x	
1.8x	
2.0x	
2012	 2013	 2014	 2015	 2016	
Core	Performance	vs.		x86	E5	v2	
		Java/JVM	
	OLTP	
	Mem	GB/s	
X86:	dashed
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
First	Principle	Thinking:	Bandwidth	for	Apache	Spark	
In-memory	Scan	ul0mately	determined	by	Memory	Bandwidth	
•  “…no	maper	how	high	performance	my	engine	is,	if	I	need	to	scan	a	Terabyte	of	
data	to	answer	my	query	it’s	going	to	be	slow	even	if	you	are	reading	from	memory”	
–  Patrick	Wendell,	Databricks	(June	4,	2015	on	O’Reilly	Data	Show	Podcast	with	Ben	Lorica)	
•  Let’s	say	I	have	1	TB	In-Memory	that		I	want	to	scan	in	1	second	
– We	need	31		C4.8xlarge	(62	chips)	to	scan	in	1	second,	(1024GB	/	33.6	GB/s)	
– SPARC	M7-8	(8-chips)	server	has	1.2	TB/s	delivered	Bandwidth	in	10	RU	
43	
IBM	Power8	E880	SPARC	M7-8	 x86	E7	v3	Haswell	
Circles	show		
Processors	
	
Inter-chip	
bandwidths		
are	to	scale	
Fully	connected	 2-hop	 2-hop	
hNp://browser.primatelabs.com/geekbench3/5105516	
	hNp://browser.primatelabs.com/geekbench3/1694602	
4x	E5	v3	Haswell	
10GbE	connected
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
SPARC	DAX	Sofware	in	Silicon	for	AnalyIcs	(Oracle	DB	&	Spark)	
•  Integrated	Offload	
– Data	AnalyIcs	AcceleraIon	(DAX)	
– OPEN	!		Add	to	your	own	ApplicaIons	
•  hNps://swisdev.oracle.com	
•  It’s	more	important	how	you	use	transistors,	
than	Moore’s	Law	(#	transistors	you	make)	
Radical	Innova0on:	Integrated	Offload	offers	10x	faster	performance!	
• Oracle	Database	AnalyIc	Queries	
• SPARC	M7	10.8x	faster	per	chip	x86	E5	v3	
• Same	techniques	apply	to	Apache	Spark	
Memory	
Half	
BW	
Memory	
x86	E5	v3	
X86	100%	U0lized	
NO	OFFLOAD	!	
NO	OPEN	Cores	!	
Band-
Width	
SPARC	M7	
DAX	
OFFLOAD	
OFFLOAD	
DAX	
44	
hNps://blogs.oracle.com/bestperf/entry/20151025_imdb_t7_1
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
QuesIons?	
45
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	 46
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Backup	Slides	
Oracle	Database	12c	In-memory	Database	Innova0ons	
…all	innova0ons	that	can	apply	to	Apache	Spark	
47
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Oracle	DB	
Quick	Digression	(which	explains	how	we	can	make	Spark	faster)	
Learning	From	Oracle	Database	In-Memory?	
• OLTP	uses	proven	row	format	
• AnalyIcs	&	reporIng	use	new	
in-memory	Column	format	
•  AnalyIcs	Compression	means	huge	
amounts	of	database	can	now	fit	
in-memory	
• The	Oracle	Database	stores		
BOTH	row	and	column	formats	
for	same	table	
• Simultaneously	acIve	and	
transacIonally	consistent	
48	
Memory	 Memory	
SALES	 SALES	
Row	
Format	
Column	
Format
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
OperaIon	Pushdown:	Reduce	Rows	Processed	by	Plan			
•  When	possible,	push	operaIons	down	to	In-Memory	scan		
–  Greatly	reduces	#	rows	flowing	up	through	the	plan	
•  For	example:	
–  Predicate	EvaluaIon	(for	qualifying	predicates	–	equality,	range,	etc.):		
•  Inline	predicate	evaluaIon	within	the	scan	
•  Each	IMCU	scan	only	returns	qualifying	rows	instead	of	all	rows	
•  Another	example	
–  AggregaIon	(for	qualifying	aggregates,	e.g.	sum(),	min/max(),	etc.):	
•  IMCU	is	aggregated	during	the	scan.	
•  Each	IMCU	scan	returns	only	the	aggregate	(e.g.	the	sum)	
•  Upper	plan	nodes	aggregates	the	aggregates	(e.g.	sum	of	sums)		
	
IM	scan		
Products	
	
	
Sales	>	1000	
IM	scan		
Stores	
	
49	
	
State	=	CA	
SALES	>	1000	
						STATE	=	CA
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
OperaIon	Pushdown:	Bloom	Filter	
•  Bloom	Filter:	
•  Compact	bit	vector	for	set	
membership	tesIng	
•  10g	opImizer	feature		
•  Bloom	filter	pushdown:	
•  Filtering	pushed	down	to	IMCU	scan	
•  Returns	only	rows	that	are	likely	to	be	
join	candidates	
•  Joins	tables	10x	faster	
	
50	
Example:	Find	total	sales	in	outlet	stores	
Sales	Stores	
Store	ID	
StoreID	in	
15,	38,	64	
Type=‘Outlet’	
Type	
Sum	
Store	ID	
Amount	
Bloom	Filter
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
In-Memory	Storage	Index:		Eliminate	IMCUs	from	Scan	
•  Min-Max	Pruning	
– Min/Max	values	serve	as	storage	index	
– Check	predicate	against	min/max	values	
– Skip	enIre	IMCU	if	predicate	not	saIsfied	
– Eliminates	processing	unnecessary	IMCUs	
–  Can	prune	for	predicates	including	equality,	range,	inlist,	…		
•  DicIonary	pruning	
– DicIonaries	also	serve	as	storage	index	
– Check	predicate	against	dicIonary	values	
•  “Find	sales	from	stores	in	Nevada”	
– Skip	enIre	IMCU	if	predicate	not	saIsfied	
Min			$4000	
Max		$7000	
Min			$8000	
Max	$12000	
Min		$13000	
Max	$15000	
Example:	Find	stores	with	sales	greater	than	$10,000	
51
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
	
	
	
•  Avoid	evaluaIng	predicates	against	every	
column	value	
– Check	range	predicate	against	min/max	values	
•  As	before,	skip	IMCUs	where	min/max	disqualifies	
predicate	
– If	min/max	indicates	all	rows	will	qualify,	no	need	
to	evaluate	predicates	on	column	values	
	
Min			$4000	
Max		$7000	
Min			$8000	
Max	$13000	
Min		$13000	
Max	$15000	
Example:	Find	stores	with	sales	between		
$8000	and		$14000	
NO	ROWS	
	
Skip	IMCU	
SOME	ROWS	
	
Needs	evalua0on	
ALL	ROWS	
	
Skip	Evalua0on	
Predicate	OpImizaIon:	Reduce	Predicate	EvaluaIons		
?	
52
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Predicate	OpImizaIon:	Reduce	Predicate	EvaluaIons	
•  If	min/max	cannot	eliminate	predicate	
– Evaluate	predicate	once	per	dicIonary	value	
– Create	list	of	qualifying	dicIonary	values	
•  Use	vector	instrucIons	to	find	qualifying	
values	in	column	
•  Greatly	reduces	predicate	evaluaIons	
– Once	per	disInct	value	vs.	of	once	per	value	
•  Also	for	more	complex	predicates	…	
– LIKE	predicates:	ex:	Find	sales	of	product	
names	containing	“mustard”	
Example:	Find	stores	with	sales	between		
																				$8000	and	$14000	
	
		
5	
1	
4	
3	
3	
4	
4	
5	
5	
3	
0	
1	
	
Vector		Compare	
	{0,1,2,3}	
$13,000	
$13,500	
$13,800	
$13,900	
$14,500	
0	
1	
2	
3	
4	
Dic0onary	 Column	CU	
$15,000	5	
53	
Apple	
English	Mustard	
Mustard	Greens	
0	
1	
2	
Dic0onary
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
Backup	Slides	
Oracle’s	SPARC	Processor	
54
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
(1) Factory	configured	with	one	(up	to	8	processors)	or	two	(up	to	4	processors	each)	sta$c	physical	domains	
(2) 1,	2,	3	or	4	reconfigurable	physical	domains		
(3)	Maximum	memory	capacity	is	based	on	32	GB	DIMMs,	capacity	can	double	in	future	with	64	GB	DIMMs	
	
	
SPARC	T7	&	M7	Systems	-	All	Shipping	Now		
	“-#”	indicates	how	many	chips	in	server	
T7-1	 T7-2	 T7-4	 M7-8	 M7-16	
Processors	 1	 2	 2	or	4	 Up	to	8	1	 Up	to	16	2	
Max	Cores	 32	 64	 128	 256	 512	
Max	Threads	 256	 512	 1,024	 2,048	 4,096	
Max	Memory	3	 .5	TB	 1	TB	 2	TB	 4	TB	 8	TB	
Form	Factor	 2U	 3U	 5U	 Rack	/	10U	 Rack	
Domaining	 LDOMs	 LDOMs	 LDOMs	 LDOMs,	PDOMs	1	 LDOMs,	PDOMs	2	
ConfidenIal:	Oracle	Restricted		
55
Copyright	©	2016,	Oracle	and/or	its	affiliates.	All	rights	reserved.		|	
(1) Maximum	memory	capacity	is	based	on	64	GB	DIMMs.	
	
	
SPARC	S7	Servers	
SPARC	S7-2	Server	 SPARC	S7-2L	Server	
Processors	 1	or	2	 2	
Max	Cores/Threads	 16	/	128	 16	/	128	
Max	Memory	1	 1	TB	 1	TB	
Form	Factor	 1U	 2U	
Max	Disk	Drives	 8	 26	
PCIe	Slots	Available	 3	 6	
Integrated	Ethernet	 4x	10GBase-T	 4x	10GBase-T	
S7-2L	
Storage	
Storage	
S7-2L
Performance in Spark 2.0, PDX Spark Meetup 8/18/16

Weitere ähnliche Inhalte

Was ist angesagt?

OOW16 - Oracle E-Business Suite 12 Upgrade Experience for a 14 TB Oracle E-Bu...
OOW16 - Oracle E-Business Suite 12 Upgrade Experience for a 14 TB Oracle E-Bu...OOW16 - Oracle E-Business Suite 12 Upgrade Experience for a 14 TB Oracle E-Bu...
OOW16 - Oracle E-Business Suite 12 Upgrade Experience for a 14 TB Oracle E-Bu...vasuballa
 
OOW16 - Maintenance Strategies for Oracle E-Business Suite [CON6725]
OOW16 - Maintenance Strategies for Oracle E-Business Suite [CON6725]OOW16 - Maintenance Strategies for Oracle E-Business Suite [CON6725]
OOW16 - Maintenance Strategies for Oracle E-Business Suite [CON6725]vasuballa
 
OOW16 - Migrating and Managing Customizations for Oracle E-Business Suite 12....
OOW16 - Migrating and Managing Customizations for Oracle E-Business Suite 12....OOW16 - Migrating and Managing Customizations for Oracle E-Business Suite 12....
OOW16 - Migrating and Managing Customizations for Oracle E-Business Suite 12....vasuballa
 
Jfokus 2017 Oracle Dev Cloud and Containers
Jfokus 2017 Oracle Dev Cloud and ContainersJfokus 2017 Oracle Dev Cloud and Containers
Jfokus 2017 Oracle Dev Cloud and ContainersMika Rinne
 
OOW16 - Testing Oracle E-Business Suite Best Practices [CON6713]
OOW16 - Testing Oracle E-Business Suite Best Practices [CON6713]OOW16 - Testing Oracle E-Business Suite Best Practices [CON6713]
OOW16 - Testing Oracle E-Business Suite Best Practices [CON6713]vasuballa
 
Why MySQL High Availability Matters
Why MySQL High Availability MattersWhy MySQL High Availability Matters
Why MySQL High Availability MattersMark Swarbrick
 
No sql from the web’s favourite relational database MySQL
No sql from the web’s favourite relational database MySQLNo sql from the web’s favourite relational database MySQL
No sql from the web’s favourite relational database MySQLMark Swarbrick
 
OOW16 - Personalizing Oracle E-Business Suite: The Next Generation [CON6716]
OOW16 - Personalizing Oracle E-Business Suite: The Next Generation [CON6716]OOW16 - Personalizing Oracle E-Business Suite: The Next Generation [CON6716]
OOW16 - Personalizing Oracle E-Business Suite: The Next Generation [CON6716]vasuballa
 
Java EE 8 - February 2017 update
Java EE 8 - February 2017 updateJava EE 8 - February 2017 update
Java EE 8 - February 2017 updateDavid Delabassee
 
OOW15 - Installation, Cloning, and Configuration of Oracle E-Business Suite 12.2
OOW15 - Installation, Cloning, and Configuration of Oracle E-Business Suite 12.2OOW15 - Installation, Cloning, and Configuration of Oracle E-Business Suite 12.2
OOW15 - Installation, Cloning, and Configuration of Oracle E-Business Suite 12.2vasuballa
 
Oow MySQL Whats new in security overview sept 2017 v1
Oow MySQL Whats new in security overview sept 2017 v1Oow MySQL Whats new in security overview sept 2017 v1
Oow MySQL Whats new in security overview sept 2017 v1Mark Swarbrick
 
P6 Release 8 Installation Orientation
P6 Release 8 Installation OrientationP6 Release 8 Installation Orientation
P6 Release 8 Installation Orientationp6academy
 
How to Prepare Your Toolbox for the Future of SharePoint Development
How to Prepare Your Toolbox for the Future of SharePoint DevelopmentHow to Prepare Your Toolbox for the Future of SharePoint Development
How to Prepare Your Toolbox for the Future of SharePoint DevelopmentProgress
 
Migrating your infrastructure to OpenStack - Avi Miller, Oracle
Migrating your infrastructure to OpenStack - Avi Miller, OracleMigrating your infrastructure to OpenStack - Avi Miller, Oracle
Migrating your infrastructure to OpenStack - Avi Miller, OracleOpenStack
 
Related OSS Projects - Peter Rowe, Flexera Software
Related OSS Projects - Peter Rowe, Flexera SoftwareRelated OSS Projects - Peter Rowe, Flexera Software
Related OSS Projects - Peter Rowe, Flexera SoftwareOpenStack
 
Navigating Your Product's Growth with Embedded Analytics
Navigating Your Product's Growth with Embedded Analytics Navigating Your Product's Growth with Embedded Analytics
Navigating Your Product's Growth with Embedded Analytics Progress
 

Was ist angesagt? (20)

OOW16 - Oracle E-Business Suite 12 Upgrade Experience for a 14 TB Oracle E-Bu...
OOW16 - Oracle E-Business Suite 12 Upgrade Experience for a 14 TB Oracle E-Bu...OOW16 - Oracle E-Business Suite 12 Upgrade Experience for a 14 TB Oracle E-Bu...
OOW16 - Oracle E-Business Suite 12 Upgrade Experience for a 14 TB Oracle E-Bu...
 
OOW16 - Maintenance Strategies for Oracle E-Business Suite [CON6725]
OOW16 - Maintenance Strategies for Oracle E-Business Suite [CON6725]OOW16 - Maintenance Strategies for Oracle E-Business Suite [CON6725]
OOW16 - Maintenance Strategies for Oracle E-Business Suite [CON6725]
 
Java EE Next
Java EE NextJava EE Next
Java EE Next
 
REST in an Async World
REST in an Async WorldREST in an Async World
REST in an Async World
 
JAX-RS 2.1 Reloaded
JAX-RS 2.1 ReloadedJAX-RS 2.1 Reloaded
JAX-RS 2.1 Reloaded
 
OOW16 - Migrating and Managing Customizations for Oracle E-Business Suite 12....
OOW16 - Migrating and Managing Customizations for Oracle E-Business Suite 12....OOW16 - Migrating and Managing Customizations for Oracle E-Business Suite 12....
OOW16 - Migrating and Managing Customizations for Oracle E-Business Suite 12....
 
Jfokus 2017 Oracle Dev Cloud and Containers
Jfokus 2017 Oracle Dev Cloud and ContainersJfokus 2017 Oracle Dev Cloud and Containers
Jfokus 2017 Oracle Dev Cloud and Containers
 
OOW16 - Testing Oracle E-Business Suite Best Practices [CON6713]
OOW16 - Testing Oracle E-Business Suite Best Practices [CON6713]OOW16 - Testing Oracle E-Business Suite Best Practices [CON6713]
OOW16 - Testing Oracle E-Business Suite Best Practices [CON6713]
 
Why MySQL High Availability Matters
Why MySQL High Availability MattersWhy MySQL High Availability Matters
Why MySQL High Availability Matters
 
No sql from the web’s favourite relational database MySQL
No sql from the web’s favourite relational database MySQLNo sql from the web’s favourite relational database MySQL
No sql from the web’s favourite relational database MySQL
 
OOW16 - Personalizing Oracle E-Business Suite: The Next Generation [CON6716]
OOW16 - Personalizing Oracle E-Business Suite: The Next Generation [CON6716]OOW16 - Personalizing Oracle E-Business Suite: The Next Generation [CON6716]
OOW16 - Personalizing Oracle E-Business Suite: The Next Generation [CON6716]
 
Oracle Solaris Overview
Oracle Solaris OverviewOracle Solaris Overview
Oracle Solaris Overview
 
Java EE 8 - February 2017 update
Java EE 8 - February 2017 updateJava EE 8 - February 2017 update
Java EE 8 - February 2017 update
 
OOW15 - Installation, Cloning, and Configuration of Oracle E-Business Suite 12.2
OOW15 - Installation, Cloning, and Configuration of Oracle E-Business Suite 12.2OOW15 - Installation, Cloning, and Configuration of Oracle E-Business Suite 12.2
OOW15 - Installation, Cloning, and Configuration of Oracle E-Business Suite 12.2
 
Oow MySQL Whats new in security overview sept 2017 v1
Oow MySQL Whats new in security overview sept 2017 v1Oow MySQL Whats new in security overview sept 2017 v1
Oow MySQL Whats new in security overview sept 2017 v1
 
P6 Release 8 Installation Orientation
P6 Release 8 Installation OrientationP6 Release 8 Installation Orientation
P6 Release 8 Installation Orientation
 
How to Prepare Your Toolbox for the Future of SharePoint Development
How to Prepare Your Toolbox for the Future of SharePoint DevelopmentHow to Prepare Your Toolbox for the Future of SharePoint Development
How to Prepare Your Toolbox for the Future of SharePoint Development
 
Migrating your infrastructure to OpenStack - Avi Miller, Oracle
Migrating your infrastructure to OpenStack - Avi Miller, OracleMigrating your infrastructure to OpenStack - Avi Miller, Oracle
Migrating your infrastructure to OpenStack - Avi Miller, Oracle
 
Related OSS Projects - Peter Rowe, Flexera Software
Related OSS Projects - Peter Rowe, Flexera SoftwareRelated OSS Projects - Peter Rowe, Flexera Software
Related OSS Projects - Peter Rowe, Flexera Software
 
Navigating Your Product's Growth with Embedded Analytics
Navigating Your Product's Growth with Embedded Analytics Navigating Your Product's Growth with Embedded Analytics
Navigating Your Product's Growth with Embedded Analytics
 

Andere mochten auch

Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDatabricks
 
Docker 基本概念與指令操作
Docker  基本概念與指令操作Docker  基本概念與指令操作
Docker 基本概念與指令操作NUTC, imac
 
Spark Solution for Rank Product
Spark Solution for Rank ProductSpark Solution for Rank Product
Spark Solution for Rank ProductMahmoud Parsian
 
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Alexey Zinoviev
 
使用 CLI 管理 OpenStack 平台
使用 CLI 管理 OpenStack 平台使用 CLI 管理 OpenStack 平台
使用 CLI 管理 OpenStack 平台NUTC, imac
 
Use of spark for proteomic scoring seattle presentation
Use of spark for  proteomic scoring   seattle presentationUse of spark for  proteomic scoring   seattle presentation
Use of spark for proteomic scoring seattle presentationlordjoe
 
Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Alexey Zinoviev
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Chris Fregly
 
Spark 巨量資料處理基礎教學
Spark 巨量資料處理基礎教學Spark 巨量資料處理基礎教學
Spark 巨量資料處理基礎教學NUTC, imac
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Knoldus Inc.
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Robert "Chip" Senkbeil
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkDatabricks
 

Andere mochten auch (20)

Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
 
Docker 基本概念與指令操作
Docker  基本概念與指令操作Docker  基本概念與指令操作
Docker 基本概念與指令操作
 
Spark Solution for Rank Product
Spark Solution for Rank ProductSpark Solution for Rank Product
Spark Solution for Rank Product
 
Apache Spark Essentials
Apache Spark EssentialsApache Spark Essentials
Apache Spark Essentials
 
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
 
Meetup Spark 2.0
Meetup Spark 2.0Meetup Spark 2.0
Meetup Spark 2.0
 
使用 CLI 管理 OpenStack 平台
使用 CLI 管理 OpenStack 平台使用 CLI 管理 OpenStack 平台
使用 CLI 管理 OpenStack 平台
 
Use of spark for proteomic scoring seattle presentation
Use of spark for  proteomic scoring   seattle presentationUse of spark for  proteomic scoring   seattle presentation
Use of spark for proteomic scoring seattle presentation
 
Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
 
Spark 巨量資料處理基礎教學
Spark 巨量資料處理基礎教學Spark 巨量資料處理基礎教學
Spark 巨量資料處理基礎教學
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache Spark
 

Ähnlich wie Performance in Spark 2.0, PDX Spark Meetup 8/18/16

MySQL Group Replication
MySQL Group ReplicationMySQL Group Replication
MySQL Group ReplicationMark Swarbrick
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark Summit
 
MySQL Enterprise Monitor 3
MySQL Enterprise Monitor 3MySQL Enterprise Monitor 3
MySQL Enterprise Monitor 3Mark Swarbrick
 
MySQL Enterprise Cloud
MySQL Enterprise Cloud MySQL Enterprise Cloud
MySQL Enterprise Cloud Mark Swarbrick
 
MySQL Enterprise Cloud
MySQL Enterprise CloudMySQL Enterprise Cloud
MySQL Enterprise CloudMark Swarbrick
 
Develop Oracle Virtual Box and deploy to Cloud
Develop Oracle Virtual Box and deploy to CloudDevelop Oracle Virtual Box and deploy to Cloud
Develop Oracle Virtual Box and deploy to CloudInprise Group
 
MySQL Enterprise Edition
MySQL Enterprise EditionMySQL Enterprise Edition
MySQL Enterprise EditionMark Swarbrick
 
Trivadis TechEvent 2017 Leveraging the Oracle Cloud by Kris Bhanushali tech_e...
Trivadis TechEvent 2017 Leveraging the Oracle Cloud by Kris Bhanushali tech_e...Trivadis TechEvent 2017 Leveraging the Oracle Cloud by Kris Bhanushali tech_e...
Trivadis TechEvent 2017 Leveraging the Oracle Cloud by Kris Bhanushali tech_e...Trivadis
 
Oracle Solaris Cloud Management and Deployment with OpenStack
Oracle Solaris Cloud Management and Deployment with OpenStackOracle Solaris Cloud Management and Deployment with OpenStack
Oracle Solaris Cloud Management and Deployment with OpenStackOTN Systems Hub
 
Percona Live - Dublin 03 ee + cloud
Percona Live - Dublin 03 ee + cloudPercona Live - Dublin 03 ee + cloud
Percona Live - Dublin 03 ee + cloudMark Swarbrick
 
MOUG17 Keynote: What's New from Oracle Database Development
MOUG17 Keynote: What's New from Oracle Database DevelopmentMOUG17 Keynote: What's New from Oracle Database Development
MOUG17 Keynote: What's New from Oracle Database DevelopmentMonica Li
 
Oracle RAC 12c Rel. 2 Best Practices - UKOUG Tech17 Version
Oracle RAC 12c Rel. 2 Best Practices - UKOUG Tech17 VersionOracle RAC 12c Rel. 2 Best Practices - UKOUG Tech17 Version
Oracle RAC 12c Rel. 2 Best Practices - UKOUG Tech17 VersionMarkus Michalewicz
 
Oracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best Practices
Oracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best PracticesOracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best Practices
Oracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best PracticesMarkus Michalewicz
 
Oracle RAC - Roadmap for New Features
Oracle RAC - Roadmap for New FeaturesOracle RAC - Roadmap for New Features
Oracle RAC - Roadmap for New FeaturesMarkus Michalewicz
 
Pitfalls of migrating projects to JDK 9
Pitfalls of migrating projects to JDK 9Pitfalls of migrating projects to JDK 9
Pitfalls of migrating projects to JDK 9Pavel Bucek
 
Next Generation Data Center Strategies
Next Generation Data Center StrategiesNext Generation Data Center Strategies
Next Generation Data Center StrategiesVenkat Nambiyur
 
NZOUG-GroundBreakers-2018 - Troubleshooting and Diagnosing 18c RAC
NZOUG-GroundBreakers-2018 - Troubleshooting and Diagnosing 18c RACNZOUG-GroundBreakers-2018 - Troubleshooting and Diagnosing 18c RAC
NZOUG-GroundBreakers-2018 - Troubleshooting and Diagnosing 18c RACSandesh Rao
 
3° Sessione Oracle - CRUI: Mobile&Conversational Interface
3° Sessione Oracle - CRUI: Mobile&Conversational Interface3° Sessione Oracle - CRUI: Mobile&Conversational Interface
3° Sessione Oracle - CRUI: Mobile&Conversational InterfaceJürgen Ambrosi
 

Ähnlich wie Performance in Spark 2.0, PDX Spark Meetup 8/18/16 (20)

MySQL Group Replication
MySQL Group ReplicationMySQL Group Replication
MySQL Group Replication
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
 
MySQL Enterprise Monitor 3
MySQL Enterprise Monitor 3MySQL Enterprise Monitor 3
MySQL Enterprise Monitor 3
 
MySQL Enterprise Cloud
MySQL Enterprise Cloud MySQL Enterprise Cloud
MySQL Enterprise Cloud
 
MySQL Enterprise Cloud
MySQL Enterprise CloudMySQL Enterprise Cloud
MySQL Enterprise Cloud
 
Develop Oracle Virtual Box and deploy to Cloud
Develop Oracle Virtual Box and deploy to CloudDevelop Oracle Virtual Box and deploy to Cloud
Develop Oracle Virtual Box and deploy to Cloud
 
MySQL Enterprise Edition
MySQL Enterprise EditionMySQL Enterprise Edition
MySQL Enterprise Edition
 
Trivadis TechEvent 2017 Leveraging the Oracle Cloud by Kris Bhanushali tech_e...
Trivadis TechEvent 2017 Leveraging the Oracle Cloud by Kris Bhanushali tech_e...Trivadis TechEvent 2017 Leveraging the Oracle Cloud by Kris Bhanushali tech_e...
Trivadis TechEvent 2017 Leveraging the Oracle Cloud by Kris Bhanushali tech_e...
 
Oracle Solaris Cloud Management and Deployment with OpenStack
Oracle Solaris Cloud Management and Deployment with OpenStackOracle Solaris Cloud Management and Deployment with OpenStack
Oracle Solaris Cloud Management and Deployment with OpenStack
 
MySQL HA
MySQL HAMySQL HA
MySQL HA
 
MySQL Clusters
MySQL ClustersMySQL Clusters
MySQL Clusters
 
Percona Live - Dublin 03 ee + cloud
Percona Live - Dublin 03 ee + cloudPercona Live - Dublin 03 ee + cloud
Percona Live - Dublin 03 ee + cloud
 
MOUG17 Keynote: What's New from Oracle Database Development
MOUG17 Keynote: What's New from Oracle Database DevelopmentMOUG17 Keynote: What's New from Oracle Database Development
MOUG17 Keynote: What's New from Oracle Database Development
 
Oracle RAC 12c Rel. 2 Best Practices - UKOUG Tech17 Version
Oracle RAC 12c Rel. 2 Best Practices - UKOUG Tech17 VersionOracle RAC 12c Rel. 2 Best Practices - UKOUG Tech17 Version
Oracle RAC 12c Rel. 2 Best Practices - UKOUG Tech17 Version
 
Oracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best Practices
Oracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best PracticesOracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best Practices
Oracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best Practices
 
Oracle RAC - Roadmap for New Features
Oracle RAC - Roadmap for New FeaturesOracle RAC - Roadmap for New Features
Oracle RAC - Roadmap for New Features
 
Pitfalls of migrating projects to JDK 9
Pitfalls of migrating projects to JDK 9Pitfalls of migrating projects to JDK 9
Pitfalls of migrating projects to JDK 9
 
Next Generation Data Center Strategies
Next Generation Data Center StrategiesNext Generation Data Center Strategies
Next Generation Data Center Strategies
 
NZOUG-GroundBreakers-2018 - Troubleshooting and Diagnosing 18c RAC
NZOUG-GroundBreakers-2018 - Troubleshooting and Diagnosing 18c RACNZOUG-GroundBreakers-2018 - Troubleshooting and Diagnosing 18c RAC
NZOUG-GroundBreakers-2018 - Troubleshooting and Diagnosing 18c RAC
 
3° Sessione Oracle - CRUI: Mobile&Conversational Interface
3° Sessione Oracle - CRUI: Mobile&Conversational Interface3° Sessione Oracle - CRUI: Mobile&Conversational Interface
3° Sessione Oracle - CRUI: Mobile&Conversational Interface
 

Kürzlich hochgeladen

Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 

Kürzlich hochgeladen (20)

Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 

Performance in Spark 2.0, PDX Spark Meetup 8/18/16