With the right combination of open source projects, you can have a high concurrency and low latency spark jobs for doing data analysis. We'll show both REST and JDBC access to access data from a persistent spark context and then show how the combination of Spark Job Server, Spark Thrift Server and Apache Kudu can create a scalable backend for low latency analytics.
30. Reads CSV
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as
header
.option("inferSchema", "true") // Automatically infer
data types
.load("cars.csv")
Writes CSV
val selectedData = df.select("year", "model")
selectedData.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("newcars.csv")
31. But these are often simlified as:
val parquetDataframe =
sqlContext.read.parquet(“people.parquet")
parquetDataframe.write.parquet("people.parquet")
32. I wrote the current version
of the Kudu
Datasource/Spark
Integration
60. // Use KuduContext to
create, delete, or write to
Kudu tables
val kuduContext = new
KuduContext("kudu.master:7051")
61. // Insert data
kuduContext.insertRows(df, "test_table")
// Delete data
kuduContext.deleteRows(filteredDF, "test_table")
// Upsert data
kuduContext.upsertRows(df, "test_table")
// Update data
val alteredDF = df.select("id", $"count" + 1)
kuduContext.updateRows(alteredDF, “test_table")
http://kudu.apache.org/docs/developing.html
62. Upserts are handled server side for performance
Upserts can also be handled through datasource api:
df.write.options(Map("kudu.master"->
"kudu.master:7051", "kudu.table"-
>"test_table")).mode("append").kudu
63. You can also create,
check existence and
delete tables through api
64. Additional notes:
Kudu datasource currently works with spark 1.x
Next release it will support both 1.x and 2.x
It's being improved on regular basis
65. Number of partitions on the dataframe is
related to how many tablets/partitions are
related to the filter.
Partition scans are parallel and have locality
awareness in spark
66. Be sure to set spark locality wait to
something other small for low latency
(3 seconds is the spark default)
70. Rest based api to:
Run Jobs
Create contexts
Check status of job both async/sync
71. Creating a context calls spark
submit (in separate jvm mode)
Uses akka to communicate
between rest and spark driver
72. To create a persistent context you need:
cpu cores + memory footprint
name to reference it by
factory to use for the context.
ie HiveContextFactory vs SqlContextFactory
73. Our average job time is
30ms when coming through
api for simpler retrievals
74. Jobs need to implement an interface
context will be passed in
DON’T CREATE YOUR OWN
SQLCONTEXT!!
81. Due to jvm classloader
contexts need to be restarted
on deploy to pick up new code
82. Some settings:
spark.files.overwrite = true
context-per-jvm = true
spray-can: parsing.max-content-length = 256m
spray-can: idle-timeout = 600 s
spray-can: request-timeout = 540 s
spark.serializer =
"org.apache.spark.serializer.KryoSerializer"
filedao vs sqldao backend
have to build from source/no binary for SJS
85. I run the following on a persistent context:
sc.getConf.set("spark.sql.hive.thriftServer.singleSes
sion", "true")
sqlContext.setConf("hive.server2.thrift.port", port) //
port to run thrift server on
HiveThriftServer2.startWithContext(sqlContext)
86. Now I can connect using
hive-jdbc
odbc (microsoft or simba)
87. Run a job with joins/ or even just a
basic dataframe through
datasource api and
registerTempTable
89. You could also potentially
cache/persist via spark and
register that way assuming joins
are expensive
90. Now you can run queries
as if it was a traditional
database
91. Hey thats great, but how fast?
500 ms average response time
200 concurrent complex queries
1+ Billion rows with 200+ columns
sql queries with 5 predicates, min,max,count some
values and group by on 5 columns
No spark caching
92. We take this a step farther and do
complex dataframes and it is made
available as a registered temp
table