These are the slides from my talk at Hulu in March 2015 discussing Apache Spark & Cassandra. I cover the evolution of data from a single machine to RDBMS (MySQL is the primary example) to big data systems.
On the Spark side, I covered batch jobs, streaming, Apache Kafka, an introduction to machine learning, clustering, logistic regression and recommendations systems (collaborative filtering).
The talk was recorded and is available on youtube: https://www.youtube.com/watch?v=_gFgU3phogQ
2. Small Data
• 100's of MB to low GB, single user
• sed, awk, grep are great
• Python & Ruby
• sqlite
• Iterate quickly
• Limitations:
• bad for multiple concurrent users (file sharing!)
3. Medium Data
• Fits on 1 machine
• Most current web apps
• Content-driven web apps (answerbag.com)
• RDBMS is fine
• postgres
• mysql
• Supports hundreds of concurrent
users
• ACID makes us feel good
• Scales vertically
5. Replication: ACID is a lie
Client
Master Slave
replication lag
Consistent results? Nope!
6. Third Normal Form Doesn't Scale
• Queries are unpredictable
• Users are impatient
• If data > memory, you = history
• Disk seeks are the worst
• Data must be denormalized
(SELECT
CONCAT(city_name,', ',region) value,
latitude,
longitude,
id,
population,
( 3959 * acos( cos( radians($latitude) ) *
cos( radians( latitude ) ) * cos( radians( longitude )
- radians($longitude) ) + sin( radians($latitude) ) *
sin( radians( latitude ) ) ) )
AS distance,
CASE region
WHEN '$region' THEN 1
ELSE 0
END AS region_match
FROM `cities`
$where and foo_count > 5
ORDER BY region_match desc, foo_count desc
limit 0, 11)
UNION
(SELECT
CONCAT(city_name,', ',region) value,
latitude,
longitude,
id,
population,
( 3959 * acos( cos( radians($latitude) ) *
cos( radians( latitude ) ) * cos( radians( longitude )
- radians($longitude) ) + sin( radians($latitude) ) *
sin( radians( latitude ) ) ) )
7. Sharding is a Nightmare
• Data is all over the place
• No more joins
• No more aggregations
• Denormalize all the things
• Querying secondary indexes
requires hitting every shard
• Adding shards requires manually
moving data
• Schema changes
8. High Availability.. not really
• Master failover… who's responsible?
• Another moving part…
• Bolted on hack
• Downtime is frequent
• Change database settings (innodb
buffer pool, etc)
• Drive, power supply failures
• OS updates
• Multi-DC? Yeah ok buddy…
9. Summary of Failure
• Scaling is a pain
• ACID is naive at best
• You aren't consistent
• Re-sharding is a manual process
• We're going to denormalize for
performance
• High availability is complicated,
requires additional operational
overhead
10. Lessons Learned
• Consistency is not practical
• So we give it up
• Manual sharding & rebalancing is hard
• So let's build in
• Every moving part makes systems more complex
• So let's simplify our architecture - no more master / slave
• Scaling up is expensive
• We want commodity hardware
• Scatter / gather no good
• We denormalize for real time query performance
• Goal is to always hit 1 partition
11. What is Apache Cassandra?
• Fast Distributed Database
• High Availability
• Linear Scalability
• Predictable Performance
• No SPOF
• Multi-DC
• Commodity Hardware
• Easy to manage operationally
12. Masterless, Peer to Peer
• Data distributed automatically
• Data replicated automatically
(replication factor)
• Rack / AZ aware
• Queries routed internally to correct
server
• Tune consistency as needed
• ONE, QUORUM, ALL
13. Data Structures
• Like an RDBMS, Cassandra uses a Table to
store data
• But there’s where the similarities end
• Partitions within tables
• Rows within partitions (or a single row)
• CQL to create tables & query data
• Partition keys determine where a partition
is found
• Clustering keys determine ordering of rows
within a partition
Table
Partition
Row
Keyspace
14. Embrace Denormalization
• Fastest way to query data is to ask
for a list of things, stored
contiguously
group_name user_id name age
ninjas 1 Jon 33
ninjas 2 Luke 33
ninjas 3 Pete 103
ninjas 4 Sarah 22
partition key (grouping)
clustering key (sorting)
15. Model Tables to Answer Queries
• This is not 3NF!!
• We always query by partition key
• Create many tables aka
materialized views
• Manage in your app code
• Denormalize!!
user age
jon 33
luke 33
JCVD 53
age user user
33 jon luke
53 JCVD
CREATE TABLE age_to_user (
age int,
user text,
primary key (age, user)
);
16. Limitations
• No aggregations yet
• (coming in 3.0)
• No joins
• Select rows by partition key
• Manage your own secondary
indexes
• No arbitrary, full cluster queries
18. Apache Spark
• Batch processing
• Functional constructs
• map / reduce / filter
• Fully distributed SQL
• RDD is a collection of data
• Scala or Python
• Streaming
• Machine learning
• Graph analytics (GraphX)
19. Spark on Cassandra
• Use the DataStax Spark connector (OSS)
• Data locality - run spark locally on each
node
• Dedicated DC - different workload
• Jobs write results back to Cassandra
21. Data Migrations
1 import org.apache.spark.{SparkContext,SparkConf}
2 import com.datastax.spark.connector._
3
4 object DataMigration {
5 def main(args: Array[String]): Unit = {
6
7 val conf = new SparkConf(true)
8 .set("spark.cassandra.connection.host", "127.0.0.1")
9
10 val sc = new SparkContext("local", "test", conf)
11
12 case class FoodToUserIndex(food: String, user: String)
13
14 val user_table = sc.cassandraTable("tutorial", "user")
15
16 val food_index = user_table.map(r =>
17 new FoodToUserIndex(r.getString("favorite_food"),
18 r.getString("name")))
19
20 food_index.saveToCassandra("tutorial", "food_to_user_index")
21
22 }
23 }
22. SparkSQL
• Register an RDD as a table
• Query as if it was RDBMS
• Joins, aggregations, etc
• Join across data sources (Postgres to Cassandra)
• Supports JDBC & ODBC
1 case class Person(name: String, age: Int)
2 sc.cassandraTable[Person]("test", "persons").registerAsTable("persons")
3 val adults = sql("SELECT * FROM persons WHERE age > 17")
4
24. Overview
• Read data from a streaming source
• ZeroMQ, Kafka, Raw Socket
• Data is read in batches
• Streaming is at best an approximation
• val ssc = new StreamingContext(sc, Seconds(1))
Time 1.1 1.5 1.7 2.1 2.4 2.8 3.4
Data (1,2) (4,2) (6,2) (9,1) (3,5) (7,1) (3,10)
25. What is Apache Kafka?
• Distributed, partitioned pub/sub
• Messages are sent to “topics” (sort of
like a queue)
• Multiple subscribers can read from
same partition
• Each subscriber maintains it's own
• position
• Scales massively
• We can use this to message both
ways
26. Streaming Snippet
1
2 val rawEvents: ReceiverInputDStream[(String, String)] =
3 KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, Map(topic -> 1),
4 StorageLevel.MEMORY_ONLY)
5
6 val parsed: DStream[PageView] = rawEvents.map{ case (_,v) =>
7 parse(v).extract[PageView]
8 } // return parsed json as RDD
9
10 case class PageViewsPerSite(site_id:String, ts:UUID, pageviews:Int)
11
12 val pairs = parsed.map(event => (event.site_id, 1))
13
14 val hits_per_site: DStream[PageViewsPerSite] = pairs.reduceByKey(_+ _).map(
15 x => {
16 val (site_id, hits) = x
17 PageViewsPerSite(site_id, UUIDs.timeBased(), hits)
18
19 }
20 )
21
22 hits_per_site.saveToCassandra("killranalytics", "real_time_data")
Full source: https://github.com/rustyrazorblade/killranalytics/blob/master/spark/src/main/scala/RawEventProcessing.scala
27. Things to keep in mind….
• Streaming aggregations are an
approximation
• Best practice is to come up w/ a
rough number in streaming
• You’ll need to reaggregate original
data if you want precision
• If it's time series, use
DateTieredCompaction with
consistent TTLs
29. Clustering
• Unsupervised learning
• Batch or streaming
• reevaluate clusters as new data arrives
• K-means
• Puts points into predefined # of clusters
• Power iteration clustering
• clustering vertices of a graph
31. Collaborative Filtering
• Recommendation engine
• Algo: Alternating least squares
• Movies, music, etc
• Perfect match for Cassandra
• Source of truth
• Hot, live data
• Spark generates recommendations
(store in cassandra)
• Feedback loops generates better
recommendations over time
32. When is Spark + Cassandra right?
• Many storage alternatives
• HDFS / Hbase
• RDBMS
• Spark talks to everyone
• CSV on S3
• Lots of live data
• Real time requirements - SLA
• Multi-DC
• Not for data at rest, not ETL
• Cassandra is not a data warehouse
33. Open Source
• Latest, bleeding edge features
• File JIRAs
• Support via mailing list & IRC
• Fix bugs
• cassandra.apache.org
• Perfect for hacking
34. DataStax Enterprise
• Integrated Multi-DC Solr
• Integrated Spark
• Free Startup Program
• <3MM rev & <$30M funding
• Extended support
• Additional QA
• Focused on stable releases for enterprise
• Included on USB drive