Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

ETL with SPARK - First Spark London meetup

29.875 Aufrufe

Veröffentlicht am

Supercharging ETL with Spark
Slides from first Spark Meetup London

Veröffentlicht in: Technologie
  • Sex in your area is here: ❶❶❶ http://bit.ly/39sFWPG ❶❶❶
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Dating for everyone is here: ❶❶❶ http://bit.ly/39sFWPG ❶❶❶
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Looking For A Job? Positions available now. FT or PT. $10-$30/hr. No exp required. ◆◆◆ https://tinyurl.com/vvgf8vz
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

ETL with SPARK - First Spark London meetup

  1. 1. Supercharging ETL with Spark Rafal Kwasny First Spark London Meetup 2014-05-28
  2. 2. Who are you?
  3. 3. About me • Sysadmin/DevOps background • Worked as DevOps @Visualdna • Now building game analytics platform @Sony Computer Entertainment Europe
  4. 4. Outline • What is ETL • How do we do it in the standard Hadoop stack • How can we supercharge it with Spark • Real-life use cases • How to deploy Spark • Lessons learned
  5. 5. Standard technology stack Get the data
  6. 6. Standard technology stack Load into HDFS / S3
  7. 7. Standard technology stack Extract & Transform & Load
  8. 8. Standard technology stack Query, Analyze, train ML models
  9. 9. Standard technology stack Real Time pipeline
  10. 10. Hadoop • Industry standard • Have you ever looked at Hadoop code and tried to fix something?
  11. 11. How simple is simple? ”Simple YARN application to run n copies of a unix command - deliberately kept simple (with minimal error handling etc.)” ➜ $ git clone https://github.com/hortonworks/simple-yarn-app.git (…) ➜ $ find simple-yarn-app -name "*.java" |xargs cat | wc -l 232
  12. 12. ETL Workflow • Get some data from S3/HDFS • Map • Shuffle • Reduce • Save to S3/HDFS
  13. 13. ETL Workflow • Get some data from S3/HDFS • Map • Shuffle • Reduce • Save to S3/HDFS Repeat 10 times
  14. 14. Issue: Test run time • Job startup time ~20s to run a job that does nothing • Hard to test the code without a cluster ( cascading simulation mode != real life )
  15. 15. Issue: new applications MapReduce awkward for key big data workloads: • Low latency dispatch (E.G. quick queries) • Iterative algorithms (E.G. ML, Graph…) • Streaming data ingest
  16. 16. Issue: hardware is moving on Hardware had advanced since Hadoop started: • Very large RAMs, Faster networks (10Gb+) • Bandwidth to disk not keeping up • 1 GB of RAM ~ $0.75/month * *based on a spot price of AWS r3.8xlarge instance
  17. 17. How can we supercharge our ETL?
  18. 18. Use Spark • Fast and Expressive Cluster Computing Engine • Compatible with Apache Hadoop • In-memory storage • Rich APIs in Java, Scala, Python
  19. 19. Why Spark? • Up to 40x faster than Hadoop MapReduce ( for some use cases, see: https://amplab.cs.berkeley.edu/benchmark/ ) • Jobs can be scheduled and run in <1s • Typically less code (2-5x) • Seamless Hadoop/HDFS integration • REPL • Accessible Source in terms of LOC and modularity
  20. 20. Why Spark? • Berkeley Data Analytics Stack ecosystem: • Spark, Spark Streaming, Shark, BlinkDB, MLlib • Deep integration into Hadoop ecosystem • Read/write Hadoop formats • Interoperability with other ecosystem components • Runs on Mesos & YARN, also MR1 • EC2, EMR • HDFS, S3
  21. 21. Why Spark?
  22. 22. Using RAM for in-memory caching
  23. 23. Fault recovery
  24. 24. Stack Also: • SHARK ( Hive on Spark ) • Tachyon ( off heap caching ) • SparkR ( R wrapper ) • BlinkDB ( Approximate Queries)
  25. 25. Real-life use
  26. 26. Spark use-cases • next-generation ETL platform • No more “multiple chained MapReduce jobs” architecture • Less jobs to worry about • Better sleep for your DevOps team
  27. 27. Sessionization Add session_id to events
  28. 28. Why add session id? Combine all user activity into user sessions
  29. 29. Adding session ID user_id timestamp Referrer URL user1 1401207490 http://fb.com http://webpage/ user2 1401207491 http://twitter.com http://webpage/ user1 1401207543 http://webpage/ http://webpage/login user1 140120841 http://webpage/login http://webpage/add_to_cart user2 1401207491 http://webpage/ http://webpage/product1
  30. 30. Group by user user_id timestamp Referrer URL user1 1401207490 http://fb.com http://webpage/ user1 1401207543 http://webpage/ http://webpage/login user1 140120841 http://webpage/login http://webpage/add_to_cart user2 1401207491 http://twitter.com http://webpage/ user2 1401207491 http://webpage/ http://webpage/product1
  31. 31. Add unique session id user_id timestamp session_id Referrer URL user1 140120749 0 8fddc743bfbafdc 45e071e5c126ce ca7 http://fb.com http://webpage/ user1 140120754 3 8fddc743bfbafdc 45e071e5c126ce ca7 http://webpage/ http://webpage/login user1 140120841 8fddc743bfbafdc 45e071e5c126ce ca7 http://webpage/lo gin http://webpage/add_to_ cart user2 140120749 1 c00e742152500 8584d9d1ff4201 cbf65 http://twitter.com http://webpage/ 140120749 c00e742152500 http://webpage/product
  32. 32. Join with external data user_id timestamp session_id new_user Referrer URL user1 1401207490 8fddc743bfba fdc45e071e5 c126ceca7 TRUE http://fb.com http://webpage/ user1 1401207543 8fddc743bfba fdc45e071e5 c126ceca7 TRUE http://webpag e/ http://webpage/l ogin user1 140120841 8fddc743bfba fdc45e071e5 c126ceca7 TRUE http://webpag e/login http://webpage/ add_to_cart user2 1401207491 c00e7421525 008584d9d1ff 4201cbf65 FALSE http://twitter.c om http://webpage/ c00e7421525
  33. 33. Sessionize user clickstream • Filter interesting events • Group by user • Add unique sessionId • Join with external data sources • Write output
  34. 34. val input = sc.textFile("file:///tmp/input") val rawEvents = input .map(line => line.split("t")) val userInfo = sc.textFile("file:///tmp/userinfo") .map(line => line.split("t")) .map(user => (user(0),user)) val processedEvents = rawEvents .map(arr => (arr(0),arr)) .cogroup(userInfo) .flatMapValues(k => { val new_user = k._2.length match { case x if x > 0 => "true" case _ => "false" } val session_id = java.util.UUID.randomUUID.toString k._1.map(line => line.slice(0,3) ++ Array(session_id) ++ Array(new_user) ++ line.drop(3) ) }) .map(k => k._2)
  35. 35. Why is it better? • Single spark job • Easier to maintain than 3 consecutive map reduce stages • Can be unit tested
  36. 36. From the DevOps perspective
  37. 37. v1.0 - running on EC2 • Start with an EC2 script ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> —instance-type=c3.xlarge launch <cluster-name> If it does not work for you - modify it, it’s just a simple python+boto
  38. 38. v2.0 - Autoscaling on spot instances 1x Master - on-demand (c3.large) XX Slaves - spot instances depending on usage patterns (r3.*) • no HDFS • persistence in memory + S3
  39. 39. Other options • Mesos • YARN • MR1
  40. 40. Lessons learned
  41. 41. JVM issues • java.lang.OutOfMemoryError: GC overhead limit exceeded • add more memory? val sparkConf = new SparkConf() .set("spark.executor.memory", "120g") .set("spark.storage.memoryFraction","0.3") .set("spark.shuffle.memoryFraction","0.3") • increase parallelism: sc.textFile("s3://..path", 10000) groupByKey(10000)
  42. 42. Full GC 2014-05-21T10:15:23.203+0000: 200.710: [Full GC 109G- >45G(110G), 79.3771030 secs] 2014-05-21T10:16:42.580+0000: 280.087: Total time for which application threads were stopped: 79.3773830 seconds we want to avoid this • Use G1GC + Java 8 • Store data serialized set("spark.serializer","org.apache.spark.serializer.KryoSerializer") set("spark.kryo.registrator","scee.SceeKryoRegistrator")
  43. 43. Bugs • for example: cdh5 does not work with Amazon S3 out of the box ( thx to Sean it will be fixed in next release ) • If in doubt use the provided ec2/spark-ec2 script • ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> —instance-type=c3.xlarge launch <cluster-name>
  44. 44. Tips & Tricks • you do not need to package whole spark with your app, just specify dependencies as provided in sbt libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.0-cdh5.0.1" % „provided" libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0-cdh5.0.1" % "provided" assembly jar size from 120MB -> 5MB • always ensure you are compiling agains the same version of artifacts, if not ”bad things will happen”™
  45. 45. Future - Spark 1.0 • Voting in progress to release Spark 1.0.0 RC11 • Spark SQL • History server • Job Submission Tool • Java 8 support
  46. 46. Spark - Hadoop done right • Faster to run, less code to write • Deploying Spark can be easy and cost-effective • Still rough around the edges but improves quickly
  47. 47. Thank you for listening :)

×