Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Scala at Treasure Data

2.577 Aufrufe

Veröffentlicht am

Scala at Treasure Data - Treasure Data Tech Talk @ Tokyo, June 13, 2017

Veröffentlicht in: Technologie
  • Positions Available Now! We currently have several openings for writing workers. ●●● https://tinyurl.com/vvgf8vz
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Download this 3-step guide to generating insane amounts of media coverage for your from LinkedIn: http://bit.ly/linkedin3stepguide
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Scala at Treasure Data

  1. 1. T R E A S U R E D A T A Scala at Treasure Data Taro L. Saito - GitHub:@xerial Ph.D., Software Engineer at Treasure Data, Inc. Treasure Data Tech Talk @ Tokyo, June 13, 2017 1
  2. 2. Why Scala? • Scala is not an official programming language of Treasure Data • I was the only engineer who can write Scala in TD • 3 years ago • Now all of my team members can write Scala • Fact: Java experts can quickly learn Scala https://www.treasuredata.com/company/careers/ 2
  3. 3. Challenge: Increased Presto Usage at Treasure Data (2017) Processing 15 Trillion Rows / Day 
 (= 173 Million Rows / sec.) 150,000~ Queries / Day 1,500~ Users • How do we improve the service by utilizing this massive amount of query logs? 3 Query Logs Store Analyze SQL Improve & Optimize
  4. 4. A Success Story: Using Scala in Genome Science 4
  5. 5. Scala Use Cases in TD • Analyzing Query Engine Logs • Data analytics workflows written in Scala • For finding effective optimization approaches • Prestobase • Management Base of Presto • Gateway to access Presto (Finagle + Presto) • Monitoring + Runtime Analysis • Spark Integration • Accessing to Treasure Data from Spark 5
  6. 6. Open-Source Scala Libraries Developed at TD • Libraries that make Scala programming fun • wvlet-log: handy logging library: https://github.com/wvlet/log • Airframe: Dependency Injection Library http://wvlet.org/airframe • Airframe Config: YAML-based configuration library (a module in Airframe) • Heavy use of meta-programing via Scalamacros • sbt plugins • Data analytics • sbt-sql: https://github.com/xerial/sbt-sql • Deployment • sbt-pack: https://github.com/xerial/sbt-pack • sbt-sonatype: https://github.com/xerial/sbt-sonatype 6
  7. 7. What is Scalamacros? • Generates Scala code at compile-time • Meta-programming (Writing a program that writes programs) • Experimental State at Scala 2.10, 2.11, and 2.12 • Scalamacros will no longer be experimental • Productization within 2017 • https://github.com/scalamacros/scalamacros • Scala Macro author (@xeno-by), IntelliJ team, EPFL Ph.D student • Support Scala 2.12 (and maybe Scala 2.11) and 3.0 • Announced at Scala Meetup at Twitter HQ, San Francisco 7
  8. 8. What is Scala 3.x? • Scala 3.x • Replaces the compiler to Dotty for faster compilation and better integration with IDE • Dotty: Compilers Are Databases (Martin Odersky, Scala’s creator) • https://www.youtube.com/watch?v=WxyyJyB_Ssc • Because compiler needs to answer … • Q: What is the signature of 
 method A.f at a given point of time? • class A[T] { def f(x: T): T = … } • Compiler itself, IDE (e.g., IntelliJ), etc. • Need to know these temporal types (Denotation) 8
  9. 9. Open-Source Scala Libraries in TD 9
  10. 10. Logging Library: Hard to Use • Logging configuration is hard • slf4j, log4j, logback-classic, etc. • XML configuration, etc. • Need to have redundant getLogger calls embulk log configuration with logback-classic 10
  11. 11. Dependency Hell of slf4j • slf4j (simple logger for Java) • The de facto standard of Java logging library • scala-logging: slf4j wrapper for Scala • Switches log outputs • Using a binding library in classpath • slf4j-nop (no output) • slf4j-simple (console output) • slf4j-log4j (output to log4j) • Pitfall • Cannot have multiple binders • But must have 1 binder (!!!) • de facto = many bad users • e.g., hadoop • Doesn’t care the other people: Including slf4j-log4j in the direct dependency • Need to exclude slf4j-log4j bindings from all of hadoop-related projects 11
  12. 12. wvlet-log github.com/wvlet/log • Favors Simplicity • Use Scalamacros to simplify user codes • Only need to extend LogSupport trait • No getLogger call • Using standard java.util.logging • No other dependency required • Features • Show source code locations of logs • Log format is configurable in the code (No XML nor plugin!) • Changing log levels with files or JMX • log.properties • log-test.properties • Built-in log handlers • log-rotate handler, async handler • Works with Scala.js to show logs in Web browser console 12
  13. 13. wvlet-log: Logging code generation with Scalamacros • Generate low-overhead logging code • Quasiquote • q”… scala code “ • Just writing Scala code template in macros 13
  14. 14. Airframe: wvlet.org/airframe/ • Dependency Injection Library for Scala • Best practices of building objects in Scala • We needed Google Guice for Scala • But there is no good alternative • Guice, Dagger2, Scaldi, Macwire, etc. • http://wvlet.org/airframe/docs/comparison.html • Using Google Guice in Scala • PlayFramework • Weird syntax • Airframe uses Scalamacros to simplify DI in Scala 14 ???
  15. 15. Airframe • Three step DI in Scala • Bind • Design • Build • Built-in life cycle manager • Session start/shutdown • e.g., connection open/close • Session • Manage singletons and 
 binding rules 15
  16. 16. Clear Separation of Concerns • Traditional Service Building: • With Airframe: • Clear separation of concerns: • How to build objects (design) • How to use objects (bind) • Simplest DI patten for Scala 16 How to build dependencies Just use components! Need to remember argument orders
  17. 17. Airframe Internals (Advanced) • Code generation with Scalamacros • Passing a Session when building App and A • http://wvlet.org/airframe/docs/internals.html 17
  18. 18. Customizing Prestobase Filters with Airframe • Prestobase Proxy: Gateway to access Presto • Adding TD specific binding • Finagle filters -> Injecting TD Specific filters 18
  19. 19. VCR Record/Replay for Testing Presto • Launching Presto requires a lot of memory (e.g., 2GB or more) • Often crashes CI service containers (TravisCI, CircleCI, etc.) • Recording Presto responses (prestobase-vcr) • with sqlite-jdbc: https://github.com/xerial/sqlite-jdbc • DB file for each test suite • Enabled small-memory footprint testing • Can run many Presto tests in CI 19
  20. 20. Airframe Config • YAML is useful for configuring applications • Embedding YAML configurations inside docker images • Provide credentials in a separate manner • password, API keys, instance specific param, etc. • properties file, environment variables, etc. • YAML + overrides + object mapping • http://wvlet.org/airframe/docs/config.html 20
  21. 21. Airframe Internal: Surface • Surface: Object surface (shape) inspector library • https://github.com/wvlet/airframe/tree/master/surface • case class A(id:Int, name:String) • surface.of[A] • => Surface(“A”, Seq(Param(“id”, surface.of[Int]), Param(“name”, surface.of[String])) • Extract object type parameters with Scala Runtime Reflection • Scala generates this type information at compile type • Used as Type Identifiers of Airframe and Airframe Config • e.g., [A], [Seq[B]], [Map[Int, String]], [A @@ Tag], etc. • Generating serializer/deserializer of Scala classes • Surface => Serialize object parameters => Encoding in MessagePack.gz => Embulk 21
  22. 22. td-spark • Access TD from Spark • Binding components with Airframe • IO Manager, Presto Client, etc. • Passing Design through SparkContext • Integration • TD -> Spark Dataframe • TD Presto Query -> DataFrame 22
  23. 23. Data Analytics with Scala 23
  24. 24. New Directions Explored By Presto • Traditional Database Usage • Required Database Administrator (DBA) • DBA designs the schema and queries • DBA tunes query performance • After Presto • Schema is designed by data providers • 1st data (user’s customer data) • 3rd party data sources • Analysts or Marketers explore the data with Presto • Don’t know the schema in advance • Many Analytical SQL queries 24
  25. 25. Bridging Gaps Between SQL and Programming Language • Traditional Approach • OR-Mapper: app developer design objects and schema, then generate SQLs • New Approach: SQL First • Need to manage various SQL results inside programming language • But How? 25
  26. 26. An Instinct 26
  27. 27. sbt-sql: https://github.com/xerial/sbt-sql • Scala SBT plugin for generating model classes from SQL files • src/main/sql/presto/*.sql (Presto Queries) • Using SQL as a function • Read Presto SQL Results as Objects • Enabled managing SQL queries in GitHub • Type-safe data analysis 27
  28. 28. Scala at Production 28
  29. 29. Packaging • Do you need to install Scala? • No. Only JDK is required • sbt-pack • https://github.com/xerial/sbt-pack • Create Scala code packages for releasing • At ./target/pack folder • Folder structure: • bin/ - launch scripts • lib/ - Scala/Java libraries • Makes easier to create docker images • Also used for creating distributable packages of td-spark 29
  30. 30. Deploying to Maven Central • Necessary Steps • Upload artifacts -> Close -> Release -> Drop • Painful • Need to login to Nexus Web UI • Many manual steps • Bintray? • Uploading to Bintray -> Automatic sync to Maven Central 30
  31. 31. sbt-sonatype plugin • Enable one-command release to Maven Central • Using REST APIs of Sonatype NEXUS Repository Manager • Developed at 2015 New Year holiday • Jan 5: Test Nexus REST API • Jan 20: First release (Just 1 day effort) • Released sbt-sonatype using sbt-sonatype • 2,000+ projects are using sbt-sonatype • Supporting sbt 0.13.x and 1.0.0 • And can be used for Java projects too • Nexus to Maven Central sync is now fast • Less than 10 minutes (June 2017) 31
  32. 32. Summary • TD is a heavy user of Scala • Analytics pipelines • Production services • Many libraries helping development • Airframe, wvlet-log • sbt plugins • For details about Presto analysis • Join Presto Meetup on Thursday! 32 Presto Meetup Tokyo: June 15, 2017 (Thu)
  33. 33. T R E A S U R E D A T A 33

×