Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Spark Streamingによるリアルタイムユーザ属性推定

1.430 Aufrufe

Veröffentlicht am

Spark Meetup December 2015 http://connpass.com/event/23159/ 発表資料

Veröffentlicht in: Daten & Analysen
  • People used to laugh at me behind my back before I was in shape or successful. Once I lost a lot of weight, I was so excited that I opened my own gym, and began helping others. I began to get quite a large following of students, and finally, I didn't catch someone laughing at me behind my back any longer. CLICK HERE NOW ●●● https://tinyurl.com/1minweight4u
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Spark Streamingによるリアルタイムユーザ属性推定

  1. 1. Spark Streaming / @laclefyoshi <ysaeki@r.recruit.co.jp>
  2. 2. • • Spark Streaming • • • Spark Streaming Tips • 2
  3. 3. : / SAEKI Yoshiyasu : IT : Web 4 9 R&D Hadoop, Kafka, Storm, Spark, Druid : RICOH Theta ( ) + Google Cardboard 3
  4. 4. Spark Streaming http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html 4
  5. 5. 5
  6. 6. • • = • • http://www.recruit.jp/company/about/structure.html 6
  7. 7. • • ≒ … • • ! OS etc. 7
  8. 8. 1. Web 
 (JavaScript) 2. fluentd Kafka 8
  9. 9. : fluentd → Kafka • fluent-plugin-kafka • https://github.com/htgc/fluent-plugin-kafka • output type = kafka_buffered (on file) • Kafka 0.8.2.2 • 0.9.0 • ACL 9
  10. 10. 10
  11. 11. Suro • Netflix • https://github.com/Netflix/suro • : Kafka Consumer API Thrift API • : • HDFS • AWS S3 • Kafka Producer • Elasticsearch • 11 LinkedIn Gobblin
  12. 12. Hadoop • • HDFS • MLlib 
 • Streaming linear regression (Classification) • Streaming k-means (Clustering) • 12
  13. 13. Spark Streaming 13
  14. 14. Kafka • Direct Approach (>= Spark 1.3) • • Exactly-once • Kafka Simple Consumer API Direct Approach 14
  15. 15. Spark Streaming 1 15 http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html RDD @ time1 RDD @ time2 RDD @ time3 RDD @ time4
  16. 16. Spark Streaming 2 16 http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html
  17. 17. Micro-batch 17 1Micro-batch (Cookie)
  18. 18. Window-based micro-batch 1 1Micro-batch1Micro-batch 18
  19. 19. Micro-batch • RDD HBase dstream.foreachRDD { rdd => val hbaseConf = createHbaseConfiguration() val jobConf = new Configuration(hbaseConf) jobConf.set("mapreduce.job.output.key.class", classOf[Text].getName) jobConf.set("mapreduce.job.output.value.class", classOf[Text].getName) jobConf.set("mapreduce.outputformat.class", classOf[TableOutputFormat[Text]].getName) new PairRDDFunctions(rdd.map(hbaseConvert)).saveAsNewAPIHadoopDataset(jobConf) } // RDD[(String, Map[K,V])] RDD[(String, Put)] def hbaseConvert(t:(String, Map[String, String])) = { val p = new Put(Bytes.toBytes(t._1)) t._2.toSeq.foreach( m => p.addColumn(Bytes.toBytes("seg"), Bytes.toBytes(m._1), Bytes.toBytes(m._2)) ) (t._1, p) } 19 0.5 1
  20. 20. 20
  21. 21. Spark Streaming : • DStream RDD • Spark 
 Spark Streaming 21 http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html
  22. 22. Spark Streaming : • Fault Tolerance • Micro-batch • YARN • YARN Dynamic Resource Allocation • 22
  23. 23. Spark Streaming : • : → 
 RDD → RDD DStream → DStream • 1Micro-batch 23 // RDD → RDD val input:RDD[String] = sparkContext.makeRDD(Seq("a", "b", “c")) // DStream → DStream val queue = scala.collection.mutable.Queue(rdd) val dstream:DStream[String] = sparkStreamingContext.queueStream(queue)
  24. 24. Spark Streaming : • spark-testing-base • https://github.com/holdenk/spark-testing-base class JsonElementCountTest extends StreamingSuiteBase { test("simple") { val input = List(List("aa"), List("bb")) val expected = List(List("AA"), List(“BB")) testOperation[String, String]( input, converterMethod _, expected, useSet = true) }
 } 24
  25. 25. Spark Streaming : • Window-based micro-batch • • o.a.spark.streaming.util.ManualClock
 • private class Scala • http://mkuthan.github.io/blog/2015/03/01/spark- unit-testing/ 25
  26. 26. Spark Streaming : • Scala Java • • Spark Streaming Kafka HBase Scala • Java 26 // api/java/JavaRDD.scala object JavaRDD { implicit def fromRDD[T: ClassTag](rdd: RDD[T]): JavaRDD[T] = new JavaRDD[T](rdd) implicit def toRDD[T](rdd: JavaRDD[T]): RDD[T] = rdd.rdd }
  27. 27. 27 • • • = • Spark Streaming • MLlib • GraphX

×