Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

3.236 Aufrufe

Veröffentlicht am

My Scala Matsuri 2016 presentation on the basics of using Spark's text mining tools with Japanese Text

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

  1. 1. Japanese Text Mining with Scala and Spark Eduardo Gonzalez Scala Matsuri 2016
  2. 2. About Me • Eduardo Gonzalez • Japan Business Systems • Japanese System Integrator (SIer) • Social Systems Design Center (R&D) • Pittsburgh University • Computer Science • Japanese @wm_eddie
  3. 3. Agenda • Intro to Text mining with Spark • Pre-processing Japanese Text • Japanese Word Breaking • Spark Gotchas • Topic Extraction with LDA • Intro to Word2Vec • Recommendation with Word Embedding
  4. 4. Machine Learning Vocabulary • Feature: A number that represents something about a data point • Label: A feature of the data we want to predict • Document: A block of text with a unique ID • Model: A learned set of parameters that can be used for prediction • Corpus: A collection of documents 機械学習の前提となる語彙としてFeature、Label、Document、Model、Corpusが ある
  5. 5. What is Apache Spark • A library that defines a Resilient Distributed Dataset type and a set of transformations • RDDs are only representations of calculations • A runtime that can execute RDDs in a distributed manner • A master process that schedules and monitors executors • Executors actually do the calculations and can keep results in their memory • Spark SQL, MLLib and Graph X define special types of RDDs Sparkは汎用分散処理基盤で、SQL/機械学習/グラフといったコンポーネントを保 持する
  6. 6. Apache Spark Example import org.apache.spark.{SparkConf, SparkContext} object Main extends App { val sc = new SparkContext(new SparkConf()) val text = sc.textFile("hdfs:///kjb.txt") val counts = text.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.collect().foreach(println) } SparkでWordCountアプリケーションを構築するとこのようになる
  7. 7. Spark’s Text-Mining Tools • LDA for Topic Extraction • Word2Vec an unsupervised way to turn words into features based on their meaning • CountVectorizer turns documents into vectors based on word count • HashingTF-IDF calculates important words of a document with respect to the corpus • And much more SparkのテキストマイニングツールとしてLDA、CountVectorizer、HashingTF- IDF等のツールがある
  8. 8. How to use Spark LDA import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel} import org.apache.spark.mllib.linalg.Vectors // Load and parse the data val data = sc.textFile("data/mllib/sample_lda_data.txt") val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble))) // Index documents with unique IDs val corpus = parsedData.zipWithIndex.map(_.swap).cache() // Cluster the documents into three topics using LDA val ldaModel = new LDA().setK(3).run(corpus)
  9. 9. sample_lda_data.txt ただ、入力のLDAデータは文章のようには見えない 1 2 6 0 2 3 1 1 0 0 3 1 3 0 1 3 0 0 2 0 0 1 1 4 1 0 0 4 9 0 1 2 0 2 1 0 3 0 0 5 0 2 3 9 3 1 1 9 3 0 2 0 0 1 3 4 2 0 3 4 5 1 1 1 4 0 2 1 0 3 0 0 5 0 2 2 9 1 1 1 9 2 1 2 0 0 1 3 4 4 0 3 4 2 1 3 0 0 0 2 8 2 0 3 0 2 0 2 7 2 1 1 1 9 0 2 2 0 0 3 3 4 1 0 0 4 5 1 3 0 1 0 (´Д`) This does not look like text
  10. 10. LDA Step 0: Get words LDA実行にあたり、まずはじめに単語を抽出する必要がある
  11. 11. Word Segmentation • Hard to actually get right. • Simple in theory with English • Str.Split(“ “) • But not enough for real data. • (Take parens for example.) • [“(Take”, “parens”, “for”, “example.)”] • Etc. 実際の単語抽出は難しく、区切りで分割するだけではうまくいかない
  12. 12. Word Segmentation • Since Japanese lacks spaces it’s hard even in theory • A probabilistic approach is necessary • Thankfully there are libraries that can help 日本語単語の抽出は単語区切り文字がなく、確率的アプローチが必要、ライブラ リで効率的に実行できる
  13. 13. Morphological Analyzers • Include POS tagging, pronunciation and stemming • MeCab • Written in C++with SWIG bindings to pretty much everything • Kuromoji • Written in Java available via maven • Others 形態素解析(品詞タグ付け、発音、語幹処理服務)用にMeCabやKuromoji等のラ イブラリがある
  14. 14. JMecab & Spark/Hadoop • Not impossible but difficult • Add MeCab to each node • Add jar to classpaths • Include jar in project for compilation • Not too bad with own hardware but painful with Amazon EMR or Azure HDInsight JMecabは事前Installが必要なため、オンプレでは何とかなるが、クラウド環境で は実行困難
  15. 15. Kuromoji & Spark/Hadoop • Easy • Include dependency in build.sbt • Include jar file in FatJar with sbt- assembly Kuromojiは依存性を追加し、FatJarをビルドするだけなので使いやすい
  16. 16. Using Kuromoji import org.atilika.kuromoji.Tokenizer object Main extends App { import scala.collection.JavaConverters.asScalaBufferConverter val tokenizer = Tokenizer.builder().build() val ex1 = "リストのような構造の物から条件を満たす物を探す" val res1 = tokenizer.tokenize(ex1).asScala for (token <- res1) { println(s"${token.getBaseForm}t${token.getPartOfSpeech}") } }
  17. 17. Using Kuromoji Kuromojiを使うとこのように認識される 厚生 名詞,一般,*,* 年金 名詞,一般,*,* 基金 名詞,一般,*,* 脱退 名詞,サ変接続,*,* に 助詞,格助詞,一般,* 伴う 動詞,自立,*,* 手続き 名詞,サ変接続,*,* について 助詞,格助詞,連語,* の 助詞,連体化,*,* リマ 名詞,固有名詞,地域,一般 インド 名詞,固有名詞,地域,国 です 助動詞,*,*,* リスト 名詞,一般,*,* の 助詞,連体化,*,* よう 名詞,非自立,助動詞語幹,* だ 助動詞,*,*,* 構造 名詞,一般,*,* の 助詞,連体化,*,* 物 名詞,非自立,一般,* から 助詞,格助詞,一般,* 条件 名詞,一般,*,* を 助詞,格助詞,一般,* 満たす 動詞,自立,*,* 物 名詞,非自立,一般,* を 助詞,格助詞,一般,* 探す 動詞,自立,*,*
  18. 18. Step 1: Build Vocabulary 語彙の構築
  19. 19. Vocabulary lazy val tokenizer = Tokenizer.builder().build() val text = sc.textFile("documents") val words = for { line <- text token <- tokenizer.tokenize(line).asScala } yield token.getBaseForm val vocab = words.distinct().zipWithIndex().collectAsMap()
  20. 20. Step 2: Create Corpus コーパスの作成
  21. 21. Corpus val documentWords: RDD[Array[String]] = text.map(line => tokenizer.tokenize(line).asScala.map(t => t.getBaseForm).toArray) val documentCounts: RDD[Array[(String, Int)]] = documentWords.map(words => words.distinct.map { word => (word, words.count(_ == word)) }) val documentIndexAndCount: RDD[Seq[(Int, Double)]] = documentCounts.map(wordsAndCount => wordsAndCount.map { case (word, count) => (vocab(word).toInt, count.toDouble) }) val corpus: RDD[(Long, Vector)] = documentIndexAndCount.map(Vectors.sparse(vocab.size, _)).zipWithIndex.map(_.swap)
  22. 22. Step 3: Learn Topics トピックモデルの学習
  23. 23. Learn Topics val ldaModel = new LDA().setK(10).setMaxIterations(100).run(corpus) val topics = ldaModel.describeTopics(10).map { case (terms, weights) => terms.map(vocabulary(_)).zip(weights) } topics.zipWithIndex.foreach { case (topic, i) => println(s"TOPIC $i") topic.foreach { case (term, weight) => println(s"$termt$weight") } println(s"==========") }
  24. 24. Step 4: Evaluate 結果の評価
  25. 25. Topics? Topic 0: です 0.10870545899718176。0.09623411796419644さん 0.06105040403724023 Topic 1: の0.11035671185240525を0.07860862808644907する 0.05605566895190625 Topic 2: お願い 0.07579177409154919ご0.04431117457391179よろしく 0.032788330612439916 結果は助詞や文章の補助単語になっていた
  26. 26. Step 5: GOTO 2
  27. 27. Filter Stopwords val popular = words .map(w => (w, 1)) .reduceByKey(_ + _) .sortBy(-_._2) .take(50) .map(_._1) .toSet val vocabIndicies = words.distinct().filter(w => !popular.contains(w)).zipWithIndex() val vocab: Map[String, Long] = vocabIndicies.collectAsMap() val vocabulary = vocabIndicies.collect().map(_._1) ストップワードの除去
  28. 28. Topics! Topic 0: 変更 0.032952997236706624サーバー 0.03140777729144046設定 0.021643554361727567エ ラー 0.017955380768330902 Topic 1: ログ 0.028665774057609564時間 0.026686704628121154時 0.02404938565591628発生 0.020797622509804107 Topic 2: 様0.0474658820402456株式会社 0.026174292703953685お世話 0.021939329774535308
  29. 29. Using the LDA model • Prediction requires a LocalLDAModel • Use .toLocal if isInstanceOf[DistributedLDAModel] • Convert to Vector using same steps • Be sure to filter out words not in the vocabulary • Call topicDistributions to see topic scores LDAモデルはトピックの予想のために使用される
  30. 30. Topics Prediction New document topics: 0.091084004103132,0.1044111561202625,0.09090943947509807,0.11607354553753861,0.104042 84803971378,0.09697071269561051,0.09571658794577831,0.0919546186785918,0.091762489301 32802,0.11707459810294643 New document topics: 0.09424474530277152,0.1183270779577911,0.09230776874419214,0.09835759337114718,0.1315 9581881630272,0.09279638945611612,0.094124104743527,0.09295449996673977,0.09291472297 512052,0.09237727866629193 トピックの予想 Topic 0 Topic 1 Topic 2 Topic …
  31. 31. Now what? • Find the minimum logLikelihood in a set of documents you know are OK • Report anomaly whenever a new document has a lower logLikelihood トピックを正しく予想できた集合の最小対数尤度を計算、新しい文書がその値を 下回ったら「異常」に分類
  32. 32. Anomaly Detection val newDoc = sc.parallelize(Seq("平素は当社サービスをご利用いただき、誠にありがとうございます。 ")) def stringToCountVector(strings: RDD[String]) = { . . . } val score = lda.logLikelihood(stringToCountVector(newDoc)) /* -2153492.694125671 */
  33. 33. Word2Vec • Created vectors that represents points in meaning space • Unsupervised but requires a lot of data to generate good vectors • Google’s sample vectors trained on 100 billion words (~X00GB?) • Vectors with less data can provide interesting similarities but can’t do so consistently Word2Vecでは単語をベクトル化して定量的に表現可能で、単語同士の類似度を 出すことができる
  34. 34. Word2Vec Intuition • Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013. 実際の単語ベクトル化例
  35. 35. Vector Concatenation ベクトル連結 ITEM_01 営業 活用 営業 の 情報 共有 と サポート. . .
  36. 36. Step 1: Make vectors 単語ベクトルの生成
  37. 37. Making Word2VecModel val documentWords: RDD[Seq[String]] = text.map(line => tokenizer.tokenize(line).asScala.map(_.getSurfaceForm).toSeq) documentWords.cache() val model = new Word2Vec().setVectorSize(300).fit(documentWords)
  38. 38. Step 2: Use vectors 単語ベクトルの適用
  39. 39. Using Word2VecModel model.findSynonyms(“日本”, 5).foreach(println) /* (マイクロソフト,3.750299190465294) (ビジネス,3.7329870992662104) (株式会社,3.323983664186244) (システムズ,3.1331352923187987) (ビジネスプロダクティビティ,2.595931613590554) */ 実際に単語類似度算出例、ただし、元データで結果は大きく変動するため元デー タが非常に重要 Big dataset is very important.
  40. 40. Recommendation • Paragraph Vectors • Not available in Spark T_T 文章のベクトル化によるレコメンドはSparkではできない
  41. 41. Embedding with Vector Concatenation • Calculate sum of words in description • Add it to vectors from Word2VecModel.getVectors with special keyword (Ex. ITEM_1234) • Create new Word2VecModel using constructor • ※Not state of the art but can produce reasonable recommendations without user rating data ベクトル連結による embedding、「アイテム」ごとに含まれる単語のベクトルを 合計する
  42. 42. Item Embedding (1/2) val embeds = Map( "ITEM_001_01" -> "営業部門の情報共有と活用をサポートし", "ITEM_001_02" -> "組織的な営業力・売れる仕組みを構築します", "ITEM_001_03" -> "営業情報のコミュニケーション基盤を構築する", "ITEM_002_01" -> "一般的なサーバ、ネットワーク機器やOSレベルの監視に加え", "ITEM_002_02" -> "またモニタリングポータルでは、アラームの発生状況", "ITEM_002_03" -> "監視システムにより取得されたパフォーマンス情報が逐次ダッシュボード形式", "ITEM_003_01" -> "IPネットワークインフラストラクチャを構築します", "ITEM_003_02" -> "導入にとどまらず、アプリケーションやOAシステムとの融合を図ったユニファイドコミュニ ケーション環境を構築", "ITEM_003_03" -> "企業内および企業外へのコンテンツの効果的な配信環境、閲覧環境をご提供します" )
  43. 43. Item Embedding (2/2) def stringToVector(s: String): Array[Double] = { val words = tokenizer.tokenize(s).asScala.map(_.getSurfaceForm).toSeq val vectors = words.map(word => Try(model.transform(word)).getOrElse(model.transform("は"))) val breezeVectors: Seq[DenseVector[Double]] = vectors.map(v => new DenseVector(v.toArray)) val concat = breezeVectors.foldLeft(DenseVector.zeros[Double](vectorLength))((a, b) => a :+ b) concat.toArray } val embedVectors: Map[String, Array[Float]] = embeds.map { case (key, value) => (key, stringToVector(value).map(_.toFloat)) } val embedModel = new Word2VecModel(embedVectors ++ model.getVectors)
  44. 44. Recommending Similar embedModel.findSynonyms("ITEM_001_01", 5).foreach(println) /* (ITEM_001_03,12.577457221575695) (ITEM_003_03,12.542920930725996) (ITEM_003_02,12.315240961298104) (ITEM_001_02,12.260734177166485) (ITEM_002_01,10.866897938028856) */ 類似度の計算
  45. 45. Recommending New val newSentence = stringToVector("会計・受発注及び生産管理を中心としたシステム") embedModel.findSynonyms(Vectors.dense(newSentence), 5).foreach(println) /* (ITEM_001_02,14.372981084681571) (ITEM_003_03,14.343473534848325) (ITEM_001_01,13.83593570884867) (ITEM_002_01,13.61507040314043) (ITEM_002_03,13.462141195072414) */ 新しいサンプルからのレコメンド
  46. 46. Thank you • Questions? • Example source code at: • https://github.com/wmeddie/spark-text

×