Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
RevolutionizeTextMining
withSparkandZeppelin
April2017
YanboLiang
ApacheS...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Agenda
TextminingworkflowonBigData
TextminingwithSparkandMLlib
SparkandZep...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TextMining:PracticalApplications
•Textclassification
–Spamfiltering
–Fraudd...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TraditionalTextMining
•Commercialsoftware
•Opensourcesoftware
–Gensim,KNI...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TraditionalTextMining
•Commercialsoftware
–IBMSPSS,RapidMiner,SAS
•Openso...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TextMiningonBigData
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TextMiningonBigData
DataScientistsSoftwareengineers
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
WhyApacheSparkMLlib
•ScalablemachinelearningalgorithmsontopofSpark
–Alter...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TextMiningworkflow
•Prototype(Python/R)
•CreatePipeline
–Loaddataset
–Extr...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TextMiningworkflow
•Prototype(Python/R)
•CreatePipeline
–Loaddataset
–Extr...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Loaddata
TextLabel
Iboughtthegame…4
DoNOTbothertry…1
Thisshirtisawesome…5...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Extractfeatures
TextLabelWordsFeatures
Iboughtthegame…4“i”,“bought”,…[1,0...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Fitamodel
TextLabelWordsFeaturesProbabilityPrediction
Iboughtthegame…4“i”...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Evaluate
TextLabelWordsFeaturesProbabilityPrediction
Iboughtthegame…4“i”,...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
KeyabstractionofSparkMLpipeline
•Transformer
–Featuretransformers(e.g.,Ha...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Spark’sTextMiningalgorithms
•LDAfortopicmodel
•Word2Vecanunsupervisedwayt...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
MLlibTextMiningPipeline-classification
Dataset
RegexTokenizer
StopWordsRem...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
MLlibTextMiningPipeline–topicmodel
Dataset
RegexTokenizer
StopWordsRemove...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
MLlibTextMiningPipeline-recommendation
Dataset
RegexTokenizerWord2Vec
rec...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
MLlibTextMiningPipeline
Dataset
RegexTokenizer
StopWordsRemover
CountVect...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Demo
•loadthefilecontentsandthecategories
•extractfeaturevectorssuitablefo...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
CustomingMLPipelines
•MLlib2.1includes:
–30+featuretransformers(Tokenizer...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Optionsforcustomization
•Existingusecases:
–spark-corenlp
–spark-vlbfgs
•...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Sparkvirtualenvironment
DataScientistADataScientistB
Python2.7
Python2.7
...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Sparkvirtualenvironment
DataScientistADataScientistB
Python2.7
Python2.7
...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TextMiningworkflow
•Prototype(Python/R)
•CreatePipeline
–Loaddataset
–Extr...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
MLpersistence
•Prototype(Python/R)
•CreatePipeline
•LoadPipeline(Java/Sca...
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Datascientistsworkwithsoftwareengineer
DataScientistsSoftwareengineers
Ex...
Nächste SlideShare
Wird geladen in …5
×

Revolutionize Text Mining with Spark and Zeppelin

1.445 Aufrufe

Veröffentlicht am

Revolutionize Text Mining with Spark and Zeppelin Slides

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Revolutionize Text Mining with Spark and Zeppelin

  1. 1. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved RevolutionizeTextMining withSparkandZeppelin April2017 YanboLiang ApacheSparkcommitter Softwareengineer@Hortonworks
  2. 2. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved Agenda TextminingworkflowonBigData TextminingwithSparkandMLlib SparkandZeppelinastheplatform
  3. 3. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved TextMining:PracticalApplications •Textclassification –Spamfiltering –Frauddetection •Textclustering •Sentimentanalysis •Entityextraction •Recommendations •Automaticlabeling •Contextualadvertising
  4. 4. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved TraditionalTextMining •Commercialsoftware •Opensourcesoftware –Gensim,KNIME,NLTK, sklearn,R
  5. 5. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved TraditionalTextMining •Commercialsoftware –IBMSPSS,RapidMiner,SAS •Opensourcesoftware –Gensim,KNIME,NLTK, sklearn,R
  6. 6. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved TextMiningonBigData
  7. 7. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved TextMiningonBigData DataScientistsSoftwareengineers
  8. 8. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved WhyApacheSparkMLlib •ScalablemachinelearningalgorithmsontopofSpark –AlternatingLeastSquaresonSpotifydata •50+millionusersx30+millionsongs,50billionratings •Forrank10with10iterations,~1hourrunningtime •Workflowutilities –MLpipeline –Modelimport/export –crossvalidation
  9. 9. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved TextMiningworkflow •Prototype(Python/R) •CreatePipeline –Loaddataset –Extractrawfeatures –Transformfeatures –Selectkeyfeatures –Fitandchoosebestmodels •Re-implementPipelinefor production(Java/Scala) •DeployPipeline •Scoring DataScienceSoftwareengineering
  10. 10. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved TextMiningworkflow •Prototype(Python/R) •CreatePipeline –Loaddataset –Extractrawfeatures –Transformfeatures –Selectkeyfeatures –Fitandchoosebestmodels •Re-implementPipelinefor production(Java/Scala) •DeployPipeline •Scoring DataScienceSoftwareengineering
  11. 11. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved Loaddata TextLabel Iboughtthegame…4 DoNOTbothertry…1 Thisshirtisawesome…5 nevergotit.Seller…1 Iorderedthisto…3 Dataset Feature engineering Model training Model evaluation
  12. 12. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved Extractfeatures TextLabelWordsFeatures Iboughtthegame…4“i”,“bought”,…[1,0,3,9,…] DoNOTbothertry…1“do”,“not”,…[0,0,11,0,…] Thisshirtisawesome…5“this”,“shirt”,…[0,2,3,1,…] nevergotit.Seller…1“never”,“got”,…[1,2,0,0,…] Iorderedthisto…3“i”,“ordered”,…[1,0,0,3,…] Dataset Feature engineering Model training Model evaluation
  13. 13. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved Fitamodel TextLabelWordsFeaturesProbabilityPrediction Iboughtthegame…4“i”,“bought”,…[1,0,3,9,…]0.84 DoNOTbothertry…1“do”,“not”,…[0,0,11,0,…]0.62 Thisshirtisawesome…5“this”,“shirt”,…[0,2,3,1,…]0.95 nevergotit.Seller…1“never”,“got”,…[1,2,0,0,…]0.71 Iorderedthisto…3“i”,“ordered”,…[1,0,0,3,…]0.74 Dataset Feature engineering Model training Model evaluation
  14. 14. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved Evaluate TextLabelWordsFeaturesProbabilityPrediction Iboughtthegame…4“i”,“bought”,…[1,0,3,9,…]0.84 DoNOTbothertry…1“do”,“not”,…[0,0,11,0,…]0.62 Thisshirtisawesome…5“this”,“shirt”,…[0,2,3,1,…]0.95 nevergotit.Seller…1“never”,“got”,…[1,2,0,0,…]0.71 Iorderedthisto…3“i”,“ordered”,…[1,0,0,3,…]0.74 Dataset Feature engineering Model training Model evaluation
  15. 15. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved KeyabstractionofSparkMLpipeline •Transformer –Featuretransformers(e.g.,HashingTF)andtrainedMLmodels(e.g.,NaiveBayesModel). •Estimator –MLalgorithmsfortrainingmodels(e.g.,NaiveBayes). •Evaluator –Theseevaluatepredictionsandcomputemetrics,usefulfortuningalgorithmparameters(e.g., BinaryClassificationEvaluator).
  16. 16. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved Spark’sTextMiningalgorithms •LDAfortopicmodel •Word2Vecanunsupervisedwaytoturnwordsintofeaturesbasedontheirmeaning •CountVectorizerturnsdocumentsintovectorsbasedonwordcount •HashingTF-IDFcalculatesimportantwordsofadocumentwithrespecttothecorpus •Andmuchmore
  17. 17. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved MLlibTextMiningPipeline-classification Dataset RegexTokenizer StopWordsRemover CountVectorizer HashingTF IDF StringIndexer NaiveBayes LogisticRegression SVM MLP textclassification
  18. 18. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved MLlibTextMiningPipeline–topicmodel Dataset RegexTokenizer StopWordsRemover CountVectorizer HashingTF IDFLDAtopicmodel
  19. 19. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved MLlibTextMiningPipeline-recommendation Dataset RegexTokenizerWord2Vec recommendation
  20. 20. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved MLlibTextMiningPipeline Dataset RegexTokenizer StopWordsRemover CountVectorizer HashingTF IDF StringIndexer NaiveBayes LogisticRegression SVM MLP LDA Word2Vec textclassification topicmodel recommendation
  21. 21. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved Demo •loadthefilecontentsandthecategories •extractfeaturevectorssuitableformachinelearning •trainalinearmodeltoperformcategorization •useagridsearchstrategytofindagoodconfigurationofboththefeatureextraction componentsandtheclassifier https://github.com/yanboliang/dataworks-munich-2017
  22. 22. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved CustomingMLPipelines •MLlib2.1includes: –30+featuretransformers(Tokenizer,Word2Vec,…) –25+models(forclassification,regression,clustering,…) –Modeltuning&evaluation •Butsomeapplicationsrequirecustomized –Transformers&Models
  23. 23. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved Optionsforcustomization •Existingusecases: –spark-corenlp –spark-vlbfgs •Extendabstractions –Transformer –Estimator&Model –Evaluator
  24. 24. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved Sparkvirtualenvironment DataScientistADataScientistB Python2.7 Python2.7 Python2.7 Python2.7 Python2.7
  25. 25. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved Sparkvirtualenvironment DataScientistADataScientistB Python2.7 Python2.7 Python2.7 Python2.7 Python2.7 Python3.5 Python3.5 Python3.5
  26. 26. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved TextMiningworkflow •Prototype(Python/R) •CreatePipeline –Loaddataset –Extractrawfeatures –Transformfeatures –Selectkeyfeatures –Fitandchoosebestmodels •Re-implementPipelinefor production(Java/Scala) •DeployPipeline •Scoring DataScienceSoftwareengineering Duplicatedand error-prone
  27. 27. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved MLpersistence •Prototype(Python/R) •CreatePipeline •LoadPipeline(Java/Scala) –Model.load(“s3n://…”) •Deployinproduction DataScienceSoftwareengineering PersistmodelorPipeline: model.save(“s3n://…”)
  28. 28. ‹# › ©HortonworksInc.2011–2016.AllRightsReserved Datascientistsworkwithsoftwareengineer DataScientistsSoftwareengineers Exploredata Createpipeline Findbestparams Savemodel Loadmodel Deployinproduction Scoringon batch/streamingdata

×