Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Nächste SlideShare
What to Upload to SlideShare
Weiter
Herunterladen, um offline zu lesen und im Vollbildmodus anzuzeigen.

Teilen

[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR

Herunterladen, um offline zu lesen

Azure's HDInsight provides an easy way to process big data using Spark, and learn from it using Machine Learning. See SparkML in action, and learn how to use R and Python at scale, within Jupyter.

製品/テクノロジ: AI (人工知能)/Deep Learning (深層学習)/Machine Learning (機械学習)/Microsoft Azure

Michael Lanzetta
Microsoft Corporation
Developer Experience and Evangelism
Principal Software Development Engineer

Ähnliche Bücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen

Ähnliche Hörbücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen

[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR

  1. 1. A unified, open source, parallel data processing framework for big data analytics Spark core engine Spark SQL Interactive queries Spark Streaming Stream processing Spark ML Machine learning GraphX Graph computation Yarn Mesos Standalone scheduler
  2. 2. Unified engine Ecosystem Developer productivity Performance
  3. 3. Primary resource managers: Hadoop 1.0+ or Hadoop YARN HadoopSpark Alternative resource managers: Mesos or the Spark resource manager
  4. 4. 102.5 100 72 23 2100 206 50400 6592 2013 Record (Hadoop) Spark 100 TB Data Size (TB) Time (Min) Nodes Cores tinyurl.com/spark-sort Logistic regression 140 120 100 80 40 20 0 60 Hadoop Spark 0.9 Logistic regression on a 100-node cluster with 100 GB of data. Spark is the 2014 Sort Benchmark winner. 3x faster than 2013 winner (Hadoop).
  5. 5. Reads from HDFS Writes to HDFS Reads from HDFS Writes to HDFS Step 1 Step 2 Step 1 Reads and writes from HDFS
  6. 6. ReadReadRead Cluster manager HDFS Worker nodeWorker node Worker node Worker node Driver program SparkContext
  7. 7. Machine learning Real-time stream processing Developer productivity Interactive analyticsHigh performance batch computation
  8. 8. • .map() • .groupByKey() • .join() • .reduce() • .collect()
  9. 9. RDD RDD RDD RDDRDD Transformations ValueActions
  10. 10. tweetDF = spark.read.json('wasb:///libya-sentences/*/*.json') tokenizer = Tokenizer(inputCol="Sentence", outputCol="Words") counter = CountVectorizer(inputCol="Words", outputCol="features", vocabSize=10000, minDF=2.) tokenized = tokenizer.transform(tweetDF) countModel = counter.fit(tokenized) counted = countModel.transform(tokenized) lda = LDA(k=10, maxIter=10) model = lda.fit(counted) topics = model.describeTopics(3) topics.show(truncate=False)
  11. 11. tweetDF = spark.read.json('wasb:///libya-sentences/*/*.json') tokenizer = Tokenizer(inputCol="Sentence", outputCol="Words") counter = CountVectorizer(inputCol="Words", outputCol="features", vocabSize=10000, minDF=2.) lda = LDA(k=10, maxIter=10) pipeline = Pipeline(stages=[tokenizer, counter, lda]) model = pipeline.fit(tweetDF) topics = model.describeTopics(3) topics.show(truncate=False) ldaScored = model.transform(tweetDF)
  12. 12. • • Microsoft R Server
  13. 13. • • • • Microsoft R Server
  14. 14. mySparkCluster <- RxSpark() rxSetComputeContext(mySparkCluster) myData <- read.json('wasb:///creditfraud/*.json') # Run a logistic regression using RevoScaleR model <- rxLogit(Class ~ Amount + V1 + V2 + V3 + V4, data = myData) # Now run the same using SparkR Model2 <- spark.logit(myData, Class ~ Amount + V1 + V2 + V3 + V4, regParam = 0.3, elasticNetParam = 0.8) summary(model) Summary(model2)
  15. 15. tweetDF = spark.read.json('wasb:///libya-sentences/*/*.json') tweetDF = tweetDF[tweetDF.Language == 'en'] tokenizer = Tokenizer(inputCol="Sentence", outputCol="Words") enSW = StopWordsRemover.loadDefaultStopWords('english') + ['rt', '-', '&amp;', ''] swr = StopWordsRemover(inputCol="Words", outputCol="Filtered", stopWords=enSW) tokenized = tokenizer.transform(tweetDF) filtered = swr.transform(tokenized) counter = CountVectorizer(inputCol="Filtered", outputCol=“rawFeatures", vocabSize=10000, minDF=2.) countModel = counter.fit(filtered) counted = countModel.transform(filtered)
  16. 16.   
  17. 17. idf = IDF(inputCol="rawFeatures", outputCol="features") idfModel = idf.fit(counted) idfScaled = idfModel.transform(counted) gbt = GBTClassifier(labelCol="label", featuresCol="features", maxIter=10) # 80/20 train/test split train, test = labeled.randomSplit([0.8, 0.2], seed=1337) model = gbt.fit(train)
  18. 18. predictions = model.transform(test).select('prediction', 'label') metrics = BinaryClassificationMetrics(predictions.rdd) print('Area under PR = %s' % metrics.areaUnderPR) print('Area under ROC = %s' % metrics.areaUnderROC) # Set us up for plotting ROC predictions.registerTempTable('pred_and_labels') # In a new cell, use %%sql magic to pull results down to local context %%sql –q –o predictionResults select * from pred_and_labels
  19. 19. %%local %matplotlib inline from sklearn.metrics import roc_curve,auc prob = predResults['prediction'] fpr, tpr, thresholds = roc_curve(predResults['label'], prob, pos_label=1); roc_auc = auc(fpr, tpr) plt.figure(figsize=(5,5)); plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc) plt.plot([0, 1], [0, 1], 'k--'); plt.xlim([0.0, 1.0]); plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate'); plt.ylabel('True Positive Rate') plt.title('ROC Curve'); plt.legend(loc="lower right") plt.show()
  20. 20. HDInsight Spark SparkML HDInsight Rserver RevoScaleR SparkR GitHub Channel 9 Microsoft Virtual Academy
  • tyszw

    Feb. 25, 2020

Azure's HDInsight provides an easy way to process big data using Spark, and learn from it using Machine Learning. See SparkML in action, and learn how to use R and Python at scale, within Jupyter. 製品/テクノロジ: AI (人工知能)/Deep Learning (深層学習)/Machine Learning (機械学習)/Microsoft Azure Michael Lanzetta Microsoft Corporation Developer Experience and Evangelism Principal Software Development Engineer

Aufrufe

Aufrufe insgesamt

261

Auf Slideshare

0

Aus Einbettungen

0

Anzahl der Einbettungen

0

Befehle

Downloads

32

Geteilt

0

Kommentare

0

Likes

1

×