Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Large-Scale Data Science in Apache Spark 2.0

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 21 Anzeige

Large-Scale Data Science in Apache Spark 2.0

Herunterladen, um offline zu lesen

Data science is one of the only fields where scalability can lead to fundamentally better results. Scalability allows users to train models on more data or to experiment with more types of models, both of which result in better models. It is no accident that the organizations most successful with AI have been those with huge distributed computing resources. In this talk, Matei Zaharia will describe how Apache Spark is democratizing large-scale data science to make it easier for more organizations to build high-quality data and AI products. Matei Zaharia will talk about the new structured APIs in Spark 2.0 that enable more optimization underneath familia programming interfaces, as well as libraries to scale up deep learning or traditional machine learning libraries on Apache Spark.

Speaker: Matei Zaharia

Data science is one of the only fields where scalability can lead to fundamentally better results. Scalability allows users to train models on more data or to experiment with more types of models, both of which result in better models. It is no accident that the organizations most successful with AI have been those with huge distributed computing resources. In this talk, Matei Zaharia will describe how Apache Spark is democratizing large-scale data science to make it easier for more organizations to build high-quality data and AI products. Matei Zaharia will talk about the new structured APIs in Spark 2.0 that enable more optimization underneath familia programming interfaces, as well as libraries to scale up deep learning or traditional machine learning libraries on Apache Spark.

Speaker: Matei Zaharia

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Large-Scale Data Science in Apache Spark 2.0 (20)

Anzeige

Weitere von Databricks (20)

Aktuellste (20)

Anzeige

Large-Scale Data Science in Apache Spark 2.0

  1. 1. Matei Zaharia @matei_zaharia Large-Scale Data Science in Apache Spark 2.0
  2. 2. Why Large-Scale? More data = better models Faster iteration = better models Scale is the key tool of effective data science and AI
  3. 3. Two Forms of Scale Hardware scalability • Distribute work onto parallel hardware • Utilize the hardware efficiently (e.g. fast, low-level code) User scalability • Write applications quickly Often at odds!
  4. 4. What is Apache Spark? Designed to tackle both challenges • High-level APIs and libraries • Efficient execution via parallelism and compilation Largest open source project in big data • 1000+ contributors, 300+ packages, 3x user growth / year SQLStreaming ML Graph …
  5. 5. Spark for Data Science Spark-specific libraries • DataFrames, ML Pipelines, SQL, GraphFrames • Parallelize common computations Integration with existing libraries • Call arbitrary Python / R / etc libraries at scale Both expanding in Apache Spark 2.x
  6. 6. This Talk Structured APIs in Spark 2.0 Scaling deep learning Parallelizing traditional data science libraries
  7. 7. Original Spark API Functional operations on collections of Java/Python objects (RDDs) + Expressive and concise - Hard to automatically optimize lines = spark.textFile(“hdfs://...”) // RDD[String] points = lines.map(lambda line: parsePoint(line)) // RDD[Point] points.filter(lambda p: p.x > 100).count()
  8. 8. Structured APIs New APIs for data with a table-like schema • DataFrames (untyped), Datasets (typed), and SQL Optimized storage and computation • Binary storage based on schema (e.g. columnar) • Compute via SQL-like expressions that Spark can compile
  9. 9. Structured API Example events = sc.read.json(“/logs”) stats = events.join(users) .groupBy(“loc”,“status”) .avg(“duration”) errors = stats.where( stats.status == “ERR”) DataFrame API Optimized Plan Specialized Code SCAN logs SCAN users JOIN AGG FILTER while(logs.hasNext) { e = logs.next if(e.status == “ERR”) { u = users.get(e.uid) key = (u.loc, e.status) sum(key) += e.duration count(key) += 1 } } ...
  10. 10. Structured API Performance 0 2 4 6 8 10 RDD Scala RDD Python DataFrame Scala DataFrame Python DataFrame R DataFrame SQL Time for aggregation benchmark (s) Higher-level and easier to optimize
  11. 11. Structured Streaming Incrementalize an existing DataFrame/Dataset query logs = ctx.read.format(“json”).open(“hdfs://logs”) logs.groupBy(“userid”, “hour”).avg(“latency”) .write.format(”parquet”) .save(“s3://...”) Example batch job:
  12. 12. Structured Streaming Incrementalize an existing DataFrame/Dataset query logs = ctx.readStream.format(“json”).load(“hdfs://logs”) logs.groupBy(“userid”, “hour”).avg(“latency”) .writeStream.format(”parquet") .start(“s3://...”) Example as streaming:
  13. 13. Structured APIs Elsewhere ML Pipelines on DataFrames • Pipeline API based on scikit- learn • Grid search, cross-validation, etc GraphFrames for graph analytics • Pattern matching à la Neo4J tokenizer = Tokenizer() tf = HashingTF(numFeatures=1000) lr = LogisticRegression() pipe = Pipeline([tokenizer, tf, lr]) model = pipe.fit(df) tokenizer TF LR modelDataFrame
  14. 14. This Talk Structured APIs in Spark 2.0 Scaling deep learning Parallelizing traditional data science libraries
  15. 15. Why Deep Learning on Spark? Scale out model application to large data • For transfer learning or model evaluation Scale out parameter search: one model per machine Distributed training: one model on multiple machines
  16. 16. Deep Learning Libraries TensorFlow model eval on DataFrames, for serving or transfer learning Distributed model training on CPUs Run Caffe & TensorFlow on Spark data TensorFrames TensorFlowOnSpark BigDL
  17. 17. This Talk Structured APIs in Spark 2.0 Scaling deep learning Parallelizing traditional data science libraries
  18. 18. Parallelizing Existing Libraries Spark Python/R APIs let you scale out existing libraries • spark-sklearn for arbitrary scikit-learn models • SparkR dapply from sklearn import svm, datasets from spark_sklearn import GridSearchCV iris = datasets.load_iris() model = svm.SVC() params = { 'kernel': ['linear', 'rbf’], 'C': [1, 10] } gs = GridSearchCV(sc, model, params) gs.fit(iris.data, iris.target) github.com/databricks/spark-sklearn
  19. 19. spark-sklearn Execution Input data
  20. 20. Coming Soon Native APIs for zero-copy data transfer into C libraries Streamlined installation in Python: pip install pyspark
  21. 21. To Learn More See hundreds of talks on use cases at Spark Summit spark-summit.org . Try interactive Spark tutorials in Databricks Community Edition databricks.com/ce . June 5-8, San Francisco

×