SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Apache Spark
Arnon Rotem-Gal-Oz
Demo
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
case class Data(InvoiceNo: String, StockCode: String, Description: String, Quantity: Long, InvoiceDate:
String, UnitPrice: Double, CustomerID: String, Country: String)
val schema = Encoders.product[Data].schema
val df=spark.read.option("header",true).schema(schema).csv("./data.csv")
val clean=df.na.drop(Seq("CustomerID")).dropDuplicates()
val data = clean.withColumn("total",when($"StockCode"!=="D",$"UnitPrice"*$"Quantity").otherwise(0))
.withColumn("Discount",when($"StockCode"==="D",$"UnitPrice"*$"Quantity").otherwise(0))
.withColumn("Postage",when($"StockCode"==="P",1).otherwise(0))
.withColumn("Invoice",regexp_replace($"InvoiceNo","^C",""))
.withColumn("Cancelled",when(substring($"InvoiceNo",0,1)==="C",1).otherwise(0))
val aggregated=data.groupBy($"Invoice",$"Country",$"CustomerID")
.agg(sum($"Discount").as("Discount"),sum($"total").as("Total"),max($"Cancelled").as("Cancelled"))
val customers =aggregated.groupBy($"CustomerID")
.agg(sum($"Total").as("Total"),sum($"Discount").as("Discount"),sum($"Cancelled").as("Cancelled"),count($"
Invoice").as("Invoices"))
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new
VectorAssembler().setInputCols(Array("Total","Discount","Cancelled","Invoices")).setOutputCol("features")
val features=assembler.transform(customers)
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.evaluation.ClusteringEvaluator
val Array(test,train)= features.randomSplit(Array(0.3,0.7))
val kmeans=new KMeans().setK(12).setFeaturesCol("features").setPredictionCol("prediction")
val model = kmeans.fit(train)
model.clusterCenters.foreach(println)
val predictions=model.transform(test)
predictions.groupBy($"prediction").count().show()
(reduce + (map #(+ % 2) (range 0 10)))
(0 to 10).map(_+2).reduce(_+_)
Enumerable.Range(0, 10).Select(x => x + 2).Aggregate(0, (acc, x) => acc
+ x);
2004
MapReduce:
Simplified Data Processing on Large Clusters
Jeff Dean, Sanjay Ghemawat
Google, Inc.
https://research.google.com/archive/mapreduce-osdi04-slides/index.html
• Re-execute on fail
• Skip bad-records
• Redundent execution (copies of tasks)
• Data locality optimization
• Combiners (map-side reduce)
• Compression of data
Sort is shuffle
Microsoft
DryadLINQ /
LINQ to HPC
(2009-2011)
• DAG
• Compile not run
directly
https://www.microsoft.com/en-us/research/project/dryadlinq/
AMPLabs Spark
• Born as a way to test Mesos
• Open sourced 2010
Spark Component
Resilient Distributed Dataset
Dataframe and DataSet
• Higher abstraction
• More like a database table than an array
• Adds Optimizers
parsed plan
logical plan
Optimized
plan
Physical
plan
Spark UI
With Batch all data is there
https://www2.slideshare.net/VadimSolovey/dataflow-a-unified-model-for-batch-and-streaming-data-processing
Streaming – event by event
Streaming challenges
Watermarks – describe event time
progress
Events earlier than watermark are
ignored
(too slow – delay, too fast – more
late events)
• Spark Streaming
• Spark Structured Streaming (unified code for batch & Streaming)
• Demo - Dimensio
Caveat emptor
• Also bugs (spark 1.6)
Bugs..
• https://issues.apache.org/jira/browse/SPARK-8406
Debugging
Out Of Memory
problems
Long DAGs
Data Skew
Spark
• Lots of things out of the box
• Batch (RDD, DataFrames, DataSets)
• Streaming
• Structured Streaming (unify batch and streaming)
• Graph
• (“Classic”) ML
• Runs on Hadoop, Mesos, Kubernetes
Lots of extensions
• Spark NLP - John Snow Labs
• Spark Deep Learning - Databricks, Intel (BidDL), DeepLearing4j, H2O
• Connectors to any DB that respects itself
• (Hades is WIP  )
Multiple languages• Scala, Java, R, Python, .NET (just
released)
• Currently Scala is favorite
• Python taking center-stage

Weitere ähnliche Inhalte

Was ist angesagt?

Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1Anuj Jain
 
FleetDB A Schema-Free Database in Clojure
FleetDB A Schema-Free Database in ClojureFleetDB A Schema-Free Database in Clojure
FleetDB A Schema-Free Database in Clojureelliando dias
 
Google apps script database abstraction exposed version
Google apps script database abstraction   exposed versionGoogle apps script database abstraction   exposed version
Google apps script database abstraction exposed versionBruce McPherson
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation FrameworkCaserta
 
array of object pointer in c++
array of object pointer in c++array of object pointer in c++
array of object pointer in c++Arpita Patel
 
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...NoSQLmatters
 
DBIx-DataModel v2.0 in detail
DBIx-DataModel v2.0 in detail DBIx-DataModel v2.0 in detail
DBIx-DataModel v2.0 in detail Laurent Dami
 
Mongodb Aggregation Pipeline
Mongodb Aggregation PipelineMongodb Aggregation Pipeline
Mongodb Aggregation Pipelinezahid-mian
 
Aggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days MunichAggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days MunichNorberto Leite
 
Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceLivePerson
 
Developing A Real World Logistic Application With Oracle Application - UKOUG ...
Developing A Real World Logistic Application With Oracle Application - UKOUG ...Developing A Real World Logistic Application With Oracle Application - UKOUG ...
Developing A Real World Logistic Application With Oracle Application - UKOUG ...Roel Hartman
 
Avro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSONAvro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSONAlexandre Victoor
 
Node js mongodriver
Node js mongodriverNode js mongodriver
Node js mongodriverchristkv
 
Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / OptiqJulian Hyde
 
Grails: a quick tutorial (1)
Grails: a quick tutorial (1)Grails: a quick tutorial (1)
Grails: a quick tutorial (1)Davide Rossi
 
Enterprise workflow with Apps Script
Enterprise workflow with Apps ScriptEnterprise workflow with Apps Script
Enterprise workflow with Apps Scriptccherubino
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Stamatis Zampetakis
 

Was ist angesagt? (20)

Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1
 
FleetDB A Schema-Free Database in Clojure
FleetDB A Schema-Free Database in ClojureFleetDB A Schema-Free Database in Clojure
FleetDB A Schema-Free Database in Clojure
 
Dbabstraction
DbabstractionDbabstraction
Dbabstraction
 
Google apps script database abstraction exposed version
Google apps script database abstraction   exposed versionGoogle apps script database abstraction   exposed version
Google apps script database abstraction exposed version
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
 
array of object pointer in c++
array of object pointer in c++array of object pointer in c++
array of object pointer in c++
 
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
 
DBIx-DataModel v2.0 in detail
DBIx-DataModel v2.0 in detail DBIx-DataModel v2.0 in detail
DBIx-DataModel v2.0 in detail
 
Mongodb Aggregation Pipeline
Mongodb Aggregation PipelineMongodb Aggregation Pipeline
Mongodb Aggregation Pipeline
 
Aggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days MunichAggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days Munich
 
Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduce
 
Developing A Real World Logistic Application With Oracle Application - UKOUG ...
Developing A Real World Logistic Application With Oracle Application - UKOUG ...Developing A Real World Logistic Application With Oracle Application - UKOUG ...
Developing A Real World Logistic Application With Oracle Application - UKOUG ...
 
Avro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSONAvro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSON
 
Node js mongodriver
Node js mongodriverNode js mongodriver
Node js mongodriver
 
Dm adapter
Dm adapterDm adapter
Dm adapter
 
Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / Optiq
 
Zend framework service
Zend framework serviceZend framework service
Zend framework service
 
Grails: a quick tutorial (1)
Grails: a quick tutorial (1)Grails: a quick tutorial (1)
Grails: a quick tutorial (1)
 
Enterprise workflow with Apps Script
Enterprise workflow with Apps ScriptEnterprise workflow with Apps Script
Enterprise workflow with Apps Script
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 

Ähnlich wie Apache spark

R (Shiny Package) - Server Side Code for Decision Support System
R (Shiny Package) - Server Side Code for Decision Support SystemR (Shiny Package) - Server Side Code for Decision Support System
R (Shiny Package) - Server Side Code for Decision Support SystemMaithreya Chakravarthula
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 
MapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupMapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupLandoop Ltd
 
How to ship customer value faster with step functions
How to ship customer value faster with step functionsHow to ship customer value faster with step functions
How to ship customer value faster with step functionsYan Cui
 
Open Source Search: An Analysis
Open Source Search: An AnalysisOpen Source Search: An Analysis
Open Source Search: An AnalysisJustin Finkelstein
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraRustam Aliyev
 
MongoDB World 2018: Keynote
MongoDB World 2018: KeynoteMongoDB World 2018: Keynote
MongoDB World 2018: KeynoteMongoDB
 
Performante Java Enterprise Applikationen trotz O/R-Mapping
Performante Java Enterprise Applikationen trotz O/R-MappingPerformante Java Enterprise Applikationen trotz O/R-Mapping
Performante Java Enterprise Applikationen trotz O/R-MappingSimon Martinelli
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsMatt Stubbs
 
Agile Testing Days 2018 - API Fundamentals - postman collection
Agile Testing Days 2018 - API Fundamentals - postman collectionAgile Testing Days 2018 - API Fundamentals - postman collection
Agile Testing Days 2018 - API Fundamentals - postman collectionJoEllen Carter
 
Cutting Edge Data Processing with PHP & XQuery
Cutting Edge Data Processing with PHP & XQueryCutting Edge Data Processing with PHP & XQuery
Cutting Edge Data Processing with PHP & XQueryWilliam Candillon
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...Lucidworks
 
Streaming Solr - Activate 2018 talk
Streaming Solr - Activate 2018 talkStreaming Solr - Activate 2018 talk
Streaming Solr - Activate 2018 talkAmrit Sarkar
 
Do more with less code in serverless
Do more with less code in serverlessDo more with less code in serverless
Do more with less code in serverlessjeromevdl
 
MongoDB World 2019: Just-in-time Validation with JSON Schema
MongoDB World 2019: Just-in-time Validation with JSON SchemaMongoDB World 2019: Just-in-time Validation with JSON Schema
MongoDB World 2019: Just-in-time Validation with JSON SchemaMongoDB
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
 

Ähnlich wie Apache spark (20)

R (Shiny Package) - Server Side Code for Decision Support System
R (Shiny Package) - Server Side Code for Decision Support SystemR (Shiny Package) - Server Side Code for Decision Support System
R (Shiny Package) - Server Side Code for Decision Support System
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
A Shiny Example-- R
A Shiny Example-- RA Shiny Example-- R
A Shiny Example-- R
 
MapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupMapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London Meetup
 
How to ship customer value faster with step functions
How to ship customer value faster with step functionsHow to ship customer value faster with step functions
How to ship customer value faster with step functions
 
Open Source Search: An Analysis
Open Source Search: An AnalysisOpen Source Search: An Analysis
Open Source Search: An Analysis
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
MongoDB World 2018: Keynote
MongoDB World 2018: KeynoteMongoDB World 2018: Keynote
MongoDB World 2018: Keynote
 
Performante Java Enterprise Applikationen trotz O/R-Mapping
Performante Java Enterprise Applikationen trotz O/R-MappingPerformante Java Enterprise Applikationen trotz O/R-Mapping
Performante Java Enterprise Applikationen trotz O/R-Mapping
 
XQuery Rocks
XQuery RocksXQuery Rocks
XQuery Rocks
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
 
Agile Testing Days 2018 - API Fundamentals - postman collection
Agile Testing Days 2018 - API Fundamentals - postman collectionAgile Testing Days 2018 - API Fundamentals - postman collection
Agile Testing Days 2018 - API Fundamentals - postman collection
 
Cutting Edge Data Processing with PHP & XQuery
Cutting Edge Data Processing with PHP & XQueryCutting Edge Data Processing with PHP & XQuery
Cutting Edge Data Processing with PHP & XQuery
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
DataMapper
DataMapperDataMapper
DataMapper
 
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
 
Streaming Solr - Activate 2018 talk
Streaming Solr - Activate 2018 talkStreaming Solr - Activate 2018 talk
Streaming Solr - Activate 2018 talk
 
Do more with less code in serverless
Do more with less code in serverlessDo more with less code in serverless
Do more with less code in serverless
 
MongoDB World 2019: Just-in-time Validation with JSON Schema
MongoDB World 2019: Just-in-time Validation with JSON SchemaMongoDB World 2019: Just-in-time Validation with JSON Schema
MongoDB World 2019: Just-in-time Validation with JSON Schema
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 

Mehr von Arnon Rotem-Gal-Oz

Mehr von Arnon Rotem-Gal-Oz (20)

Taking ML to production - a journey
Taking ML to production - a journeyTaking ML to production - a journey
Taking ML to production - a journey
 
Fallacies of Distributed Computing
Fallacies of Distributed Computing Fallacies of Distributed Computing
Fallacies of Distributed Computing
 
Docker & Kubernetes intro
Docker & Kubernetes introDocker & Kubernetes intro
Docker & Kubernetes intro
 
Docker Intro
Docker IntroDocker Intro
Docker Intro
 
Data security @ the personal level
Data security @ the personal levelData security @ the personal level
Data security @ the personal level
 
Microservices - it's déjà vu all over again
Microservices  - it's déjà vu all over againMicroservices  - it's déjà vu all over again
Microservices - it's déjà vu all over again
 
Big data in the cloud - welcome to cost oriented design
Big data in the cloud - welcome to cost oriented designBig data in the cloud - welcome to cost oriented design
Big data in the cloud - welcome to cost oriented design
 
Distilling insights @ AppsFlyer
Distilling insights @ AppsFlyerDistilling insights @ AppsFlyer
Distilling insights @ AppsFlyer
 
Distilling Insights @ Appsflyer (Data Architecture)
Distilling Insights @ Appsflyer (Data Architecture)Distilling Insights @ Appsflyer (Data Architecture)
Distilling Insights @ Appsflyer (Data Architecture)
 
Big data Overview
Big data OverviewBig data Overview
Big data Overview
 
Hadoop YARN overview
Hadoop YARN overviewHadoop YARN overview
Hadoop YARN overview
 
SAF
SAFSAF
SAF
 
REST presentation
REST presentationREST presentation
REST presentation
 
SOA & Big Data
SOA & Big DataSOA & Big Data
SOA & Big Data
 
Why the JVM?
Why the JVM?Why the JVM?
Why the JVM?
 
Building reliable systems from unreliable components
Building reliable systems from unreliable componentsBuilding reliable systems from unreliable components
Building reliable systems from unreliable components
 
Azure migration
Azure migrationAzure migration
Azure migration
 
Things to think about while architecting azure solutions
Things to think about while architecting azure solutionsThings to think about while architecting azure solutions
Things to think about while architecting azure solutions
 
Soa
Soa Soa
Soa
 
Rest
RestRest
Rest
 

Kürzlich hochgeladen

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Kürzlich hochgeladen (20)

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Apache spark

  • 2. Demo import spark.implicits._ import org.apache.spark.sql.functions._ import org.apache.spark.sql._ case class Data(InvoiceNo: String, StockCode: String, Description: String, Quantity: Long, InvoiceDate: String, UnitPrice: Double, CustomerID: String, Country: String) val schema = Encoders.product[Data].schema val df=spark.read.option("header",true).schema(schema).csv("./data.csv") val clean=df.na.drop(Seq("CustomerID")).dropDuplicates() val data = clean.withColumn("total",when($"StockCode"!=="D",$"UnitPrice"*$"Quantity").otherwise(0)) .withColumn("Discount",when($"StockCode"==="D",$"UnitPrice"*$"Quantity").otherwise(0)) .withColumn("Postage",when($"StockCode"==="P",1).otherwise(0)) .withColumn("Invoice",regexp_replace($"InvoiceNo","^C","")) .withColumn("Cancelled",when(substring($"InvoiceNo",0,1)==="C",1).otherwise(0)) val aggregated=data.groupBy($"Invoice",$"Country",$"CustomerID") .agg(sum($"Discount").as("Discount"),sum($"total").as("Total"),max($"Cancelled").as("Cancelled")) val customers =aggregated.groupBy($"CustomerID") .agg(sum($"Total").as("Total"),sum($"Discount").as("Discount"),sum($"Cancelled").as("Cancelled"),count($" Invoice").as("Invoices")) import org.apache.spark.ml.feature.VectorAssembler val assembler = new VectorAssembler().setInputCols(Array("Total","Discount","Cancelled","Invoices")).setOutputCol("features") val features=assembler.transform(customers) import org.apache.spark.ml.clustering.KMeans import org.apache.spark.ml.evaluation.ClusteringEvaluator val Array(test,train)= features.randomSplit(Array(0.3,0.7)) val kmeans=new KMeans().setK(12).setFeaturesCol("features").setPredictionCol("prediction") val model = kmeans.fit(train) model.clusterCenters.foreach(println) val predictions=model.transform(test) predictions.groupBy($"prediction").count().show()
  • 3. (reduce + (map #(+ % 2) (range 0 10))) (0 to 10).map(_+2).reduce(_+_) Enumerable.Range(0, 10).Select(x => x + 2).Aggregate(0, (acc, x) => acc + x);
  • 4. 2004 MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. https://research.google.com/archive/mapreduce-osdi04-slides/index.html
  • 5.
  • 6. • Re-execute on fail • Skip bad-records • Redundent execution (copies of tasks) • Data locality optimization • Combiners (map-side reduce) • Compression of data
  • 7.
  • 9. Microsoft DryadLINQ / LINQ to HPC (2009-2011) • DAG • Compile not run directly https://www.microsoft.com/en-us/research/project/dryadlinq/
  • 10. AMPLabs Spark • Born as a way to test Mesos • Open sourced 2010
  • 13. Dataframe and DataSet • Higher abstraction • More like a database table than an array • Adds Optimizers
  • 16. With Batch all data is there https://www2.slideshare.net/VadimSolovey/dataflow-a-unified-model-for-batch-and-streaming-data-processing
  • 18.
  • 19. Streaming challenges Watermarks – describe event time progress Events earlier than watermark are ignored (too slow – delay, too fast – more late events)
  • 20. • Spark Streaming • Spark Structured Streaming (unified code for batch & Streaming) • Demo - Dimensio
  • 21. Caveat emptor • Also bugs (spark 1.6)
  • 26.
  • 27. Spark • Lots of things out of the box • Batch (RDD, DataFrames, DataSets) • Streaming • Structured Streaming (unify batch and streaming) • Graph • (“Classic”) ML • Runs on Hadoop, Mesos, Kubernetes
  • 28. Lots of extensions • Spark NLP - John Snow Labs • Spark Deep Learning - Databricks, Intel (BidDL), DeepLearing4j, H2O • Connectors to any DB that respects itself • (Hades is WIP  )
  • 29. Multiple languages• Scala, Java, R, Python, .NET (just released) • Currently Scala is favorite • Python taking center-stage

Hinweis der Redaktion

  1. Higher abstraction More like a database table than an array Adds Optimizers