SlideShare ist ein Scribd-Unternehmen logo
1 von 13
The 10 Apache Spark Features 
You (Unlikely) Didn't Hear About 
Roger Brinkley 
Technical Evangelist
The 10 Apache Stack Features You 
(Unlikely) Didn't Hear About 
• 10 minutes – 10 slides 
• Ignite Format 
• No stopping! 
• No going back! 
• Questions? Sure, but only if and until time 
remains on slide (otherwise, save for later) 
• Hire me, I’ll find 45 more
It’s Fast Really Fast 
• 10 - 100x faster than MapReduce 
• 10 – 100x faster than Hive 
• Historical perspective 
MapReduce is Listed as the Last Most 
Important Software Innovation 
– JRuby 2-3x Faster with InvokedDynamic JVM 
– Hardware rarely gets greater than 10x/year
It’s Pure Open Source 
• Commons-based Peer Production 
– Apache Software Foundation Top Level Project 
– 200 people from 50 OrganizationsContributing 
– 12 Organizations Committing 
– Peer Governance 
– Participative Decision Making 
The very essence of a free software 
consists in considering contributing 
roles as public trusts, bestowed for the 
good of the community, and not for 
the benefit of an individual or a party 
The very essence of a free government 
consists in considering offices as public 
trusts, bestowed for the good of the 
country, and not for the benefit of an 
individual or a party 
Modern John C. FOSS Calhoun John 2/C. 13/Calhoun 
1835
Strong Enterprise Relationships 
• Spark is in every major Hadoop distributor 
• Vertical enterprise use 
– Internet companies, government, financials 
– Churn analysis, fraud detection, risk analytics 
• Used in other data stores 
– Datastax (Cassandra) 
– MongoDB 
• Databricks has a cloud based implementation
Enhances Other Big Data 
Implementations 
• Hadoop – Replacement of Map Reduce 
• Cassandara – Analytics 
• Hive – Faster SQL processing 
• SAP Hana – Faster interactive analysis
API Stability 
• Guaranteed stability of its core API for 1.X 
• Spark has always been conservative with API 
changes 
• Clearly defined annotations for future APIs 
– Experimental 
– Alpha 
– Developer
Don’t Need to Learn a New Language 
• Scala 
• Java – 25% 
• Python – 30% 
• And soon R
Java 8 Lambda Support 
JavaRDD<String> String> lines = sc.textFile("lines hdfs://= sc.log.textFile("txt"); 
hdfs://log.txt"); 
JavaRDD<// Map each line String> to multiple words 
words = 
JavaRDD<lines.String> flatMap(words = lines.line flatMap( 
new FlatMapFunction<String, String>() -> Arrays.{ 
asList(line.split(" "))); 
JavaPairRDD<public Iterable<String> String, call(String Integer> line) { 
counts = 
words.mapToPair(w -> new Tuple2<String, Integer>(w, 1)) 
return Arrays.asList(line.split(" ")); 
} 
}); 
// Turn the words into (word, 1) pairs 
JavaPairRDD<String, Integer> ones = words.mapToPair( 
new PairFunction<String, String, Integer>() { 
public Tuple2<String, Integer> call(String w) { 
.reduceByKey((x, y) -> x + y); 
counts.saveAsTextFile("hdfs://counts.txt"); 
return new Tuple2<String, Integer>(w, 1); 
} 
}); 
// Group up and add the pairs by key to produce counts 
JavaPairRDD<String, Integer> counts = ones.reduceByKey( 
new Function2<Integer, Integer, Integer>() { 
public Integer call(Integer i1, Integer i2) { 
return i1 + i2; 
} 
}); 
counts.saveAsTextFile("hdfs://counts.txt");
Real Time Stream Process 
val ssc = new StreamingContext(args(0), 
"NetworkHashCount", Seconds(10), 
file = sc.textFile("hdfs://.../pagecounts-*.gz") 
val counts = file.flatMap(line => line.split(" ")) 
System.getenv("SPARK_HOME"), 
.map(word => (word, 1)) 
.reduceByKey(_ + _) 
Seq(System.getenv("SPARK_EXAMPLES_JAR"))) 
val lines = ssc.socketTextStream("localhost", 9999) 
val words = lines.flatMap(_.split(" 
")).filter(_.startsWith("#")) 
val wordCounts = words.map(x => (x, 
1)).reduceByKey(_ + _) 
wordCounts.print() 
ssc.start() 
counts.saveAsTextFile("hdfs://.../word-count")
Caching Interactive Algorithms 
val points = 
sc.textFile("...").map(parsePoint).cache() 
var w = Vector.random(D) //current separating 
plane 
for (i <- 1 to ITERATIONS) { 
val gradient = points.map(p => 
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x 
).reduce(_ + _) 
w -= gradient 
} 
println("Final separating plane: " + w)
New Security Integration 
• Complete Integration with Haddop/YARN Security 
Model 
– Authenticate Job Submissions 
– Securely transfer HDFS credentials 
– Authenticate communication between component 
• Other deployments supported 
val conf = new SparkConf 
conf.set("spark.authenticate", "true") 
conf.set("spark.authenticate.secret", "good")
And Lots More 
• Apache Spark Website 
• Databricks – making big data easy 
– Introduction to Apache Spark 
• Jul 28 – Austin, TX - More Info & Registration 
• Aug 25 – Chicago, IL - More Info & Registration

Weitere ähnliche Inhalte

Was ist angesagt?

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Spark Summit
 

Was ist angesagt? (20)

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
 
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
 
Hadoop at ayasdi
Hadoop at ayasdiHadoop at ayasdi
Hadoop at ayasdi
 
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
 
Redshift Introduction
Redshift IntroductionRedshift Introduction
Redshift Introduction
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in Azure
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
 
Introduction and HDInsight best practices
Introduction and HDInsight best practicesIntroduction and HDInsight best practices
Introduction and HDInsight best practices
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 

Andere mochten auch

August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
Yahoo Developer Network
 

Andere mochten auch (14)

Test your english 1
Test your english 1Test your english 1
Test your english 1
 
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
 
Jeemain 2015 question-paper_solution
Jeemain 2015 question-paper_solutionJeemain 2015 question-paper_solution
Jeemain 2015 question-paper_solution
 
Access grammar 3+
Access grammar 3+Access grammar 3+
Access grammar 3+
 
Access 1-test-booklet
Access 1-test-bookletAccess 1-test-booklet
Access 1-test-booklet
 
Click on test_booklet_3_with_key
Click on test_booklet_3_with_keyClick on test_booklet_3_with_key
Click on test_booklet_3_with_key
 
Click on 2 test booklet with keys
Click on 2   test booklet with keysClick on 2   test booklet with keys
Click on 2 test booklet with keys
 
Summative test in English 6
Summative test in English 6Summative test in English 6
Summative test in English 6
 
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and IgniteJCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
 
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
 
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing HubIMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
 
K to 12 ENGLISH Grade 2 (4th Quarter 1st Summative Test)
K to 12 ENGLISH Grade 2 (4th Quarter 1st Summative Test)K to 12 ENGLISH Grade 2 (4th Quarter 1st Summative Test)
K to 12 ENGLISH Grade 2 (4th Quarter 1st Summative Test)
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
Fce practice test (book 3)
Fce practice test (book 3)Fce practice test (book 3)
Fce practice test (book 3)
 

Ähnlich wie 10 Things About Spark

Ähnlich wie 10 Things About Spark (20)

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
963
963963
963
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Real-Time Analytics with Apache Cassandra and Apache Spark,
Real-Time Analytics with Apache Cassandra and Apache Spark,Real-Time Analytics with Apache Cassandra and Apache Spark,
Real-Time Analytics with Apache Cassandra and Apache Spark,
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark Streaming
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe Seiler
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 

Kürzlich hochgeladen

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Kürzlich hochgeladen (20)

HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 

10 Things About Spark

  • 1. The 10 Apache Spark Features You (Unlikely) Didn't Hear About Roger Brinkley Technical Evangelist
  • 2. The 10 Apache Stack Features You (Unlikely) Didn't Hear About • 10 minutes – 10 slides • Ignite Format • No stopping! • No going back! • Questions? Sure, but only if and until time remains on slide (otherwise, save for later) • Hire me, I’ll find 45 more
  • 3. It’s Fast Really Fast • 10 - 100x faster than MapReduce • 10 – 100x faster than Hive • Historical perspective MapReduce is Listed as the Last Most Important Software Innovation – JRuby 2-3x Faster with InvokedDynamic JVM – Hardware rarely gets greater than 10x/year
  • 4. It’s Pure Open Source • Commons-based Peer Production – Apache Software Foundation Top Level Project – 200 people from 50 OrganizationsContributing – 12 Organizations Committing – Peer Governance – Participative Decision Making The very essence of a free software consists in considering contributing roles as public trusts, bestowed for the good of the community, and not for the benefit of an individual or a party The very essence of a free government consists in considering offices as public trusts, bestowed for the good of the country, and not for the benefit of an individual or a party Modern John C. FOSS Calhoun John 2/C. 13/Calhoun 1835
  • 5. Strong Enterprise Relationships • Spark is in every major Hadoop distributor • Vertical enterprise use – Internet companies, government, financials – Churn analysis, fraud detection, risk analytics • Used in other data stores – Datastax (Cassandra) – MongoDB • Databricks has a cloud based implementation
  • 6. Enhances Other Big Data Implementations • Hadoop – Replacement of Map Reduce • Cassandara – Analytics • Hive – Faster SQL processing • SAP Hana – Faster interactive analysis
  • 7. API Stability • Guaranteed stability of its core API for 1.X • Spark has always been conservative with API changes • Clearly defined annotations for future APIs – Experimental – Alpha – Developer
  • 8. Don’t Need to Learn a New Language • Scala • Java – 25% • Python – 30% • And soon R
  • 9. Java 8 Lambda Support JavaRDD<String> String> lines = sc.textFile("lines hdfs://= sc.log.textFile("txt"); hdfs://log.txt"); JavaRDD<// Map each line String> to multiple words words = JavaRDD<lines.String> flatMap(words = lines.line flatMap( new FlatMapFunction<String, String>() -> Arrays.{ asList(line.split(" "))); JavaPairRDD<public Iterable<String> String, call(String Integer> line) { counts = words.mapToPair(w -> new Tuple2<String, Integer>(w, 1)) return Arrays.asList(line.split(" ")); } }); // Turn the words into (word, 1) pairs JavaPairRDD<String, Integer> ones = words.mapToPair( new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String w) { .reduceByKey((x, y) -> x + y); counts.saveAsTextFile("hdfs://counts.txt"); return new Tuple2<String, Integer>(w, 1); } }); // Group up and add the pairs by key to produce counts JavaPairRDD<String, Integer> counts = ones.reduceByKey( new Function2<Integer, Integer, Integer>() { public Integer call(Integer i1, Integer i2) { return i1 + i2; } }); counts.saveAsTextFile("hdfs://counts.txt");
  • 10. Real Time Stream Process val ssc = new StreamingContext(args(0), "NetworkHashCount", Seconds(10), file = sc.textFile("hdfs://.../pagecounts-*.gz") val counts = file.flatMap(line => line.split(" ")) System.getenv("SPARK_HOME"), .map(word => (word, 1)) .reduceByKey(_ + _) Seq(System.getenv("SPARK_EXAMPLES_JAR"))) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")).filter(_.startsWith("#")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() counts.saveAsTextFile("hdfs://.../word-count")
  • 11. Caching Interactive Algorithms val points = sc.textFile("...").map(parsePoint).cache() var w = Vector.random(D) //current separating plane for (i <- 1 to ITERATIONS) { val gradient = points.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final separating plane: " + w)
  • 12. New Security Integration • Complete Integration with Haddop/YARN Security Model – Authenticate Job Submissions – Securely transfer HDFS credentials – Authenticate communication between component • Other deployments supported val conf = new SparkConf conf.set("spark.authenticate", "true") conf.set("spark.authenticate.secret", "good")
  • 13. And Lots More • Apache Spark Website • Databricks – making big data easy – Introduction to Apache Spark • Jul 28 – Austin, TX - More Info & Registration • Aug 25 – Chicago, IL - More Info & Registration

Hinweis der Redaktion

  1. There are lot of features that you probably don’t know about but you can find them at the Apache Spark Website or at Databricks, the company where a number of the leading conributors to Apache Spark work. Also be aware that Databricks offers an Introduction to Apache Spark with events coming up on July 28 in Austin and August 25 in Chicago.