Spark浅谈

•

0 gefällt mir•173 views

Jiahua Zhu

spark 是个分布式弹性数据计算框架

Wissenschaft

1 text_file = spark.textFile("hdfs://...")
2 text_file.flatMap(lambda line: line.split())
3 .map(lambda word: (word, 1))
4 .reduceByKey(lambda a, b: a+b)
Spark 80
Scala, Python and R shells .

SQL, streaming, and
complex analytics.
Spark
including SQL and
DataFrames, MLlib for
machine learning, GraphX,
and Spark Streaming.

Spark Hadoop, Mesos,
standalone, or in the cloud.
HDFS, Cassandra, HBase, and S3
.
Spark
on EC2, on Hadoop
YARN, or on Apache Mesos .
HDFS, Cassandra, HBase, Hive,
Tachyon, and any Hadoop data
source.

(RDDs)
Spark Resilient Distributed Dataset (RDD)
2 RDDs
HDFS HBase Hadoop

RDDs SparkContext textFile
URI ( hdfs:// s3n://
)
1 val distFile = sc.textFile("/usr/local/Cellar/
apache-spark/1.5.2/README.md")
2 distFile.count()
3 distFile.map(s => s.length).reduce((a, b) => a+b)
4 val wordCounts = distFile.flatMap(line =>
line.split(" ")).map(word => (word, 1)).reduceByKey((a,
b) => a + b)

$1 /* SimpleApp.scala */ 2 import org.apache.spark.SparkContext 3 import org.apache.spark.SparkContext._ 4 import org.apache.spark.SparkConf 5 6 object SimpleApp { 7 def main(args: Array[String]) { 8 val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system 9 val conf = new SparkConf().setAppName( "Simple Application") 10 val sc = new SparkContext(conf) 11 val logData = sc.textFile(logFile, 2).cache() 12 val numAs = logData.filter(line => line.contains("a")).count() 13 val numBs = logData.filter(line => line.contains("b")).count() 14 println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) 15 } 16 }$

1 """SimpleApp.py"""
2 from pyspark import SparkContext
3
4 logFile = "YOUR_SPARK_HOME/README.md"
# Should be some file on your system
5 sc = SparkContext("local", "Simple App")
6 logData = sc.textFile(logFile).cache()
7
8 numAs = logData.filter(lambda s: 'a' in s).count()
9 numBs = logData.filter(lambda s: 'b' in s).count()
10
11 print("Lines with a: %i, lines with b: %i" % (numAs,
numBs))

$1 /* SimpleApp.java */ 2 import org.apache.spark.api.java.*; 3 import org.apache.spark.SparkConf; 4 import org.apache.spark.api.java.function.Function; 5 6 public class SimpleApp { 7 public static void main(String[] args) { 8 String logFile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system 9 SparkConf conf = new SparkConf().setAppName("Simple Application"); 10 JavaSparkContext sc = new JavaSparkContext(conf); 11 JavaRDD<String> logData = sc.textFile(logFile).cache(); 12 13 long numAs = logData.filter(new Function<String, Boolean>() { 14 public Boolean call(String s) { return s.contains("a"); } 15 }).count(); 16 17 long numBs = logData.filter(new Function<String, Boolean>() { 18 public Boolean call(String s) { return s.contains("b"); } 19 }).count(); 20 21 System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs); 22 } 23 }$

Spark Streaming
Spark streaming Spark API
kafka flume Twitter ZeroMQ Kinesis
map reduce join window
Spark

1 Source Artifact
2 Kafka spark-streaming-kafka_2.10
3 Flume spark-streaming-flume_2.10
4 Kinesis spark-streaming-kinesis-asl_2.10
5 Twitter spark-streaming-twitter_2.10
6 ZeroMQ spark-streaming-zeromq_2.10
7 MQTT spark-streaming-mqtt_2.10

Spark SQL
Spark SQL Spark SQL, HiveQL Scala
RDD-SchemaRDD
SchemaRDDs scheme
SchemaRDD
RDD Parquet JSON
Apache Hive HiveSQL

DataFrame
Spark DataFrame RDD
DataFrame RDD schema
DataFrame

https://endymecy.gitbooks.io/spark-programming-guide-zh-cn/content/
spark-sql/data-sources/rdds.html
Spark RDDs SchemaRDDs
1. RDD
(schema)
2. SchemaRDDs
RDDs

Weitere ähnliche Inhalte

Was ist angesagt?

PySpark in practice slidesDat Tran

Introduction to SparkROlgun Aydın

Parallelizing Existing R PackagesCraig Warman

Learning spark ch04 - Working with Key/Value Pairsphanleson

Learning spark ch09 - Spark SQLphanleson

Introduce to Spark sql 1.3.0 Bryan Yang

Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks

Spark tutorialSahan Bulathwela

Heuritech: Apache Spark REXdidmarin

Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks

Using pySpark with Google Colab & Spark 3.0 previewMario Cartia

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thAlton Alexander

SparkR: Enabling Interactive Data Science at Scalejeykottalam

DataEngConf SF16 - Spark SQL WorkshopHakka Labs

Unified Big Data Processing with Apache Spark (QCON 2014)Databricks

Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!

SQL and Search with Spark in your browserDataWorks Summit/Hadoop Summit

SQL to Hive Cheat SheetHortonworks

Learn Apache Spark: A Comprehensive GuideWhizlabs

Introduction to Apache HiveAvkash Chauhan

Was ist angesagt? (20)

PySpark in practice slides

Introduction to SparkR

Parallelizing Existing R Packages

Learning spark ch04 - Working with Key/Value Pairs

Learning spark ch09 - Spark SQL

Introduce to Spark sql 1.3.0

Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials

Spark tutorial

Heuritech: Apache Spark REX

Spark SQL Deep Dive @ Melbourne Spark Meetup

Using pySpark with Google Colab & Spark 3.0 preview

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

SparkR: Enabling Interactive Data Science at Scale

DataEngConf SF16 - Spark SQL Workshop

Unified Big Data Processing with Apache Spark (QCON 2014)

Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...

SQL and Search with Spark in your browser

SQL to Hive Cheat Sheet

Learn Apache Spark: A Comprehensive Guide

Introduction to Apache Hive

Ähnlich wie Spark浅谈

Apache Spark TutorialFarzad Nozarian

Introduction to Apache SparkRahul Jain

Introduction to Apache SparkMohamed hedi Abidi

Spark corePrashant Gupta

Apache Spark Introductionsudhakara st

xPatterns on Spark, Tachyon and Mesos - Bucharest meetupRadu Chilom

Spark ProgrammingTaewook Eom

Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson

Apache Spark Overview @ ferretAndrii Gakhov

Strata NYC 2015 - Supercharging R with Apache SparkDatabricks

Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos

Spark Study NotesRichard Kuo

Artigo 81 - spark_tutorial.pdfWalmirCouto3

Apache Spark OverviewDharmjit Singh

Spark Summit EU talk by Jim DowlingSpark Summit

20170126 big data processingVienna Data Science Group

Meetup ml spark_pptSnehal Nagmote

In Memory Analytics with Apache SparkVenkata Naga Ravi

Spark SQL | Apache SparkEdureka!

Big Data Processing With SparkEdureka!

Ähnlich wie Spark浅谈 (20)

Apache Spark Tutorial

Introduction to Apache Spark

Spark core

Apache Spark Introduction

xPatterns on Spark, Tachyon and Mesos - Bucharest meetup

Spark Programming

Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)

Apache Spark Overview @ ferret

Strata NYC 2015 - Supercharging R with Apache Spark

Apache Spark Workshop, Apr. 2016, Euangelos Linardos

Spark Study Notes

Artigo 81 - spark_tutorial.pdf

Apache Spark Overview

Spark Summit EU talk by Jim Dowling

20170126 big data processing

Meetup ml spark_ppt

In Memory Analytics with Apache Spark

Spark SQL | Apache Spark

Big Data Processing With Spark

Kürzlich hochgeladen

GBSN - Microbiology (Unit 2)Areesha Ahmad

American Type Culture Collection (ATCC).pptxabhishekdhamu51

Nanoparticles synthesis and characterization kaibalyasahoo82800

Factory Acceptance Test( FAT).pptx .Poonam Aher Patil

High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293

Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora

Clean In Place(CIP).pptx .Poonam Aher Patil

Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385

Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Joonhun Lee

Proteomics: types, protein profiling steps etc.Silpa

Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju

GBSN - Biochemistry (Unit 1)Areesha Ahmad

TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344

Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju

Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1

FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani

Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74

Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani

Kürzlich hochgeladen (20)

GBSN - Microbiology (Unit 2)

American Type Culture Collection (ATCC).pptx

Nanoparticles synthesis and characterization

Factory Acceptance Test( FAT).pptx .

High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...

Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency

Clean In Place(CIP).pptx .

Pulmonary drug delivery system M.pharm -2nd sem P'ceutics

Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)

Proteomics: types, protein profiling steps etc.

Presentation Vikram Lander by Vedansh Gupta.pptx

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf

GBSN - Biochemistry (Unit 1)

TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...

Pests of mustard_Identification_Management_Dr.UPR.pdf

Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service

FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b

Chemical Tests; flame test, positive and negative ions test Edexcel Internati...

Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...

Spark浅谈

1. spark 2015.12.25

2. spark Apache Spark in Hadoop and Spark

3. 1 text_file = spark.textFile("hdfs://...") 2 text_file.flatMap(lambda line: line.split()) 3 .map(lambda word: (word, 1)) 4 .reduceByKey(lambda a, b: a+b) Spark 80 Scala, Python and R shells .

4. SQL, streaming, and complex analytics. Spark including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming.

5. Spark Hadoop, Mesos, standalone, or in the cloud. HDFS, Cassandra, HBase, and S3 . Spark on EC2, on Hadoop YARN, or on Apache Mesos . HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source.

6. (RDDs) Spark Resilient Distributed Dataset (RDD) 2 RDDs HDFS HBase Hadoop

7. RDDs SparkContext textFile URI ( hdfs:// s3n:// ) 1 val distFile = sc.textFile("/usr/local/Cellar/ apache-spark/1.5.2/README.md") 2 distFile.count() 3 distFile.map(s => s.length).reduce((a, b) => a+b) 4 val wordCounts = distFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)

8. 1 /* SimpleApp.scala */ 2 import org.apache.spark.SparkContext 3 import org.apache.spark.SparkContext._ 4 import org.apache.spark.SparkConf 5 6 object SimpleApp { 7 def main(args: Array[String]) { 8 val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system 9 val conf = new SparkConf().setAppName( "Simple Application") 10 val sc = new SparkContext(conf) 11 val logData = sc.textFile(logFile, 2).cache() 12 val numAs = logData.filter(line => line.contains("a")).count() 13 val numBs = logData.filter(line => line.contains("b")).count() 14 println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) 15 } 16 }

9. 1 """SimpleApp.py""" 2 from pyspark import SparkContext 3 4 logFile = "YOUR_SPARK_HOME/README.md" # Should be some file on your system 5 sc = SparkContext("local", "Simple App") 6 logData = sc.textFile(logFile).cache() 7 8 numAs = logData.filter(lambda s: 'a' in s).count() 9 numBs = logData.filter(lambda s: 'b' in s).count() 10 11 print("Lines with a: %i, lines with b: %i" % (numAs, numBs))

10. 1 /* SimpleApp.java */ 2 import org.apache.spark.api.java.*; 3 import org.apache.spark.SparkConf; 4 import org.apache.spark.api.java.function.Function; 5 6 public class SimpleApp { 7 public static void main(String[] args) { 8 String logFile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system 9 SparkConf conf = new SparkConf().setAppName("Simple Application"); 10 JavaSparkContext sc = new JavaSparkContext(conf); 11 JavaRDD<String> logData = sc.textFile(logFile).cache(); 12 13 long numAs = logData.filter(new Function<String, Boolean>() { 14 public Boolean call(String s) { return s.contains("a"); } 15 }).count(); 16 17 long numBs = logData.filter(new Function<String, Boolean>() { 18 public Boolean call(String s) { return s.contains("b"); } 19 }).count(); 20 21 System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs); 22 } 23 }

11. Spark Streaming Spark streaming Spark API kafka flume Twitter ZeroMQ Kinesis map reduce join window Spark

12. 1 Source Artifact 2 Kafka spark-streaming-kafka_2.10 3 Flume spark-streaming-flume_2.10 4 Kinesis spark-streaming-kinesis-asl_2.10 5 Twitter spark-streaming-twitter_2.10 6 ZeroMQ spark-streaming-zeromq_2.10 7 MQTT spark-streaming-mqtt_2.10

13. Spark SQL Spark SQL Spark SQL, HiveQL Scala RDD-SchemaRDD SchemaRDDs scheme SchemaRDD RDD Parquet JSON Apache Hive HiveSQL

14. DataFrame Spark DataFrame RDD DataFrame RDD schema DataFrame

15. RDD RDD DAG DataFrame lazy RDD DAG

16. https://endymecy.gitbooks.io/spark-programming-guide-zh-cn/content/ spark-sql/data-sources/rdds.html Spark RDDs SchemaRDDs 1. RDD (schema) 2. SchemaRDDs RDDs

Spark浅谈

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Spark浅谈

Ähnlich wie Spark浅谈 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Spark浅谈