Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Agenda
Big Data
Overview
Spark
Overview
Spark
Internals
Spark
Libraries
BIG DATA OVERVIEW
Big Data -- Digital Data growth…
V-V-V
Legacy Architecture Pain Points
• Report arrival latency quite high - Hours to perform joins,
aggregate data
• Existing fr...
SPARK OVERVIEW
Why
Spark?
Why Spark
Separate, fast, Map-Reduce-like engine
In-memory data storage for very fast iterative queries
Better Fault To...
In Memory - Spark vs Hadoop
Improve efficiency over MapReduce
100x in memory , 2-10x in disk
Up to 40x faster than Hado...
Spark In & Out
RDBMS
Streaming
SQL
GraphX
BlinkDB
Hadoop Input Format
Apps
Distributions:
- CDH
- HDP
- MapR
- DSE
Tachyon...
Spark Streaming + SQL
Streaming
SQL
Benchmarking & Best Facts
SPARK INSIDE – AROUND RDD
Resilient Distributed Data (RDD)
Immutable + Distributed+ Catchable+ Lazy evaluated
 Distributed collections of objects
...
RDD Types
RDD
RDD Operation
Memory and Persistent
Dependencies Types
Spark Cluster Overview
o Application
o Driver program
o Cluster manage
o Worker node
o Job
o Stage
o Executor
o Task
Job Flow
Task Scheduler , DAG
• Pipelines functions within a stage
• Cache-aware data reuse & locality
• Partitioning-aware to avoi...
Fault Recovery & Checkpoints
• Efficient fault recovery using Lineage
• log one operation to apply to many elements (linea...
QUICK DEMO
SPARK STACAK DETAILS
Spark SQL
• Seamlessly mix SQL queries with Spark programs
• Load and query data from a variety of sources
• Standard Conn...
Data Frames
• A distributed collection of data organized into named columns
• Like a table in a relational database
Spark ...
SparkR
• New R language for Spark and SparkSQL
• Exposes existing Spark functionality in
an R-friendly syntax view the Dat...
Spark Streaming
File systems
Databases
Dashboards
Flume
HDFS
Kinesis
Kafka
Twitter
High-level API
joins, windows, …
often ...
MLib
• Scalable Machine learning library
• Iterative computing -> High Quality algorithm 100x faster than
hadoop
MLib Algorithms
ML Pipeline
• Feature Extraction
• Normalization
• Dimensionality reduction
• Model training
GraphX
• Spark’s API For Graph and Graph-parallel computation
• Graph abstraction: a directed multigraph with properties a...
GraphX Framework & Algorithms
Algorithms
Spark Packages
Users & Distributors…
Thanks to Apache Spark by….
Started using it in our projects…
Contribute to their open source community…
Socialize Spar...
Backup Slides
SPARK CLUSTER
Cluster Support
• Standalone – a simple cluster manager included with Spark that makes it easy to set
up a cluster
• Apach...
Spark On Mesos
Spark on YARN
Data Science Process
Data Science in Practice
• Data Collection
• Munging
• Analysis
• Visualization
• Decision
Real Time Feedback
SQL Optimization (Catalyst)
Project Tungsten
• Memory Management and Binary Processing: leveraging application semantics to
manage memory explicitly a...
BDAS - Berkeley Data Analytics
Stackhttps://amplab.cs.berkeley.edu/software/
BDAS, the Berkeley Data Analytics Stack, is a...
Optimization
• groupBy is costlier – use mapr() or reduceByKey()
• RDD storage level MEMOR_ONLY is better
Optimization Code Example
RDDs vs Distributed Shared Mem
DAG Visualization
Spark + Akka+Spray
Spark R Architecture
PySpark
GraphX representation
Links References
• Spark
• Spark Submit 2015
• Spark External Projects
• Spark Central
Project Tungsten Roadmap
TACHYON
• Tachyon is a memory-centric distributed storage system enabling
reliable data sharing at memory-speed across clu...
Blink DB
Batches…
• Chop up the live stream into batches of X seconds
• Spark treats each batch of data as RDDs and processes
them ...
Dstream (Discretized Streams)
DStream is represented by a continuous series of RDDs
Window Operation & Checkpoint
Streaming
• Scalable high-throughput
streaming process of live data
• Integrate with many sources
• Fault-tolerant- Statef...
Spark streaming
data streams
Receiv
ers
batches
as RDDs
results as
RDDs
Streaming Fault Tolerance
Spark Streaming UI
Micro Batch (Near Real Time)
Micro Batch
Spark with Storm
Spark + Cassandra
Big Data Landscape
100 opensourceBig Dataarchitecturepapers
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Nächste SlideShare
Wird geladen in …5
×

Processing Large Data with Apache Spark -- HasGeek

10.781 Aufrufe

Veröffentlicht am

Apache Spark presentation at HasGeek FifthElelephant

https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark

Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries

Veröffentlicht in: Technologie
  • Fibroids Miracle! Cure Uterine Fibroids , end your PCOS related symptoms and regain your natural inner balance ... Guaranteed! -- Discover how Amanda Leto has taught thousands of women worldwide to achieve Uterine Fibroids freedom faster than they ever thought possible... Even if you've never succeeded at curing your Uterine Fibroids before... Right here you've found the Uterine Fibroids freedom success system you've been looking for! ★★★ http://ishbv.com/fibroids7/pdf
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Great Slides, will you be able to share it?
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Great slides Venkata. thanks
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Processing Large Data with Apache Spark -- HasGeek

  1. 1. Agenda Big Data Overview Spark Overview Spark Internals Spark Libraries
  2. 2. BIG DATA OVERVIEW
  3. 3. Big Data -- Digital Data growth…
  4. 4. V-V-V
  5. 5. Legacy Architecture Pain Points • Report arrival latency quite high - Hours to perform joins, aggregate data • Existing frameworks cannot do both • Either, stream processing of 100s of MB/s with low latency • Or, batch processing of TBs of data with high latency • Expressibility of business logic in Hadoop MR is challenging
  6. 6. SPARK OVERVIEW Why Spark?
  7. 7. Why Spark Separate, fast, Map-Reduce-like engine In-memory data storage for very fast iterative queries Better Fault Tolerance Combine SQL, Streaming and complex analytics Runs on Hadoop, Mesos, standalone, or in the cloud Data sources -> HDFS, Cassandra, HBase and S3
  8. 8. In Memory - Spark vs Hadoop Improve efficiency over MapReduce 100x in memory , 2-10x in disk Up to 40x faster than Hadoop
  9. 9. Spark In & Out RDBMS Streaming SQL GraphX BlinkDB Hadoop Input Format Apps Distributions: - CDH - HDP - MapR - DSE Tachyon MLlib Ref: http://training.databricks.com/intro.pdf
  10. 10. Spark Streaming + SQL Streaming SQL
  11. 11. Benchmarking & Best Facts
  12. 12. SPARK INSIDE – AROUND RDD
  13. 13. Resilient Distributed Data (RDD) Immutable + Distributed+ Catchable+ Lazy evaluated  Distributed collections of objects  Can be cached in memory across cluster nodes  Manipulated through various parallel operations
  14. 14. RDD Types RDD
  15. 15. RDD Operation
  16. 16. Memory and Persistent
  17. 17. Dependencies Types
  18. 18. Spark Cluster Overview o Application o Driver program o Cluster manage o Worker node o Job o Stage o Executor o Task
  19. 19. Job Flow
  20. 20. Task Scheduler , DAG • Pipelines functions within a stage • Cache-aware data reuse & locality • Partitioning-aware to avoid shuffles rdd1.map(splitlines).filter("ERROR") rdd2.map(splitlines).groupBy(key) rdd2.join(rdd1, key).take(10)
  21. 21. Fault Recovery & Checkpoints • Efficient fault recovery using Lineage • log one operation to apply to many elements (lineage) • Recomputed lost partitions on failure • Checkpoint RDDs to prevent long lineage chains during fault recovery
  22. 22. QUICK DEMO
  23. 23. SPARK STACAK DETAILS
  24. 24. Spark SQL • Seamlessly mix SQL queries with Spark programs • Load and query data from a variety of sources • Standard Connectivity through (J)ODBC • Hive Compatibility
  25. 25. Data Frames • A distributed collection of data organized into named columns • Like a table in a relational database Spark SQL Resilient Distributed Datasets Spark JDBC Console User Programs (Java, Scala, Python) Catalyst Optimizer DataFrame API Figur e 1: I nter faces to Spar k SQL , and inter action with Spar k. 3.1 DataFr ame API The main abstraction in Spark SQL’s API is a DataFrame, a dis- tributed collection of rows with a homogeneous schema. A DataFrame is equivalent to a table in a relational database, and can also be manipulated in similar ways to the “ native” distributed collections as well maps an to creat Spark S the quer ports us Using data fro tional d 3.3 D Users c domain Python operato aggrega jects in expressi of fema empl oy . j oi
  26. 26. SparkR • New R language for Spark and SparkSQL • Exposes existing Spark functionality in an R-friendly syntax view the DataFrame API
  27. 27. Spark Streaming File systems Databases Dashboards Flume HDFS Kinesis Kafka Twitter High-level API joins, windows, … often 5x less code Fault-tolerant Exactly-once semantics, even for stateful ops Integration Integrate with MLlib, SQL, DataFrames, GraphX Chop up the live stream into batches of X seconds. DStream is represented by a continuous series of RDDs
  28. 28. MLib • Scalable Machine learning library • Iterative computing -> High Quality algorithm 100x faster than hadoop
  29. 29. MLib Algorithms
  30. 30. ML Pipeline • Feature Extraction • Normalization • Dimensionality reduction • Model training
  31. 31. GraphX • Spark’s API For Graph and Graph-parallel computation • Graph abstraction: a directed multigraph with properties attached to each vertex and edge • Seamlessly works with both graph and collections
  32. 32. GraphX Framework & Algorithms Algorithms
  33. 33. Spark Packages
  34. 34. Users & Distributors…
  35. 35. Thanks to Apache Spark by…. Started using it in our projects… Contribute to their open source community… Socialize Spark ..
  36. 36. Backup Slides
  37. 37. SPARK CLUSTER
  38. 38. Cluster Support • Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster • Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service applications • Hadoop YARN – the resource manager in Hadoop 2
  39. 39. Spark On Mesos
  40. 40. Spark on YARN
  41. 41. Data Science Process Data Science in Practice • Data Collection • Munging • Analysis • Visualization • Decision
  42. 42. Real Time Feedback
  43. 43. SQL Optimization (Catalyst)
  44. 44. Project Tungsten • Memory Management and Binary Processing: leveraging application semantics to manage memory explicitly and eliminate the overhead of JVM object model and garbage collection • Cache-aware computation: algorithms and data structures to exploit memory hierarchy • Code generation: using code generation to exploit modern compilers and CPUs
  45. 45. BDAS - Berkeley Data Analytics Stackhttps://amplab.cs.berkeley.edu/software/ BDAS, the Berkeley Data Analytics Stack, is an open source software stack that integrates software components being built by the AMPLab to make sense of Big Data.
  46. 46. Optimization • groupBy is costlier – use mapr() or reduceByKey() • RDD storage level MEMOR_ONLY is better
  47. 47. Optimization Code Example
  48. 48. RDDs vs Distributed Shared Mem
  49. 49. DAG Visualization
  50. 50. Spark + Akka+Spray
  51. 51. Spark R Architecture
  52. 52. PySpark
  53. 53. GraphX representation
  54. 54. Links References • Spark • Spark Submit 2015 • Spark External Projects • Spark Central
  55. 55. Project Tungsten Roadmap
  56. 56. TACHYON • Tachyon is a memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. It achieves high performance by leveraging lineage information and using memory aggressively. Tachyon caches working set files in memory, thereby avoiding going to disk to load datasets that are frequently read. This enables different jobs/queries and frameworks to access cached files at memory speed.
  57. 57. Blink DB
  58. 58. Batches… • Chop up the live stream into batches of X seconds • Spark treats each batch of data as RDDs and processes them using RDD operations • Finally, the processed results of the RDD operations are returned in batches Micro Batch
  59. 59. Dstream (Discretized Streams) DStream is represented by a continuous series of RDDs
  60. 60. Window Operation & Checkpoint
  61. 61. Streaming • Scalable high-throughput streaming process of live data • Integrate with many sources • Fault-tolerant- Stateful exactly-once semantics out of box • Combine streaming with batch and interactive queries
  62. 62. Spark streaming data streams Receiv ers batches as RDDs results as RDDs
  63. 63. Streaming Fault Tolerance
  64. 64. Spark Streaming UI
  65. 65. Micro Batch (Near Real Time) Micro Batch
  66. 66. Spark with Storm
  67. 67. Spark + Cassandra
  68. 68. Big Data Landscape
  69. 69. 100 opensourceBig Dataarchitecturepapers

×