Anzeige
Anzeige

Más contenido relacionado

Destacado(20)

Anzeige

Último(20)

Anzeige

Lightening Fast Big Data Analytics using Apache Spark

  1. www.unicomlearning.com Lightning Fast Big Data Analytics using Apache Spark Manish Gupta Solutions Architect – Product Engineering and Development 30th Jan 2014 - Delhi www.bigdatainnovation.org
  2. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
  3. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
  4. www.unicomlearning.com www.bigdatainnovation.org What is Hadoop? It’s an open-sourced software for distributed storage of large datasets on commodity class hardware in a highly fault-tolerant, scalable and a flexible way. HDFS It also provide a programming model/framework for processing these large datasets in a massively-parallel, fault-tolerant and data-location aware fashion. MR Map Input Reduce Map Map Output Reduce
  5. www.unicomlearning.com www.bigdatainnovation.org Limitations of Map Reduce HDFS read HDFS write HDFS read iter. 1 Input Map iter. 2 Map . . . Reduce Map Input HDFS write Output Reduce  Slow due to replication, serialization, and disk IO  Inefficient for: • Iterative algorithms (Machine Learning, Graphs & Network Analysis) • Interactive Data Mining (R, Excel, Adhoc Reporting, Searching)
  6. www.unicomlearning.com www.bigdatainnovation.org Approach: Leverage Memory?  Memory bus >> disk & SSDs  Many datasets fit into memory  1TB = 1 billion records @ 1 KB  Memory Capacity also follows the Moore’s Law A single 8GB stick of RAM is about $80 right now. In 2021, you’d be able to buy a single stick of RAM that contains 64GB for the same price.
  7. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
  8. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
  9. www.unicomlearning.com www.bigdatainnovation.org Spark “A big data analytics cluster-computing framework written in Scala.”  Open Sourced originally developed in AMPLab at UC Berkley.  Provides In-Memory analytics which is faster than Hadoop/Hive (upto 100x).  Designed for running Iterative algorithms & Interactive analytics  Highly compatible with Hadoop’s Storage APIs.  - Can run on your existing Hadoop Cluster Setup.  Developers can write driver programs using multiple programming languages. …
  10. www.unicomlearning.com www.bigdatainnovation.org Spark Spark Driver (Master) Cluster Manager Cache Cache Cache Spark Worker Datanode Datanode Block .... .... Spark Worker Block Spark Worker Datanode Block HDFS
  11. www.unicomlearning.com www.bigdatainnovation.org Spark HDFS read HDFS write iter. 1 Input HDFS read HDFS write iter. 2 . . .
  12. www.unicomlearning.com www.bigdatainnovation.org Spark HDFS read iter. 1 iter. 2 . . . Input Not tied to 2 stage Map Reduce paradigm 1. Extract a working set 2. Cache it 3. Query it repeatedly Logistic regression in Hadoop and Spark
  13. www.unicomlearning.com www.bigdatainnovation.org Spark A simple analytical operation: 1 pagecount = spark.textFile( "/wiki/pagecounts“ ) pagecount.count() 2 englishPages = pagecount.filter( _.split(" ")(1) == "en“ ) englishPages.cache() englishPages.count() englishTuples = englishPages.map( line => line.split(" ") ) englishKeyValues = englishTuples.map( line => (line(0), line(3).toInt) ) englishKeyValues.reduceByKey( _+_, 1).collect Select count(*) from pagecounts Select Col1, sum(Col4) from pagecounts Where Col2 = “en” Group by Col1
  14. www.unicomlearning.com www.bigdatainnovation.org Shark  HIVE on SPARK = SHARK  A large scale data warehouse system just like Apache Hive.  Highly compatible with Hive (HQL, metastore, serialization formats, and UDFs)  Built on top of Spark (thus a faster execution engine).  Provision of creating In-memory materialized tables (Cached Tables).  And cached tables utilizes columnar storage instead of raw storage. Row Storage Column Storage 1 ABC 4.1 1 2 3 2 XYZ 3.5 ABC XYZ PPP 3 PPP 6.4 4.1 3.5 6.4
  15. www.unicomlearning.com www.bigdatainnovation.org Shark HIVE Client CLI JDBC Driver Meta store SQL Parser Query Optimizer Map Reduce HDFS Physical Plan Execution
  16. www.unicomlearning.com www.bigdatainnovation.org Shark SHARK Client CLI Driver Meta store SQL Parser Query Optimizer Spark HDFS JDBC Cache Mgr. Physical Plan Execution
  17. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
  18. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
  19. www.unicomlearning.com www.bigdatainnovation.org Spark Programming Model Driver Program sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map Cluster Manager SparkContext Worker Node Writes Executer Task Worker Node Cache Executer Task Datanode Task … User (Developer) HDFS Cache Task Datanode
  20. www.unicomlearning.com www.bigdatainnovation.org Spark Programming Model Driver Program sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map Writes User (Developer) RDD (Resilient Distributed Dataset) • • • • • • Immutable Data structure In-memory (explicitly) Fault Tolerant Parallel Data Structure Controlled partitioning to optimize data placement Can be manipulated using rich set of operators.
  21. www.unicomlearning.com www.bigdatainnovation.org RDD  Programming Interface: Programmer can perform 3 types of operations: Transformations • Create a new dataset from and existing one. • Actions • Lazy in nature. They are executed only when some action is performed. • • Example : • Map(func) • Filter(func) • Distinct() Returns to the driver program a value or exports data to a storage system after performing a computation. Example: • Count() • Reduce(funct) • Collect • Take() Persistence • For caching datasets in-memory for future operations. • Option to store on disk or RAM or mixed (Storage Level). • Example: • Persist() • Cache()
  22. www.unicomlearning.com www.bigdatainnovation.org Spark How Spark Works: RDD: Parallel collection with partitions  User application create RDDs, transform them, and run actions. This results in a DAG (Directed Acyclic Graph) of operators. DAG is compiled into stages Each stage is executed as a series of Task (one Task for each Partition).
  23. www.unicomlearning.com www.bigdatainnovation.org Spark Example: sc.textFile(“/wiki/pagecounts”) textFile RDD[String]
  24. www.unicomlearning.com www.bigdatainnovation.org Spark Example: sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) textFile map RDD[String] RDD[List[String]]
  25. www.unicomlearning.com www.bigdatainnovation.org Spark Example: sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) .map(R => (R[0], int(R[1]))) textFile map map RDD[String] RDD[List[String]] RDD[(String, Int)]
  26. www.unicomlearning.com www.bigdatainnovation.org Spark Example: sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) .map(R => (R[0], int(R[1]))) .reduceByKey(_+_, 3) textFile map map RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] reduceByKey
  27. www.unicomlearning.com www.bigdatainnovation.org Spark Example: sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) .map(R => (R[0], int(R[1]))) .reduceByKey(_+_, 3) .collect() RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] Array[(String, Int)] collect textFile map map reduceByKey
  28. www.unicomlearning.com www.bigdatainnovation.org Spark Execution Plan: collect textFile map map reduceByKey Above logical plan gets compiled by the DAG scheduler into a Plan comprising of Stages as…
  29. www.unicomlearning.com www.bigdatainnovation.org Spark Execution Plan: Stage 2 Stage 1 collect textFile map map reduceByKey Stages are sequences of RDDs, that don’t have a Shuffle in between
  30. www.unicomlearning.com www.bigdatainnovation.org Spark Stage 2 Stage 1 collect textFile 1. 2. 3. 4. map map reduceByKey 1. 2. 3. Read HDFS split Apply both the maps Start Partial reduce Write shuffle data Stage 1 Stage 2 Read shuffle data Final reduce Send result to driver program
  31. www.unicomlearning.com www.bigdatainnovation.org Spark Stage Execution: Stage 1 Task 1 Task 2 Task 2 Task 2  Create a task for each Partition in the new RDD  Serialize the Task  Schedule and ship Tasks to Slaves And all this happens internally (you need to do anything)
  32. www.unicomlearning.com www.bigdatainnovation.org Spark Task Execution: Task is the fundamental unit of execution in Spark Fetch Input HDFS / RDD Execute Task Write Output time HDFS / RDD / intermediate shuffle output
  33. www.unicomlearning.com www.bigdatainnovation.org Spark Spark Executor (Slaves) Fetch Input Core 1 Fetch Input Execute Task Fetch Input Execute Task Write Output Execute Task Write Output Fetch Input Core 2 Write Output Fetch Input Execute Task Execute Task Write Output Fetch Input Core 3 Write Output Fetch Input Execute Task Write Output Execute Task Write Output
  34. www.unicomlearning.com www.bigdatainnovation.org Spark Summary of Components  Task : The fundamental unit of execution in Spark  Stage: Set of Tasks that run parallel  DAG : Logical Graph of RDD operations  RDD : Parallel dataset with partitions
  35. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
  36. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
  37. www.unicomlearning.com www.bigdatainnovation.org Example & Demo Cluster Details:  6 m1.Xlarge EC2 nodes.  1 machine is Master Node  5 worker node machines  64 bit, 4 vCPU  15 GB Ram
  38. www.unicomlearning.com www.bigdatainnovation.org Example & Demo Dataset:  Wiki Page View Stats  20 GB of webpage view counts  3 days worth of data <date_time> <project_code> <page_title> <num_hits> <page_size> Base RDD to All Wiki Pages val allPages = sc.textFile("/wiki/pagecounts") allPages.take(10).foreach(println) allPages.count() Transformed RDD for all English pages (cached) val englishPages = allPages.filter(_.split(" ")(1) == "en") englishPages.cache() englishPages.count() englishPages.count()
  39. www.unicomlearning.com www.bigdatainnovation.org Example & Demo Dataset:  Wiki Page View Stats  20 GB of webpage view counts  3 days worth of data <date_time> <project_code> <page_title> <num_hits> <page_size> Select date, sum(pageviews) from pagecounts group by date englishPages.map(line => line.split(" ")).map(line => (line(0).substring(0, 8), line(3).toInt)).reduceByKey(_+_, 1).collect.foreach(println) Select date, count(distinct pageURL) from pagecounts group by date englishPages.map(line => line.split(" ")).map(line => (line(0).substring(0, 8), line(2))).distinct().countByKey().foreach(println) Select distinct(datetime) from pagecounts order by datetime englishPages.map(line => line.split(" ")).map(line => (line(0), 1)).distinct().sortByKey().collect().foreach(println)
  40. www.unicomlearning.com www.bigdatainnovation.org Example & Demo Dataset:  Network Datasets  Directed and Bi-directed Graphs  One small Facebook Social Network  127 nodes (Friends)  1668 Edges (Friendships)  Bi-directed graph  Google’s internal site network  15713 Nodes (web pages)  170845 Edges (hyperlinks)  Directed Graph
  41. www.unicomlearning.com www.bigdatainnovation.org Example & Demo Page Rank Calculation: • • • • Estimate the node importance Each directed link from A -> B is a vote to B from A. More links to a page, more important a page is. When a page with higher PR, points to something, then it’s vote weighs more. 1. Start each page at a rank of 1 2. On each iteration, have page p contribute (rank of p) / (no. of neighbors of p) to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs
  42. www.unicomlearning.com www.bigdatainnovation.org Example & Demo Scala Code: var iters = 100 val lines = sc.textFile("/dataset/google/edges.csv",1) val links = lines.map{ s => val parts = s.split( "t“ ) (parts(0), parts(1)) }.distinct().groupByKey().cache() var ranks = links.mapValues(v => 1.0) for (i <- 1 to iters) { val contribs = links.join(ranks).values.flatMap{ case (urls, rank) => val size = urls.size urls.map(url => (url, rank / size)) } ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _) } val output = ranks.map(l=>(l._2,l._1)).sortByKey(false).map(l=>(l._2,l._1)) output.take(20).foreach(tup => println( tup._2 + " : " + tup._1 ))
  43. 2 seconds
  44. 38 seconds Page Rank 761.1985177 455.7028756 259.6052388 192.7257649 144.0349154 134.1566312 130.3546324 123.4014613 120.0661165 118.6884515 112.2309539 108.8375347 106.9724799 105.822426 105.1554798 99.97741309 97.90651416 90.7910291 90.70522689 87.4353413 Page URL google google/about.html google/privacy.html google/jobs/ google/support google/terms_of_service.html google/intl/en/about.html google/imghp google/accounts/Login google/intl/en/options/ google/preferences google/sitemap.html google/press/ google/language_tools google/support/toolbar/ google/maps google/advanced_search google/intl/en/services/ google/intl/en/ads/ google/adsense/
  45. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
  46. www.unicomlearning.com www.bigdatainnovation.org Spark Current Users & Roadmap Source: Apache - Powered By Spark
  47. www.unicomlearning.com www.bigdatainnovation.org Roadmap
  48. www.unicomlearning.com www.bigdatainnovation.org Conclusion  Because of In-memory processing, computations are very fast. Developers can write iterative algorithms without writing out a result set after each pass through the data.  Suitable for scenarios when sufficient memory available in your cluster.  It provides an integrated framework for advanced analytics like Graph processing, Stream Processing, Machine Learning etc. This simplifies integration.  It’s community is expanding and development is happening very aggressively.  It’s comparatively newer than Hadoop and only few users.
  49. www.unicomlearning.com Topic: Thank You Speaker name: MANISH GUPTA Email ID: manish.gupta@globallogic.com www.bigdatainnovation.org Organized by UNICOM Trainings & Seminars Pvt. Ltd. contact@unicomlearning.com
  50. Backup Slides
  51. www.unicomlearning.com www.bigdatainnovation.org Spark Internal Components Spark core Operators Scheduler Block manager Networking Accumulators Interpreter Broadcast Hadoop I/O Mesos backend Standalone backend
  52. www.unicomlearning.com www.bigdatainnovation.org In-Memory But what if I run out of memory? 100 70 58.1 60 40.7 50 29.7 40 30 11.5 Iteration time (s) 80 68.8 90 20 10 0 Cache disabled 25% 50% 75% % of working set in memory Fully cached
  53. www.unicomlearning.com www.bigdatainnovation.org Benchmarks  AMPLab performed a quantitative and qualitative comparisons of 4 system  HIVE, Impala, Redshift and Shark  Done on Common Crawl Corpus Dataset  81 TB size  Consists of 3 tables:  Page Rankings  User Visits  Documents  Data was partitioned in such a way that each node had:  25GB of User Visits  1GB of Ranking  30GB of Web Crawl (document) Source: https://amplab.cs.berkeley.edu/benchmark/#
  54. www.unicomlearning.com www.bigdatainnovation.org Benchmarks
  55. www.unicomlearning.com www.bigdatainnovation.org Benchmarks Hardware Configuration
  56. www.unicomlearning.com www.bigdatainnovation.org Benchmarks • Redshift outperforms for on-disk data. • Shark and Impala outperform Hive by 3-4X. • For larger result-sets, Shark outperforms Impala.
  57. www.unicomlearning.com www.bigdatainnovation.org Benchmarks • Redshift columnar storage outperforms every time. • Shark in-memory is 2nd best in all cases.
  58. www.unicomlearning.com www.bigdatainnovation.org Benchmarks • Redshift bigger cluster has an advantage. • Shark and Impala competing.
  59. www.unicomlearning.com www.bigdatainnovation.org Benchmarks • Impala & Redshift don’t have UDF. • Shark outperforms hive.
  60. www.unicomlearning.com www.bigdatainnovation.org Roadmap
  61. www.unicomlearning.com www.bigdatainnovation.org Spark In Last 6 months of Year 2013

Hinweis der Redaktion

  1. Solutions Architect in GlobalLogic.Been working for the last 10 years on large databases, data warehouses, ETLs, data mining, and now for around 2-3 years on Big Data Analytics, Machine Learning &amp; distributed System.GlobalLogic is a 6000+ headcount company is into Full Product Life Cycle service and one of the fastest growing R&amp;D services firm.Provide Advisory, Professional Services, Engineering and Support service to 250+ customers globallyWill speak about an In memory cluster computing framework that can really Nitrogen Boost your existing Hadoop based Big Data setup for analytics.
  2. Quickly touch upon Hadoop, What it does, HDFS, Map Reduce, and some of it’s limitationsIntroduce Spark and one of the tool build on top of Spark called Shark (The SQL Interface to Spark)Little bit on Spark’s architecture and it’s basic programming modelShowcase a demo about Spark and Shark’s functionalityWill speak a bit about the future of Spark, where it’s heading and about some of it’s existing customers and contributors.
  3. Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure.Large is basically 10-100 GBs and above only.It is the driving force behind the big data industry growth.Provides 2 basic components:HDFS: Large Scale Storage SystemMap Reduce: Distributed Cluster computing frameworkTypical Hadoop setup comprises of :Cluster of a particular Hadoop DistributionTools like Hive, Pig and Mahout running on top of Hadoop (internally processing HDFS data using Map Reduce jobs)Set of tools for importing/exporting data into HDFS from/to external systems like RDBMS or Server Logs.
  4. - One of the reason why Map Reduced is criticized is – Restricted programming framework - MapReduce tasks must be written as acyclic dataflow programs - Stateless mapper followed by a stateless reducer, that are executed by a batch job scheduler - Repeated querying of datasets become difficult - thus hard to write iterative algorithms- After each iteration of Map-Reduce, data has to be persisted on disc for next iteration to proceed with processing.
  5. SparkContext: represents the connection to a Spark cluster provides the entry point for interacting with Spark. we can interact our jobs.Driver program: The process runniwith Spark and distribute ng the main() function of the application and creating the SparkContextCluster manager: An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)Worker node: Any node that can run application code in the clusterExecutor: A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.Task: A unit of work that will be sent to one executorJob: A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you&apos;ll see this term used in the driver&apos;s logs.Stage: Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you&apos;ll see this term used in the driver&apos;s logs.
  6. Resilient Distributed Datasets or RDD are the distributed memory abstractions that lets programmer perform in-memory parallel computations on large clusters. And that too in a highly fault tolerant manner.This is the main concept around which the whole Spark framework revolves around.Currently 2 types of RDDs:Parallelized collections: Created by calling parallelize method on an existing Scala collection. Developer can specify the number of slices to cut the dataset into. Ideally 2-3 slices per CPU.Hadoop Datasets: These distributed datasets are created from any file stored on HDFS or other storage systems supported by Hadoop (S3, Hbaseetc). These are created using SparkContext’s textFile method. Default number of slices in this case is 1 slice per file block.
  7. Transformations: Like map – takes an RDD as an input, passes &amp; process each element to a function, and return a new transformed RDD as an output.By default, each transformed RDD is recomputed each time you run an action on it. Unless you specify the RDD to be cached in memory. Spark will try to keep the elements around the cluster for faster access.RDD can be persisted on discs as well.Caching is the Key tool for iterative algorithms.Using persist, one can specify the Storage Level for persisting an RDD. Cache is just a short hand for default storage level. Which is MEMORY_ONLY.MEMORY_ONLYStore RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they&apos;re needed. This is the default level.MEMORY_AND_DISKStore RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don&apos;t fit on disk, and read them from there when they&apos;re needed.MEMORY_ONLY_SERStore RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.MEMORY_AND_DISK_SERSimilar to MEMORY_ONLY_SER, but spill partitions that don&apos;t fit in memory to disk instead of recomputing them on the fly each time they&apos;re needed.DISK_ONLYStore the RDD partitions only on disk.MEMORY_ONLY_2, MEMORY_AND_DISK_2 etcSame as the levels above, but replicate each partition on two cluster nodes.Which Storage level is best:Few things to consider:Try to keep in-memory as much as possibleTry not to spill to disc unless your computed datasets are memory expensiveUse replication only if you want fault tolerance
  8. PageRank is an algorithm used by Google Search to rank websites in their search engine results. PageRank was named after Larry Page,[1] one of the founders of Google. PageRank is a way of measuring the importance of website pages.PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.
  9. Spark Streaming: For stream processingContinuously executes various parallel operations on an Input Stream of Data.System receives a continuous data and divide is into batches. And each batch is considered and processed as an RDD.Graph X:Distributed Graph SystemDesigned to efficiently execute Graph algorithms using Spark parallel and in-memory computation frameworkMLBase:Goal of MLBase is to make distributed machine learning easy.BlinkDB:Approximate query engineAllows for trade-off between accuracy and response timeHighly interactive on very large datasetsIn process of deploying this at FacebookAMPLab have demonstrated how complex queries on 17 TB data (running on 100 node cluster) can be completed in less than 2 seconds !You specify queries with time boundationSelect avg(SessionTime) from tblSession where UserGender=‘MALE’ within 2 SECONDS
  10. -Interpreter: It’s actually the Scala command line (interpreter) that’s been modified for SparkHadoop I/O: for Reading/Writing from HDFSStanadlone: Custom Resource Manager- Operators: Map, Join, Group by etc on RDDNetworking: Replication, Caching, GraphBlock Manager: Very Simple Key-Value store that used as cacheBroadcaster: Sending / Receiving event, Heartbeat etc-
  11. -used by majority of Fortune 50 companies.
Anzeige