Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Etu Solution Day 2014 Track-D: 掌握Impala和Spark

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 60 Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Etu Solution Day 2014 Track-D: 掌握Impala和Spark (20)

Anzeige

Aktuellste (20)

Anzeige

Etu Solution Day 2014 Track-D: 掌握Impala和Spark

  1. 1. 掌握Impala和Spark Real-time Big Data即時應用架構研習 Etu首席顧問 陳昭宇 Oct 8, 2014
  2. 2. Workshop Goal Let’s talk about the 3Vs in Big Data. Hadoop is good for Volume and Variety But… How about Velocity ??? This is why we are sitting here ….
  3. 3. Target Audience • CTO • Architect • Software/Application Developer • IT
  4. 4. Background Knowledge • Linux operation system • Basic Hadoop ecosystem knowledge • Basic knowledge of SQL • Java or Python programming experience
  5. 5. Terminology • Hadoop: Open source big data platform • HDFS: Hadoop Distributed Filesystem • MapReduce: Parallel computing framework on top of HDFS • HBase: NoSQL database on top of Hadoop • Impala: MPP SQL query engine on top of Hadoop • Spark: In-memory cluster computing engine • Hive: SQL to MapReduce translator • Hive Metastore: Database that stores table schema • Hive QL: A SQL subset
  6. 6. Agenda • What is Hadoop and what’s wrong with Hadoop in real-time? • What is Impala? • Hands-on Impala • What is Spark? • Hands-on Spark • Spark and Impala work together • Q & A
  7. 7. What is Hadoop ? Apache Hadoop is an open source platform for data storage and processing that is : Distributed Fault tolerant Scalable CORE HADOOP SYSTEM COMPONENTS HDFS HDFS A fault-tolerant, scalable clustered A fault-tolerant, scalable clustered storage storage MapReduce MapReduce A distributed computing framework A distributed computing framework • Ask questions across structured and unstructured data • Schema-less • Scale-out architecture divides workloads across nodes. • Flexible file system eliminates ETL bottlenecks. Flexible for storing and mining any type of data Processing Complex Big Data Scales Economically • Deploy on commodity hardware • Open sourced platform
  8. 8. Limitations of MapReduce • Batch oriented • High latency • Doesn’t fit all cases • Only for developers
  9. 9. Pig and Hive • MR is hard and only for developers • High level abstraction for converting declarative syntax to MR – SQL – Hive – Dataflow language - Pig • Build on top of MapReduce
  10. 10. Goals • General-purpose SQL engine: – Works for both analytics and transactional/single-row workloads. – Supports queries that take from milliseconds to hours. • Runs directly within Hadoop and: – Reads widely used Hadoop file formats. – Runs on same nodes that run Hadoop processes. • High performance: – C++ instead of Java – Runtime code generation – Completely new execution engine – No MapReduce
  11. 11. What is Impala • General-purpose SQL engine • Real-time queries in Apache Hadoop • Beta version released since Oct. 2012 • GA since Apr. 2013 • Apache licensed • Latest release v1.4.2
  12. 12. Impala Overview • Distributed service in cluster: One impala daemon on each data node • No SPOF • User submits query via ODBC/JDBC, CLI, or HUE to any of the daemons. • Query is distributed to all nodes with data locality. • Uses Hive’s metadata interfaces and connects to Hive metastore. • Supported file formats: – Uncompressed/lzo-compressed text files – Sequence files and RCFile with snappy/gzip, Avro – Parquet columnar format
  13. 13. Impala’s SQL • High compatibility with HiveQL • SQL support: – Essential SQL-92, minus correlated subqueries – INSERT INTO … SELECT … – Only equi-joins; no non-equi-joins, no cross products – Order By requires Limit (not required after 1.4.2) – Limited DDL support – SQL-style authorization via Apache Sentry – UDFs and UDAFs are supported
  14. 14. Impala’s SQL limitations • No file formats, SerDes • No beyond SQL (buckets, samples, transforms, arrays, structs, maps, xpath, json) • Broadcast joins and partitioned hash joins supported (Smaller tables have to fit in the aggregate memory of all executing nodes)
  15. 15. Work with HBase • Functionality highlights: – Support for SELECT, INSERT INTO…SELECT…, and INSERT INTO … VALUES (…) – Predicates on rowkey columns are mapped into start/stop rows – Predicates on other columns are mapped into SingleColumnValueFilters • BUT mapping of HBase tables and metastore table patterned after Hive: – All data is stored as scalars and in ASCII. – The rowkey needs to be mapped into a single string column.
  16. 16. HBase in Roadmap • Full support for UPDATE and DELETE. • Storage of structured data to minimize storage and access overhead. • Composite rowkey encoding mapped into an arbitrary number of table columns.
  17. 17. Impala’s Architecture Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Hive Metastore HDFS NN Statestore SQL Client ODBC 1. Request arrives via ODBC/JDBC/Beeswax/ Shell.
  18. 18. Impala’s Architecture Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Hive Metastore HDFS NN Statestore SQL Client ODBC 2. Planner turns request into collections of plan fragments. 3. Coordinator initiates execution on impalad(s) local to data.
  19. 19. Impala’s Architecture Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Hive Metastore HDFS NN Statestore SQL Client ODBC 5. Query results are streamed back to client. 4. Intermediate results are streamed between impalad(s).
  20. 20. Metadata Handling • Impala metadata – Hive’s metastore: Logical metadata (table definitions, columns, CREATE TABLE parameters) – HDFS NameNode: Directory contents and block replica locations – HDFS DataNode: Block replias’ volume ids • Caches metadata: No synchronous metastore API calls during query execution • Impala instances read metadata from metastore at startup. • Catalog Service relays metadata when you run DDL or update metadata on one of the impalads.
  21. 21. Metadata Handling – Cont. • REFRESH [<tbl>]: Reloads metadata on all impalads (if you added new files via Hive) • INVALIDATE METADATA: Reloads metadata for all tables
  22. 22. Comparing Impala to Dremel • What is Dremel? – Columnar storage for data with nested structures – Distributed scalable aggregation on top of that • Columnar storage in Hadoop: Parquet – Store data in appropriate native/binary types – Can also store nested structures similar to Dremel’s ColumnIO • Distributed aggregation: Impala • Impala plus Parquet: A superset of the published version of Dremel (does not support joins)
  23. 23. Comparing Impala to Hive • Hive: MapReduce as an execution engine – High latency, low throughput queries – Fault-tolerance model based on MapReduce’s on-disk check pointing; materializes all intermediate results – Java runtime allows for easy late-binding of functionality: file formats and UDFs – Extensive layering imposes high runtime overhead • Impala: – Direct, process-to-process data exchange – No fault tolerance – An execution engine designed for low runtime overhead
  24. 24. Impala and Hive Shares Everything Client-Facing •Metadata (table definitions) •ODBC/JDBC drivers •SQL syntax (Hive SQL) •Flexible file formats •Machine pool •GUI Resource Management Data Store But Built for Different Purposes •Hive: Runs on MapReduce and ideal for batch processing •Impala: Native MPP query engine ideal for interactive SQL Data Ingestion HDFS HBase TEXT, RCFILE, PARQUET,AVRO, ETC. RECORDS Hive SQL Syntax MapReduce Compute framework Impala SQL syntax + compute framework
  25. 25. Typical Use Cases • Data Warehouse Offload • Ad-hoc Analytics • Provide SQL interoperability to HBase
  26. 26. Hands-on Impala • Query a file on HDFS with Impala • Query a table on HBase with Impala
  27. 27. What is Spark? • MapReduce Review… • Apache Spark… • How Spark Works… • Fault Tolerance and Performance… • Examples… • Spark & More…
  28. 28. MapReduce: Good The Good: •Built in fault tolerance •Optimized IO path •Scalable •Developer focuses on Map/Reduce, not infrastructure •Simple? API
  29. 29. MapReduce: Bad The Bad: •Optimized for disk IO – Does not leverage memory – Iterative algorithms go through disk IO path again and again •Primitive API – Developers have to build on a very simple abstraction – Key/Value in/out – Even basic things like join require extensive code •A common result is many files require to be combined appropriately
  30. 30. Apache Spark • Originally developed in 2009 in UC Berkeley’s AMP Lab. • Fully open sourced in 2010 – now at Apache Software Foundation.
  31. 31. Spark: Easy and Fast Big Data • Easy to Develop – Rich APIs in Java, Scala, Python – Interactive Shell • 2-5x less code • Fast to Run – General execution graph – In-memory store
  32. 32. How Spark Works – SparkContext Cluster Master Spark Worker Spark Worker Executer Cache Executer Task Task Cache Data Node Data Node HDFS Task Task SparkContext sc=new SparkContext Rdd=sc.textfile(“hdfs://..”) Rdd.filter(…) Rdd.cache(…) Rdd.count(…) Rdd.map sc=new SparkContext Rdd=sc.textfile(“hdfs://..”) Rdd.filter(…) Rdd.cache(…) Rdd.count(…) Rdd.map
  33. 33. How Spark Works – RDD RDD (Resilient Distributed Dataset) sc=new SparkContext Rdd=sc.textfile(“hdfs://..” ) Rdd.filter(…) Rdd.cache(…) Rdd.count(…) Rdd.map sc=new SparkContext Rdd=sc.textfile(“hdfs://..” ) Rdd.filter(…) Rdd.cache(…) Rdd.count(…) Rdd.map Storage Types: MEMORY_ONLY, MEMORY_AND_DISK DISK_ONLY, • Fault Tolerant • Controlled • Fault Tolerant • Controlled partitioning to optimize data placement partitioning to optimize data placement • Manipulated by • Manipulated by using a rich set of operators using a rich set of operators • Partitions of Data • Dependency between partitions
  34. 34. RDD • Stands for Resilient Distributed Datasets • Spark revolves around RDDs • Fault-tolerant read only collection of elements that can be operated on in parallels • Cached in memory Reference: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spar k.pdf
  35. 35. RDD • Read-only, partitioned collection of records DD11 DD22 DD33 3 partitions • Supports only coarse-grained operations – e.g. map and group-by transformation, reduce actions DD11 DD22 DD33 DD11 DD22 DD33 DD11 DD22 DD33 DD11 DD22 DD33 DD11 DD22 DD33 Value
  36. 36. RDD Operations
  37. 37. RDD Operations - Expressive • Transformations – Creation of a new RDD dataset from an existing: • map, filter, distinct, union, sample, groupByKey, join, reduce, etc… • Actions – Returns a value after running a computation: • Collect, count, first, takeSample, foreach, etc… • Reference – http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations
  38. 38. Word Count on Spark sparkContext.textFile(“hdfs://…”) RDD[String] textFile
  39. 39. Word Count on Spark sparkContext.textFile(“hdfs://…”) .map(line => line.split(“s”)) RDD[String] RDD[List[String]] textFile map
  40. 40. Word Count on Spark sparkContext.textFile(“hdfs://…”) .map(line => line.split(“s”)) .map(word => (word, 1)) RDD[String] RDD[List[String]] RDD[(String, Int)] textFile map map
  41. 41. Word Count on Spark sparkContext.textFile(“hdfs://…”) .map(line => line.split(“s”)) .map(word => (word, 1)) .reduceByKey((a, b) => a+b) RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] textFile map map reduceByKey map
  42. 42. Word Count on Spark sparkContext.textFile(“hdfs://…”) .map(line => line.split(“s”)) .map(word => (word, 1)) .reduceByKey((a, b) => a+b) .collect() RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] Array[(String, Int)] textFile map map reduceByKey map collect
  43. 43. Actions • Parallel Operations map reduce sample filter count take grougBy fold first sort reduceByKey partitionBy union groupByKey mapWith join cogroup pipe leftOuterJoin cross save rightOuterJoin zip ….
  44. 44. Stages textFile map map reduceByKey collect Stage 1 Stage 2 DAG (Directed Acyclic Graph) Each stage is executed as Stage 1 Stage 2 a series of Task (one Task for each partition).
  45. 45. Tasks Task is the fundamental unit of execution in Spark Fetch Input Execute Task Write Output HDFS /RDD HDFS/RDD/Intermediate shuffle output Core 1 Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Core 2
  46. 46. Spark Summary • SparkContext • Resilient Distributed Dataset • Parallel Operations • Shared Variables – Broadcast Variables – read-only – Accumulators
  47. 47. Comparison MapReduce Impala Spark Storage HDFS HDFS/HBase HDFS Scheduler MapReduce Job Query Plan Computation Graph I/O Disk In-memory with cache In-memory, cache and shared data Fault Tolerance Duplication and Disk I/O No Fault Tolerance Hash partition and auto reconstruction Iterative Bad Bad Good Shared data No No Yes Streaming No No Yes
  48. 48. Hands-on Spark • Spark Shell • Word Count
  49. 49. Spark Streaming • Takes the concept of RDDs and extends it to Dstreams – Fault-tolerant like RDDs – Transformable like RDDs • Adds new “rolling window” operations – Rolling averages, etc.. • But keeps everything else! – Regular Spark code works in Spark Streaming – Can still access HDFS data, etc. • Example use cases: – “On-the-fly” ETL as data is ingested into Hadoop/HDFS – Detecting anomalous behavior and trigger alerts – Continuous reporting of summary metrics for incoming data
  50. 50. How Streaming Works
  51. 51. Micro-batching for on the fly ETL
  52. 52. Window-based Transformation
  53. 53. Spark SQL • Spark SQL is one of Spark’s components. – Executes SQL on Spark – Builds SchemaRDD • Optimizes execution plan • Uses existing Hive metastores, SerDes, and UDFs.
  54. 54. Unified Data Access • Ability to load and query data from a variety of sources. • SchemaRDDs provides a single interface that efficiently works with structured data, including Hive tables, parquet files, and JSON. sqlCtx.jsonFile("s3n://...") .registerAsTable("json") schema_rdd = sqlCtx.sql(""" SELECT * FROM hiveTable JOIN json ...""") Query and join different data sources
  55. 55. Hands-on Spark • Parse/transform log on the fly with Spark-Streaming • Aggregate with Spark SQL (Top N) • Output from Spark to HDFS
  56. 56. Spark & Impala work together Data Strea m Data Strea m Spark- Streaming Spark Impala DN RS Data Strea m Spark- Streaming Spark Impala DN RS Data Strea m Spark- Streaming Spark DN RS Impala … Data Strea m Data Strea m Data Stream -Click Steam -Machine Data -Log -Network Traffic -Etc.. On-the-fly Processing -ETL, transformation, filter -Pattern Matching & Alert Real-time Analytics -Machine Learning (Rec. Cluster..) -Iterative Algorithms Near Real-time Query - Ad-hoc query - Reporting Long term data store -Batch process -Offline analytics -Historical Mining
  57. 57. Etu 讓 Hadoop 更容易 全自動、高效能、易管理的軟體式一體機 空機自動部署 效能最佳化 全叢集管理 唯一在地 Hadoop 專業服務 主流 X86 商用伺服器
  58. 58. ESA Software Stack Cloudera Manager Etu Manager 安全管理 效能最佳化 組態同步網路管理監控告警套件管理 CentOS作業系統 (64 bits) HA 管理 Rack Awareness Etu 加值模組
  59. 59. Etu在Hadoop生態系的定位與價值 Etu Services 人才 招聘 團隊 建立 程式開發 數據 架構 挖掘 設計 部署、調校 易 Etu 易 運維、管理 應用 平台 搶 佔 市場 Etu Professional Services 核心 價 值 資源 調配 屏蔽 Hadoop 平台 部署與運維的複雜度 易 • 快速推出服務,搶佔市場 • 應用、數據才是企業核心價值 • 依核心價值調配資源,建立競爭優 勢 Software Appliance EEttuu SSuuppppoorrtt Etu Professional Services EEttuu CCoonnssuullttiinngg EEttuu TTrraaiinniinngg 易
  60. 60. Question and Discussion Thank you 318, Rueiguang Rd., Taipei 114, Taiwan T: +886 2 7720 1888 F: +886 2 8798 6069 www.etusolution.com

×