Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Transformation Processing Smackdown; Spark vs Hive vs Pig

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Page1
Transformation Processing Smackdown
Spark vs Hive vs Pig
Lester Martin DevNexus 2017

YouTube-Videos werden auf SlideShare nicht mehr unterstützt.

Original auf YouTube ansehen

Page2
Connection before Content
Lester Martin – Hadoop/Spark Trainer & Consultant
lester.martin@gmail.com
http://lester.we...
Wird geladen in …3
×

Hier ansehen

1 von 94 Anzeige

Transformation Processing Smackdown; Spark vs Hive vs Pig

Herunterladen, um offline zu lesen

Compare and contrast using Spark, Hive and Pig for transformation processing requirements. Video of my "talk" at https://www.youtube.com/watch?v=36_MayK5eU4.

Conference page for the talk is at https://devnexus.com/s/devnexus2017/presentations/17533.

Compare and contrast using Spark, Hive and Pig for transformation processing requirements. Video of my "talk" at https://www.youtube.com/watch?v=36_MayK5eU4.

Conference page for the talk is at https://devnexus.com/s/devnexus2017/presentations/17533.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Transformation Processing Smackdown; Spark vs Hive vs Pig (20)

Anzeige

Aktuellste (20)

Transformation Processing Smackdown; Spark vs Hive vs Pig

  1. 1. Page1 Transformation Processing Smackdown Spark vs Hive vs Pig Lester Martin DevNexus 2017
  2. 2. Page2 Connection before Content Lester Martin – Hadoop/Spark Trainer & Consultant lester.martin@gmail.com http://lester.website (links to blog, twitter, github, LI, FB, etc)
  3. 3. Page3 Agenda • Present Frameworks • File Formats • Source to Target Mappings • Data Quality • Data Profiling • Core Processing Functionality • Custom Business Logic • Mutable Data Concerns • Performance Lots of Code !
  4. 4. Page4 Standard Disclaimers Apply Wide Topic – Multiple Frameworks – Limited Time, so… • Simple use cases • Glad to enhance https://github.com/lestermartin/oss-transform-processing-comparison • ALWAYS 2+ ways to skin a cat • Especially with Spark ;-) • CLI, not GUI, tools • Others in that space such as Talend, Informatica & Syncsort • ALL code compiles in PPT ;-) • Won’t explain all examples!!!
  5. 5. Page5 Apache Pig – http://pig.apache.org • A high-level data-flow scripting language (Pig Latin) • Run as standalone scripts or use the interactive shell • Executes on Hadoop • Uses lazy execution Grunt shell
  6. 6. Page6 Simple and Novel Commands Pig Command Description LOAD Read data from file system STORE Write data to file system FOREACH Apply expression to each record and output 1+ records FILTER Apply predicate and remove records that do not return true GROUP/COGROUP Collect records with the same key from one or more inputs JOIN Joint 2+ inputs based on a key; various join algorithms exist ORDER Sort records based on a key DISTINCT Remove duplicate records UNION Merge two data sets SPLIT Split data into 2+ more sets based on filter conditions STREAM Send all records through a user provided executable SAMPLE Read a random sample of the data LIMIT Limit the number of records
  7. 7. Page7 Executing Scripts in Ambari Pig View
  8. 8. Page8 Apache Hive – http://hive.apache.org • Data warehouse system for Hadoop • Create schema/table definitions that point to data in HDFS • Treat your data in Hadoop as tables • SQL 92 • Interactive queries at scale
  9. 9. Page9 Hive’s Alignment with SQL SQL Datatypes SQL Semantics INT SELECT, LOAD, INSERT from query TINYINT/SMALLINT/BIGINT Expressions in WHERE and HAVING BOOLEAN GROUP BY, ORDER BY, SORT BY FLOAT CLUSTER BY, DISTRIBUTE BY DOUBLE Sub-queries in FROM clause STRING GROUP BY, ORDER BY BINARY ROLLUP and CUBE TIMESTAMP UNION ARRAY, MAP, STRUCT, UNION LEFT, RIGHT and FULL INNER/OUTER JOIN DECIMAL CROSS JOIN, LEFT SEMI JOIN CHAR Windowing functions (OVER, RANK, etc.) VARCHAR Sub-queries for IN/NOT IN, HAVING DATE EXISTS / NOT EXISTS INTERSECT, EXCEPT
  10. 10. Page10 Hive Query Process User issues SQL query Hive parses and plans query Query converted to YARN job and executed on Hadoop 2 3 Web UI JDBC / ODBC CLI Hive SQL 1 1 HiveServer2 Hive MR/Tez/Spark Compiler Optimizer Executor 2 Hive MetaStore (MySQL, Postgresql, Oracle) MapReduce, Tez or Spark Job Data DataData Hadoop 3 Data-local processing
  11. 11. Page11 Submitting Hive Queries – CLI and GUI Tools
  12. 12. Page12 Submitting Hive Queries – Ambari Hive View
  13. 13. Page13 Apache Spark – http://spark.apache.org  A data access engine for fast, large-scale data processing  Designed for iterative in-memory computations and interactive data mining  Provides expressive multi-language APIs for Scala, Java, R and Python  Data workers can use built-in libraries to rapidly iterate over data for: – ETL – Machine learning – SQL workloads – Stream processing – Graph computations
  14. 14. Page14 Spark Executors & Cluster Deployment Options  Responsible for all application workload processing – The "workers" of a Spark application – Includes the SparkContext serving as the "master” • Schedules tasks • Pre-created in shells & notebooks  Exist for the life of the application  Standalone mode and cluster options – YARN – Mesos HDP Cluster Executor Executor Executor Executor Executor Executor
  15. 15. Page15 Spark SQL Overview  A module built on top of Spark Core  Provides a programming abstraction for distributed processing of large-scale structured data in Spark  Data is described as a DataFrame with rows, columns and a schema  Data manipulation and access is available with two mechanisms – SQL Queries – DataFrames API
  16. 16. Page16 The DataFrame Visually Col(1) Col(2) Col(3) … Col(n) RDD 1.1x RDD 1.3x RDD 1.2x Represented logically as…. Pre-existing Hive data on HDFS data file 1 data file 2 data file 3 Spark SQL Converted to… Data input by Spark SQL
  17. 17. Page17 Apache Zeppelin – http://zeppelin.apache.org
  18. 18. Page18 Still Based on MapReduce Principles sc.textFile("/some-hdfs-data") mapflatMap reduceByKey collecttextFile .flatMap(lambda line: line.split(" ")) .map(lambda line: (word, 1))) .reduceByKey(lambda a,b : a+b, numPartition=3) .collect() RDD[String] RDD[List[String]] RDD[(String, Int)] Array[(String, Int)] RDD[(String, Int)]
  19. 19. Page19 ETL Requirements • Error handling • Alerts / notifications • Logging • Lineage & job statistics • Administration • Reusability • Performance & scalability • Source code management • Read/write multiple file formats and persistent stores • Source-to-target mappings • Data profiling • Data quality • Common processing functionality • Custom business rules injection • Merging changed records
  20. 20. Page20 ETL vs ELT Source: http://www.softwareadvice.com/resources/etl-vs-elt-for-your-data-warehouse/
  21. 21. Page21 File Formats The ability to read & write many different file formats is critical • Delimited values (comma, tab, etc) • XML • JSON • Avro • Parquet • ORC • Esoteric formats such as EBCDIC and compact RYO solutions
  22. 22. Page22 File Formats: Delimited Values Delimited datasets are very common place in Big Data clusters Simple example file: catalog.del Programming Pig|Alan Gates|23.17|2016 Apache Hive Essentials|Dayong Du|39.99|2015 Spark in Action|Petar Zecevic|41.24|2016
  23. 23. Page23 Pig Code for Delimited File book_catalog = LOAD '/otpc/ff/del/data/catalog.del’ USING PigStorage('|') AS (title:chararray, author:chararray, price:float, year:int); DESCRIBE book_catalog; DUMP book_catalog;
  24. 24. Page24 Pig Output for Delimited File book_catalog: {title: chararray,author: chararray,price: float,year: int} (Programming Pig,Alan Gates,23.17,2016) (Apache Hive Essentials,Dayong Du,39.99,2015) (Spark in Action,Petar Zecevic,41.24,2016)
  25. 25. Page25 Hive Code for Delimited File CREATE EXTERNAL TABLE book_catalog_pipe( title string, author string, price float, year int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/otpc/ff/del/data';
  26. 26. Page26 Hive Output for Delimited File Schema desc book_catalog_pipe; | col_name | data_type | comment | +-----------+------------+----------+--+ | title | string | | | author | string | | | price | float | | | year | int | |
  27. 27. Page27 Hive Output for Delimited File Contents SELECT * FROM book_catalog_pipe; | title | author | price | year | +-----------+---------+-------+------+--+ | Progra... | Alan... | 23.17 | 2016 | | Apache... | Dayo... | 39.99 | 2015 | | Spark ... | Peta... | 41.24 | 2016 |
  28. 28. Page28 Spark Code for Delimited File val catalogRDD = sc.textFile( "hdfs:///otpc/ff/del/data/catalog.del") case class Book(title: String, author: String, price: Float, year: Int) val catalogDF = catalogRDD .map(b => b.split('|')) .map(b => Book(b(0), b(1), b(2).toFloat, b(3).toInt)) .toDF()
  29. 29. Page29 Spark Output for Delimited File Schema catalogDF.printSchema() root |-- title: string (nullable = true) |-- author: string (nullable = true) |-- price: float (nullable = false) |-- year: integer (nullable = false)
  30. 30. Page30 Spark Output for Delimited File Contents catalogDF.show() | title| author|price|year| +--------------+-----------+-----+----+ |Programming...| Alan Gates|23.17|2016| |Apache Hive...| Dayong Du|39.99|2015| |Spark in Ac...|Petar Ze...|41.24|2016|
  31. 31. Page31 File Formats: XML Simple example file: catalog.xml <CATALOG> <BOOK> <TITLE>Programming Pig</TITLE> <AUTHOR>Alan Gates</AUTHOR> <PRICE>23.17</PRICE> <YEAR>2016</YEAR> </BOOK> <!-- other 2 BOOKs not shown --> </CATALOG>
  32. 32. Page32 Pig Code for XML File raw = LOAD '/otpc/ff/xml/catalog.xml' USING XMLLoader('BOOK') AS (x:chararray); formatted = FOREACH raw GENERATE XPath(x, 'BOOK/TITLE') AS title, XPath(x, 'BOOK/AUTHOR') AS author, (float) XPath(x, 'BOOK/PRICE') AS price, (int) XPath(x, 'BOOK/YEAR') AS year;
  33. 33. Page33 Hive Code for XML File CREATE EXTERNAL TABLE book_catalog_xml(str string) LOCATION '/otpc/ff/xml/flat'; CREATE TABLE book_catalog STORED AS ORC AS SELECT xpath_string(str,'BOOK/TITLE') AS title, xpath_string(str,'BOOK/AUTHOR') AS author, xpath_float( str,'BOOK/PRICE') AS price, xpath_int( str,'BOOK/YEAR') AS year FROM book_catalog_xml;
  34. 34. Page34 Spark Code for XML File val df = sqlContext .read .format("com.databricks.spark.xml") .option("rowTag", "BOOK") .load("/otpc/ff/xml/catalog.xml")
  35. 35. Page35 File Formats: JSON Simple example file: catalog.json {"title":"Programming Pig", "author":"Alan Gates", "price":23.17, "year":2016} {"title":"Apache Hive Essentials", "author":"Dayong Du", "price":39.99, "year":2015} {"title":"Spring in Action", "author":"Petar Zecevic", "price":41.24, "year":2016}
  36. 36. Page36 Pig Code for JSON File book_catalog = LOAD '/otpc/ff/json/data/catalog.json’ USING JsonLoader('title:chararray, author:chararray, price:float, year:int');
  37. 37. Page37 Hive Code for JSON File CREATE EXTERNAL TABLE book_catalog_json( title string, author string, price float, year int) ROW FORMAT SERDE 'o.a.h.h.d.JsonSerDe’ STORED AS TEXTFILE LOCATION '/otpc/ff/json/data';
  38. 38. Page38 Spark Code for JSON File val df = sqlContext .read .format("json") .load("/otpc/ff/json/catalog.json")
  39. 39. Page39 WINNER: File Formats
  40. 40. Page40 Data Set for Examples Reliance on Hive Metastore for Pig and Spark SQL
  41. 41. Page41 Source to Target Mappings Classic ETL need to map one dataset to another; includes these scenarios Column Presence Action Source and Target Move data from source column to target column (could be renamed, cleaned, transformed, etc) Source, not in Target Ignore this column Target, not in Source Implies a hard-coded or calculated value will be inserted or updated
  42. 42. Page42 Source to Target Mappings Use Case Create new dataset from airport_raw • Change column names • airport_code to airport_cd • airport to name • Carry over as named • city, state, country • Exclude • latitude and longitude • Hard-code new field • gov_agency as ‘FAA’
  43. 43. Page43 Pig Code for Data Mapping src_airport = LOAD 'airport_raw’ USING o.a.h.h.p.HCatLoader(); tgt_airport = FOREACH src_airport GENERATE airport_code AS airport_cd, airport AS name, city, state, country, 'FAA' AS gov_agency:chararray; DESCRIBE tgt_airport; DUMP tgt_airport;
  44. 44. Page44 Pig Output for Data Mapping target_airport: {airport_cd: chararray, name: chararray, city: chararray, state: chararray, country: chararray, gov_agency: chararray} (00M,Thigpen,BaySprings,MS,USA,FAA) (00R,LivingstonMunicipal,Livingston,TX,USA,F AA) (00V,MeadowLake,ColoradoSprings,CO,USA,FAA)
  45. 45. Page45 Hive Code for Data Mapping CREATE TABLE tgt_airport STORED AS ORC AS SELECT airport_code AS airport_cd, airport AS name, city, state, country, 'FAA' AS gov_agency FROM airport_raw;
  46. 46. Page46 Hive Output for Mapped Schema | col_name | data_type | comment | +------------+------------+----------+--+ | airport_cd | string | | | name | string | | | city | string | | | state | string | | | country | string | | | gov_agency | string | |
  47. 47. Page47 Hive Output for Mapped Contents SELECT * FROM tgt_airport; | _cd | name | city | st | cny | g_a | +-----+---------+---------+----+-----+-----+ | 00M | Thig... | BayS... | MS | USA | FAA | | 00R | Livi... | Livi... | TX | USA | FAA | | 00V | Mead... | Colo... | CO | USA | FAA |
  48. 48. Page48 Spark Code for Data Mapping val airport_target = hiveContext .table("airport_raw") .drop("lat").drop("long") .withColumnRenamed("airport_code", "airport_cd") .withColumnRenamed("airport", "name") .withColumn("gov_agency", lit("FAA"))
  49. 49. Page49 Spark Output for Mapped Schema root |-- airport_cd: string (nullable = true) |-- name: string (nullable = true) |-- city: string (nullable = true) |-- state: string (nullable = true) |-- country: string (nullable = true) |-- gov_agency: string (nullable = false)
  50. 50. Page50 Spark Output for Mapped Contents airport_target.show() |_cd| name| city|st|cny|g_a| +---+-------------+-------------+--+---+---+ |00M| Thigpen| BaySprings|MS|USA|FAA| |00R|Livingston...|Livingston...|TX|USA|FAA| |00V| MeadowLake|ColoradoSp...|CO|USA|FAA|
  51. 51. Page51 WINNER: Source to Target Mapping
  52. 52. Page52 Data Quality DQ is focused on detecting/correcting/enhancing input data • Data type conversions / casting • Numeric ranges • Currency validation • String validation • Leading / trailing spaces • Length • Formatting (ex: SSN and phone #s) • Address validation / standardization • Enrichment
  53. 53. Page53 Numeric Validation Use Case Validate latitude and longitude values from airport_raw • Convert them from string to float • Verify these values are within normal ranges Attribute Min Max latitude -90 +90 longitude -180 +180
  54. 54. Page54 Pig Code for Numeric Validation src_airport = LOAD 'airport_raw’ USING HCatLoader(); aprt_cnvrtd = FOREACH src_airport GENERATE airport_cd, (float) latitude, (float) longitude; ll_not_null = FILTER aprt_cnvrtd BY ( NOT ( (latitude IS NULL) OR (longitude IS NULL) ) ); valid_airports = FILTER ll_not_null BY (latitude <= 70) AND (longitude >= -170);
  55. 55. Page55 Hive Code for Numeric Validation CREATE TABLE airport_stage STORED AS ORC AS SELECT airport_code, CAST(latitude AS float), CAST(longitude AS float) FROM airport_raw; CREATE TABLE airport_final STORED AS ORC AS SELECT * FROM airport_stage WHERE latitude BETWEEN -80 AND 70 AND longitude BETWEEN -170 AND 180;
  56. 56. Page56 Spark Code for Numeric Validation val airport_validated = hiveContext .table("airport_raw") .selectExpr("airport_code", "cast(latitude as float) latitude", "cast(longitude as float) longitude") .filter("latitude is not null") .filter("longitude is not null") .filter("latitude <= 70") .filter("longitude >= -170")
  57. 57. Page57 String Validation Use Case Validate city values from airport_raw • Trim any leading / trailing spaces • Truncate any characters beyond the first 30
  58. 58. Page58 Pig Code for String Validation src_airport = LOAD 'airport_raw’ USING HCatLoader(); valid_airports = FOREACH src_airport GENERATE airport_code, airport, SUBSTRING(TRIM(city),0,29) AS city, state, country;
  59. 59. Page59 Hive Code for String Validation CREATE TABLE airport_final STORED AS ORC AS SELECT airport_code, airport, SUBSTR(TRIM(city),1,30) AS city, state, country FROM airport_raw;
  60. 60. Page60 Spark Code for String Validation val airport_validated = hiveContext .table("airport_raw") .withColumnRenamed("city", "city_orig") .withColumn("city", substring( trim($"city_orig"),1,30)) .drop("city_orig")
  61. 61. Page61 WINNER: Data Quality
  62. 62. Page62 Data Profiling Technique used to examine data for different purposes such as determining accuracy and completeness – drives DQ improvements • Numbers of records – including null counts • Avg / max lengths • Distinct values • Min / max values • Mean • Variance • Standard deviation
  63. 63. Page63 Data Profiling with Pig Coupled with Apache DataFu generates statistics such as the following Column Name: sales_price Row Count: 163794 Null Count: 0 Distinct Count: 1446 Highest Value: 70589 Lowest Value: 1 Total Value: 21781793 Mean Value: 132.98285040966093 Variance: 183789.18332067598 Standard Deviation: 428.7064069041609
  64. 64. Page64 Data Profiling with Hive Column-level statistics for all data types |col_name|min|max|nulls|dist_ct| +--------+---+---+-----+-------+ |air_time| 0|757| 0| 316| |col_name|dist_ct|avgColLn|mxColLn| +--------+-------+--------+-------+ | city | 2535| 8.407| 32|
  65. 65. Page65 Data Profiling with Spark Inherent statistics for numeric data types |summary| air_time| +-------+-----------------+ | count| 2056494| | mean|103.9721783773743| | stddev|67.42112792270458| | min| 0| | max| 757|
  66. 66. Page66 WINNER: Data Profiling
  67. 67. Page67 Core Processing Functionality Expected features to enable data transformation, cleansing & enrichment • Filtering / splitting • Sorting • Lookups / joining • Union / distinct • Aggregations / pivoting • SQL support • Analytical functions
  68. 68. Page68 Filtering Examples; Pig, Hive & Spark tx_arprt = FILTER arprt BY state == 'TX'; SELECT * FROM arprt WHERE state = 'TX'; val txArprt = hiveContext .table("arprt") .filter("state = 'TX'")
  69. 69. Page69 Sorting Examples; Pig, Hive & Spark srt_flight = ORDER flight BY dep_delay DESC, unique_carrier, flight_num; SELECT * FROM flight ORDER BY dep_delay DESC, unique_carrier, flight_num; val longestDepartureDelays = hiveContext .table("flight").sort($"dep_delay".desc, $"unique_carrier", $"flight_num")
  70. 70. Page70 Joining with Pig jnRslt = JOIN flights BY tail_num, planes BY tail_number; prettier = FOREACH with_year GENERATE flights::flight_date AS flight_date, flights::tail_num AS tail_num, -- plus other 17 "flights" attribs planes::year AS plane_built; -- ignore other 8 "planes" attribs
  71. 71. Page71 Joining with Hive SELECT F.*, P.year AS plane_built FROM flight F, plane P WHERE F.tail_num = P.tail_number;
  72. 72. Page72 Joining with Spark val flights = hiveContext.table("flight") val planes = hiveContext.table("plane") .select("tail_number", "year") .withColumnRenamed("year", "plane_built") val augmented_flights = flights .join(planes) .where($"tail_num" === $"tail_number") .drop("tail_number")
  73. 73. Page73 Pig Code for Distinct planes = LOAD 'plane' USING HCatLoader(); rotos = FILTER planes BY aircraft_type == 'Rotorcraft'; makers = FOREACH rotos GENERATE manufacturer; distinct_makers = DISTINCT makers; DUMP distinct_rotor_makers;
  74. 74. Page74 Pig Output for Distinct (BELL) (SIKORSKY) (AGUSTA SPA) (AEROSPATIALE) (COBB INTL/DBA ROTORWAY INTL IN)
  75. 75. Page75 Hive Code for Distinct SELECT DISTINCT(manufacturer) FROM plane WHERE aircraft_type = 'Rotorcraft';
  76. 76. Page76 Hive Output for Distinct | manufacturer | +---------------------------------+ | AEROSPATIALE | | AGUSTA SPA | | BELL | | COBB INTL/DBA ROTORWAY INTL IN | | SIKORSKY |
  77. 77. Page77 Spark Code for Distinct val rotor_makers = hiveContext .table("plane") .filter("aircraft_type = 'Rotorcraft'") .select("manufacturer") .distinct()
  78. 78. Page78 Spark Output for Distinct | manufacturer| +--------------------+ | BELL| | SIKORSKY| | AGUSTA SPA| | AEROSPATIALE| |COBB INTL/DBA ROT...|
  79. 79. Page79 Pig Code for Aggregation flights = LOAD 'flight' USING HCatLoader(); reqd_cols = FOREACH flights GENERATE origin, dep_delay; by_orig = GROUP reqd_cols BY origin; avg_delay = FOREACH by_orig GENERATE group AS origin, AVG(reqd_cols.dep_delay) AS avg_dep_delay; srtd_delay = ORDER avg_delay BY avg_dep_delay DESC; top5_delay = LIMIT srtd_delay 5; DUMP top5_delay;
  80. 80. Page80 Pig Output for Aggregation (PIR,49.5) (ACY,35.916666666666664) (ACK,25.558333333333334) (CEC,23.40764331210191) (LMT,23.40268456375839)
  81. 81. Page81 Hive Code for Aggregation SELECT origin, AVG(dep_delay) AS avg_dep_delay FROM flight GROUP BY origin ORDER BY avg_dep_delay DESC LIMIT 5;
  82. 82. Page82 Hive Output for Aggregation | origin | avg_dep_delay | +---------+---------------------+ | PIR | 49.5 | | ACY | 35.916666666666664 | | ACK | 25.558333333333334 | | CEC | 23.40764331210191 | | LMT | 23.40268456375839 |
  83. 83. Page83 Spark Code for Aggregation val sorted_orig_timings = hiveContext .table("flight") .select("origin", "dep_delay") .groupBy("origin").avg() .withColumnRenamed("avg(dep_delay)", "avg_dep_delay") .sort($"avg_dep_delay".desc) sorted_orig_timings.show(5)
  84. 84. Page84 Spark Output for Aggregation |origin| avg_dep_delay| +------+------------------+ | PIR| 49.5| | ACY|35.916666666666664| | ACK|25.558333333333334| | CEC| 23.40764331210191| | LMT| 23.40268456375839|
  85. 85. Page85 WINNER: Core Processing Functionality
  86. 86. Page86 Custom Business Logic Implemented via User Defined Functions (UDF) • Pig and Hive • Write Java and compile to a JAR • Register JAR • Hive can administratively pre-register UDFs at the database level • Spark can wrap functions at runtime from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType get_year = udf(lambda x: int(x[:4]), IntegerType()) df1.select(get_year(df1["date"]).alias("year"), df1["product"]) .collect() +----+-------+ |year|product| +----+-------+ |2015|toaster| |2015| iron| |2014| fridge| |2015| cup| +----+-------+ +----------+-------+ | date|product| +----------+-------+ |2015-03-12|toaster| |2015-04-12| iron| |2014-12-31| fridge| |2015-02-03| cup| +----------+-------+
  87. 87. Page87 WINNER: Custom Business Logic
  88. 88. Page88 Mutable Data – Merge & Replace See my preso and video on this topic • http://www.slideshare.net/lestermartin/mutable-data-in-hives-immutable-world • https://www.youtube.com/watch?v=EUz6Pu1lBHQ Ingest – bring over the incremental data Reconcile – perform the merge Compact – replace the existing data with the newly merged content Purge – cleanup & prepare to repeat
  89. 89. Page89 WINNER: Mutable Data
  90. 90. Page90 Performance Scalability is based on size of cluster • Tez and Spark has MR optimizations • Can link multiple maps and reduces together without having to write intermediate data to HDFS • Every reducer does not require a map phase • Hive and Spark SQL have query optimizers • Spark has the edge • Caching data to memory can avoid extra reads from disk • Resources dedicated for entire life of the application • Scheduling of tasks from 15-20s to 15-20ms
  91. 91. Page91 WINNER: Performance
  92. 92. Page92 Recommendations Review ALL THREE frameworks back at “your desk” Decision Criteria… •Existing investments •Forward-looking beliefs •Adaptability & current skills •It’s a “matter of style” •Polyglot programming is NOT a bad thing!! Share your findings via blogs and local user groups
  93. 93. Page93 Questions? Lester Martin – Hadoop/Spark Trainer & Consultant lester.martin@gmail.com http://lester.website (links to blog, twitter, github, LI, FB, etc) THANKS FOR YOUR TIME!!

Hinweis der Redaktion

  • Despite the “Data Science and Machine Learning” track, this is NOT a talk on DS or ML. Sorry!! ;-)

    Target audience is someone new to, at least one of, these processing frameworks and want to have some initial info to compare/contrast with.
  • Originally created by Yahoo!

    Pigs eat anything
    Pig can process any data, structured or unstructured
    Pigs live anywhere
    Pig can run on any parallel data processing framework, so Pig scripts do not have to run just on Hadoop
    Pigs are domestic animals
    Pig is designed to be easily controlled and modified by its users
    Pigs fly
    Pig is designed to process data quickly
  • Calliouts are that connections are maintained by HS2, but all real processing is happening on the worker nodes in the grid
  • Use familiar command-line and SQL GUI tools just as with “normal” RDBMS technologies

    Beeline

    HiveServer2 (introduced in Hive 0.11) has its own CLI called Beeline. HiveCLI is now deprecated in favor of Beeline, as it lacks the multi-user, security, and other capabilities of HiveServer2. Beeline is started with the JDBC URL of the HiveServer2, which depends on the address and port where HiveServer2 was started. By default, it will be (localhost:10000), so the address will look like jdbc:hive2://localhost:10000.

    GUI Tools

    Open source SQL tools used to query Hive include:
    Ambari Hive View (http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.0.0/bk_ambari_views_guide/content/)
    Zeppelin (https://zeppelin.incubator.apache.org/)
    DBVisualizer (https://www.dbvis.com/download/)
    Dbeaver (http://dbeaver.jkiss.org/)
    SquirrelSQL (http://squirrel-sql.sourceforge.net/)
    SQLWorkBench (http://www.sql-workbench.net/)
  • This is Hortonworks preferred tool over Hue

    Ambari includes a built-in set of Views that are pre-deployed for you to use with your cluster. The Hive View is designed to help you author, execute, understand, and debug Hive queries.
    You can:
    Browse databases
    Write and execute queries
    Manage query execution jobs and history
  • Spark allows you to do data processing, ETL, machine learning, stream processing, SQL querying from one framework

  • The Spark executor is the component that does performs the map and reduce tasks of a Spark application, and is sometimes referred to as a Spark “worker.” Once created, executors exist for the life of the application.

    Note: In the context of Spark, the SparkContext is the "master" and executors are the "workers." However, in the context of HDP in general, you also have "master" nodes and "worker" nodes. Both uses of the term worker are correct - in terms of HDP, the worker (node) can run one or more Spark workers (executors). When in doubt, make sure to verify whether the worker being described is an HDP node or a Spark executor running on an HDP node.

    Spark executors function as interchangeable work spaces for Spark application processing. If an executor is lost while an application is running, all tasks assigned to it will be reassigned to another executor. In addition, any data lost will be recomputed on another executor.

    Executor behavior can be controlled programmatically. Configuring the number of executors and their resources available can greatly increase performance of an application when done correctly.
  • Spark SQL is a module that is built on top of Spark Core. Spark SQL provides another level of abstraction for declarative programming on top of Spark.

    In Spark SQL, data is described as a DataFrame with rows, columns and a schema. Data manipulation in Spark SQL is available via SQL queries, and DataFrames API. Spark SQL is being used more and more at the enterprise level. Spark SQL allows the developer to focus even more on the business logic and even less on the plumbing, and even some of the optimizations.
  • The image above shows what a data frame looks like visually. Much like Hive, a DataFrame is a set of metadata that sits on top of an RDD. The RDD can be created from many file types. A DataFrame is conceptually equivalent to a table in traditional data warehousing.
  • Zeppelin has four major functions: data ingestion, discovery, analytics, and visualization. It comes with built-in examples that demonstrate these capabilities. These examples can be reused and modified for real-world scenarios.
  • Pointing out that even the Spark RDD API have ”map” and “reduce” method names. Pig and Hive execute as MapReduce (even if on Tez (or Spark)).

    Also points out that the examples in this preso use DF instead of RDD APIs as they focus on “what you want to do” instead of on “how exactly things should be done”.
  • Covering list of left, but mostly NOT covering the one on the right (will discuss perf/scale). Here’s are some thoughts on these additional requirements
    Error Handling: generally done via creating datasets with issues (ex: reject file) and restartability is handled by frameworks – have to RYO for a something that doesn’t rerun from the beginning.
    Alerts, Logging, Job Stats: system-oriented solutions, but have to RYO for a custom soln
    Lineage: rely on other tools such as CDH’s Navigator or HDP’s Atlas
    Admin: common workflow & scheduling options for all jobs and frameworks
  • Land the raw data first – Bake it as needed (aka Schema on Read).

    Tools such as sqoop, flume nifi, storm, spark streaming and custom integrations take care of the Extract and Load – this preso focuses on Transformation processing.
  • Just showing examples of del, xml and json in the slides
  • Yep… a bit gnarly…
  • NOT showing output slides as is (basically) the SAME as the delimited output
  • Have to FLATTEN the XML first and then do a CTAS against it to get rid of XPATH stuff.
  • NOT showing output slides as is (basically) the SAME as the delimited output
  • TIE! Spark shines in the file formats that have included schema (Pig & Hive have to regurgitate the schema def), but it doesn’t work all that well with simple delimited files. All in all, they all can read & write a variety of file formats.
  • No clear winner: all address this req in a straightforward manner.
  • Just showing examples of numeric and string validations in the slides
  • See github project notes – had to fudge the numbers since all where already valid
  • See github project notes – had to fudge the numbers since all where already valid
  • See github project notes – had to fudge the numbers since all where already valid
  • IMHO, Hive really is not the tool for a series of data testing and conforming logic due to its need to continually build tables for the output of each step along the way.

    Pig and Spark tackle this more appropriately (again, my opinion). Both off crisp and elegant solutions with the difference really being a matter of style.

    See orc-ddl.hql for example of a full, yet very simple, DQ implementation used for rest of examples.
  • CALL OUT THE orc-ddl.hql SCRIPT FOR THE CLEANSED DATA MODEL
  • DataFu came from LinkedIn
  • With DataFu and a bit of coding, Pig can satisfy baseline statistical functions.

    Hive and Spark present different data natively and coupling their results can satisfy base statistics.
  • This gets a bit "chatty" in Pig Latin
  • Determine the top 5 longest average dep_delay values by aggregating the origin airport for all flight records.
  • PIR = Pierre,SD
    ACY = Atlantic City, NJ
    ACK = Nantucket, MA
    CEC = Crescent City, CA
    LMT = Klamath Falls, OR
  • Determine the top 5 longest average dep_delay values by aggregating the origin airport for all flight records.
  • PIR = Pierre,SD
    ACY = Atlantic City, NJ
    ACK = Nantucket, MA
    CEC = Crescent City, CA
    LMT = Klamath Falls, OR
  • Determine the top 5 longest average dep_delay values by aggregating the origin airport for all flight records.
  • PIR = Pierre,SD
    ACY = Atlantic City, NJ
    ACK = Nantucket, MA
    CEC = Crescent City, CA
    LMT = Klamath Falls, OR
  • Hive is slight winner as all know "language of SQL" and these basic operations are very well known.

    Pig and Spark are both equally capable in these spaces, but I fear the masses will think they are a bit "chatty" to get the job done.

    As is often the case, it is a matter of style.
  • For grins… this code snippet is with Python instead of Scala
  • Spark is 1st at how easy to surface an UDF. Hive in 2nd due to being able to publish UDF to a database.
  • “Mutable Data in an Immutable World” is hard for ALL, but Hive edges out with it’s growing ”transactions” features; https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
  • Spark is 1st at how easy to surface an UDF. Hive in 2nd due to being able to publish UDF to a database.

×