Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

2

Teilen

Jetzt herunterladen Herunterladen

Herunterladen, um offline zu lesen

Azure HDInsight

Jetzt herunterladen Herunterladen

Herunterladen, um offline zu lesen

Azure Meetup, Azure HDInsight, Hadoop, Hive, Pig

  • Als Erste(r) kommentieren

Azure HDInsight

  1. 1. Big Data World With Azure HDInsight
  2. 2. About ▪ Koray Kocabaş ▪ Data Platform (SQL Server) MVP ▪ Yemeksepeti Business Intelligence ▪ Bahcesehir University Instructor ▪ @koraykocabas ▪ https://tr.linkedin.com/in/koraykocabas ▪ Blog: http://www.misjournal.com ▪ E-Mail: koraykocabas@outlook.com
  3. 3. Evolution of Data Internet ofThings Web 2.0 ERP/CRM • Clickstream • Sensors / RFID / Devices • Log Files • Spatial & GPS Coordinates • Social Media • Mobile • Advertising • eCommerce • Digital Marketing • Search Marketing • Recommendations • Payables • Payroll • Inventory • Contacts • DealTracking • Sales Pipeline Gigabytes Terabytes Petabytes Exabytes
  4. 4. Big Data Utility Gap 70 % of data generated by customers 80 % of data being stored 3 % being prepared for analysis 0.5 % begin analyzed < 0.5 % begin operationalized
  5. 5. How Does this Work in Practice • Obsessively collect data • Keep it forever • Put the data in one place Store Everything • Cleanse, organize and manage your data • Make the right tools available • Use the resources wisely to compute, analyze and understand data Analyze Anything • Use insights to iteratively improve your product Build the Right Thing
  6. 6. Big Data isn’t meaningful Big Data is not just data 65 + Million Members 50 Countries 1000 + Devices Supported ~25 PB Datawarehouse on Cloud (Read %10) ~550 Billion events daily
  7. 7. • 20 Million songs • 24 Million Active Users • 8 Million Daily Active Users • 1TB of Compressed Data Generated From Users Per Day • 700 node Hadoop Cluster
  8. 8. Big Data is not just data Cannes Lion 2014 - Grand Prix - Titanium : Honda 'Sound of Honda Ayrton Senna 1989' 2000 + sensors, 200 GB data per a race
  9. 9. Big Data is not just data Boeing generates 20 TB data per hour
  10. 10. ~10 Billions row processed (Daily) ~750 Millions row result set (Daily)
  11. 11. New E-Commerce Big Data Flow Purchase User Product Data Warehouse Store it All
  12. 12. Overview (ETL to ELT) Demand Architecture Data Loading Data Preparation Analytics Validation
  13. 13. Problem: How can we track?
  14. 14. Web Analytics Companies
  15. 15. Google Analytics & Adobe Omniture Problem 1: How can we collect data Problem 2: How can we store data Problem 3: How can we visualize data Problem 4: How can we predict data
  16. 16. Buraya Google Analytics Adobe Örneği koy
  17. 17. Azure vs Amazon Collect Process Analyze Visualize Prediction Store
  18. 18. Ready to Use
  19. 19. MOOC Big Data Analytics, Implementing Big Data Analysis, Big Data Analytics with HDInsight, Big Data and BusinessAnalytics Immersion,Getting Started with MicrosoftAzure Machine Learning RealWorld Big Data in Azure, Big Data on AmazonWeb Services, Reporting with MongoDB, Cloud Business Intelligence, HDInsight Deep Dive: Storm HBase and Hive, Data Science & Hadoop Workflows at ScaleWith Scalding, SQL on Hadoop - Analyzing Big Data with Hive Introduction to Big Data Analytics, Machine Learning with Big Data, Big Data Analytics for Healthcare, Data Science at Scale,The Data Scientist'sToolbox, R Programming Master Big Data and Hadoop Step by Step, Hadoop Essentials, Hadoop Starter Kit, Data Analytics using Hadoop eco system, Big Data: How Data Analytics IsTransforming the World, Applied Data Science with R, Hadoop Enterprise Integration Data Science and Analytics in Context, Introduction to Big Data with Spark, Data Science and Machine Learning Essentials, Machine Learning for Data Science and Analytics, Statistical Thinking for Data Science and Analytics
  20. 20. OLTP vs Hadoop
  21. 21. Hadoop Ecosystem
  22. 22. One more cup of coffee https://azure.microsoft.com/en-us/pricing/details/hdinsight/ https://azure.microsoft.com/en-us/pricing/calculator/#
  23. 23. Developed by Facebook. Later it was adopted in Apache as an open source project. A data warehouse infrastructure built on top of Hadoop for providing data summarization, query and analysis Integration between Hadoop and BI and visualization Provides an SQL Like language called Hive QL to query data Create Index, includes Partitioning Not supported Update (isn’t correct) Hive provides Users, Groups, Roles. But it’s not designed for high security. Console (hive>), script, ODBC/JDBC, SQuirreL, HUE,Web Interface, etc. Most popular Business IntelligenceTools support Hive
  24. 24. DataTypes Primitive DataTypes: int, bigint, float, double, boolean, decimal, string, timestamp, date etc Complex DataTypes: arrays, maps, structs ARRAY<string>: workplace: istanbul, ankara STRUCT<sex:string,age:int> : Female,25 MAP<string,int>: SOLR:92 Hive RDBMS SQL Interface SQL Interface Focus on analytics ay focus on online or analytics No transactions Transactions usually supported Partition adds, no random Inserts. Random Insert and Update supported Distributed processing via map/reduce Distributed processing varies by vendor (if available) Scales to hundreds of nodes Seldom scale beyond 20 nodes Built for commodity hardware Often built on proprietary hardware (especially when scaling out) Low cost per petabyte What's petabyte? :) (note: Are you sure?)
  25. 25. Hive Architecture SQL on Hadoop Frameworks • Apache Hive • Impala • Presto (Facebook) • EMC/Pivotal HAWQ • BigSQL by IBM
  26. 26. OLTP vs Hive http://hortonworks.com/wp-content/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLtoHive.pdf
  27. 27. Originally developed atYahoo! (Huge contributions from Hortonworks,Twitter) A Platform for analyzing large data sets that consists of high-level language for expressing data analysis programs Processing large semi-structured data sets using Hadoop Map Reduce Write complex MapReduce jobs using a simple script language (Pig Latin) Pig provides a bunch of aggregation function (AVG, COUNT, SUM, MAX, MIN etc.) Developers can develop UDF Console (grunt), script, java, HUE (Hadoop User Experience by Cloudera) Easy to use and efficient
  28. 28. DataTypes Simple DataTypes: int, float, double, chararray (UTF-8), bytearray Complex DataTypes: map (Key,Value),Tuple, Bag (list of tuples) Commands Loading: LOAD, STORE, DUMP Filtering: FILTER, FOREACH, DISTINCT Grouping: JOIN, GROUP, COGROUP, CROSS Ordering: ORDER, LIMIT Merging & Split: UNION, SPLIT
  29. 29. DataTypes Simple DataTypes: int, float, double, chararray (UTF-8), bytearray Complex DataTypes: map (Key,Value),Tuple, Bag (list of tuples) Commands Loading: LOAD, STORE, DUMP Filtering: FILTER, FOREACH, DISTINCT Grouping: JOIN, GROUP, COGROUP, CROSS Ordering: ORDER, LIMIT Merging & Split: UNION, SPLIT SQL SCRIPT PIG SCRIPT SELECT * FROM TABLE A=LOAD 'DATA' USING PigStorage('t') AS (col1:int, col2:int, col3:int); SELECT col1+col2, col3 FROM TABLE B=FOREACH A GENERATE col1+col2, col3; SELECT col1+col2, col3 FROM TABLE WHERE col3>10 C=FILTER B by col3>10; SELECT col1, col2, sum(col3) FROM X GROUP BY col1, col2 D=GROUP A BY (col1,col2); E=FOREACH D GENERATE FLATTEN(group), SUM(A.col3); ... HAVING sum(col3) > 5 F=FILTER E BY $2>5; ... ORDER BY col1 G=ORDER F BY $0 SELECT DISTINCT col1 FROM TABLE I=FOREACH A GENERATE col1; J=DISTINCT I; SELECT col1,COUNT(DISTINCT col2) FROM TABLE GROUP BY col1 K=GROUP A BY col1; L=FOREACH K {M=DISTINCT A.col2; GENERATE FLATTEN(group), count(M);}
  30. 30. Methods of Creating Azure HDInsight (Azure Portal)
  31. 31. Methods of Creating Azure HDInsight (Powershell)
  32. 32. Methods of Creating Azure HDInsight (.Net SDK)
  33. 33. Methods of Creating Azure HDInsight (SSIS)
  34. 34. Demo • Create Hadoop Cluster (HDInsight) • Create Database andTable (Hive) • Data Load (Hive) • Querying (Hive) • Analyzing BreakingBad Subtitle (Pig)
  35. 35. Case Study Klout • Collect and normalize more than 12 billion signals a day • Hive data warehouse of more than 1 trillion rows • Klout acquired for $200 million by LithiumTechnologies
  36. 36. Necessary to use HDInsight or Hadoop? • Find the Major Problem
  37. 37. Thank you

    Als Erste(r) kommentieren

    Loggen Sie sich ein, um Kommentare anzuzeigen.

  • baybarsbumur

    Nov. 23, 2015
  • wing1124

    Sep. 19, 2016

Azure Meetup, Azure HDInsight, Hadoop, Hive, Pig

Aufrufe

Aufrufe insgesamt

774

Auf Slideshare

0

Aus Einbettungen

0

Anzahl der Einbettungen

3

Befehle

Downloads

20

Geteilt

0

Kommentare

0

Likes

2

×