Large Scale Data Analysis Tools

•Als KEY, PDF herunterladen•

11 gefällt mir•2,789 views

boorad

My talk on Hadoop, Storm, and other big data tools - DevNexus - 3/21/2012

Technologie Business

Large Scale
Data Analysis Tools

Brad Anderson
brad@scalingdata.com
@boorad

shameless borrowing

http://codahale.com/codeconf-2011-04-09-metrics-metrics-everywhere.pdf

Business value is
anything which makes
people more likely to
give us money.

Business value is
anything which
saves us money.

We want to generate
more business value.

sensors
rfid tags
smart meters
ocean buoys

parsing terabytes of noise
to get a megabyte of signal

http://www.kaushik.net/avinash/big-data-imperative-driving-big-action/

function

data data

data data

data data

data data

data data

function

data data

data data

data data

data data

data data

ship code not data

function
function
data
function function

data data

function function

data data data data

data data

function function
data data
data data

data data

data data function function

data data
function

data

ship code not data

distributed systems
problems
opportunities

Cloudera IBM
Amazon EMR MapR
Hortonworks EMC

data ingest
storage
querying / processing
output

processes
RDBMS

batch
Hadoop

Cache
Raw
NoSQL Apps
Data

processes
realtime
Storm
NoSQL

querying / processing

Example Pig Script

querying / processing

Example Hive Query
FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count(DISTINCT pv_users.userid)
GROUP BY pv_users.gender
INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum'
SELECT pv_users.age, count(DISTINCT pv_users.userid)
GROUP BY pv_users.age;

querying / processing

MRv2 allows
MRv1 (of course)
Spark
Bulk Synchronous Parallel
Graphs
MPI

querying / processing

machine learning
algorithms

streams

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Unbounded sequence of tuples

spout examples

•Read from Kestrel queue
• Read from Twitter streaming API

bolts

Processes input streams and produces new streams

bolts
• Functions
• Filters
• Aggregation
• Joins
• Talk to databases

The Unreasonable
Effectiveness of Data

http://bit.ly/x407Ln

Empfohlen

pandas: Powerful data analysis tools for PythonWes McKinney

Pandas/Data Analysis at BaypiggiesAndy Hayden

PandasJyoti shukla

What's new in pandas and the SciPy stack for financial usersWes McKinney

pandas: a Foundational Python Library for Data Analysis and StatisticsWes McKinney

Productive Data Tools for QuantsWes McKinney

New Directions for Spark in 2015 - Spark Summit EastDatabricks

Enabling exploratory data science with Spark and RDatabricks

Empfohlen

pandas: Powerful data analysis tools for PythonWes McKinney

Pandas/Data Analysis at BaypiggiesAndy Hayden

PandasJyoti shukla

What's new in pandas and the SciPy stack for financial usersWes McKinney

pandas: a Foundational Python Library for Data Analysis and StatisticsWes McKinney

Productive Data Tools for QuantsWes McKinney

New Directions for Spark in 2015 - Spark Summit EastDatabricks

Enabling exploratory data science with Spark and RDatabricks

Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...Databricks

A look inside pandas design and developmentWes McKinney

Enabling Python to be a Better Big Data CitizenWes McKinney

Python for Financial Data Analysis with pandasWes McKinney

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkDatabricks

Real-World NoSQL Schema DesignDataWorks Summit/Hadoop Summit

Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks

Data Structures for Statistical Computing in PythonWes McKinney

GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks

Apache Flink - Hadoop MapReduce CompatibilityFabian Hueske

Graph databases: Tinkerpop and Titan DBMohamed Taher Alrefaie

First impressions of SparkR: our own machine learning algorithmInfoFarm

Evolution of spark framework for simplifying data analysis.Anirudh Gangwar

Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Databricks

Giraph++: From "Think Like a Vertex" to "Think Like a Graph"Yuanyuan Tian

Apache Arrow: Leveling Up the Data Science StackWes McKinney

Introduction to SparkRKien Dang

Optimizing your app by understanding your Postgres | RailsConf 2019 | Samay S...Citus Data

SparkR: Enabling Interactive Data Science at Scalejeykottalam

Building a modern Application with DataFramesSpark Summit

Overview of data analytics service: Treasure Data ServiceSATOSHI TAGOMORI

BigData AnalysisInnfinision Cloud and BigData Solutions

Weitere ähnliche Inhalte

Was ist angesagt?

Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...Databricks

A look inside pandas design and developmentWes McKinney

Enabling Python to be a Better Big Data CitizenWes McKinney

Python for Financial Data Analysis with pandasWes McKinney

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkDatabricks

Real-World NoSQL Schema DesignDataWorks Summit/Hadoop Summit

Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks

Data Structures for Statistical Computing in PythonWes McKinney

GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks

Apache Flink - Hadoop MapReduce CompatibilityFabian Hueske

Graph databases: Tinkerpop and Titan DBMohamed Taher Alrefaie

First impressions of SparkR: our own machine learning algorithmInfoFarm

Evolution of spark framework for simplifying data analysis.Anirudh Gangwar

Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Databricks

Giraph++: From "Think Like a Vertex" to "Think Like a Graph"Yuanyuan Tian

Apache Arrow: Leveling Up the Data Science StackWes McKinney

Introduction to SparkRKien Dang

Optimizing your app by understanding your Postgres | RailsConf 2019 | Samay S...Citus Data

SparkR: Enabling Interactive Data Science at Scalejeykottalam

Building a modern Application with DataFramesSpark Summit

Was ist angesagt? (20)

Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...

A look inside pandas design and development

Enabling Python to be a Better Big Data Citizen

Python for Financial Data Analysis with pandas

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark

Real-World NoSQL Schema Design

Enabling Exploratory Analysis of Large Data with Apache Spark and R

Data Structures for Statistical Computing in Python

GraphFrames: DataFrame-based graphs for Apache® Spark™

Apache Flink - Hadoop MapReduce Compatibility

Graph databases: Tinkerpop and Titan DB

First impressions of SparkR: our own machine learning algorithm

Evolution of spark framework for simplifying data analysis.

Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...

Giraph++: From "Think Like a Vertex" to "Think Like a Graph"

Apache Arrow: Leveling Up the Data Science Stack

Introduction to SparkR

Optimizing your app by understanding your Postgres | RailsConf 2019 | Samay S...

SparkR: Enabling Interactive Data Science at Scale

Building a modern Application with DataFrames

Andere mochten auch

Overview of data analytics service: Treasure Data ServiceSATOSHI TAGOMORI

BigData AnalysisInnfinision Cloud and BigData Solutions

Decision analysisNorahim Ibrahim

Steganography presentationAshwin Prasad

Steganography Project Jitu Choudhary

Chapter 9-METHODS OF DATA COLLECTIONLudy Mae Nalzaro,BSM,BSN,MN

PPT steganographyparvez Sharaf

Methods of data collection PRIYAN SAKTHI

Andere mochten auch (8)

Overview of data analytics service: Treasure Data Service

BigData Analysis

Decision analysis

Steganography presentation

Steganography Project

Chapter 9-METHODS OF DATA COLLECTION

PPT steganography

Methods of data collection

Ähnlich wie Large Scale Data Analysis Tools

Introduction to HadoopOvidiu Dimulescu

Tech4Africa - Opportunities around Big DataSteve Watt

Steve Watt PresentationBig Data Houston

Realtime Computation with Stormboorad

My Master's ThesisHumoyun Ahmedov

Prdc2012Yusuke Shimizu

An introduction to apache drill presentationMapR Technologies

SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesOReillyStrata

Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo

Data Driven Innovation with Amazon Web ServicesAmazon Web Services

Microsoft's Hadoop StoryMichael Rys

Spark Based Distributed Deep Learning Framework For Big Data Applications Humoyun Ahmedov

Processing Big Datacwensel

Galaxy of bitsMichal Zylinski

Four Problems You Run into When DIY-ing a “Big Data” Analytics SystemTreasure Data, Inc.

Hadoop on Azure, Blue elephantsOvidiu Dimulescu

Fluentd meetup #3Treasure Data, Inc.

Rapidly Building Data Driven Web Pages with Dynamic ADO.NETgoodfriday

Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall

Scaling Big Data Mining Infrastructure Twitter ExperienceDataWorks Summit

Ähnlich wie Large Scale Data Analysis Tools (20)

Introduction to Hadoop

Tech4Africa - Opportunities around Big Data

Steve Watt Presentation

Realtime Computation with Storm

My Master's Thesis

Prdc2012

An introduction to apache drill presentation

SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

Big Data: Architecture and Performance Considerations in Logical Data Lakes

Data Driven Innovation with Amazon Web Services

Microsoft's Hadoop Story

Spark Based Distributed Deep Learning Framework For Big Data Applications

Processing Big Data

Galaxy of bits

Four Problems You Run into When DIY-ing a “Big Data” Analytics System

Hadoop on Azure, Blue elephants

Fluentd meetup #3

Rapidly Building Data Driven Web Pages with Dynamic ADO.NET

Big Data/Hadoop Infrastructure Considerations

Scaling Big Data Mining Infrastructure Twitter Experience

Mehr von boorad

Big Data Analysis Patterns with Hadoop, Mahout and Solrboorad

Big Data Analysis Patterns - TriHUG 6/27/2013boorad

Hadoop and Storm - AJUG talkboorad

Big Data Use Casesboorad

PhillyDB Talk - Beyond Batchboorad

TriHUG - Beyond Batchboorad

Realtime Computation with Stormboorad

DevNexus 2011boorad

DevNation Atlantaboorad

NOSQL, CouchDB, and the Cloudboorad

Why Erlang? - Bar Camp Atlanta 2008boorad

Mehr von boorad (11)

Big Data Analysis Patterns with Hadoop, Mahout and Solr

Big Data Analysis Patterns - TriHUG 6/27/2013

Hadoop and Storm - AJUG talk

Big Data Use Cases

PhillyDB Talk - Beyond Batch

TriHUG - Beyond Batch

Realtime Computation with Storm

DevNexus 2011

DevNation Atlanta

NOSQL, CouchDB, and the Cloud

Why Erlang? - Bar Camp Atlanta 2008

Kürzlich hochgeladen

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Key Features Of Token Development (1).pptxLBM Solutions

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

AI as an Interface for Commercial BuildingsMemoori

How to Remove Document Management Hurdles with X-Docs?XfilesPro

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Kürzlich hochgeladen (20)

Benefits Of Flutter Compared To Other Frameworks

Scaling API-first – The story of a global engineering organization

Unblocking The Main Thread Solving ANRs and Frozen Frames

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Key Features Of Token Development (1).pptx

SQL Database Design For Developers at php[tek] 2024

A Domino Admins Adventures (Engage 2024)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

How to Troubleshoot Apps for the Modern Connected Worker

The 7 Things I Know About Cyber Security After 25 Years | April 2024

AI as an Interface for Commercial Buildings

How to Remove Document Management Hurdles with X-Docs?

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Azure Monitor & Application Insight to monitor Infrastructure & Application

Maximizing Board Effectiveness 2024 Webinar.pptx

08448380779 Call Girls In Friends Colony Women Seeking Men

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Pigging Solutions in Pet Food Manufacturing

Large Scale Data Analysis Tools

1. Large Scale Data Analysis Tools Brad Anderson brad@scalingdata.com @boorad

2. shameless borrowing http://codahale.com/codeconf-2011-04-09-metrics-metrics-everywhere.pdf

3. I crunch data.

4. data

5. data business value

6. What the hell is business value?

7. Business value is anything which makes people more likely to give us money.

8. shopping cart analysis

9. mobile device tracking

10. Business value is anything which saves us money.

11. smart grid substations

12. healthcare

13. We want to generate more business value.

14. ever-growing sources of big data

15. web logs

16. mobile devices

17. sensors rfid tags smart meters ocean buoys

18. parsing terabytes of noise to get a megabyte of signal http://www.kaushik.net/avinash/big-data-imperative-driving-big-action/

19.

20. How did we get here?

21. your data doesn’t fit in local memory

22. your data doesn’t fit on local disk

23. your data doesn’t fit on one machine

24. scale up

25. $

26. SAN

27. $$

28. big db iron

29. $$$

30. business value.

31. scale out

32. move the data to the processors

33. function data data data data data data data data data data

34. function data data data data data data data data data data ship code not data

35. function function data function function data data function function data data data data data data function function data data data data data data data data function function data data function data ship code not data

36. add more machines

37. shit gets interesting

38. clusters

39. load balancers

40. distributed systems problems opportunities

41. configuration management

42. What systems do I use?

43.

44. data shape

45. query patterns

46. latency and throughput requirements

47. cassandra riak bigcouch

48. batch vs. realtime

49. Hadoop

50. hdfs mapreduce

51. ecosystem

52. Cloudera IBM Amazon EMR MapR Hortonworks EMC

53. data ingest storage querying / processing output

54. processes RDBMS batch Hadoop Cache Raw NoSQL Apps Data processes realtime Storm NoSQL

55. data ingest scribe

56. data ingest chukwa

57. data ingest flume

58. data ingest homegrown?

63. querying / processing mapreduce

64. querying / processing pig

65. querying / processing

66. querying / processing Example Pig Script

67. Equivalent MR Java code

68. querying / processing hive

69. querying / processing Example Hive Query FROM pv_users INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count(DISTINCT pv_users.userid) GROUP BY pv_users.gender INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum' SELECT pv_users.age, count(DISTINCT pv_users.userid) GROUP BY pv_users.age;

70. querying / processing cascading

71. querying / processing cascalog

72. querying / processing Datameer

73. querying / processing MRv2

74. querying / processing MRv2 allows MRv1 (of course) Spark Bulk Synchronous Parallel Graphs MPI

75. querying / processing machine learning algorithms

76. querying / processing mahout

77. output flat files

78. output rdbms

79. output cache

80. output hdfs

81. realtime

82. Storm

83. streams Tuple Tuple Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples

84. spouts Source of streams

85. spout examples •Read from Kestrel queue • Read from Twitter streaming API

86. bolts Processes input streams and produces new streams

87. bolts • Functions • Filters • Aggregation • Joins • Talk to databases

88. topologies Network of spouts and bolts

89. data business value

90. The Unreasonable Effectiveness of Data http://bit.ly/x407Ln

91. Start small

92. But definitely start!

93. Please start!

94. Thank you.

Hinweis der Redaktion

90 slides - coffee\n\nBig Data guy - Data Scientist?\n\nScaling Data helps our customers tackle this new Big Data space - their whole stack\n
if you write applications that are JVM-based and you&#x2019;re not using Metrics, you are doing it wrong\n\ninstrument your running production code to get real intelligence on what&#x2019;s going on AS your running production code creates business value\n
At scaling data, people give us money for crunching data.\n
the reason they pay us so much money is that we crunch data that generates business value.\n
I thought this was going to be about big data\n
topline\n
recommendations for other complimentary products, driving overall spend higher\n\ncustomer classification and scoring - offer good customers deals for repeat business\n\ntransactional retargeting - abandoned shopping carts are mined, and personalized ads are returned to that specific user\n
cell tower data used to track where people go for lunch - identify a new restaurant site\n\nwhat roads are used so we can target billboards - demand higher prices\n\nmunicipal planning\n
cost cutting\n
pattern recognition in the power signature can point to imminent failure for expensive equipment\n
imagine a diagnosis that was cured with 17 procedures at immense cost\n\nsame diagnosis was cured with 5 procedures elsewhere\n\nanalyzing patient histories across the country / world can get us here\n\n
because we like more money... \n
\n
\n
\n
\n
We have even more types of data,\nbecoming ever more complex,\ndistributed across multiple existences,\nand we are left with the task of parsing out terabytes of noise to get to a megabyte of signal.\n
\n
Ever more data to try to find the business value\n\nCurrent tools are straining under the load, (banks) my talk last year\n\nThere is significant pain while using these big data tools - Why are they so hot now?\n\ngetting better\n
put it on disk in a database\n
SAN\n
even with the SAN... so you get a bigger machine\n
Oracle loves you for this!\n\n37signals approach - basecamp 1 server\n
\n
EMC loves you for this!\n\n\n
\n
IBM, HP, Sun loves (or loved) you for this!\n\nmore processors, more memory, more disk\n\n
\n
mounting costs are not good for...\n
the new approach, starting about 5 years ago\n\nNoSQL?\n\nNewSQL?\n
\n
\n
\n
so you&#x2019;re sold on &#x2018;scale out&#x2019;\n
if you want your ops co-workers to be outside of their happy space, this is the ticket\n
lots of commodity hardware boxes ... racks\n
haproxy is a good one\n
things will break - fault tolerance\n\ndistribution of data - rebalancing\n\ntask coordination - leader election / masterless\n
reduce ops headache - Chef, Puppet\n
I still have the pain... I want to go forward with this\n
Cambrian explosion 530 million years ago\n\nappearance of most major animal phyla\n\ndiversification of organisms as earth warms, forms different climates\n
small records/files\n\nfixed schema, semi-structured, totally unstructured\n\ncolumn store, graph store\n
how will you ask for the data?\n\nkey lookup\n\ntable scan otherwise\n\nsecondary indices for oft-queried fields? mostly roll-your-own\n\n
per-request speed - fast = column db\n\namount of requests - availability of reads/writes under load becomes important\n
cassandra - read/write speed impressive\n\ndynamo-based clusters\n\nvery capable data stores\n
\n
hadoop rules the batch world for massive data sets\n
\n
probably 40-50 satellite projects that are non-core hadoop\n
distributions - should be matched to your use-case\n
\n
data --> business value\n
logging only, from Facebook\n\nkind of old and busted\n\nbut still on every Facebook server (or was at one time), so battle-tested\n
near-realtime: minutes\n\nreliability: getting better with recent releases\n\nmgmt: complicated\n\nsupport: apache project\n
a more general data ingest tool, although it started with log files\n\nnear-realtime: seconds\n\nreliability: best effort, store+retry on failure, and end-to-end mode \nthat uses acks and a write ahead log.\n\nmgmt: master or masters, then smooth from there\n\nsupport: cloudera\n
if you have a realtime component, use Storm\n\nit&#x2019;s already distributed, reliable, easily manageable.\n
big files\n\nrecent performance improvements\n\nships with hadoop\n
unique for small files\n\nperformance over hdfs\n\nsnapshotting\n
low-latency column store\n\nfast key-based access\n\nalso have MR to do in batch/background\n
time series schema for hbase\n\nStumbleupon\n
a framework for processing in parallel on large clusters\n\nmap - nodes process local data\n\nreduce - reduces the &#x2018;map output&#x2019; in some way (sum, count, etc)\n\n(shuffle & sort are in between M & R)\n
high-level language built on top of MR\n\noften favored for data movement, but can be used for querying / processing too\n
\n
\n
\n
high-level language built on top of MR\n\nstriving for SQL-like language\n
\n
high-level language built on top of MR\n\nmultiple MR jobs linked together\n\ncomplex query workflows\n
querying DSL written in Clojure\n
Excel-like frontend tool on top of Hadoop\n\nspreadsheet-like interface targets business users\n\njoins, data ingest too\n\n
released with Hadoop 0.23\n\nsplit JobTracker into:\n - ResourceManager (RM)\n - ApplicationMaster (AM), which does job scheduling/monitoring\n\nyou can run different applications now (next slide)\n
\n
one of the highest levels of &#x2018;gaining insight&#x2019;\n\nRecommendation\nClassification\nClustering / Segmentation\nPredictive Analytics\nSimilarity\n
loose federation of machine learning algorithms that run on hadoop\n\nHadoop not best system for some of these, although MRv2 is now here\n\nsome algos are better than others - you have been warned\n
output targets of Hadoop jobs\n
I&#x2019;m not a hater!\n\nGreat tool for 40 years\n\n
mongo, redis\n
back into the cluster for use in another MR job\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
sexy, complicated algorithms are very insightful\n\nBUT more data and a shittier / more basic algorithm wins\n\ndata can overcome &#x201C;known truths&#x201D; and organizational inertia\n\n
for your organization, start small\n\ndon&#x2019;t bet the farm... maybe 10-15% of your analytics budget\n\nskunkworks projects, hackers, etc.\n
\n
we need more Big Data people!\n
\n