The document discusses big data and cloud computing. It defines big data as large and complex data sets that are difficult to process using traditional database tools. It notes that the volume of data is growing rapidly, expected to increase over 40 times from 2010 to 2020. The document presents examples of how companies like Walmart and Target are using big data analytics in the cloud to gain business insights from their customer data.
1. Big Data & Cloud
Infinite Monkey Theorem
CloudCon Expo & Conference
October, 2012
2. First
What is Big Data?
“data sets so large and complex that it becomes
difficult to process using on-hand database
management tools.”
10/19/2012 Infochimps Confidential 2
3. Data Volume
Growing 44x
2010 = 1.2 2020 = 35.2
Zettabytes/yr Zettabytes/yr
Source: 2011 IDC Digital Universe Study
10/19/2012 Infochimps Confidential 3
5. Big Data Warehouse
Search Recommend
Rank
Analytic
Request Master: Answer
Score Next-Best-Action Name Node
Job Tracker
Ethernet Interconnect
Slave: Slave: Slave:
Task Trckr Task Trckr Task Trckr
Data Node Data Node Data Node
Semi-
.... Structured
Data
PARC | 5
6. Real
Time
Traditional Operational
Application Ecosystem
Deployment in
Analytic Public/Private Cloud
Appliances
Toolset Integration
Traditional
Decision Support Hardened
Batch
Large Small
Enterprise Enterprise
10/19/2012 Infochimps Confidential 6
7. Next
Infinite Monkey Theorem (2):
an infinite number of monkeys hitting
keys on a typewriter for a period of time
will almost surely type a given text, such
as Shakespeare”s Hamlet.
10/19/2012 Infochimps Confidential 7
9. ““
Infinite Monkey Theorem (2):
an infinite number of monkeys hitting keys
on a typewriterfor a period of time will
atypewriter for a period of time will
almost surely type a given text, such as
Shakespeare”s Hamlet.
10/19/2012 Infochimps Confidential 9
10. infinite number keys on a almost Shakespeare”s
of monkeys typewriter surely Hamlet
unlimited processing statistically insights
computational data significant
power
10/19/2012 Infochimps Confidential 10
18. Public
unlimited
computational
power
Private
Virtual
Private
10/19/2012 Infochimps Confidential 18
19. analysts use these images to
count shipping containers
coming off ships in California
and are able to get a sense of
overall US import activity
10/19/2012 Infochimps Confidential 19
20. Public
data
processing
Private
Virtual
Private
10/19/2012 Infochimps Confidential 20
26. Cars
In Lot
News
Text
Web
Pricing Quarterly
Revenue
Prediction
Social
Sentiment
Weather
Sensors
Local
Employment
10/19/2012 Infochimps Confidential 26
27. Public
insights
Private
Virtual
Private
10/19/2012 Infochimps Confidential 27
28. New Media
Data Scientist App Developer
Gnip
Powertrack
Business Users
Gnip
EDC
Sources Sentiment
Moreover
Metabase
In-Motion
Data Delivery APIs Listening
Service Application
TV
Transcription
NoSQL
Radio
Transcription
Print
Transcription
IT Staff
Traditional Media
10/19/2012 Infochimps Confidential 28
29. unlimited processing statistically insights
computational data significant
power
10/19/2012 Infochimps Confidential 29
AvinashKaushik gave a talk at Strata 2012 in Santa Clara in March.If you listen to all the hype of Big Data, it solves for the first problem.If you listen to all the vendors, there is a lot of emphasis on the first part (perhaps Infochimps included), and very little on the second.I think that’s because we don’t exactly know how to truly empower the organization to interact directly with any/all data available.It’s too expensive, risky, complex.
40%+ YoY growth with 2012 generating 2.4Zettabytes alone.http://jameskaskade.com/?p=2040http://www.emc.com/collateral/demos/microsites/emc-digital-universe-2011/index.htm
AMP:access module processorsPE: Parsing EngineBYNET: Banyan Cross-bar Switch YNET (Y Network)Store:The Parsing Engine dispatches a request to retrieve one or more rows.The BYNET ensures that appropriate AMP(s) are activated.The Parsing Engine dispatches a request to insert a row.The BYNET ensures that the row gets to the appropriate AMP (Access Module Processor) via the hashing algorithm.The AMP stores the row on its associated disk.Each AMP can have multiple physical disks associated with it.Retrieve:The AMPs (access module processors) locate and retrieve desired rows in parallel access and will sort, aggregate or format if needed.The BYNET returns retrieved rows to Parsing Engine.The Parsing Engine returns row(s) to requesting client application.Teradata’s shared-nothing architecture allows for highly scalable data volumes.
3 node Hadoop system:$8K/node$10K switch$4K/node HadoopDistro$24K + $10K x 25%x3 maintenance = $43K$4K x 3 x 3 = $36KTotal = There are three essential elements of an analytic platform: Strong support for analytic database query. A variety of query styles — at a minimum, SQL, MDX or graph.Strong support for analytic processes other than queries. Typically these would be in the areas of mathematics (statistics, predictive analytics, data mining, linear algebra, optimization, graph theory, etc.) and/or data transformation (e.g. sessionization, entity extraction).Strong integration between the first two.The point is — an analytic platform is something on which you can build a range of powerful analytic applications. Some specifics of what to look for in analytic platform may be found in the link above.http://www.dbms2.com/2011/02/24/analytic-platforms/http://www.dbms2.com/2011/01/18/architectural-options-for-analytic-database-management-systems/Enterprise data warehouse (Full or partial)Kinds of data likely to be included: All, but especially operationalLikely use styles: AllCanonical example: Central EDW for a big enterpriseStresses: Concurrency, reliability, workload managementClassical EDWs are Teradata, DB2, Exadata, and maybe Microsoft SQL ServerTraditional data martKinds of data likely to be included: AllLikely use styles: Business intelligence, budgeting/consolidation, investigativeExamples: Reporting servers, planning/consolidation servers, anything MOLAP, etc.Stresses: Performance, concurrency, TCOColumnar DBMS might have more attractive performance and TCO (Total Cost of Ownership); the same goes for Netezza. Some of them — e.g. Sybase IQ and Vertica — have excellent track records in concurrent usage as well.Investigative data mart — agileKinds of data likely to be included: All, especially customer-centricLikely use styles: InvestigativeCanonical example: A few analysts getting a few TB to examineStresses: Ease of setup/load, ease of admin, price/performanceInfobright is often cost-effective among columnar analytic DBMS. Investigative data mart — bigKinds of data likely to be included: All, especially customer-centric, logs, financial trade, scientificLikely use styles: InvestigativeCanonical example: Single-subject 20 TB – 20 PB relational databaseStresses: Performance, scale-out, analytic functionalityPerformance and scalability are major challenges, usually best addressed by MPP (Massively Parallel Processing) systems, such as Netezza, Vertica, Aster Data, ParAccel, Teradata, or Greenplum.Bit bucket - HadoopKinds of data likely to be included: Logs, other technical/externalLikely use styles: Staging/ETL, investigativeCanonical example: Log files in a Hadoop clusterStresses: TCO, scale-out, transform/big-query performance, ETL functionalityArchival data storeKinds of data likely to be included: Operational, CDR (call detail record), security logLikely use styles: Archival, reporting (for compliance), possibly also investigativeExamples: Any long-term detailed historical storeStresses: TCO, compression, scale-out, performance (if multi-use)Perhaps only Rainstor truly embraces the archival positioningOutsourced data martKinds of data likely to be included: AllLikely use styles: Traditional BI, investigative analytics, staging/ETLExamples: Advertising tracking, SaaS CRMStresses: Performance, TCO, reliability, concurrencyOracle shops = Vertica gets the nod in a number of these casesOperational analytic(s) serverKinds of data likely to be included: Customer-centric, log, financial tradeLikely use styles: Advanced operational analyticsExamples:Lower latency: Web or call-center personalization, anti-fraudHigher latency: Customer profiling, Basel 3 risk analysisStresses: Performance, reliability, analytic functionality, perhaps concurrencyhttp://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/
Being the CEO of Infochimps, I felt compelled to share a little “chimpy” research with you…The “Infinite Monkey Theorem”….is a METAPHOR that directly relates to Big Data, that I think you’ll appreciate.So what is the “Infinite Monkey Theorem”????The following definition is a variant of the original theorem….let me read it to you.This theorem has been traced back to Aristotle's “On Generation and Corruption”, where he makes deductions about the unexperienced and unobservable based on real experiences and real observations.
This theorem has been traced back to Aristotle's “On Generation and Corruption”, where he makes deductions about the unexperienced and unobservable based on real experiences and real observations.Think about this a little….we’re talking about analyzing real world experiences and observations to predict what will happen…what will happen with our business in the future….the unexperienced and unobserved.This is fundamentally what Big Data proposes to help…
So as a metaphor…the "monkey" is not an actual monkey, but a metaphor for an abstract device a device that produces a sequence of letters and symbols.And "almost surely" is a mathematical term with a precise meaningShakespeare’s Hamlet also represents a broader meaning….it represents any text, any work, any insight.
So lets look at this in more depth….Infinite number of monkeys -> represents today’s seemingly unlimited computational power of either public or private Clouds…as an elastic delivery method.Keys on a typewriter -> capture discrete transactions which only analyzed together can derive meaning. Again we amass the computational power to process dataAlmost surely -> is translated into a mathematical term, namely the concept of significanceAnd finally, Shakespeare’s Hamlet is what we strive to create and it is the source of our happiness, our translation of this raw resource into insight.
Now this may seem “chimpy”….but this is beautiful. I love this metaphor.But we have a LARGE problem….
We have a problem today WITH our data infrastructure….our ability to gleam insights.I think all of you know what I’m referring to…..It’s the fact that we’re operating on less than 15% of the corporate data available to us…..even with the ENTERPRISE DATA WAREHOUSE, the EDW which is supposedly storing a COMPLETE, SINGLE VIEW OF THE TRUTH….We’re still giving our business users…..a tiny bit…a little bit of data.
The Business User
The Business User
The Business User
So why is an elastic, unlimited computational resource important?Op-Ex vs. Cap-ExCost Reduction due to better utilization / productivityTime-to-Market
Hedge funds and Wall Street firms, are using Cold War-style satellite surveillance to gather market-moving information. The Port of Long Beach is the second-busiest container port in the United States and acts as a major gateway for trade between the US and Asia. With the activity from this port estimated at over $100 billion per year, this specific port is a location it will pay to keep track of. Satellite analysts use these images to count shipping containers coming off ships in California and are able to get a sense of overall US import activity, comparing activity month by month.This analysis is being performed in Amazon”s EC2
Now lets talk about processing your enterprise data assets….your Big Data…..again, we can leverage the cloud infrastructure to scale to the level of any processing needs you may have.
The current image shows a Walmart in Wichita, Kansas.Analysts count cars in Wal-Mart parking lots to measure overall customer traffic to understand growth versus its competition.For example, Wal-Mart's growthwas determined to come mostly from areas of high unemployment.This type of analysis is being performed in Amazon”s EC2…
The current image shows the a Target in the Moraine Point Plaza located in Gardiner, NorthAnalysts comparing satellite parking lot data with regional unemployment trends found Target's growth tended to come in areas of lower-than-average unemployment. Again, these processes are being performed in Amazon EC2.…this is interesting….but how do we process the data further to help derive more relevant insights?http://www.cnbc.com/id/38738810/Spying_For_Profits_The_Satellite_Image_Indicator
The way this is performed is by taking data sources like images and storing them into Hadoop. Then using Big Data tools like MapReduce to perform sophisticated analysis on those aggregated data sets.Why is this concept so disruptive?Things like a fraction of the price….no structured data model – aka no star schema…yet the ability to run sophisticated queries and algorithms against all your detailed data.
The Business User
The previous examples of Walmart and Target involved using a regression algorithm which was executed against the satellite data + other data to produce a quarterly revenue prediction which BEAT all previous models.
Which brings us to the discussion around insights.
Quote that sets theme….the definition of “Infinite Monkey Theorem”.