6. As long as you’re gonna be thinking anyway,
why not think big. (Donald Trump)
Because we can imagine, we are free (Jean-
Paul Satre)
What kind of modern world would we have if
Edison, Green and Dixon had not developed
cinematic technology before Hitchcock grew
up? (Kevin Kelly, futurist)
7. The Unknown Unknowns
• That is to say, there are things that we know
we don't know. But there are also unknown
unknowns. There are things we don't know
we don't know. (Donald Rumsfeld)
21. Increases ad revenue by processing 3.5
billion events per day
Massive Volumes
Processes 464 billion rows per quarter,
with average query time under 10 secs.
Measures and ranks online user
influence by processing 3 billion signals
per day
Cloud Connectivity
Connects across 15 social networks via
the cloud for data and API access
Uses sentiment analysis and web
analytics for its internal cloud
Real-Time Insight
Improves operational decision making
for IT managers and users
24. What is Hadoop?
“Flexible and Available
Architecture for Large Scale
computation and data processing
on a network of highly available
commodity hardware.”
37. BIG DATA REQUIRES AN END-TO-END APPROACH
Discover Combine Refine
Relational Non-relational Streaming
INSIGHT
DATA
ENRICHMENT
DATA
MANAGEMENT
Self-Service Collaboration Corporate Apps Devices
Analytical
39. Microsoft Hadoop Vision
Runs on Windows and Azure
• Active Directory
• System Center
• .Net Programmability
Microsoft Data Connectivity
• SQL Server / SQL Parallel Data Warehouse
• Azure Storage / Azure Data Market
40. Microsoft Hadoop Vision
Microsoft Business Intelligence
• Hive ODBC Connectivity
• BI Tools for Big Data
Collaboratewith and Contribute to OSS
• Collaborate with HortonWorks
• Provide improvements and Windows support back to OSS
41. On Premise
• Comes with:
•Hadoop command line (shell)
•Hadoop Status for name node and
map-reduce cluster
•HDInsight Dashboard
42. On Premise
• On prem:
http://www.microsoft.com/bigda
ta/
• Single node cluster (onebox) install
• C:hadoop
• Starts local services
43. On Azure
• On Windows Azure:
http://HadoopOnAzure.com/
• 3 node cluster running as a service in Azure
• Can be used for 5 days
• Provides samples and HDInsight Dashboard
• TAP Program
44. Agenda
•Big Data – What is it?
• Big Data or Big Hype?
• Big Data, Big Insights with
Hadoop
45. Because we can imagine,
we are free
Jean-Paul Satre
We have the tools. All we’ve got to
do is imagine what could be. We can
reinvent the present; we can
transform the world around us.
Jason Silva
Relational databases are pushed to the limit.Data Management techniques haven't scaledTraditional systems haven't scaledBig data is about complexity as well as scalability.NoSQL as a paradigm shift.Hadoop can run and parallelise large scale batch computations on large amounts of data. however, there is a high latency in returning the results. It is not suitable for low latency.What are the features of a Big Data system?RobustFault TolerantHuman Fault TolerantData when you need itScaleableGeneralExtensibleReduced implementation complexityError handlingAuditing-- no different from a little Data Solution. Think inserts.
Relational databases are pushed to the limit.Data Management techniques haven't scaledTraditional systems haven't scaledBig data is about complexity as well as scalability.NoSQL as a paradigm shift.Hadoop can run and parallelise large scale batch computations on large amounts of data. however, there is a high latency in returning the results. It is not suitable for low latency.What are the features of a Big Data system?RobustFault TolerantHuman Fault TolerantData when you need itScaleableGeneralExtensibleReduced implementation complexityError handlingAuditing-- no different from a little Data Solution. Think inserts.
There are some things in life are so complicated and abstract that they’re awesome. Eternity, cosmic significance, and the infinite universe are just a few of these awesome, convoluted concepts that have kept us fascinated and confused since the beginning of human consciousness.Awe - perceptual expansion, such perceptual vastness that you literally have to configure your mental schemata just to accommodate, just to take in the scale, of the experienceanthological awakening, realization of the connectedness of all things, and also the continuum from inanimate to animate matter; all of it is nature, all of it is inevitable, all of it is emerging as part of the same evolutionary processPhysicist Freeman Dyson speaks of a new future where a new generation of artists will write genomes the way that Shakespeare used to write verses
Courtesy of WIPRO
Teradata and Lyn Langit slide.we’ve got 7 billion people, we got 6 billion devices90% of the world’s data was created in the last two years aloneNot the data that’s kept behind corporate walls. unstructured content, most of which didn’t even exist years ago: documents, tweets, images, videos posted to YouTube, data gathered from surveillance cameras. We post, we blog, we share, we tweet, we like or don’t like. We have a voice and we leave a digital trail. And every tweet we send is being followed, monitored, analyzed, acted on. Companies are analyzing social to find out what you’re thinking, to know what new products and services you want even before you do. A new initiative by the U.N. is actually using sentiment analyses to help predict the civil unrest, job losses, spending reductions, disease outbreaks
Digital Marketingoptimisation – golden path analysis, clickthroughtsDigital Exploration – Discovery, new marketsMachine generated analytics – logs, real time, telemetry. Location. Remote sensors.Data Retention – archivingTraditionally: Physics Experiments, Sensor data, Satellite data, …Now:Operational LogsCustomer behaviorSocial interactions online…From Terabytes in the 1990 over Petabytes today to Zetabytes in the future
What do we have now? It is like a vacuum tube; slow and expensive.Why did Big Data get big?
What do we have now? It is like a vacuum tube; slow and expensive.Why did Big Data get big?
Volume – data comes in one size – large.Variety – structured and unstructure data.Veracity – good and bad data.Velocity – fast moving.Value – business value
Unlike real crude oil, data can be re-used. It can be mined for profit.It needs to be re-shaped in order to be used.If you don’t’ have your data, you don’t have anything! You lose your business.
Thanks to @SiSense and Bruno Aziza
If you don’t’ have your data, you don’t have anything! You lose your business.
Relational databases are pushed to the limit.Data Management techniques haven't scaledTraditional systems haven't scaledBig data is about complexity as well as scalability.NoSQL as a paradigm shift.Hadoop can run and parallelise large scale batch computations on large amounts of data. however, there is a high latency in returning the results. It is not suitable for low latency.What are the features of a Big Data system?RobustFault TolerantHuman Fault TolerantData when you need itScaleableGeneralExtensibleReduced implementation complexityError handlingAuditing-- no different from a little Data Solution. Think inserts.
Big DataThis is a picture down the center isle of a shipping container from one of Microsoft’s datacenters. We put ~1800 computers inside one of these containers. Some of us had the privilege of working on the data storage and computational platform that powers Bing. We used 22 of these containers, spanning 40,000 machines where we stored over 100PB of data. This was three years ago, and now these servers are almost obsolete.Big Data is in constant motion and growing at an incredible rate,90% of the world’s data generated in just the past two years. That's remarkable growth. Technology history has taught us that the one with themost data wins. The empires of data like Twitter, Facebook, Yahoo all of whom are able to capitalize on the notion that data equates to power. More and more companies are increasingly utilizing Hadoop to power Big Data analytics and drive revenue and profit.It’s all about your Data.
Some examples of organizations that delivering new value based in the form of revenue growth, cost savings or creating entirely new business models.Yahoo - AS with Hive, Klout - AS with Hive (white paper), GE - Hive AnalyticsYahoo! (Gartner BI Excellence Award Winner) is driving growth for existing revenue streams:Yahoo! manages a powerful, scalable advertising exchange that includes publishers and advertisers.Advertisers want to get the most out of their investment by reaching their targeted audiences effectively and efficiently.Yahoo! needs visibility into how consumers are responding to ads alongmany dimensions (websites, creative, time of day, gender, age, location) to make the exchangework as efficiently and effectivelyas possible.Yahoo! doubled its revenue by allowing campaign managers to “tune” campaign targeting and creative.Yahoo! drove an increase in spending from advertisers since they got better performance by advertising through Yahoo!.Yahoo! TAO exposed customer segment performance to campaign managers and advertisers for the first time.Klout is creating new businesses and revenue streams:Klout’s mission is to help everyone understand and leverage their influence. Klout uses Big Data to unify the social web (consumers, brands, and partners) with social networking and activity, along with data to generate a Klout score and enable analysis, targeting, and social graphs.Helps consumers manage their “social brand.”Helps brands reach influencers at scale.Helps data partners enhance their services (customer loyalty, CRM, media and identity, and marketing). For example, the Palms uses Klout scores in addition to their normal customer rewards program to determine whether or not to upgrade their customers to a better room during their stay. The Huffington Post uses Klout to help serve the best curated Twitter content.Klout Case Study: http://www.microsoft.com/casestudies/Microsoft-SQL-Server-2012-Enterprise/Klout/Data-Services-Firm-Uses-Microsoft-BI-and-Hadoop-to-Boost-Insight-into-Big-Data/710000000129Case Study on Thailand’s Department of Special Investigations : http://www.microsoft.com/casestudies/Microsoft-SQL-Server-2012-Enterprise/Department-of-Special-Investigation/Thai-Law-Enforcement-Agency-Optimizes-Investigations-with-Big-Data-Solution/710000001175 GE is driving operational efficiencies:GE is running several use cases on its Hadoop cluster while incorporating several different disparate sources to produce results. Along with sentiment analysis, GE is running web analytics on its internal cloud structure and looking at load usage, user analytics, and failure mode analytics. GE built a recommendation engine for its intranet involving various press releases users might be interested in based on their function, user profiles, and prior visits to its site. GE is working with several types of remote monitoring and diagnostic data from energy and wind businesses.
Business Users need data. There is a paradigm shift towards it, despite what the cartoon says.
Processing Platform for Big Data ProcessingUsing the “Map-Reduce” Processing ParadigmWhen people talk about Hadoop they are often talking about specific computational patterns including map reduce, which emerged as a method to process lots of unstructured data on top of a distributed storage system in a highly fault tolerant and embarrassingly scalable way. Hadoop allows us to store and process large amounts of data on commodity hardware. In the past you would spend large amounts of money on very specialized hardware. Today you can do this with off the shelf hardware running Hadoop. Now, Hadoop doesn’t have a monopoly on “big”, “real time” or “unstructured” but does provide some unique capabilities.
Assuming that the volumes of data are larger than those conventional relational database infrastructures can cope with, processing options break down broadly into a choice between massively parallel processing architectures — data warehouses or databases such as Greenplum — and Apache Hadoop-based solutions. This choice is often informed by the degree to which the one of the other "Vs" — variety — comes into play. Typically, data warehousing approaches involve predetermined schemas, suiting a regular and slowly evolving dataset. Apache Hadoop, on the other hand, places no conditions on the structure of the data it can process.
Hadoop, on the other hand, places no conditions on the structure of the data it can process.
I see the real breakthrough insights coming through when you take what is the traditional "Business Intelligence" and add more capabilities like machine learning, predictive analysis, statistical analysis, large scale graph processing, pattern mining, trend analysis, economic modeling. All of which today are a reality in Hadoop. The implications of this are quite astounding when you think about it. This is huge.
Big Data; in terms of data volume, variability and velocity at scale are is the first problem. But the Big Data solutions and technology by themselves don't lead to solving business objectives. We don't have a Hadoop problem they have analytics, pattern mining, trend analysis, statistical inferenceing, economic modeling, market regression level problems.Data science starts where the utility class services like Big Data Hadoop end. The real opportunity is to expose data science to everyone.As powerful as Hadoop is, today it’s still more of a computer scientist’s or academically-trained analyst’s tool than it is an enterprise analytics product. Hadoop itself is controlled through programming code rather than anything that looks like it was designed for business unit personnel. Hadoop data is often more “raw” and “wild” than data typically fed to data warehouse and OLAP (Online Analytical Processing) systems. This is where I and Microsoft see opportunity. Essentially; wouldn't it be cool if mere mortals could use this stuff and consume insights that are directly coming from Hadoop? Microsoft HDInsight enables you to gain insight from virtually any data, connect with the world of data, improve decision making, and enhance the development of the next generation of products and services.Nearly everyone in your organization can analyze and make more informed decisions with the right tools.PowerPivot for Microsoft Excel and Power View for SharePoint give nearly all users a view into structured and unstructured data.With the Hive Add-in for Excel and Hive ODBC Driver, almost anyone in your organization can directly access Hadoop datafrom end-user tools.Hadoop simplifies programming for developers with JavaScript for MapReduce jobs. The JavaScriptimplementation can also reduce your code by up to 10 times compared to Java.
The second thing I want to talk about is Hadoop and how Hadoop is setup to deliver Breakthrough Insights from your data.How many of you are familiar with Hadoop? How many of you are using Hadoop for projects today?How many are planning on using Hadoop in the next 12mo? How about in the cloud?When people talk about Hadoop they are often talking about specific computational patterns including map reduce, which emerged as a method to process lots of unstructured data on top of a distributed storage system in a highly fault tolerant and embarrassingly scalable way. Hadoop allows us to store and process large amounts of data on commodity hardware. In the past you would spend large amounts of money on very specialized hardware. Today you can do this with off the shelf hardware running Hadoop. Now, Hadoop doesn’t have a monopoly on “big”, “real time” or “unstructured” but does provide some unique capabilities.
The second thing I want to talk about is Hadoop and how Hadoop is setup to deliver Breakthrough Insights from your data.How many of you are familiar with Hadoop? How many of you are using Hadoop for projects today?How many are planning on using Hadoop in the next 12mo? How about in the cloud?When people talk about Hadoop they are often talking about specific computational patterns including map reduce, which emerged as a method to process lots of unstructured data on top of a distributed storage system in a highly fault tolerant and embarrassingly scalable way. Hadoop allows us to store and process large amounts of data on commodity hardware. In the past you would spend large amounts of money on very specialized hardware. Today you can do this with off the shelf hardware running Hadoop. Now, Hadoop doesn’t have a monopoly on “big”, “real time” or “unstructured” but does provide some unique capabilities.
There are other talks that will go into Big Data and Hadoop so we’ll only do a quick overview of that right now. We’ll spend most of our time on Hive.