slides:
Basic Big Data and Hadoop terminology
What projects fit well with Hadoop
Why Hadoop in the cloud is so Powerful
Sample end-to-end architecture
See: Data, Hadoop, Hive, Analytics, BI
Do: Data, Hadoop, Hive, Analytics, BI
How this tech solves your business problems
3. http://smallbitesofbigdata.comhttp://bit.ly/BDApr2015
Key Takeaways
Basic Big Data and Hadoop terminology
What projects fit well with Hadoop
Why Hadoop in the cloud is so Powerful
Sample end-to-end architecture
See: Data, Hadoop, Hive, Streaming, Analytics, BI
Do: Data, Hadoop, Hive, Streaming, Analytics, BI
How this tech solves your business problems
10. http://smallbitesofbigdata.comhttp://bit.ly/BDApr2015
What is Big Data?
It Is
Scale out, distributed processing
Enables elasticity
Encourages exploration
Faster data ingestion
Lower TCO
Empowers self-service BI and analytics
Rapid time to insight
It Is NOT
A well-defined thing
About volume, size
A replacement for everything
The answer to every problem
11. http://smallbitesofbigdata.comhttp://bit.ly/BDApr2015
What is Hadoop? Conceptual View
It Is
A type of Big Data
Just another data source
A loose collection of open source code
Distributed by many
Handles loosely structured data
Write once, read many
It Is Not
Actually a thing!
The only way to do Big Data
Only about data
19. http://smallbitesofbigdata.comhttp://bit.ly/BDApr2015
Architecture – Use Cloud Building Blocks
Blob Storage or
In Memory
(Landing Zone)
Blob Storage
(Persistent
Storage)
HDInsight
Clusters
(Hive, Pig, etc)
REST
Sqoop
Self-Service
Analytics
Reporting / DW
Curator
Optimized for write throughput
- Many small blobs
- Raw/binary format
- Data kept until curated
- Azure Blob Storage if persisted
- Azure Queues & Workers for in memory
Optimized for query efficiency
- Optimized size (combine blobs)
- Cleansed/masked
- Partitioned
- Well-defined, semi-structured data
Use Case Specific & General Processing
- Data governance requirements (PII scrub)
- Aggregate for efficient storage
- Publish to real-time consumers and long
term storage (Hadoop)
OtherAny Device!
22. Typical Big Data Use Cases
Smart meter
monitoring
Equipment
monitoring
Advertising
analysis
Life sciences
research
Fraud
detection
Healthcare
outcomes
Weather
forecasting
Natural resource
exploration
Social network
analysis
Churn
analysis
Traffic flow
optimization
Legal
discovery Telemetry
IT infrastructure
optimization
23. http://smallbitesofbigdata.comhttp://bit.ly/BDApr2015
Hadoop Shines When….
Data exploration, analytics and reporting, new data-driven actionable insights
Rapid iterating
Unknown unknowns
Flexible scaling
Data driven actions for early competitive advantage or first to market
Low number of direct, concurrent users
Low cost data archival
25. Relational
Database
SCALE (storage & processing)
Hadoop
Platform
schema
speed
governance
best fit use
processing
Required on write Required on read
Reads are fast Writes are fast
Standards and structured Loosely structured
Limited, no data processing Processing coupled with data
data typesStructured Multi and unstructured
Interactive OLAP Analytics
Complex ACID Transactions
Operational Data Store
Data Discovery
Processing unstructured data
Massive Storage/Processing
28. http://smallbitesofbigdata.com
Microsoft Hadoop Options
Cloud
HDInsight Service
Windows Azure Storage Blob (WASB)
HDP or Cloudera on VMs (Windows or Linux)
Any distro on VMs (Windows or Linux)
Hybrid / On-Premises
Parallel Data Warehouse (PDW) with Polybase
APS/PDW Hadoop Regions
OneBox for Developers
Hortonworks Data Platform
(HDP for Windows)
30. http://smallbitesofbigdata.comhttp://bit.ly/BDApr2015
Why Hadoop in the Cloud?
Hadoop
It’s easier
You can concentrate on the analytics
WASB: separation of storage and compute
Shared data, globally accessible
Lowers the cost of discovery & innovation
No commitment as you learn
Cloud in General
Today’s disruptor, tomorrow’s reality
Elasticity, capacity
Less infrastructure and implementation work
Lower TCO
Business Continuity
Operational Agility
31. http://smallbitesofbigdata.comhttp://bit.ly/BDApr2015
WASB: Separation of Storage & Compute
Windows Azure Storage Blob (WASB) = separate of storage and compute
Open source code available to any distro
Simplified data access
Reduced data movement
Faster access to new data
Enables ETL even when a cluster isn’t up = lower TCO
Share data concurrently
34. http://smallbitesofbigdata.comhttp://bit.ly/BDApr2015
So Far….
Basic Big Data and Hadoop terminology
What projects fit well with Hadoop
Why Hadoop in the cloud is so Powerful
Sample end-to-end architecture
Hands-On: Storage, data load, SQL database, Service Bus Event Hub, HDInsight, Hive, AzureML,
Power Query, Power View
37. http://smallbitesofbigdata.comhttp://bit.ly/BDApr2015
Key Takeaways
Basic Big Data and Hadoop terminology
What projects fit well with Hadoop
Why Hadoop in the cloud is so Powerful
Sample end-to-end architecture
See: Data, Hadoop, Hive, Streaming, Analytics, BI
Do: Data, Hadoop, Hive, Streaming, Analytics, BI
How this tech solves your business problems
39. http://smallbitesofbigdata.comhttp://bit.ly/BDApr2015
Big Data References
Get started / overview with a free Ebook “Introducing Microsoft Azure HDInsight”
http://blogs.msdn.com/b/microsoft_press/archive/2014/05/27/free-ebook-introducing-
microsoft-azure-hdinsight.aspx
Architect a solution with the Patterns and Practices guide “Developing big data solutions on
Microsoft Azure HDInsight“
http://blogs.msdn.com/b/masashi_narumoto/archive/2014/06/30/new-release-developing-
big-data-solutions-on-microsoft-hdinsight.aspx
The Data Science Laboratory Series is Complete
http://blogs.msdn.com/b/buckwoody/archive/2014/03/24/the-data-science-laboratory-
series-is-complete.aspx
40. http://smallbitesofbigdata.comhttp://bit.ly/BDApr2015
Big Data References
Microsoft Big Data http://microsoft.com/bigdata
HDP for Windows http://hortonworks.com/products/hdp-windows/
Hadoop: The Definitive Guide by Tom White
Programming Hive Book by Capriolo, Wampler, Rutherglen
Big Data Learning Resources http://sqlblog.com/blogs/lara_rubbelke/archive/2012/09/10/big-data-learning-
resources.aspx
Hurricane Sandy Mash-Up: Hive, SQL Server, PowerPivot & Power View
http://blogs.msdn.com/b/cindygross/archive/2013/01/31/mash-up-hive-sql-server-data-in-powerpivot-amp-
power-view-hurricane-sandy-2012.aspx
Twitter Search https://twitter.com/#!/search/%23bigdata
Hive Reference http://hive.apache.org
HDInsight Tutorials http://www.windowsazure.com/en-us/documentation/services/hdinsight/?fb=en-us
Denny Lee http://dennyglee.com/category/bigdata/
Carl Nolan http://blogs.msdn.com/b/carlnol/archive/tags/hadoop+streaming/
Cindy Gross http://tinyurl.com/SmallBitesBigData
Hinweis der Redaktion
Azure Subscription: http://youtu.be/lSxMtmRE114
Create HDInsight Cluster in Azure Portal http://smallbitesofbigdata.com/archive/2015/02/26/create-hdinsight-cluster-in-azure-portal.aspx
Atomic: Everything in a transaction succeeds or the entire transaction is rolled back.
Consistent: A transaction cannot leave the database in an inconsistent state.
Isolated: Transactions cannot interfere with each other.
Durable: Completed transactions persist, even when servers restart etc.
Presenter guidance:
Share how we think about the data platform in the cloud. Today, we’ll specifically talk about SQL in a VM (briefly), SQL DB, DocumentDB, HBase on HDInsight, and Tables/Blobs. There are lots of other adjacent services such as Redis Cache, Event Hubs, HDInsight, Azure ML, Data Factory, Stream Analytics that will not be addressed in this deck.
Slide talk track:
The top row is Power BI – you’re making decisions based on data
The middle row is ML, Stream Analytics, HDInsight, and Data Factory – processing and making sense of the data
The bottom row is where you ingest and store data -
With Azure, organizations have access to a whole range of services that allow them to use the right tool for the right job when developing applications.
In the cloud, organizations can collect and manage data in the form in which it’s born and store it in the form that best suits an application’s needs.
They have a very simple architecture.
Xbox consoles send raw data to a landing zone (it may spill to disk/blob storage). They process each small file as it lands, keep it until curation finishes.
They curate the data – scrub out personally identifiable info, aggregate, split as needed (to send subsets of data such as 10 minutes of sliding data or the new users in the last month), combine many small files into a few large files, put into AVRO format (common, well-known SerDes), persist “permanently” to azure blob store.
The data in the permanent store (WASB) is in a few large files, cleansed/masked, partitioned by day, semi-structured.
HDInsight processes the data – analytics, sending to other systems (SQL, RS, PowerPivot, etc.)
Demo (fake/cleansed data)
Show RawStats (view in notepad, Cloud Explorer) = raw binary data in a proprietary xbox format – shown here (cleansed) with comma separators for readability. Each line is a session with a start time, gamerid, IP address, who they interacted with (gamerids separated by hyphens). This is what is in the landing zone – the raw data.
Show RawCurator.pig (view in notepad). Compute/worker roles are watching for the raw data files. They pick them up and use Pig (and other MapReduce) to remove PII, aggregate, split, consolidate, remove the last octet of the IP for per state data…. Data is stored per arrival data – this sets us up for Hive partitions. This is a very simple workflow written by people who didn’t know Hadoop.
Show gamerstats.xlsx. This is the curated data.
Show PowerMap on top of sheet 3 (optionally also sheet 2 for marketing campaign data). This is using Hive/Hive ODBC driver to view new users.
(optional) Show pssnippets: PowerShell to submit jobs
Businesses using Big Data are “making it big”. They are taking advantage of all this ambient data and they’re moving ahead, gaining a foothold in new markets and gaining marketshare in existing markets. Think about how Netflix makes movie recommendations or how Google can predict a flu outbreak before the CDC does.
HDInsight is very focused on the volume and variety problems. We have our RX/Stream Insight and BI stack added in to help address the solution velocity issues.
Create HDInsight Cluster in Azure Portal http://smallbitesofbigdata.com/archive/2015/02/26/create-hdinsight-cluster-in-azure-portal.aspx
Why big data in the cloud?
collect data globally
much is already in the cloud
share globally
cross data center HA/DR
cost of hiring, training, retaining hardware personnel
highly flexible, scalable
easily pull in ambient data
It's partly a question of where to spend your resources and how much control you want.
Why Hadoop in the cloud?
You can deploy Hadoop in a traditional on-site datacenter. Some companies–including Microsoft–also offer Hadoop as a cloud-based service. One obvious question is: why use Hadoop in the cloud? Here's why a growing number of organizations are choosing this option.
The cloud saves time and money
Open source doesn't mean free. Deploying Hadoop on-premises still requires servers and skilled Hadoop experts to set up, tune, and maintain them. A cloud service lets you spin up a Hadoop cluster in minutes without up-front costs.
See how Virginia Tech is using Microsoft's cloud instead of spending millions of dollars to establish their own supercomputing center.
The cloud is flexible and scales fast
In the Microsoft Azure cloud, you pay only for the compute and storage you use, when you use it. Spin up a Hadoop cluster, analyze your data, then shut it down to stop the meter.
We quickly spun up the Azure HDInsight cluster and processed six years worth of data in just a few hours, and then we shut it down&ellipsis; processing the data in the cloud made it very affordable.
–Paul Henderson, National Health Service (U.K.)
The cloud makes you nimble
Create a Hadoop cluster in minutes–and add nodes on-demand. The cloud offers organizations immediate time to value.
It was simply so much faster to do this in the cloud with Windows Azure. We were able to implement the solution and start working with data in less than a week.
–Morten Meldgaard, Chr. Hansen