SlideShare a Scribd company logo
1 of 37
Sponsored & Brought to you by
Analyzing StackExchange data with Azure
Data Lake
Tom Kerkhove
http://www.twitter.com/TomKerkhove
https://be.linkedin.com/in/tomkerkhove
Analysing StackExchange data
with Azure Data Lake
Analysing StackExchange data with Azure Data Lake
Nice to meet you
Tom KERKHOVE
➔ Integration Professional
➔ IoT Competency Lead
➔ Windows Development &
Microsoft Azure MVP
tom.kerkhove@codit.eu
+32 473 701 074
@TomKerkhove
be.linkedin.com/in/tomkerkhove
github.com/tomkerkhove
Agenda
• Why should we care about Big Data?
• Big Data in Azure
• Azure Data Lake
• Demo
• Q & A
4
Integration of ThingsInternet of Things
6
Connect and scale
with efficiency
Analyze and act
on new data
Integrate and transform
business processes
Event producers & gateways Ingestion & transformation Report, Act, Predict
Microsoft Patterns & Practices – IoT Journey
10
11
Cluster Management
12
Languages
Platform Services
Infrastructure Services
Web Apps
Mobile
Apps
API
Management
API Apps
Logic Apps
Notification
Hubs
Content
Delivery
Network (CDN)
Media
Services
BizTalk
Services
Hybrid
Connections
Service Bus
Storage
Queues
Hybrid
Operations
Backup
StorSimple
Azure Site
Recovery
Import/Export
SQL
Database
DocumentDB
Redis
Cache
Azure
Search
Storage
Tables
Data
Warehouse Azure AD
Health Monitoring
AD Privileged
Identity
Management
Operational
Analytics
Cloud
Services
Batch
RemoteApp
Service
Fabric
Visual Studio
App
Insights
Azure
SDK
VS Online
Domain Services
HDInsight Machine
Learning
Stream
Analytics
Data
Factory
Event
Hubs
Mobile
Engagement
Data
Lake
IoT Hub
Data
Catalog
Security &
Management
Azure Active
Directory
Multi-Factor
Authentication
Automation
Portal
Key Vault
Store/
Marketplace
VM Image Gallery
& VM Depot
Azure AD
B2C
Scheduler
Overview in Azure
14
DocumentDB
Data Factory Stream Analytics Data Lake HDInsight Data Lake
(Store & Analytics)
Virtual Machine
IoT Hub SQL Data
Warehouse
SQL DatabaseStorageEvent Hubs
Document Db
Data Ingestion Data Storage
Data Pipelines
Machine Learning
Data Analytics
Cortana Analytics Suite
16
Analysing Big Data in Azure
Azure Data Lake Family
HDInsight Data Lake Store Data Lake Analytics
• Unlimited storage
• WebHDFS Store
• Managed cluster service
• Open-source technology
• Runs on Windows or Linux
• Managed job service
• U-SQL batch-processing
Azure Data Lake Store
➔ WebHDFS compatible
➔ Any size
➔ Any format as-is
➔ Write-once-read-many
➔ Enterprise-grade security
➔ Thé big data store in Azure
18
Characteristics
➔ Data Warehousing
➔ Structured data
➔ Defined set of schemas
➔ Requires Extract-Transform-
Load (ETL) before storing
➔ Known for some of us
➔ Exploratory analysis is hard
because of transforming the
data
19
Data Lake vs DataWarehousing
➔ Data Lake
➔ Raw data
(unstructured/semi-structured/structured)
➔ “Dump” all your data in the
lake
➔ Data scientists will
interpret data from the lake
➔ Without metadata, turns in
a data swamp pretty fast
20Martin Fowler on Data Lake & Data Warehouses(link)
Azure Data Lake Analytics
➔ Run analytics jobs on managed clusters
➔ Don’t worry about scale
➔ Written in U-SQL
➔ SQL Syntax
➔ Extensibility in C#
➔ Easily scaled with Analytics Units
➔ Pay for processing time only
21
Writing U-SQL scripts
22
Extract from data source by
using built-in or custom
extractors.
Transform / Analyse the data
using SQL-syntax, in-line C# or
C# method calls
Output the result to a data
source by using built-in or
custom extractors
23
Data Lake Analytics - Data Sources
U-SQL
Query Query
Azure
Storage Blobs
Azure
Data Lake Store
Azure
SQL Database
Azure
SQL Data Warehouse
Azure SQL
in VMs
Azure Data Lake Analytics
25
Meet StackExchange
➔ Over 280 subwebsites
➔ 150+ GB of open-source data
➔ Different kinds of data
➔ Posts
➔ Users
➔ Votes
➔ ...
➔ A big data sample data set
What AreWe GoingTo Do?
• Downloading the
original data set
Acquiring The
Data
• Upload data set to
Azure
• Determine what
service to use
Moving The
Data • Merging data from
each site into one
file
• Conversion from
XML to CSV
Aggregating
The Data
• Run business logic
on it
• Attempt to gain
knowledge from it
Analyzing The
Data • Visualize what we’ve
learned
Visualizing The
Data
27
Azure Data Lake tools forVisual Studio
➔ Projects / Solutions / Source control
➔ Store Explorer
➔ Browse store
➔ Download complete / subset of file
➔ Preview
➔ JobVisualizer
➔ Determine bottlenecks by using heatmaps
➔ Playback jobs based on telemetry
➔ Query optimization
➔ Job Profiler
➔ Off-Line execution
28
Integration with Azure Services
➔ Integrate in your data pipelines in Azure Data Factory
➔ Move data from Azure Data Lake Store to other store
➔ Move data to Azure Data Lake Store
➔ Run U-SQL query within pipeline
➔ Integration with Azure Data Catalog
➔ Register your Azure Data Lake Store assets
29
Pricing
➔ Data Lake Store
➔ $0,08/GB stored per month
➔ $0,14 per 1M transactions
• 1 transaction is block of up to 128 kB
➔ Egress will be billed but not know yet
➔ Data Lake Analytics
➔ $0,05 per job
➔ $0,05 per minute per Analytics Unit for processing time
30
Azure Data Lake Store vs Blob Storage
31
No Limitations
Store whatever you
want in any format
Security
Built-in Azure Active
Directory support
Pricing
More expensive than
Storage RA-GRS
Redundancy
It’s there but no control
over it
Built for Scale
Optimized for high-
scale reads
Integration
With Data Factory, Data
Catalog & HDInsight
32
Summary
➔ Big Data is not just a hype so get ready
➔ Azure Data Lake Store
➔ Analyse today & explore tomorrow
➔ Data Swamps
➔ Data Lake Analytics
➔ No cluster management
➔ Re-use existing skills
➔ Pay for what we use
➔ Big Data in Azure? Azure Data Lake family and it’s easy!
35
36
37

More Related Content

What's hot

Tokyo azure meetup #2 big data made easy
Tokyo azure meetup #2   big data made easyTokyo azure meetup #2   big data made easy
Tokyo azure meetup #2 big data made easy
Tokyo Azure Meetup
 

What's hot (20)

Azure data lake sql konf 2016
Azure data lake   sql konf 2016Azure data lake   sql konf 2016
Azure data lake sql konf 2016
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
 
Part 3 - Modern Data Warehouse with Azure Synapse
Part 3 - Modern Data Warehouse with Azure SynapsePart 3 - Modern Data Warehouse with Azure Synapse
Part 3 - Modern Data Warehouse with Azure Synapse
 
Intro to Azure Data Factory v1
Intro to Azure Data Factory v1Intro to Azure Data Factory v1
Intro to Azure Data Factory v1
 
TechDays NL 2016 - Building your scalable secure IoT Solution on Azure
TechDays NL 2016 - Building your scalable secure IoT Solution on AzureTechDays NL 2016 - Building your scalable secure IoT Solution on Azure
TechDays NL 2016 - Building your scalable secure IoT Solution on Azure
 
Azure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekAzure Data Factory for Azure Data Week
Azure Data Factory for Azure Data Week
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
ETL in the Cloud With Microsoft Azure
ETL in the Cloud With Microsoft AzureETL in the Cloud With Microsoft Azure
ETL in the Cloud With Microsoft Azure
 
Spark as a Service with Azure Databricks
Spark as a Service with Azure DatabricksSpark as a Service with Azure Databricks
Spark as a Service with Azure Databricks
 
Azure Data Factory presentation with links
Azure Data Factory presentation with linksAzure Data Factory presentation with links
Azure Data Factory presentation with links
 
Unleash the power of Azure Data Factory
Unleash the power of Azure Data Factory Unleash the power of Azure Data Factory
Unleash the power of Azure Data Factory
 
J1 T1 4 - Azure Data Factory vs SSIS - Regis Baccaro
J1 T1 4 - Azure Data Factory vs SSIS - Regis BaccaroJ1 T1 4 - Azure Data Factory vs SSIS - Regis Baccaro
J1 T1 4 - Azure Data Factory vs SSIS - Regis Baccaro
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2
 
Azure Data Factory for Redmond SQL PASS UG Sept 2018
Azure Data Factory for Redmond SQL PASS UG Sept 2018Azure Data Factory for Redmond SQL PASS UG Sept 2018
Azure Data Factory for Redmond SQL PASS UG Sept 2018
 
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
Modern ETL: Azure Data Factory, Data Lake, and SQL DatabaseModern ETL: Azure Data Factory, Data Lake, and SQL Database
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
 
Azure Data Factory
Azure Data FactoryAzure Data Factory
Azure Data Factory
 
Tokyo azure meetup #2 big data made easy
Tokyo azure meetup #2   big data made easyTokyo azure meetup #2   big data made easy
Tokyo azure meetup #2 big data made easy
 

Viewers also liked

MongoDB on Azure - Tips, Tricks and Examples
MongoDB on Azure - Tips, Tricks and ExamplesMongoDB on Azure - Tips, Tricks and Examples
MongoDB on Azure - Tips, Tricks and Examples
MongoDB
 

Viewers also liked (20)

Azure Data Lake and U-SQL
Azure Data Lake and U-SQLAzure Data Lake and U-SQL
Azure Data Lake and U-SQL
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
 
Azure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveAzure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep Dive
 
Scaling MongoDB in the cloud with Microsoft Azure
Scaling MongoDB in the cloud with Microsoft AzureScaling MongoDB in the cloud with Microsoft Azure
Scaling MongoDB in the cloud with Microsoft Azure
 
MongoDB on Azure - Tips, Tricks and Examples
MongoDB on Azure - Tips, Tricks and ExamplesMongoDB on Azure - Tips, Tricks and Examples
MongoDB on Azure - Tips, Tricks and Examples
 
Social media analytics using Azure Technologies
Social media analytics using Azure TechnologiesSocial media analytics using Azure Technologies
Social media analytics using Azure Technologies
 
Power bi ea content pack v0.1
Power bi   ea content pack v0.1Power bi   ea content pack v0.1
Power bi ea content pack v0.1
 
U-SQL Meta Data Catalog (SQLBits 2016)
U-SQL Meta Data Catalog (SQLBits 2016)U-SQL Meta Data Catalog (SQLBits 2016)
U-SQL Meta Data Catalog (SQLBits 2016)
 
U-SQL Intro (SQLBits 2016)
U-SQL Intro (SQLBits 2016)U-SQL Intro (SQLBits 2016)
U-SQL Intro (SQLBits 2016)
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
U-SQL Learning Resources (SQLBits 2016)
U-SQL Learning Resources (SQLBits 2016)U-SQL Learning Resources (SQLBits 2016)
U-SQL Learning Resources (SQLBits 2016)
 
U-SQL Reading & Writing Files (SQLBits 2016)
U-SQL Reading & Writing Files (SQLBits 2016)U-SQL Reading & Writing Files (SQLBits 2016)
U-SQL Reading & Writing Files (SQLBits 2016)
 
Using C# with U-SQL (SQLBits 2016)
Using C# with U-SQL (SQLBits 2016)Using C# with U-SQL (SQLBits 2016)
Using C# with U-SQL (SQLBits 2016)
 
U-SQL Does SQL (SQLBits 2016)
U-SQL Does SQL (SQLBits 2016)U-SQL Does SQL (SQLBits 2016)
U-SQL Does SQL (SQLBits 2016)
 
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
 
Killer Scenarios with Data Lake in Azure with U-SQL
Killer Scenarios with Data Lake in Azure with U-SQLKiller Scenarios with Data Lake in Azure with U-SQL
Killer Scenarios with Data Lake in Azure with U-SQL
 
U-SQL Federated Distributed Queries (SQLBits 2016)
U-SQL Federated Distributed Queries (SQLBits 2016)U-SQL Federated Distributed Queries (SQLBits 2016)
U-SQL Federated Distributed Queries (SQLBits 2016)
 
U-SQL Query Execution and Performance Basics (SQLBits 2016)
U-SQL Query Execution and Performance Basics (SQLBits 2016)U-SQL Query Execution and Performance Basics (SQLBits 2016)
U-SQL Query Execution and Performance Basics (SQLBits 2016)
 

Similar to Analyzing StackExchange data with Azure Data Lake

SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 

Similar to Analyzing StackExchange data with Azure Data Lake (20)

Analyzing StackExchange Data with Azure Data Lake (Tom Kerkhove @ Integration...
Analyzing StackExchange Data with Azure Data Lake (Tom Kerkhove @ Integration...Analyzing StackExchange Data with Azure Data Lake (Tom Kerkhove @ Integration...
Analyzing StackExchange Data with Azure Data Lake (Tom Kerkhove @ Integration...
 
NDC Sydney - Analyzing StackExchange with Azure Data Lake
NDC Sydney - Analyzing StackExchange with Azure Data LakeNDC Sydney - Analyzing StackExchange with Azure Data Lake
NDC Sydney - Analyzing StackExchange with Azure Data Lake
 
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data LakeNDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data Lake
 
10 Reasons Snowflake Is Great for Analytics
10 Reasons Snowflake Is Great for Analytics10 Reasons Snowflake Is Great for Analytics
10 Reasons Snowflake Is Great for Analytics
 
Intelligent Cloud Conference 2018 - Next Generation of Data Integration with ...
Intelligent Cloud Conference 2018 - Next Generation of Data Integration with ...Intelligent Cloud Conference 2018 - Next Generation of Data Integration with ...
Intelligent Cloud Conference 2018 - Next Generation of Data Integration with ...
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with Databricks
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
Azure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeAzure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data Lake
 
Azure SQL Data Warehouse
Azure SQL Data Warehouse Azure SQL Data Warehouse
Azure SQL Data Warehouse
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics Platform
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
 
IBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeIBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lake
 

More from BizTalk360

More from BizTalk360 (20)

Optimise Business Activity Tracking – Insights from Smurfit Kappa
Optimise Business Activity Tracking – Insights from Smurfit KappaOptimise Business Activity Tracking – Insights from Smurfit Kappa
Optimise Business Activity Tracking – Insights from Smurfit Kappa
 
Optimise Business Activity Tracking – Insights from Smurfit Kappa
Optimise Business Activity Tracking – Insights from Smurfit KappaOptimise Business Activity Tracking – Insights from Smurfit Kappa
Optimise Business Activity Tracking – Insights from Smurfit Kappa
 
What's inside "migrating to biz talk server 2020" Book (BizTalk360 Webinar)
What's inside "migrating to biz talk server 2020" Book (BizTalk360 Webinar)What's inside "migrating to biz talk server 2020" Book (BizTalk360 Webinar)
What's inside "migrating to biz talk server 2020" Book (BizTalk360 Webinar)
 
Integration Monday - Logic Apps: Development Experiences
Integration Monday - Logic Apps: Development ExperiencesIntegration Monday - Logic Apps: Development Experiences
Integration Monday - Logic Apps: Development Experiences
 
Integration Monday - BizTalk Migrator Deep Dive
Integration Monday - BizTalk Migrator Deep DiveIntegration Monday - BizTalk Migrator Deep Dive
Integration Monday - BizTalk Migrator Deep Dive
 
Testing for Logic App Solutions | Integration Monday
Testing for Logic App Solutions | Integration MondayTesting for Logic App Solutions | Integration Monday
Testing for Logic App Solutions | Integration Monday
 
No-Slides
No-SlidesNo-Slides
No-Slides
 
System Integration using Reactive Programming | Integration Monday
System Integration using Reactive Programming | Integration MondaySystem Integration using Reactive Programming | Integration Monday
System Integration using Reactive Programming | Integration Monday
 
Building workflow solution with Microsoft Azure and Cloud | Integration Monday
Building workflow solution with Microsoft Azure and Cloud | Integration MondayBuilding workflow solution with Microsoft Azure and Cloud | Integration Monday
Building workflow solution with Microsoft Azure and Cloud | Integration Monday
 
Serverless Minimalism: How to architect your apps to save 98% on your Azure b...
Serverless Minimalism: How to architect your apps to save 98% on your Azure b...Serverless Minimalism: How to architect your apps to save 98% on your Azure b...
Serverless Minimalism: How to architect your apps to save 98% on your Azure b...
 
Migrating BizTalk Solutions to Azure: Mapping Messages | Integration Monday
Migrating BizTalk Solutions to Azure: Mapping Messages | Integration MondayMigrating BizTalk Solutions to Azure: Mapping Messages | Integration Monday
Migrating BizTalk Solutions to Azure: Mapping Messages | Integration Monday
 
Integration-Monday-Infrastructure-As-Code-With-Terraform
Integration-Monday-Infrastructure-As-Code-With-TerraformIntegration-Monday-Infrastructure-As-Code-With-Terraform
Integration-Monday-Infrastructure-As-Code-With-Terraform
 
Integration-Monday-Stateful-Programming-Models-Serverless-Functions
Integration-Monday-Stateful-Programming-Models-Serverless-FunctionsIntegration-Monday-Stateful-Programming-Models-Serverless-Functions
Integration-Monday-Stateful-Programming-Models-Serverless-Functions
 
Integration-Monday-Serverless-Slackbots-with-Azure-Durable-Functions
Integration-Monday-Serverless-Slackbots-with-Azure-Durable-FunctionsIntegration-Monday-Serverless-Slackbots-with-Azure-Durable-Functions
Integration-Monday-Serverless-Slackbots-with-Azure-Durable-Functions
 
Integration-Monday-Building-Stateful-Workloads-Kubernetes
Integration-Monday-Building-Stateful-Workloads-KubernetesIntegration-Monday-Building-Stateful-Workloads-Kubernetes
Integration-Monday-Building-Stateful-Workloads-Kubernetes
 
Integration-Monday-Logic-Apps-Tips-Tricks
Integration-Monday-Logic-Apps-Tips-TricksIntegration-Monday-Logic-Apps-Tips-Tricks
Integration-Monday-Logic-Apps-Tips-Tricks
 
Integration-Monday-Terraform-Serverless
Integration-Monday-Terraform-ServerlessIntegration-Monday-Terraform-Serverless
Integration-Monday-Terraform-Serverless
 
Integration-Monday-Microsoft-Power-Platform
Integration-Monday-Microsoft-Power-PlatformIntegration-Monday-Microsoft-Power-Platform
Integration-Monday-Microsoft-Power-Platform
 
One name unify them all
One name unify them allOne name unify them all
One name unify them all
 
Securely Publishing Azure Services
Securely Publishing Azure ServicesSecurely Publishing Azure Services
Securely Publishing Azure Services
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

Analyzing StackExchange data with Azure Data Lake

  • 1. Sponsored & Brought to you by Analyzing StackExchange data with Azure Data Lake Tom Kerkhove http://www.twitter.com/TomKerkhove https://be.linkedin.com/in/tomkerkhove
  • 2. Analysing StackExchange data with Azure Data Lake Analysing StackExchange data with Azure Data Lake
  • 3. Nice to meet you Tom KERKHOVE ➔ Integration Professional ➔ IoT Competency Lead ➔ Windows Development & Microsoft Azure MVP tom.kerkhove@codit.eu +32 473 701 074 @TomKerkhove be.linkedin.com/in/tomkerkhove github.com/tomkerkhove
  • 4. Agenda • Why should we care about Big Data? • Big Data in Azure • Azure Data Lake • Demo • Q & A 4
  • 5.
  • 7. Connect and scale with efficiency Analyze and act on new data Integrate and transform business processes
  • 8. Event producers & gateways Ingestion & transformation Report, Act, Predict
  • 9. Microsoft Patterns & Practices – IoT Journey
  • 10. 10
  • 13. Platform Services Infrastructure Services Web Apps Mobile Apps API Management API Apps Logic Apps Notification Hubs Content Delivery Network (CDN) Media Services BizTalk Services Hybrid Connections Service Bus Storage Queues Hybrid Operations Backup StorSimple Azure Site Recovery Import/Export SQL Database DocumentDB Redis Cache Azure Search Storage Tables Data Warehouse Azure AD Health Monitoring AD Privileged Identity Management Operational Analytics Cloud Services Batch RemoteApp Service Fabric Visual Studio App Insights Azure SDK VS Online Domain Services HDInsight Machine Learning Stream Analytics Data Factory Event Hubs Mobile Engagement Data Lake IoT Hub Data Catalog Security & Management Azure Active Directory Multi-Factor Authentication Automation Portal Key Vault Store/ Marketplace VM Image Gallery & VM Depot Azure AD B2C Scheduler
  • 14. Overview in Azure 14 DocumentDB Data Factory Stream Analytics Data Lake HDInsight Data Lake (Store & Analytics) Virtual Machine IoT Hub SQL Data Warehouse SQL DatabaseStorageEvent Hubs Document Db Data Ingestion Data Storage Data Pipelines Machine Learning Data Analytics
  • 16. 16
  • 17. Analysing Big Data in Azure Azure Data Lake Family HDInsight Data Lake Store Data Lake Analytics • Unlimited storage • WebHDFS Store • Managed cluster service • Open-source technology • Runs on Windows or Linux • Managed job service • U-SQL batch-processing
  • 18. Azure Data Lake Store ➔ WebHDFS compatible ➔ Any size ➔ Any format as-is ➔ Write-once-read-many ➔ Enterprise-grade security ➔ Thé big data store in Azure 18
  • 19. Characteristics ➔ Data Warehousing ➔ Structured data ➔ Defined set of schemas ➔ Requires Extract-Transform- Load (ETL) before storing ➔ Known for some of us ➔ Exploratory analysis is hard because of transforming the data 19 Data Lake vs DataWarehousing ➔ Data Lake ➔ Raw data (unstructured/semi-structured/structured) ➔ “Dump” all your data in the lake ➔ Data scientists will interpret data from the lake ➔ Without metadata, turns in a data swamp pretty fast
  • 20. 20Martin Fowler on Data Lake & Data Warehouses(link)
  • 21. Azure Data Lake Analytics ➔ Run analytics jobs on managed clusters ➔ Don’t worry about scale ➔ Written in U-SQL ➔ SQL Syntax ➔ Extensibility in C# ➔ Easily scaled with Analytics Units ➔ Pay for processing time only 21
  • 22. Writing U-SQL scripts 22 Extract from data source by using built-in or custom extractors. Transform / Analyse the data using SQL-syntax, in-line C# or C# method calls Output the result to a data source by using built-in or custom extractors
  • 23. 23
  • 24. Data Lake Analytics - Data Sources U-SQL Query Query Azure Storage Blobs Azure Data Lake Store Azure SQL Database Azure SQL Data Warehouse Azure SQL in VMs Azure Data Lake Analytics
  • 25. 25
  • 26. Meet StackExchange ➔ Over 280 subwebsites ➔ 150+ GB of open-source data ➔ Different kinds of data ➔ Posts ➔ Users ➔ Votes ➔ ... ➔ A big data sample data set
  • 27. What AreWe GoingTo Do? • Downloading the original data set Acquiring The Data • Upload data set to Azure • Determine what service to use Moving The Data • Merging data from each site into one file • Conversion from XML to CSV Aggregating The Data • Run business logic on it • Attempt to gain knowledge from it Analyzing The Data • Visualize what we’ve learned Visualizing The Data 27
  • 28. Azure Data Lake tools forVisual Studio ➔ Projects / Solutions / Source control ➔ Store Explorer ➔ Browse store ➔ Download complete / subset of file ➔ Preview ➔ JobVisualizer ➔ Determine bottlenecks by using heatmaps ➔ Playback jobs based on telemetry ➔ Query optimization ➔ Job Profiler ➔ Off-Line execution 28
  • 29. Integration with Azure Services ➔ Integrate in your data pipelines in Azure Data Factory ➔ Move data from Azure Data Lake Store to other store ➔ Move data to Azure Data Lake Store ➔ Run U-SQL query within pipeline ➔ Integration with Azure Data Catalog ➔ Register your Azure Data Lake Store assets 29
  • 30. Pricing ➔ Data Lake Store ➔ $0,08/GB stored per month ➔ $0,14 per 1M transactions • 1 transaction is block of up to 128 kB ➔ Egress will be billed but not know yet ➔ Data Lake Analytics ➔ $0,05 per job ➔ $0,05 per minute per Analytics Unit for processing time 30
  • 31. Azure Data Lake Store vs Blob Storage 31 No Limitations Store whatever you want in any format Security Built-in Azure Active Directory support Pricing More expensive than Storage RA-GRS Redundancy It’s there but no control over it Built for Scale Optimized for high- scale reads Integration With Data Factory, Data Catalog & HDInsight
  • 32. 32
  • 33. Summary ➔ Big Data is not just a hype so get ready ➔ Azure Data Lake Store ➔ Analyse today & explore tomorrow ➔ Data Swamps ➔ Data Lake Analytics ➔ No cluster management ➔ Re-use existing skills ➔ Pay for what we use ➔ Big Data in Azure? Azure Data Lake family and it’s easy!
  • 34.
  • 35. 35
  • 36. 36
  • 37. 37