J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
1. Data Solution Architect, Microsoft
AZURE DATA LAKE
Store and Analytics
Big Data for Microsoft Developers
Kenneth M. Nielsen
@doktorkermit
2. Kenneth M. Nielsen
• Worked with SQL Server since 1999
• Co-organizer of SQL Saturday DK
• Co-organizer of SQLNexus Nordic
• Community is Everything
• Data Solution Architect at Microsoft
• kmn@funkylab.com
• @doktorkermit
• www.funkylab.com
3. Agenda
• Azure Data Lake overview
• Azure Data Lake Store
• Azure Data Lake Analytics
• Azure Data Lake Analytics – Using Visual Studio
• Azure Data Lake Analytics – Using PowerShell
• Azure Data Lake Analytics – Cognitive Analysis
• Q & A
5. History
Bing needed to…
– Understand user behavior
And do it…
– At massive scale
– With agility and speed
– At low cost
So they built …
– Cosmos
Cosmos
• Batch Jobs
• Interactive
• Machine Learning
• Streaming
Thousands of Developers
6. AZURE DATA LAKE
Store and analyze data of any kind and size
Develop faster, debug and optimize smarter
Interactively explore patterns in your data
No learning curve
Managed and supported
Dynamically scales to match your business priorities
Enterprise-grade security
Built on YARN, designed for the cloud
8. Azure Data Lake Store
A hyper scale repository
for big data analytics
workloads
No limits to SCALE
Store ANY DATA in its native format
HADOOP FILE SYSTEM (HDFS) for the cloud
ENTERPRISE READY access control,
Encryption at rest
Optimized for analytic workload
PERFORMANCE
9. Azure Data Lake Store
Any Data
• Unstructured
• Semi-structured
• Structured
11. Azure Data Lake Store
HDFS for the cloud
New filesystem build from the
ground up, based on
HADOOP file system
• Integrates with
HDInsight, Hortonworks
and Cloudera
• Supports Files and
Folder objects and
operations
12. Azure Data Lake Store
Unlimited storage
• Files sizes can be
from Gigabytes to
Petabytes
• No limits to scale
13. Azure Data Lake Store
Security
• Always encrypted; in motion
using SSL, and at rest using
keys in Azure Key Vault
• Single sign-on, multi-factor
authentication and seamless
integration of on-premises
identities with Active Directory
• Fine-grained POSIX-based
ACLs for role-based access
controls
• Auditing every access /
configuration change
15. Azure Data Lake Analytics
A elastic analytics service
built on Apache YARN that processes all
data, at any size
• No limits to SCALE
• Includes U-SQL, a language that unifies the
benefits of SQL with the expressive power of C#
• Optimized to work with ADL STORE
• FEDERATED QUERY across Azure data sources
• ENTERPRISE READY Role based access control
& Auditing
• Pay PER JOB & Scale PER JOB
16. U-SQL
A new language for
Big Data
• Familiar syntax to millions of SQL & .NET
developers
• Unifies declarative nature of SQL with the
imperative power of C#
• Unifies structured, semi-structured and
unstructured data
• Distributed query support over all data
17. Language Overview
U-SQL Fundamentals
• All the familiar SQL clauses
SELECT | FROM | WHERE
GROUP BY | JOIN | OVER
• Operate on unstructured and
structured data
• Relational metadata objects
.NET integration and extensibility
• U-SQL expressions are full C# expressions
• Reuse .NET code in your own assemblies
• Use C# to define your own:
Types | Functions | Joins | Aggregators | I/O (Extractors, Outp
utters)
19. U-SQL Distributed Query
Azure Storage Blobs
Azure Data Lake Store
Azure SQL Database
Azure SQL Data Warehouse
Azure SQL DB in Azure VM
READ
READ
READ
READ
READ
WRITE
WRITE
WRITE
WRITE
WRITE
20. Develop massively parallel
programs with simplicity
• U-SQL: a simple
and powerful language that’s
familiar and easily extensible
• Unifies the declarative
nature of SQL with expressive
power of C#
• Leverage existing libraries in .NET
languages, R and Python
• Massively parallelize code on
diverse workloads (ETL, ML, image
tagging, facial detection)
21. @orders =
EXTRACT
OrderId int,
Customer string,
Date DateTime,
Amount float
FROM "/input/orders.txt"
USING Extractors.Tsv();
OUTPUT @orders
TO "/output/orders_copy.txt"
USING Outputters.Tsv();
Apply Schema on read
From a file in a Data Lake
Easy delimited text handling
Write out
Read the input, write it directly to output (just a simple copy)
Rowset
22. U-SQL Compilation Process
C#
C++
Algebra
Other files
(system files, deployed resources)
managed dll
Unmanaged dll
Compilation output (in job folder)
Compiler & Optimizer
U-SQL Metadata Service
Deployed to Vertices
23. Logical -> Physical Plan
Each square = “a vertex” represents
a fraction of the total
Vertexes in each SuperVertex (aka
“Stage) are doing the same operation
on different parts of the same data.
Vertexes in a later stages may
depend on a vertex in an earlier stage
24. Execution with Requested Parallelism
Requested Parallelism = 1
(reserve enough to do 1 vertex at
a time)
Requested Parallelism = 4
(reserve enough to do 4 vertices
at a time)
29. Why does a Job get Queued?
Local Cause
Conditions:
• Queue already at Max
Concurrency
Global Cause
Conditions:
• System-wide shortage of ADLAUs
• System-wide shortage of
Bandwidth
* If these conditions are met, a job will be queued even if the
queue is not at its Max Concurrency
37. Debug and Optimize your
Big Data programs with ease
• Deep integration with
Visual Studio, Visual Studio Code,
Eclipse, & IntelliJ
• Easy for novices to write
simple queries
• Integrated with U-SQL,
Hive, Storm, and Spark
• Actively offers recommendations
to improve performance and
reduce cost
• Playback visually displays job run
45. ADLA: List and submit jobs
• $adla = “mscloudsummitanalytics”
• Get-AzureRmDataLakeAnalyticsJob
-Account $adla
•
Submit-AzureRmDataLakeAnalyticsJob
-Account $adla
-Script “…” # U-SQL text
-Name myjob
• Submit-AzureRmDataLakeAnalyticsJob
-Account $adla
-ScriptPath D:test.script
-Name myjob
46. ADL Store (ADLS) feature set
Account Management
Create new account
List accounts
Update account properties
Delete account
Transferring Data
Upload into store from local disk
Download from store to local disk
Files and Folders
List contents of folder
Create
Move
Delete
Does file exist
Security
Get ACLs
Update ACLs
Get Owner
Set Owner
File Content
Set file content
Append file content
Get file content
Merge files
47. ADL Analytics (ADLA) feature set
Account Management
Create new account
List accounts
Update account properties
Delete account
Data Sources
Add a data source
List data sources
Update data source
Delete data source
Compute
List jobs
Submit job
Cancel job
Catalog Items
List items in U-SQL catalog
Update item
Catalog Secrets
Create catalog secret
List catalog secrets
Delete catalog secrets