Weitere Àhnliche Inhalte
Ăhnlich wie Microsoft Azure Batch (20)
Mehr von Khalid Salama (8)
KĂŒrzlich hochgeladen (20)
Microsoft Azure Batch
- 1. | © Copyright 2016 Hitachi Consulting1
Microsoft Azure Batch
High Performance Computing with an Application of
Scalable Files Processing
Khalid M. Salama, Ph.D.
Business Insights & Analytics
Hitachi Consulting UK
We Make it Happen. Better.
- 2. | © Copyright 2016 Hitachi Consulting2
Outline
ï§ What is Azure Batch and High Performance Computing?
ï§ When to Use Azure Batch?
ï§ Azure Batch Constructs
ï§ Scalable Data Loading Solution with Azure Batch
ï§ .NET Code Walk-through & Demo
ï§ Useful Resources
- 4. | © Copyright 2016 Hitachi Consulting4
What is Azure Batch?
Yet anther azure serviceâŠ
High Performance Computing (HPC)
environment on Azure.
- 5. | © Copyright 2016 Hitachi Consulting5
What is Azure Batch?
Yet anther azure serviceâŠ
High Performance Computing (HPC)
environment on Azure.
Used to scale/parallelize compute-
intensive workloads on managed
cluster of VMs.
- 6. | © Copyright 2016 Hitachi Consulting6
What is Azure Batch?
Yet anther azure serviceâŠ
High Performance Computing (HPC)
environment on Azure.
The computation on the
cluster is managed using
Azure Batch APIs.
Used to scale/parallelize compute-
intensive workloads on managed
cluster of VMs.
- 7. | © Copyright 2016 Hitachi Consulting7
What is Azure Batch?
Yet anther azure serviceâŠ
High Performance Computing (HPC)
environment on Azure.
The computation on the
cluster is managed using
Azure Batch APIs.
On-demand â Pay as you use
Elastic â Scale up/down or shut down
PaaS â No infrastructure configurations are
needed
Used to scale/parallelize compute-
intensive workloads on managed
cluster of VMs.
- 8. | © Copyright 2016 Hitachi Consulting8
Computing Example
Job
Job
Sequential Processing
- 9. | © Copyright 2016 Hitachi Consulting9
Computing Example
Job
Job
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Sequential Processing
- 10. | © Copyright 2016 Hitachi Consulting10
Computing Example
Job
Job
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Sequential Processing
Single Compute Unit
- 11. | © Copyright 2016 Hitachi Consulting11
Computing Example
Job
Job
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 1
Sequential Processing
Single Compute Unit
Start T = 0
- 12. | © Copyright 2016 Hitachi Consulting12
Computing Example
Job
Job
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 2
Sequential Processing
Task 1 T = 1X
Start T = 0
Single Compute Unit
- 13. | © Copyright 2016 Hitachi Consulting13
Computing Example
Job
Job
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 3
Sequential Processing
Task 1 T = 1X
Start T = 0
Task 2 T = 2X
Single Compute Unit
- 14. | © Copyright 2016 Hitachi Consulting14
Computing Example
Job
Job
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 1 T = 1X
Start T = 0
Task 2 T = 2X
Task 3 T = 3X
Task 4 T = 4X
Task 5 T = 5X
Task 6 T = 6X
Sequential Processing
End T = 6X+
Single Compute Unit
- 15. | © Copyright 2016 Hitachi Consulting15
High Performance Computing
Refers to the use of parallel processing for running compute intensive
job programs efficiently via aggregating compute power
- 16. | © Copyright 2016 Hitachi Consulting16
High Performance Computing
Refers to the use of parallel processing for running compute intensive
job programs efficiently via aggregating compute power
Scale out
Using multiple compute units
Divide
A Job is decomposed into
multiple Independent tasks
Distribute
Tasks are processed in a
separate compute nodes,
simultaneously
- 17. | © Copyright 2016 Hitachi Consulting17
Computing Example
JobJob
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Parallel Processing
- 18. | © Copyright 2016 Hitachi Consulting18
Computing Example
JobJob
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Parallel Processing
Compute Cluster
- 19. | © Copyright 2016 Hitachi Consulting19
Computing Example
JobJob
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Parallel Processing
Compute Cluster
Task 1
Task 2
Task 3
Task 4
Task 4
Task 6
- 20. | © Copyright 2016 Hitachi Consulting20
Computing Example
JobJob
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Parallel Processing
Compute Cluster
Task 1 T = 1X
Start T = 0
Task 2 T = 1X
Task 3 T = 1X
Task 4 T = 1X
Task 5 T = 1X
Task 6 T = 1X
End T = 1X+
- 21. | © Copyright 2016 Hitachi Consulting21
Big Data vs. Big Compute
The big brothers
Big Data
ï§ Data Centric
ï§ Increase of data Volume + Velocity + Varity
= Technologies to store and process the data efficiently
ï§ Azure HDInsight
- 22. | © Copyright 2016 Hitachi Consulting22
Big Data vs. Big Compute
The big brothers
Big Data
Big Compute
ï§ Data Centric
ï§ Increase of data Volume + Velocity + Varity
= Technologies to store and process the data efficiently
ï§ Azure HDInsight
ï§ CPU & Memory Intensive
ï§ Increase of computation and algorithms complexity
= Technologies to parallelize/distribute workload
ï§ Azure Batch
- 23. | © Copyright 2016 Hitachi Consulting23
Big Data vs. Big Compute
Big Data Processing is a subset of Big Compute, the latter covers a wider
spectrum of computing problems
The big brothers
Big Data
Big Compute
ï§ Data Centric
ï§ Increase of data Volume + Velocity + Varity
= Technologies to store and process the data efficiently
ï§ Azure HDInsight
ï§ CPU & Memory Intensive
ï§ Increase of computation and algorithms complexity
= Technologies to parallelize/distribute workload
ï§ Azure Batch
- 24. | © Copyright 2016 Hitachi Consulting24
When to use Azure Batch
Intrinsically parallel (also known as "embarrassingly parallel") applications
Use cases for Big Compute
- 25. | © Copyright 2016 Hitachi Consulting25
When to use Azure Batch
Intrinsically parallel (also known as "embarrassingly parallel") applications
ï§ Image rendering and graphics processing
ï§ Search and optimization problems
ï§ Various experimental/simulation computing applications
ï§ Massively parallel data file processing & loading
Use cases for Big Compute
- 26. | © Copyright 2016 Hitachi Consulting26
When to use Azure Batch
Intrinsically parallel (also known as "embarrassingly parallel") applications
ï§ Image rendering and graphics processing
ï§ Search and optimization problems
ï§ Various experimental/simulation computing applications
ï§ Massively parallel data file processing & loading
ï§ Executing thousands of DB Stored Procedures simultaneously
Use cases for Big Compute
- 27. | © Copyright 2016 Hitachi Consulting27
When to use Azure Batch
Intrinsically parallel (also known as "embarrassingly parallel") applications
ï§ Image rendering and graphics processing
ï§ Search and optimization problems
ï§ Various experimental/simulation computing applications
ï§ Massively parallel data file processing & loading
ï§ Executing thousands of DB Stored Procedures simultaneously NO!
Remember where the computation occurs!
Use cases for Big Compute
- 28. | © Copyright 2016 Hitachi Consulting28
When to use Azure Batch
Intrinsically parallel (also known as "embarrassingly parallel") applications
ï§ Image rendering and graphics processing
ï§ Search and optimization problems
ï§ Various experimental/simulation computing applications
ï§ Massively parallel data file processing & loading
ï§ Executing thousands of DB Stored Procedures simultaneously NO!
Remember where the computation occurs!
For applications that needs task-to-task interaction, Message Passing Interfaces (MPI) are
supported in Azure Batch â Distributed Processing
In some cases, communication between tasks can be managed via a shared data store â
Parallel Processing
Use cases for Big Compute
- 30. | © Copyright 2016 Hitachi Consulting30
Azure Batch Constructs
Putting together the pieces of the picture
Azure Batch Account
Azure Batch Account
âą Pool
â Number of VMs
â VM Size
â VM OS Family
ï§ Job
â Set of Tasks
â Priority
â Max. Execution time
ï§ Task
â Parent Job
â Resources (.config, .dlls)
â Cmd Executable (.exe)
â Cmd Parameters
Azure Storage Account
ï§ Hosts all the task resources
(.dlls & .exe)
- 31. | © Copyright 2016 Hitachi Consulting31
Azure Batch Constructs
Putting together the pieces of the picture
Azure Batch Account
Pool
(number of
nodes, osFamily,
Node Size
Azure Batch Account
âą Pool
â Number of VMs
â VM Size
â VM OS Family
ï§ Job
â Set of Tasks
â Priority
â Max. Execution time
ï§ Task
â Parent Job
â Resources (.config, .dlls)
â Cmd Executable (.exe)
â Cmd Parameters
Azure Storage Account
ï§ Hosts all the task resources
(.dlls & .exe)
- 32. | © Copyright 2016 Hitachi Consulting32
Azure Batch Constructs
Putting together the pieces of the picture
Azure Batch Account
Pool 1
(number of
nodes, osFamily,
Node Size
Pool 2
(number of
nodes, osFamily,
Node Size
Azure Batch Account
âą Pool
â Number of VMs
â VM Size
â VM OS Family
ï§ Job
â Set of Tasks
â Priority
â Max. Execution time
ï§ Task
â Parent Job
â Resources (.config, .dlls)
â Cmd Executable (.exe)
â Cmd Parameters
Azure Storage Account
ï§ Hosts all the task resources
(.dlls & .exe)
- 33. | © Copyright 2016 Hitachi Consulting33
Azure Batch Constructs
Putting together the pieces of the picture
Azure Batch Account
Pool 1
(number of
nodes, osFamily,
Node Size
Job
(priority, max
execution time)
Task 1
(job, exe
resources)
Task 2
(job, ex
resources)
Task 3
(job, exe
resources)
Pool 2
(number of
nodes, osFamily,
Node Size
Azure Batch Account
âą Pool
â Number of VMs
â VM Size
â VM OS Family
ï§ Job
â Set of Tasks
â Priority
â Max. Execution time
ï§ Task
â Parent Job
â Resources (.config, .dlls)
â Cmd Executable (.exe)
â Cmd Parameters
Azure Storage Account
ï§ Hosts all the task resources
(.dlls & .exe)
- 34. | © Copyright 2016 Hitachi Consulting34
Job 2
(priority, max
execution time)
Azure Batch Constructs
Putting together the pieces of the picture
Azure Batch Account
Pool 1
(number of
nodes, osFamily,
Node Size
Job 1
(priority, max
execution time)
Task 1
(job, exe
resources)
Task 2
(job, ex
resources)
Task 3
(job, exe
resources)
Task A
(job, exe
resources)
Task B
(job, exe
resources)
Job 3
(priority, max
execution time)
Task X
(job, exe
resources)
Task Y
(job, exe
resources)
Pool 2
(number of
nodes, osFamily,
Node Size
Azure Batch Account
âą Pool
â Number of VMs
â VM Size
â VM OS Family
ï§ Job
â Set of Tasks
â Priority
â Max. Execution time
ï§ Task
â Parent Job
â Resources (.config, .dlls)
â Cmd Executable (.exe)
â Cmd Parameters
Azure Storage Account
ï§ Hosts all the task resources
(.dlls & .exe)
- 35. | © Copyright 2016 Hitachi Consulting35
Job 2
(priority, max
execution time)
Azure Batch Constructs
Putting together the pieces of the picture
Azure Batch Account
Pool 1
(number of
nodes, osFamily,
Node Size
Job 1
(priority, max
execution time)
Task 1
(job, exe
resources)
Task 2
(job, exe
resources)
Task 3
(job, exe
resources)
Task A
(job, exe
resources)
Task B
(job, exe
resources)
Job 3
(priority, max
execution time)
Task X
(job, exe
resources)
Task Y
(job, exe
resources)
Pool 2
(number of
nodes, osFamily,
Node Size
Azure Batch Account
âą Pool
â Number of VMs
â VM Size
â VM OS Family
ï§ Job
â Set of Tasks
â Priority
â Max. Execution time
ï§ Task
â Parent Job
â Resources (.config, .dlls)
â Cmd Executable (.exe)
â Cmd Parameters
Azure Storage Account
ï§ Hosts all the task resources
(.dlls & .exe)
- 36. | © Copyright 2016 Hitachi Consulting36
Compute Size
Resource Default Maximum Limit
Azure Batch Account 1 50
Pools per Batch Account 20 5000
Cores per Batch Account 20 N/A
Tasks per Compute Node 1 4 X node core
Number of Nodes vs Node Size:
ï§ Many small nodes â many tasks, not compute/memory intensive
ï§ Few big nodes â few tasks, compute/memory intensive
(potential multi-threading per task)
ï§ Task queueing is automatically managed by Azure Batch
Azure Batch Account
âą Pool
â Number of VMs
â VM Size
â VM OS Family
ï§ Job
â Set of Tasks
â Priority
â Max. Execution time
ï§ Task
â Parent Job
â Resources (.config, .dlls)
â Cmd Executable (.exe)
â Cmd Parameters
Azure Storage Account
ï§ Hosts all the task resources
(.dlls & .exe)
- 37. | © Copyright 2016 Hitachi Consulting37
Compute Size
What If:
ï§ Pool Size = 10 Nodes
ï§ Node Size = Small (1 Core)
ï§ Total Cores = 10
And you have:
ï§ 2 Jobs
ï§ Each Job has 7 task
ï§ Total tasks = 14
By default:
ï§ 1 Core can process only 1 task
- 38. | © Copyright 2016 Hitachi Consulting38
Compute Size
What If:
ï§ Pool Size = 10 Nodes
ï§ Node Size = Small (1 Core)
ï§ Total Cores = 10
And you have:
ï§ 2 Jobs
ï§ Each Job has 7 task
ï§ Total tasks = 14
By default:
ï§ 1 Core can process only 1 task
Then:
ï§ The 7 tasks with the higher priority job will be executed
(status = âRunningâ)
ï§ The first 3 added tasks to the lower priority job will be executed
(status = âRunningâ)
ï§ The rest 4 task of the lower priority job will be queued
(status = âActiveâ)
ï§ As soon as a âRunningâ task finishes (status = âCompletedâ)
an âActiveâ task will be assigned to the freed compute node
- 39. | © Copyright 2016 Hitachi Consulting39
Compute Size
What If:
ï§ Pool Size = 10 Nodes
ï§ Node Size = Small (1 Core)
ï§ Total Cores = 10
And you have:
ï§ 2 Jobs
ï§ Each Job has 7 task
ï§ Total tasks = 14
By default:
ï§ 1 Core can process only 1 task
Then:
ï§ The 7 tasks with the higher priority job will be executed
(status = âRunningâ)
ï§ The first 3 added tasks to the lower priority job will be executed
(status = âRunningâ)
ï§ The rest 4 task of the lower priority job will be queued
(status = âActiveâ)
ï§ As soon as a âRunningâ task finishes (status = âCompletedâ)
an âActiveâ task will be assigned to the freed compute node
ï§ If job was executed (status = âRunningâ), then a higher priority job is
submitted to the same pool:
â Azure Batch will âpauseâ tasks of the low priority job (status = âSuspendedâ)
to free resources (cores) for the higher priority job,
â then resume them when resources become available
- 40. | © Copyright 2016 Hitachi Consulting40
Use Case: Parallel Data Files Loading
- 41. | © Copyright 2016 Hitachi Consulting41
Parallel Data Loading with Azure Batch
ï§ Source data is a set of files, with different formants
(Fixed width, Delimited, XML, JSON, Mainframe, Other), in Azure Blob Storage
ï§ Blob Storage Structure: â<DataDomain><DataFeed><DataFeed>_<Timestamp>.<ext>â
ï§ 200+ data feeds, each produces 1-3 files daily
ï§ Data feed formats (column, data types, file format) are described in MetadataDB (Azure SQL DB)
ï§ The objective is to build a Data Loading Solution to:
ï§ Parse the files and load them into a database (Azure SQL DW)
ï§ Be scalable â used for ongoing data loading and history data migration
ï§ Be metadata driven â new data feeds can be handled by the solution by adding metadata
ï§ Log execution history and errors
Problem Context
- 42. | © Copyright 2016 Hitachi Consulting42
Parallel Data Loading with Azure Batch
The task (unit of parallelization, or granule) can be:
ï§ Processing a Feed
ï§ balanced number of files/file sizes in each feed
ï§ loading files in sequence
ï§ files can be processed simultaneously on the same node using multithreading (CPU/Memory
implications)
ï§ Processing a File
ï§ no files sequence is needed
ï§ fine grain, more control, better utilization of resources
ï§ less manageable (many tasks per job).
ï§ Processing File Line
ï§ multithreading on the same node.
Parallelism Level
- 43. | © Copyright 2016 Hitachi Consulting43
Parallel Data Loading with Azure Batch
Solution Architecture
Azure Batch
Runner
<Host>
Source
<Azure Blob Storage>
Compute Cluster
<Azure Batch Pool>
Feed 1
Feed 2
Feed N
.
.
.
.
.
.
Destination
<Azure SQL DW>
Metadata
<Azure SQL DB>
- 44. | © Copyright 2016 Hitachi Consulting44
Parallel Data Loading with Azure Batch
Solution Architecture
Azure Batch
Runner
<Host> Metadata
<Azure SQL DB>
Source
<Azure Blob Storage>
Compute Cluster
<Azure Batch Pool>
Feed 1
Feed 2
Feed N
.
.
.
.
.
.
1 - Get list of feeds to process
Destination
<Azure SQL DW>
- 45. | © Copyright 2016 Hitachi Consulting45
Parallel Data Loading with Azure Batch
Solution Architecture
Azure Batch
Runner
<Host>
Source
<Azure Blob Storage>
Compute Cluster
<Azure Batch Pool>
Feed 1
Feed 2
Feed N
.
.
.
.
.
.
1 - Get list of feeds to process
2 â Create a Job
3 â Create a task for each feed
4 â add the tasks to the job
5 â Submit the job
Metadata
<Azure SQL DB>
Destination
<Azure SQL DW>
- 46. | © Copyright 2016 Hitachi Consulting46
Parallel Data Loading with Azure Batch
Solution Architecture
Azure Batch
Runner
<Host> Metadata
<SQL Azure DB>
Source
<Azure Blob Storage>
Compute Cluster
<Azure Batch Pool>
Feed 1
Feed 2
Feed N
.
.
.
.
.
.
Task 1
Task 2
Task N
Destination
<Azure SQL DW>
- 47. | © Copyright 2016 Hitachi Consulting47
Parallel Data Loading with Azure Batch
Solution Architecture
Azure Batch
Runner
<Host>
Source
<Azure Blob Storage>
Compute Cluster
<Azure Batch Pool>
Feed 1
Feed 2
Feed N
.
.
.
.
.
.
File
1
File
2
. . . DS
1
DS
2
. . .
Task 1
Task 2
Task N
Metadata
<Azure SQL DB>
Destination
<Azure SQL DW>
- 48. | © Copyright 2016 Hitachi Consulting48
Parallel Data Loading with Azure Batch
Task Processing Steps
Get feed format Info from Metadata
- 49. | © Copyright 2016 Hitachi Consulting49
Parallel Data Loading with Azure Batch
Get feed format Info from Metadata
Create destination tables
Task Processing Steps
- 50. | © Copyright 2016 Hitachi Consulting50
Parallel Data Loading with Azure Batch
Get feed format Info from Metadata
Create destination tables
Get list of file to process
Task Processing Steps
- 51. | © Copyright 2016 Hitachi Consulting51
Parallel Data Loading with Azure Batch
Get feed format Info from Metadata
Create destination tables
Get list of file to process
Load parser class to use
Task Processing Steps
- 52. | © Copyright 2016 Hitachi Consulting52
Parallel Data Loading with Azure Batch
Get feed format Info from Metadata
Create destination tables
Get list of file to process
Load parser class to use
For each file to process
Task Processing Steps
- 53. | © Copyright 2016 Hitachi Consulting53
Parallel Data Loading with Azure Batch
Get feed format Info from Metadata
Create destination tables
Get list of file to process
Load parser class to use
For each file to process
Load file content from Blob Storage
Task Processing Steps
- 54. | © Copyright 2016 Hitachi Consulting54
Parallel Data Loading with Azure Batch
Get feed format Info from Metadata
Create destination tables
Get list of file to process
Load parser class to use
For each file to process
Load file content from Blob Storage
Parse file content to DataTable
Task Processing Steps
- 55. | © Copyright 2016 Hitachi Consulting55
Parallel Data Loading with Azure Batch
Get feed format Info from Metadata
Create destination tables
Get list of file to process
Load parser class to use
For each file to process
Load file content from Blob Storage
Parse file content to DataTable
Dump DataTable content to destination (DW)
Task Processing Steps
- 56. | © Copyright 2016 Hitachi Consulting56
.NET Solution Structure
âą Model
âą Database Services
âą Blob Storage Services
âą Parsers
Processing Logic
(Class Library)
âą Receives Command Line parameters
âą Performs the operation according to the supplied
parameters
Task
(Console App)
âą Azure Batch Services
âą Creates Pools/Jobs/Task
Runner
(Console App)
- 57. | © Copyright 2016 Hitachi Consulting57
.NET Solution Structure
}Azure Blob
Storage
} A Host
âą Model
âą Database Services
âą Blob Storage Services
âą Parsers
Processing Logic
(Class Library)
âą Receives Command Line parameters
âą Performs the operation according to the supplied
parameters
Task
(Console App)
âą Azure Batch Services
âą Creates Poos/Jobs/Task
Runner
(Console App)
- 58. | © Copyright 2016 Hitachi Consulting58
Hosting Azure Batch Runner
ï§None! â One-off execution
ï§SQL Agent Job (VM + SqlServer)
ï§SQL Server Integration Services (VM + SqlServer)
ï§Azure WebJob + Azure Scheduler (or on-demand)
ï§Azure Data Factory
ï§Azure Orchestration???
- 60. | © Copyright 2016 Hitachi Consulting60
Code Walk-through
ï§ Solution Structure
ï§ Azure Batch Bits
ï§ Azure Blob Storage Bits
ï§ Text File Processing
ï§ XML & JSON â (Quick and Dirty)
ï§ SQL Bulk Copy with Retry Pattern
This is how we do it
- 61. | © Copyright 2016 Hitachi Consulting61
Code Walk-through
Solution Structure
- 62. | © Copyright 2016 Hitachi Consulting62
Code Walk-through
Azure Batch Bits
Very useful if you want to
sync with subsequent
processing steps.
I.e., start a subsequent step
only when the job finishes.
- 63. | © Copyright 2016 Hitachi Consulting63
Code Walk-through
Azure Batch Bits
- 64. | © Copyright 2016 Hitachi Consulting64
Code Walk-through
Azure Batch Bits
- 65. | © Copyright 2016 Hitachi Consulting65
Code Walk-through
Azure Blob Storage
Streaming is very efficient in
terms of processing large files,
instead of downloading the whole
file to be processed
- 66. | © Copyright 2016 Hitachi Consulting66
Code Walk-through
Text File Parsing â FileHelpers Library
Parallel processing at the file level
(a separate thread per line to parse)
- 67. | © Copyright 2016 Hitachi Consulting67
Code Walk-through
XML & JSON Files Parsing â Quick & Dirty
âą The content of the whole file is loaded in a dataset
âą Cannot flush data in batches
âą Unlike streaming, it is more memory intensive approach
- 68. | © Copyright 2016 Hitachi Consulting68
Code Walk-through
SQL Bulk Copy â Loading in Batches
Batch size <
(available memory / record size)
- 69. | © Copyright 2016 Hitachi Consulting69
Code Walk-through
SQL Bulk Copy â Asynchronous
- 70. | © Copyright 2016 Hitachi Consulting70
Code Walk-through
SQL Bulk Copy â Retry Pattern
- 71. | © Copyright 2016 Hitachi Consulting71
Some Important Notes - Polybase
ï§ Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best
option to load data from Blob Storage into it, by creating external tables that defines the format of
the data file.
- 72. | © Copyright 2016 Hitachi Consulting72
Some Important Notes - Polybase
ï§ Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best
option to load data from Blob Storage into it, by creating external tables that defines the format of
the data file.
ï§ However, to use Polybase, the Blob Storage needs to be locally-redundant, and each folder
should have only one data file type.
- 73. | © Copyright 2016 Hitachi Consulting73
Some Important Notes - Polybase
ï§ Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best
option to load data from Blob Storage into it, by creating external tables that defines the format of
the data file.
ï§ However, to use Polybase, the Blob Storage needs to be locally-redundant, and each folder
should have only one data file type.
ï§ A pre-processing step is to move the data files from the original Blob storage (that might be Geo-
redundant), to a temporary locally redundant Blob Storage, in a proper folder structure.
- 74. | © Copyright 2016 Hitachi Consulting74
Some Important Notes - Polybase
ï§ Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best
option to load data from Blob Storage into it, by creating external tables that defines the format of
the data file.
ï§ However, to use Polybase, the Blob Storage needs to be locally-redundant, and each folder
should have only one data file type.
ï§ A pre-processing step is to move the data files from the original Blob storage (that might be Geo-
redundant), to a temporary locally redundant Blob Storage, in a proper folder structure.
ï§ Parsing data files with complex format (e.g., parent child, mainframe, JSON, XML) is not possible in
Polybase (yet), but Polybase can load each line in the file into a one-column table, where T-SQL
is used to parse it.
- 75. | © Copyright 2016 Hitachi Consulting75
Some Important Notes - Polybase
ï§ Since the destination database is a Azure SQL DW, Polybase - a Big Data technology - is the best
option to load data from Blob Storage into it, by creating external tables that defines the format of
the data file.
ï§ However, to use Polybase, the Blob Storage needs to be locally-redundant, and each folder
should have only one data file type.
ï§ A pre-processing step is to move the data files from the original Blob storage (that might be Geo-
redundant), to a temporary locally redundant Blob Storage, in a proper folder structure.
ï§ Parsing data files with complex format (e.g., parent child, mainframe, JSON, XML) is not possible in
Polybase (yet), but Polybase can load each line in the file into a one-column table, where T-SQL
is used to parse it.
ï§ If the source is not Blob Storage (i.e., file system), or you destination is not Azure SQL DW (e.g.,
Azure SQL DB, DocumentDB, or another Azure Blob Storage/Data lake), or your file processing
does not only involve loading data to a database (e.g., processing requests to initiate workflow),
Azure Batch is the right tool.
- 76. | © Copyright 2016 Hitachi Consulting76
Useful Resources
Check these outâŠ
âą Azure Batch Documentation
https://azure.microsoft.com/en-us/documentation/articles/batch-technical-overview
âą Azure Batch Explorer
https://github.com/Azure/azure-batch-samples/tree/master/CSharp/BatchExplorer
âą HPC and data orchestration using Azure Batch and Data Factory
https://azure.microsoft.com/en-us/documentation/articles/data-factory-data-processing-using-batch
âą FileHelpers Librarys
http://www.filehelpers.net
âą Retry Pattern
https://msdn.microsoft.com/en-us/library/dn589788.aspx
âą Spinning up 16,000 A1 Virtual Machines on Azure Batch
https://blogs.endjin.com/2015/07/spinning-up-16000-a1-virtual-machines-on-azure-batch
âą Parallel Computing
https://en.wikipedia.org/wiki/Parallel_computing
- 77. | © Copyright 2016 Hitachi Consulting77
Acknowledgement
These guys are awesomeâŠ
Thanks to James Fox and Alessandro Aeberli for their efforts
in building the awesome Data Landing Solution for Argos.
Nirav is currently the master of the landing solution ï
- 78. | © Copyright 2016 Hitachi Consulting78
My Background
Applying Computational Intelligence in Data Mining
âą Honorary Research Fellow, School of Computing , University of Kent.
âą Ph.D. Computer Science, University of Kent, Canterbury, UK.
âą M.Sc. Computer Science , The American University in Cairo, Egypt.
âą 25+ published journal and conference papers, focusing on:
â classification rules induction,
â decision trees construction,
â Bayesian classification modelling,
â data reduction,
â instance-based learning,
â evolving neural networks, and
â data clustering
âą Journals: Swarm Intelligence, Swarm & Evolutionary Computation,
, Applied Soft Computing, and Memetic Computing.
âą Conferences: ANTS, IEEE CEC, IEEE SIS, EvoBio,
ECTA, IEEE WCCI and INNS-BigData.
ResearchGate.org