This is a step-by-step approach the entire ecosystem of features driven by Azure Data eXplorer. You can find many examples using Kusto dialect, in order to acquire data, process and build up complete web interfaces using only one service: ADX.
3. Everything our User
Group Has To Offer
Get involved in
our Meetup
Join the conversation on
our Facebook group
Follow our page
on Facebook
Follow our Videos
on Youtube
Explore
https://bit.ly/2P9sqLy https://bit.ly/2QqAWX4
https://bit.ly/3auvRnD
https://bit.ly/3n8l5bP
5. Azure Data Explorer in a sentence
The Platform
Any append-
only stream
of records
Relational query model:
Filter, aggregate, join,
calculated columns, …
Fully-
managed
Rapid iterations to
explore the data
High volume
High velocity
High variance
(structured, semi-
structured, free-text)
PaaS, Vanilla,
Database
Purposely built
ADX in a sentence
11. • seconds freshness, days retention
• in-mem aggregated data
• pre-defined standing queries
• split-seconds query performance
• data viewing
Hot
• minutes freshness, months retention
• raw data
• ad-hoc queries
• seconds-minutes query perf
• data exploration
Warm
• hours freshness, years retention
• raw data
• programmatic batch processing
• minutes-hours query perf
• data manipulation
Cold
• in-mem cube
• stream analytics
• …
• column store
• Indexing
• …
• distributed file system
• map reduce
• …
Multi-temperature data processing paths
12. The role of ADX
12
Raw data DWH
Refined data
Real time
derived data
Data
comparison
and fast kpi
ADX
THREE KEY USERS IN ONE TOOL:
• IoT Developer (data check, rule engine for insights)
• Data engineer (data comparison)
• Data scientist (data exploration)
13. How ADX is Organized
13
INSTANCE DATABASE SOURCES
DB Users/Apps
Ingestion URL
Querying URL
Cache storage
Blob storage
EXTERNAL
SOURCES
EXTERNAL
DESTINATIONS
IotHUB
EventHub
Storage
ADLS
Sql Server
MANY..
15. FIRST PHASE: Ingestion
15
• Many connections & Plugins
• Many SDKs
• Many managed pipelines
• Many tools to Ingest Rapidly
Managed pipelines:
• Ingest blob using EventGrid
• Ingest Eventhub stream
• Ingest IotHub stream
• Ingest data from ADF
Connections & Plugins:
• Logstash plugin
• Kafka Connector
• Apache spark Connector
Many SDK:
• Python SDK
• .NET SDK
• Java SDK
• Node SDK
• REST API
• GO API
Tools:
• One click ingestion
• LightIngest
16. Ingestion Types:
16
• Streaming ingestion: Optimized for low volume of data per table,
over thousands of tables
• Operation completes in under 10 seconds
• Data available for query after completion
• Batching ingestion: optimized for high ingestion throughput
• Default batch params: 5 minutes, 500 items, or 1000MB
17. Ingestion Tecniques
17
For high-volume, reliable,
and cheap data ingestion
Batch ingestion
(provided by SDK)
the client uploads the data to Azure
Blob storage (designated by the Azure
Data Explorer data management
service) and posts a notification to an
Azure Queue.
Batch ingestion is the recommended
technique.
Most appropriate for exploration and
prototyping
.Inline ingestion
(provided by query tools)
Inline ingestion: control command (.ingest inline)
containing in-band data is intended for ad hoc testing
purposes.
Ingest from query: control command (.set, .set-or-append,
.set-or-replace) that points to query results is used for
generating reports or small temporary tables.
Ingest from storage: control command (.ingest into) with
data stored externally (for example, Azure Blob Storage)
allows efficient bulk ingestion of data.
18. Ingestion: Format & UseCases
18
For all ingestion methods other than ingest from query, format the data so that Azure
Data Explorer can parse it. The supported data formats are:
• CSV, TSV, TSVE, PSV, SCSV, SOH
• JSON (line-separated, multi-line), Avro, MultiJSON (jsonLine), ORC, Parquet
• Files/Blobs can be compressed: ZIP, GZIP
• Better to use declarative names: MyData.csv.zip, MyData.json.gz
19. Supported data formats
19
Schema mapping helps bind source data fields to destination table columns.
• CSV Mapping (optional) works with all ordinal-based formats. It can be performed
using the ingest command parameter or pre-created on the table and referenced
from the ingest command parameter.
• JSON Mapping (mandatory) and Avro mapping (mandatory) can be performed
using the ingest command parameter. They can also be pre-created on the table and
referenced from the ingest command parameter.
23. What is LightIngest
23
• command-line utility for ad-hoc
data ingestion into Kusto
• pull source data from a local
folder
• pull source data from an Azure
Blob Storage container
• Useful to ingest fastly and play
with ADX
• Most useful when you want to
ingest a large amount of data, (time
constraint on ingestion duration)
[Ingest JSON data from blobs]
LightIngest "https://adxclu001.kusto.windows.net;Federated=true"
-database:db001
-table:LAB
-
sourcePath:"https://ACCOUNT_NAME.blob.core.windows.net/CONTAINER_NAME?SAS_TOKEN"
-prefix:MyDir1/MySubDir2
-format:json
-mappingRef:DefaultJsonMapping
-pattern:*.json
-limit:100
[Ingest CSV data with headers from local files]
LightIngest "https://adxclu001.kusto.windows.net;Federated=true"
-database:MyDb
-table:MyTable
-sourcePath:"D:MyFolderData"
-format:csv
-ignoreFirstRecord:true
-mappingPath:"D:MyFolderCsvMapping.txt"
-pattern:*.csv.gz
-limit:100
REFERENCE:
https://docs.microsoft.com/en-
us/azure/kusto/tools/lightingest
24. LightIngest: pay attention IngestionTime!
24
IMPORTANT:
All the data is indexed but... How is partitioned???? By Ingestion TIME !!!
the -creationTimePattern argument allows users to partition the data by creation time, not ingestion time
[Ingest CSV data with headers from local
files]
LightIngest
"https://adxclu001.kusto.windows.net;Federated=true"
-database:MyDb
-table:MyTable
-sourcePath:"D:MyFolderData"
-format:csv
-ignoreFirstRecord:true
-mappingPath:"D:MyFolderCsvMapping.txt"
-pattern:*.csv.gz
-limit:100
[Ingest JSON data from blobs]
LightIngest
"https://adxclu001.kusto.windows.net;Federated=true"
-database:db001
-table:LAB
-sourcePath:
"https://ACCOUNT_NAME.blob.core.windows.net/CON
TAINER_NAME?SAS_TOKEN"
-prefix:MyDir1/MySubDir2
-format:json
-mappingRef:DefaultJsonMapping
-pattern:*.json
-limit:100
25. One Click ingestion GA
25
• One Click makes ingestion (intuitive UX)
• Start ingesting data , creating tables and
mapping structures
• Different data formats
STEPS:
1. Check your data
2. Study the best format compression
3. Create and destroy tons of test tables
4. Derive the Mapping
5. SCRIPT ALL and Version It
26. My ingestion best experience
26
Open points:
• Why EventHub after IotHub?
• Why the second EventHub?
27. Update Policy
27
Automatically append data to a target table whenever new data is inserted into the source table, based on a
transformation query that runs on the data inserted into the source table.
USE IT IF:
• The source table is as a «free-text column based»
• The target table accepts only specific morphology
Cascading updates are allowed (TableA → TableB → TableC → ...).
Raw table Refined table
28. How to use Update Policy
28
// Create a function that will be used for update
.create function
MyUpdateFunction()
{
MyTableX
| where ColumnA == 'some-string'
| summarize MyCount=count() by ColumnB, Key=ColumnC
| join (OtherTable | project OtherColumnZ, Key=OtherColumnC) on Key
| project ColumnB, ColumnZ=OtherColumnZ, Key, MyCount
}
// Create the target table (if it doesn't already exist)
.set-or-append DerivedTableX <| MyUpdateFunction() | limit 0
// Use update policy on table DerivedTableX
.alter table DerivedTableX policy update
@'[{"IsEnabled": true, "Source": "MyTableX", "Query": "MyUpdateFunction()", "IsTransactional": false, "PropagateIngestionProperties": false}]’
.delete table DerivedTableX policy update
29. Pay attention to failures!
29
Evaluate resource usage
.show table MySourceTable extents;
// The following line provides the extent ID for the not-yet-merged extent in the source table which has the most
records
let extentId = $command_results | where MaxCreatedOn > ago(1hr) and MinCreatedOn == MaxCreatedOn | top 1 by
RowCount desc | project ExtentId;
let MySourceTable = MySourceTable | where extent_id() == toscalar(extentId);
MyFunction()
Failures
.show ingestion failures
| where FailedOn > ago(1hr) and OriginatesFromUpdatePolicy == true
• Non-transactional policy: ignored
• Transactional policy: If the ingestion method is pull => automated
retry on the entire ingestion operation (max time)
SO:
You should check failures to
trigger «BROKEN FILES» …
but HOW?
30. Use this pattern
30
First table is NEVER wide!!
… but YES for the second!
First table schema is K,V,TS,Metadata
Second table schema is WT (Wide
Table)
Telemetry oriented ML oriented
33. Kusto for SQL USers
33
• Perform SQL SELECT (no DDL, only SELECT)
• Use KQL (Kusto Query Language)
• Supports translating T-SQL queries to Kusto query
language
explain
select top(10) * from StormEvents
order by DamageProperty desc
StormEvents
| sort by DamageProperty desc nulls first
| take 10
36. Time Series Analysis – Bin Operator
36
T | summarize Hits=count() by bin(Duration,
1s)
bin(value,roundTo)
bin operator
Rounds values down to an integer multiple of a given bin size. If you have a scattered set of values, they will
be grouped into a smaller set of specific values.
[Rule]
[Example]
37. Time Series Analysis – Make Series Operator
37
T | make-series sum(amount) default=0, avg(price) default=0 on timestamp from datetime(2016-01-
01) to datetime(2016-01-10) step 1d by supplier
T | make-series [MakeSeriesParamters] [Column =] Aggregation [default = DefaultValue] [, ...] on
AxisColumn from start to end step step [by [Column =] GroupExpression [, ...]]
make-series operator
[Rule]
[Example]
38. Time Series Analysis – Basket Operator
38
StormEvents
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , "YES" , "NO")
| project State, EventType, Damage, DamageCrops
| evaluate basket(0.2)
basket operator
Basket finds all frequent patterns of discrete attributes (dimensions) in the data and will return all frequent
patterns that passed the frequency threshold in the original query.
[Rule]
[Example]
T | evaluate basket([Threshold, WeightColumn, MaxDimensions, CustomWildcard, CustomWildcard, ...])
39. Time Series Analysis – Autocluster Operator
39
StormEvents
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , "YES" ,
"NO")
| project State , EventType , Damage
| evaluate autocluster(0.6)
autocluster operator
AutoCluster finds common patterns of discrete attributes (dimensions) in the data and will reduce the results
of the original query (whether it's 100 or 100k rows) to a small number of patterns.
[Rule]
[Example]
T | evaluate autocluster([SizeWeight, WeightColumn, NumSeeds, CustomWildcard, CustomWildcard, ...])
StormEvents
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , "YES" , "NO")
| project State , EventType , Damage
| evaluate autocluster(0.2, '~', '~', '*')
41. ADX Functions
41
Functions are reusable queries or query parts. Kusto supports several kinds of functions:
• Stored functions, which are user-defined functions that are stored and managed a one kind of a
database's schema entities. See Stored functions.
• Query-defined functions, which are user-defined functions that are defined and used within the scope of
a single query. The definition of such functions is done through a let statement. See User-defined
functions.
• Built-in functions, which are hard-coded (defined by Kusto and cannot be modified by users).
42. Materialized views
42
The view expose an always up-to-date view of the defined aggregation.
Advantages:
• Performance improvement
• Freshness
• Cost reduction
Behind the scenes:
• Source table is periodically materialized into the view table
• During the query time, the view combines the materialized part with the DELTA in raw table since last
materialization to return complete results
45. Export
45
• To Storage
.export async compressed to csv (
h@"https://storage1.blob.core.windows.net/containerName;secretKey",
h@"https://storage1.blob.core.windows.net/containerName2;secretKey" )
with ( sizeLimit=100000, namePrefix=export, includeHeaders=all, encoding
=UTF8NoBOM ) <| myLogs | where id == "moshe" | limit 10000
• To Sql
.export async to sql ['dbo.MySqlTable']
h@"Server=tcp:myserver.database.windows.net,1433;Database=MyDatabas
e;Authentication=Active Directory Integrated;Connection Timeout=30;" with
(createifnotexists="true", primarykey="Id") <| print Message = "Hello
World!", Timestamp = now(), Id=12345678
1. DEFINE COMMAND
Define ADX command and try
your recurrent export strategy
2. TRY IN EDITOR
Use an Editor to try command,
verifying conection strings and
parametrizing them
3. BUILD A JOB
Build a Notebook or a C# JOB
using the command as a SQL
QUERY in your CODE
46. External tables & Continuous Export
46
• It’s an external endpoint:
• Azure Storage
• Azure Datalake Store
• SQL Server
• You need to define:
• Destination
• Continuous-Export Strategy
EXT TABLE CREATION
.create external table ExternalAdlsGen2 (Timestamp:datetime, x:long,
s:string) kind=adl partition by bin(Timestamp, 1d) dataformat=csv (
h@'abfss://filesystem@storageaccount.dfs.core.windows.net/path;secre
tKey' ) with ( docstring = "Docs", folder = "ExternalTables",
namePrefix="Prefix" )
EXPORT to EXT TABLE
.create-or-alter continuous-export MyExport over (T) to table
ExternalAdlsGen2 with (intervalBetweenRuns=1h, forcedLatency=10m,
sizeLimit=104857600) <| T
47. My best experience
47
Open points
• How to extract insights, using dynamic
and codeless approach?
• Ho to integrate ADX with low cost DB
solutions?
50. ADX Dashboards
50
• Integration in KUSTO Web
Explorer
• Optimized for big data
• Using powerful KQL to retrieve
visual data
• Make dynamic views or widgets
53. Grafana query builder
53
• Create Grafana panels
with no KQL knowledge
• Select
values/filter/grouping using
simple UI dropdowns
• Switch to RawMode to
enhance queries with KQL
54. How to use Grafana easily
54
Go to All Plugins section, search
ADX Datasource and install plugin
55. How to use Grafana easily
55
Go to https://grafana.com/
Signup and get and Account
56. How to use Grafana easily
56
Go to your grafana
https://<workbenchname>.grafana.net/datasources
And configure ADX datasource
And then Start building dashboards!
58. How about orchestration?
Three use cases in which FLOW + KUSTO are the solution
Push data to Power BI dataset
Periodically do queries, and
push to PowerBI dataset
Conditional queries
Make data checks, and send
notifications with no code
Email multiple ADX Flow charts
Send incredible emails with HTML5
Chart as query result
59. Orchestration?
Manage costs
Starting and stopping cluster,
evaluating a condition
Query sets to check data
Plan a Set of Queries in order
to say «IT’S OK, even Today !»
Manage data retention
Based on dynamic condition
60. An Example of:
60
1. Set trigger 2. Connect and test ADX BLOCK 3. Configure Email BLOCK with dynamic params
63. Data encryption in ADX
• encryption rest (using Azure Storage
• A Microsoft-managed key is used
• customer-managed keys can be enabled
• key rotation, temporary disable and revoke access controls can be implemented.
• Soft Delete and Purge Protection will be enabled on the Key Vault and cannot be disabled.
63
64. Extents, policies and Partition
• What are data shards or extents
• Column, segments, and blocks
• merge policy and sharding policy
• Data partitioning policy (post-ingestion)
64
65. FACTS:
A) Kusto stores its ingested data in reliable storage (most commonly Azure Blob Storage).
B) To speed-up queries on that data, Kusto caches this data (or parts of it) on its processing nodes,
The Kusto cache provides a granular cache policy that
customers can use to differentiate between two data
cache policies: hot data cache and cold data cache.
set query_datascope="hotcache";
T | union U | join (T datascope=all | where Timestamp < ago(365d) on X
YOU CAN SPECIFY WHICH LOCATION MUST BE USED
Cache policy
is independent
from retention
policy !
Retention policy
65
66. Retention policy
66
• Soft Delete Period (number)
• Data is available for query
ts is the ADX IngestionDate
• Default is set to 100 YEARS
• Recoverability (enabled/disabled)
• Default is set to ENABLED
• Recoverable for 14 days after deletion
.alter database DatabaseName policy retention "{}"
.alter table TableName policy retention "{}"
EXAMPLE:
{ "SoftDeletePeriod": "36500.00:00:00",
"Recoverability":"Enabled" }
.delete database DatabaseName policy retention
.delete table TableName policy retention
.alter-merge table MyTable1 policy retention softdelete = 7d
2 Parameters, applicable to DB or Table
67. Data Purge
67
PURGE PROCESS:
1. It requires database admin
permissions
2. Prior to Purging you have to
be ENABLED, opening a
SUPPORT TICKET.
3. Run purge QUERY, and
identify SIZE, EXEC.TIME and
give VerificationToken
4. Run REALLY purge QUERY
passing Verification Token
.purge table MyTable records in database MyDatabase <| where
CustomerId in ('X', 'Y')
NumRecordsToPurge
EstimatedPurge
ExecutionTime VerificationToken
1,596 00:00:02 e43c7184ed22f4f
23c7a9d7b124d19
6be2e570096987
e5baadf65057fa6
5736b
.purge table MyTable records in database MyDatabase with
(verificationtoken='e43c7184ed22f4f23c7a9d7b124d196be2e570
096987e5baadf65057fa65736b') <| where CustomerId in ('X', 'Y')
.purge table MyTable records
in database MyDatabase
with (noregrets='true')
2 STEP PROCESS 1 STEP PROCESS
With No Regrets !!!!
68. Virtual Network
BENEFITS
• USE NSG rules to limit traffic.
• Connect your on-premise network to Azure Data Explorer cluster's subnet.
• Secure your data connection sources (Event Hub and Event Grid) with
service endpoints.
68
VNET gives you TWO Independent IPs
• Private IP: access the cluster inside the VNet.
• Public IP: access the cluster from outside the VNet (management
and monitoring) and as a source address for outbound connections
initiated from the cluster.
69. Row level security
• Provides fine control of access to table data by different users
• Allow specifying user access to specific rows in tables
• Provides mechanics to mask PII data in tables
69
.create-or-alter function with () TrimCreditCardNumbers() {
let UserCanSeeFullNumbers = current_principal_is_member_of('aadgroup=super_group@domain.com');
let AllData = Customers | where UserCanSeeFullNumbers;
let PartialData = Customers | where not(UserCanSeeFullNumbers) | extend CreditCardNumber = "****";
union AllData, PartialData
}
.alter table Customers policy row_level_security enable "TrimCreditCardNumbers"
70. Leader and Follower
• Azure Data Share creates a symbolic link between two ADX cluster.
• Sharing occurs in near-real-time (no data pipeline)
• ADX Decouples the storage and compute
• Allows customers to run multiple compute (read-only) instances on the same underlying storage
• You can attach a database as a follower database, which is a read-only database on a remote cluster.
• You can share the data at the database level or at the cluster level.
70
The cluster sharing the database is the leader cluster and the
cluster receiving the share is the follower cluster.
A follower cluster can follow one or more leader cluster
databases. The follower cluster periodically synchronizes to
check for changes.
The queries running on the follower cluster use local cache
and don't use the resources of the leader cluster.
Azure Data Share
73. What is ADX for me, today
• A Telemetry data Search engine => ELK replacement
• A TSDB envolved in LAMBDA replacements (as WARM path) => OSS
LAMBDA (MinIO + Kafka) replacement
• A Tool to Materialize data into ADLS & SQL
• A Tool for monitoring, summarizing information and send
notifications
73
74. Which are the OSS Alternatives that we should
compare with?
74
From db-engines.com
Azure Data Explorer
Fully managed big data
interactive analytics platform
Elastic Search
A distributed, RESTful modern
search and analytics engine
ADX can be a replacement for search and log analytics engines such as Elasticsearch, Splunk, InfluxDB.
Splunk
real-time insights Engine to
boost productivity & security.
InfluxDB
DBMS for storing time series,
events and metrics
Vs
75. Comparison chart
75
Name Elasticsearch (ELASTIC) InfluxDB (InfluxData Inc.) Azure Data Explorer (Microsoft) Splunk (Splunk Inc.)
Description A distributed, RESTful modern search and
analytics engine based on Apache Lucene
DBMS for storing time series, events and
metrics
Fully managed big data interactive
analytics platform
Analytics Platform for Big Data
Database models Search engine, Document store Time Series DBMS Time Series DBMS, Search engine,
Document store , Event Store,
Relational DBMS
Search engine
Initial release 2010 2013 2019 2003
License Open Source Open Source commercial commercial
Cloud-based only no no yes no
Implementation language Java Go
Server operating systems All OS with a Java VM Linux, OS X hosted Linux, OS X, Solaris, Windows
Data scheme schema-free schema-free Fixed schema with schema-less datatypes
(dynamic)
yes
Typing yes Numeric data and Strings yes yes
XML support no no yes yes
Secondary indexes yes no all fields are automatically indexed yes
SQL SQL-like query language SQL-like query language Kusto Query Language (KQL), SQL
subset
no
APIs and other access methods RESTful HTTP/JSON API HTTP API RESTful HTTP API HTTP REST
Java API JSON over UDP Microsoft SQL Server communication
protocol (MS-TDS)
Supported programming
languages
.Net, Java, JavaScript, Python .Net, Java, JavaScript, Python .Net, Java, JavaScript, Python .Net, Java, JavaScript, Python
Ruby, PHP, Perl, Groovy, Community
Contributed Clients
R,Ruby,PHP,Perl,Haskell,Clojure,Erlang,Go,L
isp,Rust,Scala
R, PowerShell Ruby, PHP
Server-side scripts yes no Yes, possible languages: KQL, Python, R yes
Triggers yes no yes yes
Partitioning methods Sharding Sharding Sharding Sharding
Replication methods yes selectable replication factor yes Master-master replication
MapReduce ES-Hadoop Connector no no yes
Consistency concepts Eventual Consistency Eventual Consistency Eventual Consistency
Immediate Consistency
Foreign keys no no no no
Transaction concepts no no no no
Concurrency yes yes yes yes
Durability yes yes yes yes
In-memory capabilities Memcached and Redis integration yes no no
77. Why ADX is Unique
77
Simplified costs
• Vm costs
• ADX service add on
cost
Many Prebuilt Inputs
• ADF
• Iothub
• EventHub
• Storage
• Logstash
• Kafka
• Fluent bit
Many Prebuilt Outputs
• Power BI
• ODBC Connector
• Jupyter
• Grafana
78. Azure Data Explorer
78
Blob
Python
SDK
IoT Hub
.NET SDK
Azure Data
Explorer
REST API
Event Hub
.NET SDK
Python SDK
Web UI
Desktop App
Jupyter
Magic
APIs UX
Power BI
Direct Query
Microsoft
Flow
Azure App
Logic
Connectors
Grafana
ADF
MS-TDS
Java SDK
Java Script
Monaco IDE
Azure
Notebooks
Protocols
Streaming
Bulk
APIs
Queued
Ingestion Direct
Java SDK
79. • Metrics and time-series data
• Text search and text analytics
• Multi-dimensional/relational
analysis
Comprehensive Strength
• Simple and powerful
• Publicly available
• Data Exploration
• Rich relational query language
• Full text Search
• ML Extensibility
Analytics Query language
• Scale out in hardware
• Scale out across geos
• Granular resource utilization
Control
• Cross geo queries
High performance over large data sets
• Low Latency ingestion
• Schema management
• Compression and indexing
• Retention
• Hot/cold resource allocation
Data Ingestion and Management
80. Everything our User
Group Has To Offer
Get involved in
our Meetup
Join the conversation on
our Facebook group
Follow our page
on Facebook
Follow our Videos
on Youtube
Explore
https://bit.ly/2P9sqLy https://bit.ly/2QqAWX4
https://bit.ly/3auvRnD
https://bit.ly/3n8l5bP
82. Summary
• Use ADX for
• Understand data
• Visualize KPI and refine them
• Manage Long Term storage strategy
• Feed a Facts table in DWH
• Trigger daily auto checks
Riccardo.zamana@gmail.com