USQL Trivadis Azure Data Lake Event

This is an introduction into Azure DataLake and U-SQL held at the Trivadis Azure Data Lake Event by Michael Rys.

  1. 1. 4 Data sourcesNon-relational data DESIGNED FOR THE QUESTIONS YOU KNOW!
  2. 2. The Data Lake Approach Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Hadoop, Spark, R, Azure Data Lake Analytics (ADLA) Interactive queries Batch queries Machine Learning Data warehouse Real-time analytics Devices
  3. 3. Microsoft’s Big Data Journey We needed to better leverage data and analytics to do more experimentation So, we built a Data Lake for Microsoft: • A data lake for everyone to put their data • Tools approachable by any developer • Batch, Interactive, Streaming, ML By the numbers • Exabytes of data under management • 100Ks of Physical Servers • 100Ks of Batch Jobs, Millions of Interactive Queries • Huge Streaming Pipelines • 10K+ Developers running diverse workloads and scenarios 2010 2013 2017 Windows SMSG Live Bing CRM/Dynamics Xbox Live Office365 Malware Protection Microsoft Stores Commerce Risk Skype LCA Exchange Yammer Data Stored
  4. 4. Culture Changes Engineering How is the system performing? What is the experience my customers are having? How does that correlate to other actions? Is my feature successful ? Marketing What can we observe from our customers to increase revenues? Management How do I drive my business based on the data? Field Where are there new opportunities? How can I connect with my customers more deeply? Support How does this customer’s experience compare with others?
  5. 5. HDFS Compatible REST API ADL Store .NET, SQL, Python, R scaled out by U-SQL ADL Analytics Open Source Apache Hadoop ADL Client Azure Databricks HDInsight Hive • Performance at scale • Optimized for analytics • Multiple analytics engines • Single repository sharing
  6. 6. HDFS Compatible REST API ADL Store Storage • Architected and built for very high throughput at scale for Big Data workloads • No limits to file size, account size or number of files • Single-repository for sharing • Cloud-scale distributed filesystem with file/folder ACLS and RBAC • Encryption-at-rest by default with Azure Key Vault • Authenticated access with Azure Active Directory integration • Formal Certifications incl. ISO, SOC, PCI, HIPAA
  7. 7. HDFS Compatible REST API ADL Store Analytics Storage Cloudera CDH Hortonworks HDP Qubole QDS • Open Source Apache® ADL client for commercial and custom Hadoop • Cloud IaaS and Hybrid
  8. 8. Best of Databricks Best of Microsoft Designed in collaboration with the founders of Apache Spark One-click set up; streamlined workflows Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage) Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs) A Z U R E D ATA B R I C K S A F A S T , E A S Y , A N D C O L L A B O R A T I V E A P A C H E S P A R K B A S E D A N A L Y T I C S P L A T F O R M
  9. 9. HDFS Compatible REST API HDInsight ADL Store Hive Analytics Storage • 63% lower TCO than on-premise* • SLA- managed, monitored and supported by Microsoft • Fully managed Hadoop, Spark and R • Clusters deployed in minutes *IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
  10. 10. HDFS Compatible REST API ADL Store .NET, SQL, Python, R scaled out by U-SQL ADL Analytics• Serverless. Pay per job. Starts in seconds. Scales instantly. • Develop massively parallel programs with simplicity • Federated query from multiple data sources
  11. 11. Scales out your custom code in .NET, Python, R over your Data Lake Familiar syntax to millions of SQL & .NET developers Unifies • Declarative nature of SQL with the imperative power of your language of choice (e.g., C#, Python) • Processing of structured, semi-structured and unstructured data • Querying multiple Azure Data Sources (Federated Query) U-SQL A framework for Big Data
  12. 12. • SQL forms the declarative basis of the language: • GROUP BY/Aggs • Windowing Expressions • PIVOT/UNPIVOT • CROSS APPLY • JOINs • Etc. • Uses .NET Types and C# Expression language • Rich Extensibility model that allows to scale out your custom extension code written in .Net/C#, Python, R • Operates on unstructured data (Csv, images etc) • Operates on semistructured data (XML, JSON, Avro) • Operates on structured files (Parquet) • Provides Metadata Catalog (DB, Schema): • U-SQL Tables (for improved performance) • U-SQL code objects (View, TVFs, Procs) • Extension code objects (U-SQL Assemblies) • Etc. • Provides Federated Queries against “SQL in Azure”
  13. 13. Develop massively parallel programs with simplicity A simple U-SQL script can scale from Gigabytes to Petabytes without learning complex big data programming techniques. U-SQL automatically generates a scaled out and optimized execution plan to handle any amount of data. Execution nodes immediately rapidly allocated to run the program. Error handling, network issues, and runtime optimization are handled automatically. @searchlog = EXTRACT UserId int, Start DateTime, Region string, Query string, Duration int, Urls string, ClickedUrls string FROM @"/Samples/Data/SearchLog.tsv" USING Extractors.Tsv(); OUTPUT @searchlog TO @"/Samples/Output/SearchLog_output.tsv" USING Outputters.Tsv();
  14. 14. • Admin and Dev Tooling in • Azure Portal • VisualStudio 2013 to 2017 (with local execution mode!) • VS Code (cross platform) • Azure Data Factory: • Data movement • Job submission and orchestration • Powershell and Cross-Platform CLI support • SDKs for common languages: • .Net • Java • Python • Node.js
  15. 15.  Automatic "in-lining" optimized out-of-the- box  Per job parallelization visibility into execution  Heatmap to identify bottlenecks
  16. 16. https://github.com/Azure/usql/tree/master/Examples/ImageApp https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-cognitive Car Green Parked Outdoor Racing
  17. 17. High-level Roadmap • Worldwide Region Availability (currently US and EU) • Interactive Access with T-SQL query • Scale out your custom code in the language of choice (.Net, Java, Python, etc) • Process the data formats of your choice (incl. Parquet, ORC; larger string values) • Continued ADF, AAS, ADC, SQL DW, EventHub, SSIS integration • Administrative policies to control usage/cost for storage & compute • Secure data sharing between common AAD and public read-only sharing, fine grained ACLing • Intense focus on developer productivity for authoring, debugging, and optimization • General customer feedback http://aka.ms/adlfeedback
  18. 18. Resources http://usql.io http://blogs.msdn.microsoft.com/azuredatalake/ http://blogs.msdn.microsoft.com/mrys/ https://channel9.msdn.com/Search?term=U-SQL#ch9Search http://aka.ms/usql_reference https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u- sql-programmability-guide https://docs.microsoft.com/en-us/azure/data-lake-analytics/ https://msdn.microsoft.com/en-us/magazine/mt614251 https://msdn.microsoft.com/magazine/mt790200 http://www.slideshare.net/MichaelRys Getting Started with R in U-SQL https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u- sql-python-extensions https://social.msdn.microsoft.com/Forums/azure/en- US/home?forum=AzureDataLake http://stackoverflow.com/questions/tagged/u-sql http://aka.ms/adlfeedback Continue your education at Microsoft Virtual Academy online.