U-SQL Reading & Writing Files

•Als PPTX, PDF herunterladen•

2 gefällt mir•2,374 views

The document discusses U-SQL's built-in extractors and outputters for reading and writing files. It describes how the EXTRACT and OUTPUT expressions work with various file formats like CSV, TSV, JSON and XML. It also covers file paths, parallel processing, limits, column options and virtual columns for partitioning data.

Daten & Analysen

Michael Rys
Principal Program Manager, Big Data @ Microsoft
@MikeDoesBigData, {mrys, usql}@microsoft.com
U-SQL Reading & Writing Files

•
•
•
•
EXTRACT Expression
@s = EXTRACT a string, b int
FROM "filepath/file.csv"
USING Extractors.Csv(encoding: Encoding.Unicode);
• Built-in Extractors: Csv, Tsv, Text with lots of options
• Custom Extractors: e.g., JSON, XML, etc.
OUTPUT Expression
OUTPUT @s
TO "filepath/file.csv"
USING Outputters.Csv();
• Built-in Outputters: Csv, Tsv, Text
• Custom Outputters: e.g., JSON, XML, etc.
Filepath URIs
• Relative URI to default ADL Storage account: "filepath/file.csv"
• Absolute URIs:
• ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv"
• WASB: "wasb://container@account/filepath/file.csv"

•
•
•
•
Built-In Extractors and Outputters
• Extractors.Csv(), Extractors.Tsv(), Extractors.Text()
• Outputters.Csv(), Outputters.Tsv(), Outputters.Text()
Parallel Execution Extractors
• Every file is stored in Extents of about 250MB
• One Extract Vertex gets 4 extract processes each working on one extent
• Today:
• Upload Data as row-oriented files
• Use CR/LF as row-delimiters
• This will align row-boundaries to extend boundaries
• Otherwise: you can get data corruption or errors
Parallel Outputters
• Writes parallel extents
• Supports ORDER BY
• Stitching of extents to files
• Meta Data operation for adl:// files
• Expensive copy operation for wasb:// files!!!
Limits
• row size: 4MB
• String column: 128kB; byte[]: up to 4MB
• SQL.MAP, SQL.ARRAY not supported (transform needed)

• delimiter: column delimiter (char; Text() only)
• encoding: file encoding (System.Text.Encoding)
• Encoding.[ASCII] (7-bit)
• Encoding.BigEndianUnicode
• Encoding.Unicode
• Encoding.UTF7
• Encoding.UTF8 (This is the default)
• Encoding.UTF32
• CAVEAT: No ANSI support yet!
• escapeCharacter: escaping of delimiters (including CR/LF)
• nullEscape: allows surrogate for null value
• quoting: quoted column using "
• Default is on
• Does NOT guard row delimiter!!! (use escapeCharacter)
• rowDelimiter: row delimiter
• Default: CR LF
• silent: allows skipping rows with invalid number of columns
and nulls data type conversion errors (Extractors only)
• CAVEAT: Does not skip encoding errors

E_RUNTIME_USER_EXTRACT_INVALID_CHARACTER Invalid character for UTF8 encoding in input stream.
Message: Invalid character for UTF8 encoding in input record at around line 0
Resolution: Correct the invalid character in the input file or correct encoding in extractor and try again. Details: 0xFF 0xFE
0x31 0x0 0x9 0x0 0x4D 0x0

$• • • • Simple pattern language on filename and path @pattern string = "/input/{date:yyyy}/{date:MM}/{date:dd}/{*}.{suffix}"; • Binds two columns date and suffix • Wildcards the filename • Today: Limits on number of files (between 800 and 3000) Virtual columns EXTRACT name string , suffix string // virtual column , date DateTime // virtual column FROM @pattern USING Extractors.Csv(); • Refer to virtual columns in query to get partition elimination • Virtual columns need to be referenced for DateTime columns and if no wildcard has been given OUTPUT OUTPUT @rs TO "/output/file_{*}.csv" USING Outputters.Csv(); • One file per outputter invocation. * is unique GUID$

Additional
Resources
Documentation
Built-in Extractors: https://msdn.microsoft.com/en-
us/library/azure/mt621366.aspx
Built-in Outputters:
https://msdn.microsoft.com/en-us/library/azure/mt621345.aspx
FileSet: https://msdn.microsoft.com/en-
us/library/azure/mt621294.aspx
Sample Data
https://github.com/Azure/usql/blob/master/Examples/Samples/Da
ta/AmbulanceData/Drivers.txt
Sample Project
https://github.com/Azure/usql/tree/master/Examples/Builtin-
UDOs/

Empfohlen

The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)Michael Rys

U-SQL User-Defined Operators (UDOs) (SQLBits 2016)Michael Rys

Killer Scenarios with Data Lake in Azure with U-SQLMichael Rys

Introducing U-SQL (SQLPASS 2016)Michael Rys

Using C# with U-SQL (SQLBits 2016)Michael Rys

U-SQL Intro (SQLBits 2016)Michael Rys

Taming the Data Science Monster with A New ‘Sword’ – U-SQLMichael Rys

U-SQL - Azure Data Lake Analytics for DevelopersMichael Rys

Empfohlen

The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)Michael Rys

U-SQL User-Defined Operators (UDOs) (SQLBits 2016)Michael Rys

Killer Scenarios with Data Lake in Azure with U-SQLMichael Rys

Introducing U-SQL (SQLPASS 2016)Michael Rys

Using C# with U-SQL (SQLBits 2016)Michael Rys

U-SQL Intro (SQLBits 2016)Michael Rys

Taming the Data Science Monster with A New ‘Sword’ – U-SQLMichael Rys

U-SQL - Azure Data Lake Analytics for DevelopersMichael Rys

U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...Michael Rys

U-SQL Meta Data Catalog (SQLBits 2016)Michael Rys

ADL/U-SQL Introduction (SQLBits 2016)Michael Rys

Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...Michael Rys

Microsoft's Hadoop StoryMichael Rys

U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...Michael Rys

Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Michael Rys

Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Michael Rys

Tuning and Optimizing U-SQL Queries (SQLPASS 2016)Michael Rys

Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Michael Rys

U-SQL Query Execution and Performance Basics (SQLBits 2016)Michael Rys

U-SQL Partitioned Data and Tables (SQLBits 2016)Michael Rys

Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Michael Rys

U-SQL Does SQL (SQLBits 2016)Michael Rys

U-SQL Query Execution and Performance TuningMichael Rys

Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)Jason L Brugger

Introduction to HiveQLkristinferrier

Azure data lake sql konf 2016Kenneth Michael Nielsen

Spark SQL with Scala Code ExamplesTodd McGrath

Cubes – pluggable model explainedStefan Urbanek

U-SQL Federated Distributed Queries (SQLBits 2016)Michael Rys

Azure Data Lake Intro (SQLBits 2016)Michael Rys

Weitere ähnliche Inhalte

Was ist angesagt?

U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...Michael Rys

U-SQL Meta Data Catalog (SQLBits 2016)Michael Rys

ADL/U-SQL Introduction (SQLBits 2016)Michael Rys

Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...Michael Rys

Microsoft's Hadoop StoryMichael Rys

U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...Michael Rys

Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Michael Rys

Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Michael Rys

Tuning and Optimizing U-SQL Queries (SQLPASS 2016)Michael Rys

Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Michael Rys

U-SQL Query Execution and Performance Basics (SQLBits 2016)Michael Rys

U-SQL Partitioned Data and Tables (SQLBits 2016)Michael Rys

Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Michael Rys

U-SQL Does SQL (SQLBits 2016)Michael Rys

U-SQL Query Execution and Performance TuningMichael Rys

Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)Jason L Brugger

Introduction to HiveQLkristinferrier

Azure data lake sql konf 2016Kenneth Michael Nielsen

Spark SQL with Scala Code ExamplesTodd McGrath

Cubes – pluggable model explainedStefan Urbanek

Was ist angesagt? (20)

U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...

U-SQL Meta Data Catalog (SQLBits 2016)

ADL/U-SQL Introduction (SQLBits 2016)

Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...

Microsoft's Hadoop Story

U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...

Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)

Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...

Tuning and Optimizing U-SQL Queries (SQLPASS 2016)

Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...

U-SQL Query Execution and Performance Basics (SQLBits 2016)

U-SQL Partitioned Data and Tables (SQLBits 2016)

Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...

U-SQL Does SQL (SQLBits 2016)

U-SQL Query Execution and Performance Tuning

Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)

Introduction to HiveQL

Azure data lake sql konf 2016

Spark SQL with Scala Code Examples

Cubes – pluggable model explained

Andere mochten auch

U-SQL Federated Distributed Queries (SQLBits 2016)Michael Rys

Azure Data Lake Intro (SQLBits 2016)Michael Rys

U-SQL Learning Resources (SQLBits 2016)Michael Rys

Azure Data Lake and U-SQLMichael Rys

Analyzing StackExchange data with Azure Data LakeBizTalk360

Azure Data Lake Analytics Deep DiveIlyas F ☁☁☁

Microsoft Azure vs Amazon Web Services (AWS) Services & Feature MappingIlyas F ☁☁☁

Andere mochten auch (7)

U-SQL Federated Distributed Queries (SQLBits 2016)

Azure Data Lake Intro (SQLBits 2016)

U-SQL Learning Resources (SQLBits 2016)

Azure Data Lake and U-SQL

Analyzing StackExchange data with Azure Data Lake

Azure Data Lake Analytics Deep Dive

Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping

Ähnlich wie U-SQL Reading & Writing Files

Using existing language skillsets to create large-scale, cloud-based analyticsMicrosoft Tech Community

C for EngineersJulie Iskander

Learn c++ Programming LanguageSteve Johnson

C languageMukul Kirti Verma

Cs1123 3 c++ overviewTAlha MAlik

Doctrine 2.0 Enterprise Persistence Layer for PHPGuilherme Blanco

Python - Lecture 11Ravi Kiran Khareedi

C101 – Intro to Programming with Cgpsoft_sk

Turning a Search Engine into a Relational DatabaseMatthias Wahl

CSV import in AtoMArtefactual Systems - AtoM

Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney

20140612 phila sug proc importDavid Horvath

Assembly 8086Mustafa Salah

CJerin John

Abhishek lingineniabhishekl404

CPlusPusrasen58

Shshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptx086ChintanPatel1

C# basics...Abhishek Mukherjee

Java Input Output (java.io.*)Om Ganesh

The Internals of "Hello World" ProgramNational Cheng Kung University

Ähnlich wie U-SQL Reading & Writing Files (20)

Using existing language skillsets to create large-scale, cloud-based analytics

C for Engineers

Learn c++ Programming Language

C language

Cs1123 3 c++ overview

Doctrine 2.0 Enterprise Persistence Layer for PHP

Python - Lecture 11

C101 – Intro to Programming with C

Turning a Search Engine into a Relational Database

CSV import in AtoM

Apache Arrow Workshop at VLDB 2019 / BOSS Session

20140612 phila sug proc import

Assembly 8086

Abhishek lingineni

CPlusPus

Shshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptx

C# basics...

Java Input Output (java.io.*)

The Internals of "Hello World" Program

Mehr von Michael Rys

Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Michael Rys

Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys

Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Michael Rys

Running cost effective big data workloads with Azure Synapse and Azure Data L...Michael Rys

Big Data Processing with Spark and .NET - Microsoft Ignite 2019Michael Rys

Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys

Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys

Mehr von Michael Rys (7)

Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...

Big Data Processing with .NET and Spark (SQLBits 2020)

Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...

Running cost effective big data workloads with Azure Synapse and Azure Data L...

Big Data Processing with Spark and .NET - Microsoft Ignite 2019

Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...

Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...

Kürzlich hochgeladen

9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

IMA MSN - Medical Students Network (2).pptxdolaknnilon

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

ASML's Taxonomy Adventure by Daniel Cantervoginip

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett

Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

Kürzlich hochgeladen (20)

9654467111 Call Girls In Munirka Hotel And Home Service

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx

IMA MSN - Medical Students Network (2).pptx

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝

Top 5 Best Data Analytics Courses In Queens

ASML's Taxonomy Adventure by Daniel Canter

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

Defining Constituents, Data Vizzes and Telling a Data Story

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

DBA Basics: Getting Started with Performance Tuning.pdf

Heart Disease Classification Report: A Data Analysis Project

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

U-SQL Reading & Writing Files

1. Michael Rys Principal Program Manager, Big Data @ Microsoft @MikeDoesBigData, {mrys, usql}@microsoft.com U-SQL Reading & Writing Files

2. • • • • EXTRACT Expression @s = EXTRACT a string, b int FROM "filepath/file.csv" USING Extractors.Csv(encoding: Encoding.Unicode); • Built-in Extractors: Csv, Tsv, Text with lots of options • Custom Extractors: e.g., JSON, XML, etc. OUTPUT Expression OUTPUT @s TO "filepath/file.csv" USING Outputters.Csv(); • Built-in Outputters: Csv, Tsv, Text • Custom Outputters: e.g., JSON, XML, etc. Filepath URIs • Relative URI to default ADL Storage account: "filepath/file.csv" • Absolute URIs: • ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv" • WASB: "wasb://container@account/filepath/file.csv"

4. • • • • Built-In Extractors and Outputters • Extractors.Csv(), Extractors.Tsv(), Extractors.Text() • Outputters.Csv(), Outputters.Tsv(), Outputters.Text() Parallel Execution Extractors • Every file is stored in Extents of about 250MB • One Extract Vertex gets 4 extract processes each working on one extent • Today: • Upload Data as row-oriented files • Use CR/LF as row-delimiters • This will align row-boundaries to extend boundaries • Otherwise: you can get data corruption or errors Parallel Outputters • Writes parallel extents • Supports ORDER BY • Stitching of extents to files • Meta Data operation for adl:// files • Expensive copy operation for wasb:// files!!! Limits • row size: 4MB • String column: 128kB; byte[]: up to 4MB • SQL.MAP, SQL.ARRAY not supported (transform needed)

5. • delimiter: column delimiter (char; Text() only) • encoding: file encoding (System.Text.Encoding) • Encoding.[ASCII] (7-bit) • Encoding.BigEndianUnicode • Encoding.Unicode • Encoding.UTF7 • Encoding.UTF8 (This is the default) • Encoding.UTF32 • CAVEAT: No ANSI support yet! • escapeCharacter: escaping of delimiters (including CR/LF) • nullEscape: allows surrogate for null value • quoting: quoted column using " • Default is on • Does NOT guard row delimiter!!! (use escapeCharacter) • rowDelimiter: row delimiter • Default: CR LF • silent: allows skipping rows with invalid number of columns and nulls data type conversion errors (Extractors only) • CAVEAT: Does not skip encoding errors

6. E_RUNTIME_USER_EXTRACT_INVALID_CHARACTER Invalid character for UTF8 encoding in input stream. Message: Invalid character for UTF8 encoding in input record at around line 0 Resolution: Correct the invalid character in the input file or correct encoding in extractor and try again. Details: 0xFF 0xFE 0x31 0x0 0x9 0x0 0x4D 0x0

9. • • • • Simple pattern language on filename and path @pattern string = "/input/{date:yyyy}/{date:MM}/{date:dd}/{*}.{suffix}"; • Binds two columns date and suffix • Wildcards the filename • Today: Limits on number of files (between 800 and 3000) Virtual columns EXTRACT name string , suffix string // virtual column , date DateTime // virtual column FROM @pattern USING Extractors.Csv(); • Refer to virtual columns in query to get partition elimination • Virtual columns need to be referenced for DateTime columns and if no wildcard has been given OUTPUT OUTPUT @rs TO "/output/file_{*}.csv" USING Outputters.Csv(); • One file per outputter invocation. * is unique GUID

10. Additional Resources Documentation Built-in Extractors: https://msdn.microsoft.com/en- us/library/azure/mt621366.aspx Built-in Outputters: https://msdn.microsoft.com/en-us/library/azure/mt621345.aspx FileSet: https://msdn.microsoft.com/en- us/library/azure/mt621294.aspx Sample Data https://github.com/Azure/usql/blob/master/Examples/Samples/Da ta/AmbulanceData/Drivers.txt Sample Project https://github.com/Azure/usql/tree/master/Examples/Builtin- UDOs/

11. http://aka.ms/AzureDataLake

Hinweis der Redaktion

Shows simple Extract, OUTPUT Then simple extensibility with string functions.
https://github.com/Azure/usql/tree/master/Examples/Builtin-UDOs/
Add file sets.