The document discusses U-SQL's built-in extractors and outputters for reading and writing files. It describes how the EXTRACT and OUTPUT expressions work with various file formats like CSV, TSV, JSON and XML. It also covers file paths, parallel processing, limits, column options and virtual columns for partitioning data.
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
U-SQL Reading & Writing Files
1. Michael Rys
Principal Program Manager, Big Data @ Microsoft
@MikeDoesBigData, {mrys, usql}@microsoft.com
U-SQL Reading & Writing Files
2. •
•
•
•
EXTRACT Expression
@s = EXTRACT a string, b int
FROM "filepath/file.csv"
USING Extractors.Csv(encoding: Encoding.Unicode);
• Built-in Extractors: Csv, Tsv, Text with lots of options
• Custom Extractors: e.g., JSON, XML, etc.
OUTPUT Expression
OUTPUT @s
TO "filepath/file.csv"
USING Outputters.Csv();
• Built-in Outputters: Csv, Tsv, Text
• Custom Outputters: e.g., JSON, XML, etc.
Filepath URIs
• Relative URI to default ADL Storage account: "filepath/file.csv"
• Absolute URIs:
• ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv"
• WASB: "wasb://container@account/filepath/file.csv"
3.
4. •
•
•
•
Built-In Extractors and Outputters
• Extractors.Csv(), Extractors.Tsv(), Extractors.Text()
• Outputters.Csv(), Outputters.Tsv(), Outputters.Text()
Parallel Execution Extractors
• Every file is stored in Extents of about 250MB
• One Extract Vertex gets 4 extract processes each working on one extent
• Today:
• Upload Data as row-oriented files
• Use CR/LF as row-delimiters
• This will align row-boundaries to extend boundaries
• Otherwise: you can get data corruption or errors
Parallel Outputters
• Writes parallel extents
• Supports ORDER BY
• Stitching of extents to files
• Meta Data operation for adl:// files
• Expensive copy operation for wasb:// files!!!
Limits
• row size: 4MB
• String column: 128kB; byte[]: up to 4MB
• SQL.MAP, SQL.ARRAY not supported (transform needed)
5. • delimiter: column delimiter (char; Text() only)
• encoding: file encoding (System.Text.Encoding)
• Encoding.[ASCII] (7-bit)
• Encoding.BigEndianUnicode
• Encoding.Unicode
• Encoding.UTF7
• Encoding.UTF8 (This is the default)
• Encoding.UTF32
• CAVEAT: No ANSI support yet!
• escapeCharacter: escaping of delimiters (including CR/LF)
• nullEscape: allows surrogate for null value
• quoting: quoted column using "
• Default is on
• Does NOT guard row delimiter!!! (use escapeCharacter)
• rowDelimiter: row delimiter
• Default: CR LF
• silent: allows skipping rows with invalid number of columns
and nulls data type conversion errors (Extractors only)
• CAVEAT: Does not skip encoding errors
6. E_RUNTIME_USER_EXTRACT_INVALID_CHARACTER Invalid character for UTF8 encoding in input stream.
Message: Invalid character for UTF8 encoding in input record at around line 0
Resolution: Correct the invalid character in the input file or correct encoding in extractor and try again. Details: 0xFF 0xFE
0x31 0x0 0x9 0x0 0x4D 0x0
7.
8.
9. •
•
•
•
Simple pattern language on filename and path
@pattern string =
"/input/{date:yyyy}/{date:MM}/{date:dd}/{*}.{suffix}";
• Binds two columns date and suffix
• Wildcards the filename
• Today: Limits on number of files (between 800 and 3000)
Virtual columns
EXTRACT
name string
, suffix string // virtual column
, date DateTime // virtual column
FROM @pattern
USING Extractors.Csv();
• Refer to virtual columns in query to get partition elimination
• Virtual columns need to be referenced for DateTime columns and
if no wildcard has been given
OUTPUT
OUTPUT @rs TO "/output/file_{*}.csv" USING Outputters.Csv();
• One file per outputter invocation. * is unique GUID