1. Ingestion 101
Presenter: Oleg Krook
September 29-30, 2012
Boston, MA
Contains Company Confidential Material â Do Not Disclose
2. Ingestion Pipeline Overview
Landing Zone provides an entry point for data
Input data is defined in Ed-Fi format
Found at http://www.ed-fi.org/technical-documentation/
Two input methods supported:
âąXML files followed by a control file
âącompressed ZIP file containing above files
Contains Company Confidential Material â Do Not Disclose
3. Anatomy of an ingestion job Control files, Ed-Fi
Control File Format
The control file will be used solely as to define the set of inbound data files, and to perform
basic integrity checking on these files. It contains a row of comma-separated values for each
data file. Leading/trailing spaces are considered part of the values and will not be trimmed. The
last value in any row must not be followed by a comma.
The row format is:
<file format>,<file type>,<file name>,<file checksum>
, where
<file format> Specifies the file format.
At this time, edfi-xml is the only supported file format
<file type> Represents the type of object(s) found in the file.
In the case of Ed-Fi XML, the file type maps to the name of the appropriate interchange
schema.
Contains Company Confidential Material â Do Not Disclose
4. Anatomy of an ingestion job Control files, Ed-Fi Cont.
<file name> Specifies the file's name.
File names are case sensitive. This field may or may not be enclosed in double quotes.
File names containing double quotes and/or commas should be enclosed in double-
quotes.
A double-quote appearing inside a field must be escaped by preceding it with another
double quote.
<file checksum> Is the file's MD5 checksum.
The MD5 checksum is expressed as 32 hexadecimal digits with alphabetic characters
always in lowercase.
Contains Company Confidential Material â Do Not Disclose
5. Anatomy of an ingestion job Control files, Ed-Fi Cont.
The control file format allows for specification of job-level parameters. These are
specified in the control file as line entries preceded with the @ symbol.
The following table describes the parameters that are currently supported in the control
file:
@dry-run
Indicates that the results of ingestion processing should not be written to the core data store.
@purge
Deletes all previously ingested data from this tenant. All other content of the control file is
ignored.
A job control file may look as follows:
@dry-run
edfi-xml,StudentEnrollment,data.xml,756a5e96e330082424b83902908b070a
Contains Company Confidential Material â Do Not Disclose
6. Error/Status Logs
In the course of ingestion several log files are created and placed in the landing zone.
These files are used to capture warning and errors at job level (per control file) or at
resource level (per XML file within job).
job-<jobId>.log Once for every job INFO <jobId information>
INFO [file] <resourceId> (<internalschema>)
INFO [file] <resourceId> records considered: <#>
INFO [file] <resourceId> records ingested successfully: <#>
INFO [file] <resourceId> records failed: <#>
INFO [configProperty] <list of config parameters>
INFO <All|#> records process successfully
INFO Processed <#> records
job_warn-<jobId>.log Job-level (non-resource WARN <warning detail>
specific)
warnings present
job_error-<jobId>.log Job-level (non-resource ERROR <error detail>
specific)
errors present
warn.<resourceId>- Resource-level WARN <warning detail>
<jobId>.log warnings present
error.<resourceId>- Resource-level ERROR <error detail>
<jobId>.log errors present
Contains Company Confidential Material â Do Not Disclose
7. Offline Validation Tool
Offline Validation Tool is an open sourced tool, to provide a
way to check the format of the ingestion files for Ed-Fi
format compliance before they get transmitted for ingestion.
This provide an opportunity to check the file format on the
spot instead of waiting to transmit and process the file on
the SLI side.
This tool only checks for structure, XML compliance, but
does not check for referential integrity of data.
Contains Company Confidential Material â Do Not Disclose