6. SSIS Considered Harmful Visual programming is a terrible interface for ETL work History of UI visual designers is instructive here No way to copy/share “code samples” Low information density Even worse It’s not even a good designer Poor defaults, poor layout Nested menus of property grids are hard to use No real VCS ability Sure it’s XML, but… No diff No merge Demo ?
7. But what about all the features? Some are trivial: Foreach Loop Container Execute SQL Task Send Mail Task OMG! Many are useful, but not hard to recreate Demos soon Others may be worth consideration Fuzzy matching
8. But what about performance The advantages for data flow are real, but not that large A 5M row test on my home machine gave a 12% performance advantage to SSIS over PowerShell + SqlBulkCopy
9. The Answer PowerShell A general purpose scripting and automation language Wide adoption MS: Windows Server, Windows 7, SQL Server 2008, VM Manager, IIS, Deployment Toolkit Others: VMWare, Quest, IBM Websphere
10. PowerShell Features Takes the best of Unix, VMS, Perl Adds in integration with .Net, COM, WMI Adds in REPL environment Full scripting/programming Flexible typing model Pipes Regular expressions XML Navigation Providers Built in command parser Standardized syntax for commands
12. PSIS demo Just the ETL No logging, error reporting, etc Some typical ETL tasks Clear staging tables Bulk transfer data from one DB to another Populate a star schema Fact table and dimension tables Extract data to a .csv file
14. Bulk transfer At core, we’ll use .Net’sSqlBulkCopy under the covers uses TDS’s BCP protocol For concurrency we’ll use the Task Parallel Library We’ll simplify things by making the destination tables map directly to the source tables Select statements could provide needed mapping
16. Populate a Star-Schema Source a view on the staging tables, creating a flat de-normalized row of all dimension and fact value Use convention to indicate which columns are fact or dimension tables
17. Populate a Star-Schema Code generate a populator Create functions, given a source row which return/create the corresponding dimension key Run it with the TPL for performance
18. Populate a Star-Schema Areas of improvement Add support for slowly changing dimensions Add support for lower memory requirements query DB for dimension values on demand Measure and tune performance at the micro-benchmark level Concurrent dictionaries Partial loads