2. Zach Briggs
New Developer
7 Years in Data Analytics
Mike Fidler
Systems Security Specialist
Ex Geologist
His Unix experience is old enough to drink
Amateur Inventor
19. Thank You
Zach - briggszj@gmail.com
@theotherzach
Title of Record
Mike - rockmastermike@gmail.com
@rockmastermike
Unix Neck Beard
Available for hire
Hinweis der Redaktion
I wanted to call this talk “dirty inputs.” \n
\n
Gatekeeper\nUnexpected inputs fail, push back to the user\n\n
Fault tolerant systems. Model validations are the most obvious form cleansing. They are the gatekeepers.\n
How about bulk records?\nUsing CSV, uploads as my example could be any source\nConsuming external json, sharing databases. Anything outside of the black rectangle\nWhat’s the downside to relying on validation when we get garbage?\nBest case is it fails, we ask the user to fix their stuff and try again.\nAllow half-fails? “Fix lines X Y and Z?” \n
Super basic example. Unravels a CSV file, turns a potnentially wide table into a long one.\n
Typical data grid, once again from any source\n
And now we have a stream of data. Allows for more graceful failures. Since the entire input is in the system we can prompt the user to fix the errors or devise filters to do it automatically. \n\nIs it possible we would get better filters in the future? Better methods of cleaning the data. I’m sure none of you have ever seen a database where the columns were shifted by 1 because of a bone headed mistake that happened 2 months ago. Me either.\n
Schemaless store is just the landing area for the data to be moved into our database in batches. The stream could be MongoDB, SQL Light, cave drawings with a web cam where your OCR software processes it into something usable. \n\nIt doesn’t matter.\n
\n
What if it looked more like this? How many do fake deletes? Why? How is an update different from a delete?\nIf we automate the input/ filter process why do it only once?\nWhy throw out anything at all? How would that system be different? Here is as far as I am. Ish. That “All data” is a few hundred gigs in MySQL tables and I have scripts that run when something updates. Add a ZIP and 56 minutes later it shows up in my Rails app.\n
Nathan Marz had this idea first. \n
How’s about this? \n\nQuery is a function of all data. Capture is done in the rawest granular way possible so speed wouldn’t be a consideration. Events rather than “stuff” so it can be rewound to the beginning of time.\n
What is coffee? It’s filthy ass water, that’s what it is. Coffeeologists (board certified ones) measure the quality of coffee using the same dimensions as clean drinking water. pH, dissolved solids, rat feces. The usual.\n
Pre ground grocery store beans have been sitting there for months and have lost their volatile flavor molecules. \nThe drip machine sprays unfiltered water that is too hot into the center of the filter over extracting some grounds and leaving others under extracted. \nThe coffee hits the bottom of the hot glass carafe and is instantly burned.\nWhat about the coffee nerds here? \n
Pour over fixes the water temp and center over-extraction\nPress pot goes further and allows for extraction fine tuning\n\n
The issue is variables out of your control:\nBean age\nWater quality\n\nPress pots can come close but you’re brewing blind. \n