90% of industrial machine learning is feature extraction. What I really do is ETL.
Drive revenue and new business. Data is too important to be left to business people. The data warehouse is where data goes to die-- to be used in operational reporting or diagnosing problems. When we’re talking about data products, we’re talking about creating new revenue streams, optimizing existing ones, and solving problems for customers and for the business.
1) Means I collect everything– I don’t want to waste time getting data from operational systems every time I need something new. 2) Means I keep all of the phases of data available to me– from the raw stuff, to cleansed stuff, to joined stuff. 3) Implications for denormalization-- 1) we go beyond dimensional modeling to full on denormalization, usually along the lines of one of our conformed dimensions (product, customer, etc.)
Similar to the EDW team. For a small datawarehouse/datamart, the DW architect is the ETL developer, the DBA, the dashboard builder, and the business analyst all rolled in to one. When we are talking about data products– classifiers, recommenders, interactive or real-time data tools, we need to bring in the ability to take things to production.
Most important decision: the metrics you’re going to use to measure performance. It is an anti-pattern to solve a problem exactly once. You should either solve a problem 0 times or N times.
Time is money. Your time costs a lot more than the cost of data storage. Data acquisition, data processing, reuse code. All things you do to save money over the long term.