Presentation by Raffael Dzikowski and Sean Gustafson of Scout24 AG at the 2019 AWS Summit Berlin.
Abstract:
The Scout24 Data Platform powers all reporting, ad hoc analytics and machine learning products at AutoScout24 and ImmobilienScout24. In this talk, we will take a technical deep dive into our modern, cloud-based big data platform. We will discuss our evolution of approaches to ingestion, ETL, access control, reporting and machine learning with a focus on in-the-trenches learnings gained from our many failures and successes as we migrated from a traditional Oracle Data Warehouse to an AWS-based data lake.
- To signal a new thinking here, we had to idea to formulate a Data Landscape Manifesto which we as a company would agree on.
- This is about roles, responsibilities and common values
- Consists of 7 principles, which are each based on a assumption or a belief from which we derived that principle.
We believe that collecting & analyzing data is crucial to understand our business, our customers, and the market in order to provide the right services & products
Although this is nothing surprising these days, we wanted to start with this in order to ensure a common understanding of why all of this is important in the first place.
--> Loosely coupled (Microservices), strongly ALIGNED (Jez Humble, Adrian Cockroft)
We therefore believe that everyone in the company must have easy access to the data available and it must be easy to publish data which can be used by others. This requires a solid Data Platform: easy-to-use tools, reliable infrastructure , and simple guidelines for publishing & consuming data.
…
This is our core responsibility (and we wanted to start with this side).
The data landscape is the playground on which data producers and data consumers interact. We provide the platform and the clear guidelines but we do not own that space .
The reason for this is that we believe..
We believe that an exhaustive centralized data management does not allow us to scale to the level of data creation and consumption we aspire as a company, because it creates a bottleneck and introduces accidental, indirect dependencies. Instead , we believe that data autonomy is the only way for data usage to scale across the company. However, for data autonomy to not become data anarchy, there has to be a clear set of basic rules and responsibilities.
Data autonomy puts…
We believe that extensive data availability, data discoverability, and data usability are crucial and that – at scale – no one else can ensure this other than the one controlling the source where the data is originally generated.
We believe that the stakeholder of a metric has to be the single owner of that metric and its definition, and has to drive its implementation.
Without a single source of truth about what a metric means, we risk that multiple diverging and possibly contradicting understandings and implementations develop over time.
We believe that a minimum level of company-wide compar-ability& reliability of core KPIs is crucial for leading the company into the right direction.
The management is the owner of these core KPIs and the data group represents the management here in terms of metric ownership.
We believe that transparency is crucial for understanding what the meaning of a metric is.
If month-to-month comparability must never break, there is no way to continuously improve metrics and their transparency based on new insights.
To stay in the example: if we actually understand that a certain number of orders are actually fraud than we want to report the actual real revenue.
A federal landscape of data producers and consumers with just enough rules to ensure seamless co-operation without severely impeding autonomy.