Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Yogev | Current 2022

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 38 Anzeige

Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Yogev | Current 2022

Herunterladen, um offline zu lesen

Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Yogev | Current 2022

At Yotpo, we have a rich and busy data lake consisting of thousands of data sets ingested and digested by different engines, the main one being Spark.
We built our data infrastructure to enable our users to produce and consume data via self-service tooling, giving them the utmost freedom.

This freedom came with a cost.

We had trouble with bad standardization, little data reusability, lack of data lineage, and flaky data sets.
We also witnessed the landscape under which we built our platform change dramatically and so have our analytics needs and expectations.

We came to an understanding that the modeling layer should be decoupled from the execution layer in order to get rid of the limitations we were bounded by -
Batch and stream should be no more than attributes as part of a wider abstraction
A Kafka topic and a data lake table are no different and should be treated the same way
Observability of our data pipelines should have the same quality and depth across all execution engines, storage methods, and formats
Governance should be an implicit part of our ecosystem to serve as a basis for both exploration and automation/anomaly detection

That's when we started building YODA (soon to be open sourced) that gives us killer dev experience with the level of abstraction we always dreamed of.
Combining DBT, Databricks, lakeFS, and a multitude of streaming engines - we started seeing our vision come to life.
In this talk, we'll share from our journey redesigning the data lake, and how to best address organizational needs, without having to give up on high-end tooling and technology. We are taking this to the next level.

Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Yogev | Current 2022

At Yotpo, we have a rich and busy data lake consisting of thousands of data sets ingested and digested by different engines, the main one being Spark.
We built our data infrastructure to enable our users to produce and consume data via self-service tooling, giving them the utmost freedom.

This freedom came with a cost.

We had trouble with bad standardization, little data reusability, lack of data lineage, and flaky data sets.
We also witnessed the landscape under which we built our platform change dramatically and so have our analytics needs and expectations.

We came to an understanding that the modeling layer should be decoupled from the execution layer in order to get rid of the limitations we were bounded by -
Batch and stream should be no more than attributes as part of a wider abstraction
A Kafka topic and a data lake table are no different and should be treated the same way
Observability of our data pipelines should have the same quality and depth across all execution engines, storage methods, and formats
Governance should be an implicit part of our ecosystem to serve as a basis for both exploration and automation/anomaly detection

That's when we started building YODA (soon to be open sourced) that gives us killer dev experience with the level of abstraction we always dreamed of.
Combining DBT, Databricks, lakeFS, and a multitude of streaming engines - we started seeing our vision come to life.
In this talk, we'll share from our journey redesigning the data lake, and how to best address organizational needs, without having to give up on high-end tooling and technology. We are taking this to the next level.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Ähnlich wie Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Yogev | Current 2022 (20)

Weitere von HostedbyConfluent (20)

Anzeige

Aktuellste (20)

Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Yogev | Current 2022

  1. 1. ‹#› Next gen data modeling in the open data platform Doron Porat, Liran Yogev Current, 2022
  2. 2. Let’s talk about mistakes
  3. 3. Data infra group leader Doron Porat Director of engineering Liran Yogev Ex-coworkers who LOVE data and still share a successful Israeli podcast about Data Engineering.
  4. 4. Who sent us here
  5. 5. The open data platform
  6. 6. Flexible just enough to cope with technological and procedural changes Adaptable Parts of the platform can be replaced over time by different/similar solutions Interoperable and interchangeable Built for big data, for many consumers, for many producers Scalable Solves a specific problem in the data platform Clear purpose The open data platform Main principles
  7. 7. Yotpo’s open data platform
  8. 8. Data generation in the open data platform
  9. 9. “Data Transform V1” • Spark based • Sql oriented • Many supported inputs • Many supported writers • Data unit-testing • Dq checks Hundreds of data pipelines were built with this tool, by generalist developers.
  10. 10. The reality of V1
  11. 11. We need to be better governors. Enablement is not enough. Assume nothing. Lost metadata cannot be recovered. Coupling is dangerous.
  12. 12. V2 is all about “Governance Driven Development”.
  13. 13. "DATA TRANSFORM V2" Our Key Objectives Developer Experience Abstraction Data As A Product Simple, Reusable and Testable Orchestration, Quality, Consistency and Ownership Consumer Awareness, Documentation and Observability
  14. 14. Our quest for finding an open-source solution was a short one!
  15. 15. DBT terminologies: • Sources • Models • Macros • Exposures • Metrics
  16. 16. DBT runs on multiple environments and technologies using different adapters Modeling - Compute decoupling DBT embraced by many organizations and data tooling providers The community is the best DBT encourages metadata collection during the development process (GDD) Improves the survivability factor Open source, highly extensible and already contains many of our requirements Can adapt beautifully to our needs DBT What does work for us?
  17. 17. DBT if not a perfect fit Looking back at V2's key objectives CLI manual development Mono-repo Dev testing Dev experience Single adapter Not Spark optimized Built for batch No orchestration Abstraction
  18. 18. So how can we fix it?
  19. 19. Demo time
  20. 20. ‹#›
  21. 21. Data CI/CD
  22. 22. Orchestration
  23. 23. But what about real-time workloads?
  24. 24. Works well with DBT architecture Clickhouse implementation (link) Rockset implementation (link) Real-time analytics databases Requires extensive SQL interface Most testing cannot run in-process Difficult to convert batch to stream Streaming engines Vs. Real-time analytics in DBT
  25. 25. Streaming engines • Materialize is the first streaming engine in DBT (here) • Materialize is a streaming SQL database • Based on incrementally-updated materialized views • Extensive ANSI SQL support • DBT support includes modeling, documenting, running and testing
  26. 26. This is super cool! But we use Flink…
  27. 27. So, Flink and DBT? • Flink has an SQL interface • Flink can connect to our metastore • Flink can store kafka topics references in our metastore • SQL jobs can be deployed remotely via simple python code • Supports both batch and stream • Can write directly to the data lake
  28. 28. - name: transactions description: external: location: transactions connector: 'kafka' properties.bootstrap.servers: 'localhost:9092' key.format: 'raw' value.format: 'json' key.fields: 'id' value.fields-include: 'ALL' watermark: rowtime_column_name: transaction_time watermark_strategy_expression: transaction_time - INTERVAL '30’ SECONDS columns: - name: id data_type: STRING description: "ID" - name: currency_code data_type: STRING description: "Currency Code" - name: total data_type: DECIMAL(10,2) description: "Total amount spent" - name: transaction_time data_type: TIMESTAMP(3) description: "Time of transaction" Kafka topic #1 Flink in DBT (concept) - sources
  29. 29. version: 2 models: - name: transactions_with_rate description: transactions joined with rate config: meta: external: location: 'transactions_with_rates’ connector: 'kafka’ properties.bootstrap.servers: 'localhost:9092’ format: 'json' columns: - name: id description: 'Transaction id' data_type: STRING tests: - unique - not_null - name: total_eur data_type: DECIMAL(10,2) description: 'Total in euro' - name: total description: 'Total' data_type: DECIMAL(10,2) - name: currency_code description: 'Currency code' data_type: STRING - name: transaction_time description: 'Transaction time' data_type: TIMESTAMP(3) SELECT t.id, t.total * c.eur_rate AS total_eur, t.total, c.currency_code, t.transaction_time FROM {{ source(kafka_tables, transactions) }} t JOIN {{ source(kafka_tables, currency_rates) }} FOR SYSTEM_TIME AS OF t.transaction_time AS c ON t.currency_code = c.currency_code; Flink in DBT (concept) - model transactions_with_rates.sql transactions_with_rates.yml
  30. 30. Materialize’s implementation is great! Flink in DBT (concept) - tests - name: transactions_with_rates description: transactions joined with rates columns: - name: id data_type: STRING description: "ID" tests: - unique config: store_failures: true INSERT INTO test_unique_transactions_with_rates_id SELECT id, COUNT(1) as count_id FROM transactions_with_rates GROUP BY id HAVING count_id > 0 transactions_with_rates.yml
  31. 31. So, we have streaming figured out! (theoretically)
  32. 32. Challenges • Generalization has a price • Not everything can be abstracted this way • We lose expertise across the organization • Heavy dependency on Dataops • Requires LOTS of building
  33. 33. The early demise of apache flink, Dall-e, 2022 • Not all tech is here to stay • Our users can’t keep up • Puts governance first • Business logic is the true organizational asset Data Modeling is the last survivor
  34. 34. ‹#› Questions?

×