2. Scaling out to 10 Clusters, 1000 Users,
and 10,000 Flows:
The Dali Experience at LinkedIn
Carl Steinbach
Senior Staff Software Engineer
Data Analytics Infrastructure Group
LinkedIn
3. Hadoop @ LinkedIn: Circa 2008
1 cluster
20 nodes
10 users
10 production workflows
MapReduce, Pig
4. Hadoop @ LinkedIn: NOW
> 10 clusters
> 10,000 nodes
> 1,000 users
Hundreds of production workflows, thousands of
development flows and ad-hoc Qs
MapReduce, Pig, Hive, Gobblin, Cubert, Scalding, Spark,
Presto, …
5. What did we learn along the way?
Scaling Hardware Infrastructure is Hard
6. What did we learn along the way?
Scaling Human Infrastructure is Harder
7. Dali Motivations: Data Consumers
Data consumers have to manage too many details
What data is available, and who produces it?
Where is the data located (cluster, path)
How is the data partitioned (logical physical)
How do I read the data (format)?
8. Dali Motivations: Data Producers
Managing contracts with consumers is hard
Change Management
Who depends on this dataset?
Public versus Private APIs
Can’t unpublish new fields
Schemas are two weak
Physical types are nice, but we want semantic types
9. Dali Motivations: Infra Providers
This mess makes things really hard for infrastructure providers!
Lots of optimizations are impossible because producers/consumer logic locks us
into what should be backend decisions
Storage format (Avro)
Physical partitioning scheme (Date)
Data location (Specific directory, cluster, FS)
Lots of redundant code paths to support
11. Dali Vision and Mission
Vision: Make analytics infrastructure invisible
Mission: Make data on HDFS easier to manage
Filesystem: multi-version, multi-cluster
Datasets: tables not files
Views: contract management for producers and consumers
Lineage and Discovery: map datasets to producers, consumers, and track
changes over time
12. Dali Central Dogma
Separate logical and physical concerns!
GUIDs for logical entities are good!
Versions are good!
13. 13
“All problems in computer
science can be solved by another
level of indirection”
David Wheeler (?)
15. Is a Dataset API Enough?
Some use cases at LinkedIn
Structural transformations (flattening and nesting)
Muxing and de-muxing data (unions)
Patching bad data
Backward incompatible changes
Code reuse
What we need
Ability to decouple the API from the dataset
Producer control over public and private APIs
Tooling and process to support safe evolution of these APIs
17. A sample view
CREATE VIEW profile_flattened
TBLPROPERTIES(
'functions' =
'get_profile_section:isb.GetProfileSections',
'dependencies' =
'com.linkedin.dali-udfs:get-profile-sections:0.0.5')
AS SELECT
get_profile_section(...)
FROM
prod_identity.profile;
18. Reading a Dali View from Pig
register ivy://com.linkedin.dali:dali-all:2.3.52;
a = load “dali:///prod_identity/profile_flattened”
using dali.data.pig.DaliStorage();
19. Versioning for Dali Views
All views and UDFs must be versioned
Declare and version view dependencies
Deploy multiple versions of the same view
Incremental pull upgrades replace monolithic push upgrades
20. Managing views as Gradle artifacts
Git is the source of truth for view definitions
1:1 mapping between views and published artifacts
Mapping view artifacts to view names
db.view
${groupId}.${artifactId}_x_y_z
${groupId}.${artifactId} is an alias for the most recent view definition
21. Leveraging existing LI tools INFRA
Query view/UDF version dependency graph
who-depends-on-me?
Deprecate, EOL, and purge a specific view/UDF version
Plug into existing global namespace management provided by LI developer tools
Enforce referential integrity for views at deployment time
22. Contract Law for Datasets
Vague, poorly defined contracts bind data producers to consumers
Physical types don’t tell us much
STRING or URI?
STRING or ENUM?
Semantic types help, but what about other types of relationships?
X IS NOT NULL
A_time is in seconds, b_time is in millis
Attributes of a good contract
Easy to find
Easy to understand
Easy to change
23. Hijacking an existing process
Express contracts as logical constraints against the fields of a view
Make the contract easy to find by storing it in the view’s Git repo
Contract negotiation follows an existing process
Data producer (view owner) controls the ACL on the view repo
Data consumer requests a contract change via ReviewBoard request
View owner either accepts or rejects the pull request
If accepted, view version is bumped to notify downstream consumers
If rejected, consumer still has the option of committing the constraint to their own repo
Contract Constraint based testing for views
Contract Data Quality tests
24. Why Dali?
Consumers
Make data stable, predictable, discoverable
Producers
Explicit, manageable contracts with consumers
Frictionless, familiar process for modifying existing contracts
Infra Providers
Freedom to optimize
Flow portability DR, multi-DC scheduling
-Since People You May Know is long, we call it PYMK at Linkedin.
-The original version ran on Oracle
-And the way it worked was to attempt to find overlaps between any pairs of people. Did they share the same school? Did they work at the same company?
-One big indicator was common connections, and we used something called triangle closing.