Tech Talk at the 1st Flipkart Tech Conference: Slash N
Creating a Catalog Mgmt System is non trivial when you talk of scale. Scale which operates at all levels: volume of data, size of catalog, and the flux of changes. Besides, an ideal model to represent the relationships and the elasticity of data is a non trivial science. In this talk lets try and figure out what part of it is science, and where we cross boundary and think art.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Cataloging: The Art and Science of it
1. Cataloging
The Art & Science of it...
Utkarsh
Principal Architect @
Flipkart.com
Sunday 3 March 13
2. Art vs Science
Imaginative Free
Form
Creative
Measurable Formulative
Methodical Set
Patterns
Sunday 3 March 13
3. What is Cataloging?
• Catalog
A list or itemized display usually including descriptive information
or illustrations.
• Cataloging
a. To list or include in a catalog
b. To classify according to a categorical system
We define it as:
Cataloging is the process of managing the inventory of products
through the entire lifecycle of creating, updating, de-
provisioning/re-provisioning and deletion.
3
Sunday 3 March 13
4. Why is the problem
interesting?
• Ever growing - “size”
• Dynamic nature of the Metadata - “elasticity”
• Association(s) between data elements -
“flexibility”
• Flux of changes - “variability”
• De-coupled systems & Data Ownership -
“data duplication”
4
Sunday 3 March 13
5. How do we solve it?
• Be Comprehensive & Imaginative
• Be Methodical & Flexible
• Work with Patterns & Create new Patterns
• Be a Composer, be an artist (blend where required)
5
Sunday 3 March 13
6. What do we solve?
• Identify Data Elements
• Identify Relationships b/w Data Elements
• Identify Data Usage patterns (Query patterns)
• Create an ideal representation: Logical Model
• Characterize the Data Store(s)
• Architect the Catalog Data Cluster
• Define Views/Interface(s)
6
Sunday 3 March 13
7. Identify Data Elements
Product Stock Sellers
Biblio
Product Category Product
Variants SLAs
Supplier Product Taxation
Images
Pricing Contributors
?
Be Comprehensive ; Be Imaginative !!
7
Sunday 3 March 13
8. Identify Relationships
?
Compilation
Physical 1
Product has A
is A
Compilation
2
Book
has A
belongs to
belongs to
belongs to
Year Author
Genre
Be Comprehensive ; Be Imaginative !!
8
Sunday 3 March 13
9. Identify Data Query Patterns
• Is the querying real-time or offline (customer perspective)
• Is the query “Id” based or use of filters (adhoc or pre-defined)
• Is the query linking multiple data elements
• Understand: Query SLAs at ever increasing scale
• Question: why is the client writing such a query
Eg:
a. Book with a specific title Secret of the Nagas
b. Books by Chetan Bhagat published in 2012
c. Books which are Thrillers, published post 2005 written in Hindi and
published by Rupa Publications
9
Sunday 3 March 13
10. Identification is Non Trivial
Example “Book”
Identification -->
“Title”
10
Sunday 3 March 13
11. Identification is Non Trivial
Example “Book”
Identification -->
“Title”
“Title” + “Publisher”
11
Sunday 3 March 13
12. Identification is Non Trivial
Example “Book”
Identification -->
“Title”
“Title” + “Publisher”
“Title” + “Publisher” + “Edition”
12
Sunday 3 March 13
13. Identification is Non Trivial
Example “Book”
Identification -->
“Title”
“Title” + “Publisher”
“Title” + “Publisher” + “Edition”
“Title” + “Publisher” + “Edition” + “Variant”
13
Sunday 3 March 13
14. Identification is Non Trivial
Example “Book”
Identification -->
“Title”
“Title” + “Publisher”
“Title” + “Publisher” + “Edition”
“Title” + “Publisher” + “Edition” + “Variant”
“Title” + “Publisher” + “Edition” + “Variant” + ??
Be Imaginative - an Artist’s brush stroke !!
14
Sunday 3 March 13
15. Logical Model
Schema
Entities as Tables + Rich Query Support Relational
Databases:
+ Built-in support for
Relationships * MySQL,
Relationships as Oracle, Postgres
Constraints + Indexes et al
Queries supported - Elasticity
through indexes * Frequent addition/
and joins deletion of columns
* Growing secondary
indexes
- Not optimized for some
use-cases
* Key-Values
*Data Blobs/ Graphs
15
Sunday 3 March 13
16. Logical Model
Semi-Schema
+ Flexibility:
Blobs (Documents) Document Stores:
“Documents” are
of Data less rigid * MongoDB,
CouchBase et al
+ Query Language
Linkages between to retrieve based
Documents on content of
“Document”
Queries supported
through document - Complex
identifiers and Relationships are
document non-trivial
references - “Linked”
Document Queries
may not be
optimized
16
Sunday 3 March 13
17. Logical Model
No Schema
Data Blobs + Elasticity Other NoSQL
* Variability of Stores:
data format * HBase, RIAK,
Rules/Relationship Cassandra, et al
definitions * Secondary
Indices
+ Tunable
Queries supported performance
through data
“views”, indexes,
search based on - Relational data is
reverse indexing a force-fit (sub-
etc ... optimal)
+/- Querying
models are specific
to Stores
17
Sunday 3 March 13
18. Catalog Data Cluster
Catalog Biblio Product
Data Data Data
UGC Compliance
on Data
Products
- “View”/”Data” Partitions
- Blend multiple data stores
- Interfaces provide view to
? Pricing/
the underlying data
Accounting
- Scale uniformly for data
elements
18
Sunday 3 March 13
19. Data Store Characterization
• Data characteristics: • Elasticity
- Reliability (availability - increase in scale
and redundancy) - evolving catalog
- Consistency definitions
• Querying capability
- Support for indexes • SLAs
- Filters; secondary - Volumes
indexes
- Throughput
- linkages/relationships
- Latencies
Be Comprehensive; be Methodical but be unbounded by
choices - a Scientist who has a palet of colors in hand !!
19
Sunday 3 March 13
20. Data Store Characterization
• CAP: which 2 we pick? can data store help configure
any 2? A
C P
• Operational ease (monitoring, reporting, config
mgmt ..)
• Pluggability with Distributed Computing platforms
20
Sunday 3 March 13
21. Define Views & Interfaces
• Cataloging has multiple use-cases
which are business centric View Layer
Precomputed View(s)
• Use-cases evolve; and so do the
“view” to the data Dynamic View(s)
• “Views” as multiple interpretations
Data Access Interface
of the data;
• De-coupled with the underlying
data Data 1 Data 2
• Underlying data form has to be
elastic Data 3 Data 4
• Overlayed views have to be
adaptive
21
Sunday 3 March 13
22. Architect for Scale &
Performance
Identify
Usage Patterns Right
Tools for Job
Right
Abstractions Pluggable
Solution Stacks
Decoupled
Data Offline
Processing
22
Sunday 3 March 13
23. Measure, Monitor & Evolve
• SLAs change; system has to be adaptive
• Start off with established goals; benchmark and
meet the initial set goals
• Changes are gradual; plan at the first symptom
• Listen for system(s) not coping up
• Always work towards incremental changes; entire
overhaul of the systems will be counter productive
Be Curious, have doubts, deeply introspect -
be the ultimate Scientist !!
23
Sunday 3 March 13
24. Change is constant ... adapt
• Requirements evolve
• Business introduces flux
• Data interpretations grow
• Be flexible, adaptive, imaginative......
work as a Scientist who appreciates
Art !!
24
Sunday 3 March 13
25. Thank you !
My Co-ordinates:
utkarsh@flipkart.com
25
Sunday 3 March 13