Aadhaar application stores and searches through 200M residents' data containing personal and biometrics information. A user can search for records based on various criteria like personal or system information of resident(s). The session will discuss about the approach and challenges to creating a data store to handle 2M inserts/updates and 10M reads/day. You will learn details on storing and handling 16TB of data, spread over 8 shards for high availability and approach on scaling it to handle a total of 1.2 Billion residents' information data) in such a way, that we can process it for analytics.
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
Search data store for the world's largest biometric identity system
1. Search data store for the world's largest
biometric identity system
Regunath Balasubramanian Shashikant Soni
regunathb@gmail.com soni.shashikant@gmail.com
twitter @regunathb
CONFIDENTIAL: For limited circulation only Slide 1
2. India
● 1.2 billion residents
● 640,000 villages, ~60% lives under $2/day
● ~75% literacy, <3% pays Income Tax, <20% banking
● ~800 million mobile, ~200-300 mn migrant workers
● Govt. spends about $25-40B on direct subsidies
● Residents have no standard identity document
● Most programs plagued with ghost and multiple identities causing
leakage of 30-40%
Slide 2
3. Aadhaar
● Create a common ‘national identity’ for every ‘resident’
●Biometric backed identity to eliminate duplicates
●‘Verifiable online identity’ for portability
● Applications ecosystem using open APIs
●Aadhaar enabled bank account and payment platform
●Aadhaar enabled electronic, paperless KYC (Know Your
Customer)
Slide 3
4. Search Requirements
● Multi-attribute query like:
name contains ‘regunath’ AND city = ‘bangalore’ AND
address contains ‘J P Nagar’ AND YearOfBirth = ……
● Search 1.2B resident data with photo, history
●35Kb - Average record size
● Response times in milliseconds
● Open scale out
Slide 4
5. Why MongoDB
● Auto-sharding
● Replication
● Failover
… Essentially an AP (slaveOk) data store in CAP parlance
● Evolving schema
● Map-Reduce for analysis
● Full text search
●Compound (or) multi-keys
Slide 5
7. Implementation and Deployment
● Start - 4M records in 2 shards
Current - 250M records in 8 shards ( 8 x ~2 TB x 3 replicas)
● Performance , Reliability & Durability
●SlaveOk
●getLastError, Write Concern: availability vs durability
j = journaling
w = nodes-to-write
● Replica-sets / Shards – how?
RS 1 RS 1 RS 1
Rs 2 RS 2 RS 2
Primary
Config 1 Config 2 Config 3
Secondary
Arbiter Router Router Router
Slide 7
8. Monitoring and Troubleshooting
● Monitoring tools evaluated
●MMS
●munin
● Manual approach - daily ritual
●RS, DB, config, router - health and stats
● Problem analysis stats
●mongostat, iostat, currentOps, logs
●Client connections
● Stats for storage, shards addition
●Data file size
●Shard data distribution
●Replication
Slide 8
9. Key Learnings on MongoDB
● Indexing 32 fields
●Compound indexes
●Multi-keys indexes
{…"indexes" : [{ "email":"john.doe@email.com", "phone":"123456789“ }] }
db.coll.find ({ "indexes.email" : "john.doe@email.com" })
●Indexes use b-tree
●Many fields to index
●Performs well upto 1-2M documents
●Best if index fits in memory
● Data replication, RS failover
●Rollback when RS goes out of sync
Manual restore (physical data copy)
Restarting a very stale node
Slide 9
10. Questions?
Regunath Balasubramanian Shashikant Soni
regunathb@gmail.com soni.shashikant@gmail.com
twitter @regunathb
CONFIDENTIAL: For limited circulation only Slide 10