SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Hadoop at Aadhaar
(Data Store, OLTP & OLAP)
github.com/regunathb
RegunathB
Bangalore Hadoop Meetup
Enrolment Data
•
600 to 800 million UIDs in 4 years
– 1 million a day with transaction, durability guarantees
– 350+ trillion matches every day
•
~5MB per resident
– Maps to about 10-15 PB of raw data (2048-bit PKI encrypted)
– About 30 TB I/O every day
– Replication and backup across DCs of about 5+ TB of incremental
data every day
– Lifecycle updates and new enrolments will continue for ever
•
Enrolment data moves from very hot to cold, needing
multi-layered storage architecture
•
Additional process data
– Several million events on an average moving through async
channels (some persistent and some transient)
– Needing insert and update guarantees across data stores
2
Authentication Data
•
100+ million authentications per day (10 hrs)
– Possible high variance on peak and average
– Sub second response
– Guaranteed audits
•
Multi-DC architecture
– All changes needs to be propagated from enrolment data stores to
all authentication sites
•
Authentication request is about 4 K
– 100 million authentications a day
– 1 billion audit records in 10 days (30+ billion a year)
– 4 TB encrypted audit logs in 10 days
– Audit write must be guaranteed
3
Aadhaar Data Stores
Mongo cluster
(all enrolment records/documents
– demographics + photo)
Shard
1
Shard
4
Shard
5
Shard
2
Shard
3 Low latency indexed read (Documents per sec),
High latency random search (seconds per read)
MySQL
(all UID generated records - demographics only,
track & trace, enrolment status )
Low latency indexed read (milli-
seconds per read),
High latency random search (seconds
per read)
UID master
(sharded)
Enrolment
DB
Solr cluster
(all enrolment records/documents
– selected demographics only)
Low latency indexed read (Documents per sec),
Low latency random search (Documents per sec)
Shard
0
Shard
2
Shard
6
Shard
9
Shard
a
Shard
d
Shard
f
HDFS
(all raw packets)
Data
Node 1
Data
Node 10
Data
Node ..
High read throughput (MB per sec),
High latency read (seconds per read)
Data
Node 20
HBase
(all enrolment
biometric templates)
Region
Ser. 1
Region
Ser. 10
Region
Ser. ..
High read throughput (MB per sec),
Low-to-Medium latency read (milli-seconds per read)Region
Ser. 20
NFS
(all archived raw packets)
Moderate read throughput,
High latency read (seconds per read)
LUN 1 LUN 2 LUN 3 LUN 4
Systems Architecture
•
Work distribution
using SEDA &
Messaging
•
Ability to scale
within JVM and
across
•
Recovery through
check-pointing
•
Sync Http based
Auth gateway
•
Protocol Buffers &
XML payloads
•
Sharded clusters
•
Near Real-time data delivery to warehouse
•
Nightly data-sets used to build dashboards,
data marts and reports
•
Real-time monitoring using Events
Enrolment Biometric Middleware
•
Distribute, Reconcile biometric data extraction and de-dup
requests across multiple vendors (ABISs)
•
Biometric data de-referencing/read service(Http) over
sharded HDFS and NFS
– Serves bulk of the HDFS read requests (25TB per day)
– Locate data from multiple HDFS clusters
●
Sharded by read/write patterns : New, Archive,
Purge
•
Calculates and maintains Volume allocation, SLA breach
thresholds of ABISs
– Thresholds stored in ZK and pushed to middleware
nodes
6
Event Streams & Sinks
•
Event framework supporting different interaction/data
durability patterns
– P2P, Pub-Sub
– Intra-JVM and Queue destinations - Durable / Non-Durable
– Fire & Forget, Ack. after processing
•
Event Sinks
– Ephemeral data consumed by counters, metrics (dashboard)
– Rolling file appenders that push data to HDFS
●
Primary mechanism for delivering raw fact data from
transactional systems to the warehouse staging area
7
Data Analysis
•
Statistical analysis from millions of events
– View into quality of enrolments – e.g. Enrolment
Agencies, Operators
– Feature introduction – e.g. Based on avg. time taken for
biometric capture, demographic data input
– Enrolment volumes – e.g. By Registrar, Agency,
Operator etc
●
Useful in fraud detection
•
Goal to share anonymized data sets for use by industry and
academia – information transparency
•
Various reports – Self-serve, Canned, Operational and/or
Aggregates
8
UID BI Platform
Data Analysis architecture
9
Data Access Framework
UIDAI Systems
Events
(Rabbit MQ)
Server DB
(MySQL)
Hadoop HDFS
Data Warehouse (HDFS/Hive)
Event CSV
Fact DataDimension Data
Datasets
On-Demand Datasets
Datamarts
(MySQL)
Raw Data
Dimension Data
(MySQL)
Pig
Pentaho Kettle
Hive
Pentaho Kettle
Canned Reports Dashboard
Self-service
Analytics
Pentaho BI
FusionCharts
E-mail/Portal/Others
Hadoop stack summary
•
CDH2 (Enrolment, Analysis), CDH3(Authentication)
•
Data Store
– HDFS : Enrolment, Events, Audit Logs, Warehouse
– HBase : Biometric templates used in Authentication
•
Coordination/Config
– ZK : Biometric middleware thresholds
•
Analysis
– Pig : ETL for loading analysis data from staging to atomic
warehouse
– Hive : Dataset generation framework
10
Learnings
•
Watch out for“too many small files”. HDFS is better suited for
fewer but large files
•
Data loss from HDFS in spite of having 3 replica copies – maybe
fixed in releases post CDH2?
•
Give careful consideration to HBase table design – row key
primarily to avoid region-server hot-spotting
•
Hive data (HDFS files) does not handle duplicate records – can
be an issue if data injestion is replayed for data sets
– Hive over Hbase is a viable alternative
11
References
•
Aadhaar Portal :
https://portal.uidai.gov.in/uidwebportal/dashboard.do
•
Data Portal :
https://data.uidai.gov.in/uiddatacatalog/dataCatalogHom
e.do
•
Analytics whitepaper :
http://uidai.gov.in/images/FrontPageUpdates/uid_doc_30
012012.pdf
12

Weitere ähnliche Inhalte

Was ist angesagt?

Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Databricks
 
Apache Spark at Airbnb
Apache Spark at AirbnbApache Spark at Airbnb
Apache Spark at AirbnbDatabricks
 
in-memory database system and low latency
in-memory database system and low latencyin-memory database system and low latency
in-memory database system and low latencyhyeongchae lee
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingDatabricks
 
Snowflake SnowPro Certification Exam Cheat Sheet
Snowflake SnowPro Certification Exam Cheat SheetSnowflake SnowPro Certification Exam Cheat Sheet
Snowflake SnowPro Certification Exam Cheat SheetJeno Yamma
 
Oracle Partitioning for DBAs and Developers
Oracle Partitioning for DBAs and DevelopersOracle Partitioning for DBAs and Developers
Oracle Partitioning for DBAs and DevelopersFranky Weber Faust
 
Apache Kafka for Automotive Industry, Mobility Services & Smart City
Apache Kafka for Automotive Industry, Mobility Services & Smart CityApache Kafka for Automotive Industry, Mobility Services & Smart City
Apache Kafka for Automotive Industry, Mobility Services & Smart CityKai Wähner
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSDatabricks
 
Big Data Redis Mongodb Dynamodb Sharding
Big Data Redis Mongodb Dynamodb ShardingBig Data Redis Mongodb Dynamodb Sharding
Big Data Redis Mongodb Dynamodb ShardingAraf Karsh Hamid
 
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron SchildkroutKafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkroutconfluent
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAPEDB
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 

Was ist angesagt? (20)

Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
 
Oracle's history
Oracle's historyOracle's history
Oracle's history
 
Apache Spark at Airbnb
Apache Spark at AirbnbApache Spark at Airbnb
Apache Spark at Airbnb
 
in-memory database system and low latency
in-memory database system and low latencyin-memory database system and low latency
in-memory database system and low latency
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
 
Snowflake SnowPro Certification Exam Cheat Sheet
Snowflake SnowPro Certification Exam Cheat SheetSnowflake SnowPro Certification Exam Cheat Sheet
Snowflake SnowPro Certification Exam Cheat Sheet
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
Oracle Partitioning for DBAs and Developers
Oracle Partitioning for DBAs and DevelopersOracle Partitioning for DBAs and Developers
Oracle Partitioning for DBAs and Developers
 
Apache Kafka for Automotive Industry, Mobility Services & Smart City
Apache Kafka for Automotive Industry, Mobility Services & Smart CityApache Kafka for Automotive Industry, Mobility Services & Smart City
Apache Kafka for Automotive Industry, Mobility Services & Smart City
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RS
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Big Data Redis Mongodb Dynamodb Sharding
Big Data Redis Mongodb Dynamodb ShardingBig Data Redis Mongodb Dynamodb Sharding
Big Data Redis Mongodb Dynamodb Sharding
 
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron SchildkroutKafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAP
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Elk - An introduction
Elk - An introductionElk - An introduction
Elk - An introduction
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Airflow and supervisor
Airflow and supervisorAirflow and supervisor
Airflow and supervisor
 

Andere mochten auch

Building the Flipkart phantom
Building the Flipkart phantomBuilding the Flipkart phantom
Building the Flipkart phantomRegunath B
 
Aadhaar at 5th_elephant_v3
Aadhaar at 5th_elephant_v3Aadhaar at 5th_elephant_v3
Aadhaar at 5th_elephant_v3Regunath B
 
practical risks in aadhaar project and measures to overcome them
practical risks in aadhaar project and measures to overcome thempractical risks in aadhaar project and measures to overcome them
practical risks in aadhaar project and measures to overcome themsaipriyadonthula
 
Unique identification authority of india uid
Unique identification authority of india   uidUnique identification authority of india   uid
Unique identification authority of india uidAjit Dadresa
 
E commerce data migration in moving systems across data centres
E commerce data migration in moving systems across data centres E commerce data migration in moving systems across data centres
E commerce data migration in moving systems across data centres Regunath B
 
Facebook style notifications using hbase and event streams
Facebook style notifications using hbase and event streamsFacebook style notifications using hbase and event streams
Facebook style notifications using hbase and event streamsRegunath B
 
Aesop change data propagation
Aesop change data propagationAesop change data propagation
Aesop change data propagationRegunath B
 
Building tiered data stores using aesop to bridge sql and no sql systems
Building tiered data stores using aesop to bridge sql and no sql systemsBuilding tiered data stores using aesop to bridge sql and no sql systems
Building tiered data stores using aesop to bridge sql and no sql systemsRegunath B
 
Oss as a competitive advantage
Oss as a competitive advantageOss as a competitive advantage
Oss as a competitive advantageRegunath B
 
Authentication(pswrd,token,certificate,biometric)
Authentication(pswrd,token,certificate,biometric)Authentication(pswrd,token,certificate,biometric)
Authentication(pswrd,token,certificate,biometric)Ali Raw
 

Andere mochten auch (14)

Building the Flipkart phantom
Building the Flipkart phantomBuilding the Flipkart phantom
Building the Flipkart phantom
 
Aadhaar at 5th_elephant_v3
Aadhaar at 5th_elephant_v3Aadhaar at 5th_elephant_v3
Aadhaar at 5th_elephant_v3
 
practical risks in aadhaar project and measures to overcome them
practical risks in aadhaar project and measures to overcome thempractical risks in aadhaar project and measures to overcome them
practical risks in aadhaar project and measures to overcome them
 
Srikanth Nadhamuni
Srikanth NadhamuniSrikanth Nadhamuni
Srikanth Nadhamuni
 
Aadhaar
AadhaarAadhaar
Aadhaar
 
Unique identification authority of india uid
Unique identification authority of india   uidUnique identification authority of india   uid
Unique identification authority of india uid
 
E commerce data migration in moving systems across data centres
E commerce data migration in moving systems across data centres E commerce data migration in moving systems across data centres
E commerce data migration in moving systems across data centres
 
What database
What databaseWhat database
What database
 
Facebook style notifications using hbase and event streams
Facebook style notifications using hbase and event streamsFacebook style notifications using hbase and event streams
Facebook style notifications using hbase and event streams
 
Aesop change data propagation
Aesop change data propagationAesop change data propagation
Aesop change data propagation
 
Building tiered data stores using aesop to bridge sql and no sql systems
Building tiered data stores using aesop to bridge sql and no sql systemsBuilding tiered data stores using aesop to bridge sql and no sql systems
Building tiered data stores using aesop to bridge sql and no sql systems
 
Uid
UidUid
Uid
 
Oss as a competitive advantage
Oss as a competitive advantageOss as a competitive advantage
Oss as a competitive advantage
 
Authentication(pswrd,token,certificate,biometric)
Authentication(pswrd,token,certificate,biometric)Authentication(pswrd,token,certificate,biometric)
Authentication(pswrd,token,certificate,biometric)
 

Ähnlich wie Hadoop at aadhaar

Real time monitoring-alerting: storing 2Tb of logs a day in Elasticsearch
Real time monitoring-alerting: storing 2Tb of logs a day in ElasticsearchReal time monitoring-alerting: storing 2Tb of logs a day in Elasticsearch
Real time monitoring-alerting: storing 2Tb of logs a day in ElasticsearchAli Kheyrollahi
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWSSungmin Kim
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDogRedis Labs
 
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataHitoshi Sato
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scaledatamantra
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)Sascha Dittmann
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analyticskgshukla
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache ApexApache Apex
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016kbajda
 
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data LakesCrossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data LakesIsuru Suriarachchi
 
Red Hat Storage Server Administration Deep Dive
Red Hat Storage Server Administration Deep DiveRed Hat Storage Server Administration Deep Dive
Red Hat Storage Server Administration Deep DiveRed_Hat_Storage
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataRostislav Pashuto
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsMapR Technologies
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDBMongoDB
 
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard confluent
 
A Gentle Introduction to Big Data
A Gentle Introduction to Big DataA Gentle Introduction to Big Data
A Gentle Introduction to Big DataMehmet Ali Akyol
 

Ähnlich wie Hadoop at aadhaar (20)

Real time monitoring-alerting: storing 2Tb of logs a day in Elasticsearch
Real time monitoring-alerting: storing 2Tb of logs a day in ElasticsearchReal time monitoring-alerting: storing 2Tb of logs a day in Elasticsearch
Real time monitoring-alerting: storing 2Tb of logs a day in Elasticsearch
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWS
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
 
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data LakesCrossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
Red Hat Storage Server Administration Deep Dive
Red Hat Storage Server Administration Deep DiveRed Hat Storage Server Administration Deep Dive
Red Hat Storage Server Administration Deep Dive
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
 
Hadoop and friends
Hadoop and friendsHadoop and friends
Hadoop and friends
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
 
A Gentle Introduction to Big Data
A Gentle Introduction to Big DataA Gentle Introduction to Big Data
A Gentle Introduction to Big Data
 

Kürzlich hochgeladen

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Kürzlich hochgeladen (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Hadoop at aadhaar

  • 1. Hadoop at Aadhaar (Data Store, OLTP & OLAP) github.com/regunathb RegunathB Bangalore Hadoop Meetup
  • 2. Enrolment Data • 600 to 800 million UIDs in 4 years – 1 million a day with transaction, durability guarantees – 350+ trillion matches every day • ~5MB per resident – Maps to about 10-15 PB of raw data (2048-bit PKI encrypted) – About 30 TB I/O every day – Replication and backup across DCs of about 5+ TB of incremental data every day – Lifecycle updates and new enrolments will continue for ever • Enrolment data moves from very hot to cold, needing multi-layered storage architecture • Additional process data – Several million events on an average moving through async channels (some persistent and some transient) – Needing insert and update guarantees across data stores 2
  • 3. Authentication Data • 100+ million authentications per day (10 hrs) – Possible high variance on peak and average – Sub second response – Guaranteed audits • Multi-DC architecture – All changes needs to be propagated from enrolment data stores to all authentication sites • Authentication request is about 4 K – 100 million authentications a day – 1 billion audit records in 10 days (30+ billion a year) – 4 TB encrypted audit logs in 10 days – Audit write must be guaranteed 3
  • 4. Aadhaar Data Stores Mongo cluster (all enrolment records/documents – demographics + photo) Shard 1 Shard 4 Shard 5 Shard 2 Shard 3 Low latency indexed read (Documents per sec), High latency random search (seconds per read) MySQL (all UID generated records - demographics only, track & trace, enrolment status ) Low latency indexed read (milli- seconds per read), High latency random search (seconds per read) UID master (sharded) Enrolment DB Solr cluster (all enrolment records/documents – selected demographics only) Low latency indexed read (Documents per sec), Low latency random search (Documents per sec) Shard 0 Shard 2 Shard 6 Shard 9 Shard a Shard d Shard f HDFS (all raw packets) Data Node 1 Data Node 10 Data Node .. High read throughput (MB per sec), High latency read (seconds per read) Data Node 20 HBase (all enrolment biometric templates) Region Ser. 1 Region Ser. 10 Region Ser. .. High read throughput (MB per sec), Low-to-Medium latency read (milli-seconds per read)Region Ser. 20 NFS (all archived raw packets) Moderate read throughput, High latency read (seconds per read) LUN 1 LUN 2 LUN 3 LUN 4
  • 5. Systems Architecture • Work distribution using SEDA & Messaging • Ability to scale within JVM and across • Recovery through check-pointing • Sync Http based Auth gateway • Protocol Buffers & XML payloads • Sharded clusters • Near Real-time data delivery to warehouse • Nightly data-sets used to build dashboards, data marts and reports • Real-time monitoring using Events
  • 6. Enrolment Biometric Middleware • Distribute, Reconcile biometric data extraction and de-dup requests across multiple vendors (ABISs) • Biometric data de-referencing/read service(Http) over sharded HDFS and NFS – Serves bulk of the HDFS read requests (25TB per day) – Locate data from multiple HDFS clusters ● Sharded by read/write patterns : New, Archive, Purge • Calculates and maintains Volume allocation, SLA breach thresholds of ABISs – Thresholds stored in ZK and pushed to middleware nodes 6
  • 7. Event Streams & Sinks • Event framework supporting different interaction/data durability patterns – P2P, Pub-Sub – Intra-JVM and Queue destinations - Durable / Non-Durable – Fire & Forget, Ack. after processing • Event Sinks – Ephemeral data consumed by counters, metrics (dashboard) – Rolling file appenders that push data to HDFS ● Primary mechanism for delivering raw fact data from transactional systems to the warehouse staging area 7
  • 8. Data Analysis • Statistical analysis from millions of events – View into quality of enrolments – e.g. Enrolment Agencies, Operators – Feature introduction – e.g. Based on avg. time taken for biometric capture, demographic data input – Enrolment volumes – e.g. By Registrar, Agency, Operator etc ● Useful in fraud detection • Goal to share anonymized data sets for use by industry and academia – information transparency • Various reports – Self-serve, Canned, Operational and/or Aggregates 8
  • 9. UID BI Platform Data Analysis architecture 9 Data Access Framework UIDAI Systems Events (Rabbit MQ) Server DB (MySQL) Hadoop HDFS Data Warehouse (HDFS/Hive) Event CSV Fact DataDimension Data Datasets On-Demand Datasets Datamarts (MySQL) Raw Data Dimension Data (MySQL) Pig Pentaho Kettle Hive Pentaho Kettle Canned Reports Dashboard Self-service Analytics Pentaho BI FusionCharts E-mail/Portal/Others
  • 10. Hadoop stack summary • CDH2 (Enrolment, Analysis), CDH3(Authentication) • Data Store – HDFS : Enrolment, Events, Audit Logs, Warehouse – HBase : Biometric templates used in Authentication • Coordination/Config – ZK : Biometric middleware thresholds • Analysis – Pig : ETL for loading analysis data from staging to atomic warehouse – Hive : Dataset generation framework 10
  • 11. Learnings • Watch out for“too many small files”. HDFS is better suited for fewer but large files • Data loss from HDFS in spite of having 3 replica copies – maybe fixed in releases post CDH2? • Give careful consideration to HBase table design – row key primarily to avoid region-server hot-spotting • Hive data (HDFS files) does not handle duplicate records – can be an issue if data injestion is replayed for data sets – Hive over Hbase is a viable alternative 11
  • 12. References • Aadhaar Portal : https://portal.uidai.gov.in/uidwebportal/dashboard.do • Data Portal : https://data.uidai.gov.in/uiddatacatalog/dataCatalogHom e.do • Analytics whitepaper : http://uidai.gov.in/images/FrontPageUpdates/uid_doc_30 012012.pdf 12