2. Data Pipeline
2
Raw Data
Ingestion
and File System
Batch Processing Data Store Web Framework
Common
Crawl
AWS S3
Wikipedia 2015
160 TB per Month!
Snapshot of Entire Internet
3. Data Pipeline
3
Raw Data
Ingestion
and File System
Batch Processing Data Store Web Framework
Common
Crawl
AWS S3
Wikipedia 2015
S3
Source
of Truth
160 TB per Month!
Snapshot of Entire Internet
12 x s3 Medium
$0.80 per hour
~$160 per month
4. Data Pipeline
4
Raw Data
Ingestion
and File System
Batch Processing Data Store Web Framework
Common
Crawl
8 X T4 Large
AWS S3
Wikipedia 2015
S3
Source
of Truth
160 TB per Month!
Snapshot of Entire Internet
12 x s3 Medium
$0.80 per hour
~$160 per month
$1.01 per hour
5. Elasticsearch
Data Pipeline
5
Raw Data
Ingestion
and File System
Batch Processing Data Store Web Framework
Common
Crawl
Cassandra
8 X T4 Large 8 x T4 large
AWS S3
Wikipedia 2015
S3
Source
of Truth
160 TB per Month!
Snapshot of Entire Internet
12 x s3 Medium
$0.80 per hour
~$160 per month
$1.01 per hour
$1.01 per hour
6. Elasticsearch
Data Pipeline
6
Raw Data
Ingestion
and File System
Batch Processing Data Store Web Framework
Common
Crawl
Cassandra
Flask
8 X T4 Large 8 x T4 large
AWS S3
Wikipedia 2015
S3
1 X T2 Micro
Source
of Truth
160 TB per Month!
Snapshot of Entire Internet
12 x s3 Medium
$0.80 per hour
~$160 per month
$1.01 per hour
$1.01 per hour
System Costs:
~$2800 per month
If spot instances used:
~$300 per month
Free
11. Challenge:
How to optimize database for low latency querying?
URL (keys)
Documents (values)
1
25
3
Tyler
Ben Casey
4
Barb
Dana
Network Map – Wikipedia Contributor
QUERY: Telecommunications
12. Hybrid Database - Schema
Cassandra
Elasticsearch
Date
Elasticsearch
Index by Text
Property
Key: Text
Value
Nodes
Texta
(Physics)
a,b,d
Textb
(Engineering)
a
Textc
(Science)
a,c
13. Hybrid Database - Schema
Cassandra
Elasticsearch
Date
Value
URL
Clustering
Rank
Value
Links
/Data_Science 189 a-c, a-d, …
/Insight_Data 186 c-a, c-h, …
/Spark_Streaming 185 a-b, b-c
Property
Key: Text
Value
Nodes
Texta
(Physics)
a,b,d
Textb
(Engineering)
a
Textc
(Science)
a,c
Key = URL, Order by Rank
CassandraElasticsearch
Index by Text
14. Engineering Challenges :
Approximation of page rank with low latency
14
2
35
4
Network Map – Wikipedia Contributor
QUERY: ALL
Mary 1
Tyler
Ben Casey
6
Barb
Dana
1
25
3
Tyler
Ben Casey
4
Barb
Dana
Network Map – Wikipedia Contributor
QUERY: Data Engineering