We will show a case study of moving from SQL Server DWH to Hadoop and Vertica. In this case study you will see implementation of Lambda Architecture with Hadoop, Vertica and MongoDB for real time statistics. We will start from showing the Legacy system and describe the problems we encountered. From there we will cover all the decision making on technology choosing for the current solution. We will finish by presenting the next steps of our Data Platform solution.
12. First Stage
Data warehouse
Servers Servers
Servers
Data Collection
Servers
Data Distribution
Servers
DWH API
Servers
External Data Providers
Event Collector Analytics
Reporting
Monitoring
Servers
sFTP
FTP
sFTP
FTP
Legacy DWH
Servers
14. First Stage
Data warehouse
Servers Servers
Servers
Data Collection
Servers
Data Distribution
Servers
DWH API
Servers
External Data Providers
Event Collector Analytics
Reporting
Monitoring
Servers
sFTP
FTP
sFTP
FTP
Legacy DWH
Servers
15. SecondStage
Data warehouse
Servers Servers
Servers
Servers
Data Collection
Servers
Data Distribution
Servers
DWH API
Servers
External Data Providers
Event Collector
Scheduling
Reporting
Monitoring
Servers
S3
Azure
sFTP
FTP
Azure
S3
sFTP
FTP
Real Time DWH
Servers
Servers
Analytics
18. BatchEventProcessing
Hadoop Cluster
Hadoop Monitoring
Aggregated data exporter
Processed data aggregator
Error Processing
Data Archivator
Data Collection Cluster
Raw data processing
Map-Reduce
Raw data files pushed to
Hadoop (WEB HDFS)
Vertica
ExternalInternal DWH Clusters
Data flow direction
Monitoring data
Raw data
processing
1. Cleaning/
Transformation/
Enrichment/
Validation of data
from main data
sources with Map-
Reduce
2. Month history
Aggregator Process
1. DSL for defining
new kind of
aggregation
Data exporter
1. Export
aggregated data
2. Export processed
data
ProcessedAggregated data
Logging Framework Elastic Search
Logs will be
exposed through
Kibana to monitor
data flow
Monitoring
Monitoring of data
flow inside and
outside of Event
Processing Cluster
Hadoop monitoring data
Error Processing
1. Automatic error
re-processing with
time window
S3
23. DataCollection
Data Collection Cluster
Servers
Servers
Servers
Video Tracking
Ad Tracking
User Tracking
3rd
Party Ad Tracking
SQL Server
CSV data received every hour
via FTP.
Raw Events and Dimensions.
Text files received every five
minutes.
From Public and Private
Cloud.
Raw Events.
Logging Framework Elastic Search
Hadoop Processing Cluster
Data about received files
events reported with logging
framework
Raw data files pushed to
Hadoop (WEB HDFS)
Dimension tables
Servers to acquire
Stage 1 :
.NET Application
will pull FTP, SQL
DWH server for
loggers and SQL
Replication for
dimension data
Stage 2:
Think to move to
other more
appropriate
technology like
Akka
Data flow direction
Logs will be
exposed through
Kibana to monitor
data flow
Monitoring data
Monitoring
Monitoring of data
flow inside and
outside of Data
Collection Cluster
MongoDb
24. DataDistribution
Data Distribution Cluster
Hive
Vertica
MongoDB
Report Distributor
Logging Framework Elastic Search
Reporting Platform
Data flow direction
Logs will be
exposed through
Kibana to monitor
data flow
Monitoring data
Monitoring
Monitoring of data
flow inside and
outside of Data
Distribution Cluster
Report S3 Storage
44. SecondStage
Data warehouse
Servers Servers
Servers
Servers
Data Collection
Servers
Data Distribution
Servers
DWH API
Servers
External Data Providers
Event Collector
Scheduling
Reporting
Monitoring
Servers
S3
Azure
sFTP
FTP
Azure
S3
sFTP
FTP
Real Time DWH
Servers
Servers
Analytics