Big Telco, Bigger real-time demands: Real-time processing in Telco
- Presented by Jung-ryong Lee, engineer manager at SK Telecom at Gruter TECHDAY 2014 Oct.29 Seoul, Korea
2. Jung Ryong Lee
• IT Manager of SK Telecom, South Korea’s
largest wireless communications provider
• Work on commercial products (~ ’12)
• Has worked with email archived solutions
• Has worked with IDS using In-stream processing engine
• Open source activity (’14 ~)
• Committer of Apache REEF
4. Background
• Telco data characteristics
• Huge amount of data daily
• 40 TB/day
• 15 PB (estimated by the end of 2014)
• Active user of Hadoop
• Involved with 10 + Hadoop clusters
10 + Clusters
• The largest one has 500 + nodes and a total of 900 + nodes
altogether.
• Uses various commercial MPP databases for analytics
5. Real-time analytics in Telco
Introduce 3 projects using Spark and Tajo
1. The first being high speed data processing
• Replacement of MPP database
2. The second being In-Stream data processing
• Replacement of Hive batch job
3. The third being manufacturing optimization
• Implementing of variable column table
6. Previous approach
Working, But…
Operational
Systems
Integration Layer Data Warehouse
Marketing
Sales
Hadoop MPP DBMS
ERP
SCM
ODS
Staging Area
Data Vault
Data Marts
Strategic
Marts
Hadoop MPP DBMS
7. Previous approach
have to load too much data into MPP DBMS
Operational
Systems
Integration Layer Data Warehouse
Marketing
Sales
Hadoop MPP DBMS
ERP
SCM
ODS
Staging Area
Data Valut
Data Marts
Strategic
Marts
Hadoop MPP DBMS
8. New approach
Support High speed ETL data processing
Real-time query processing for web client users
Operational
Systems
Integration Layer Data Warehouse
Marketing
Sales
Hadoop MPP DBMS
ERP
SCM
ODS
Staging Area
Data Valut
Data Marts
Strategic
Marts
Hadoop
9. System Requirements
• Low latency ad-hoc query(< 2secs)
• ANSI SQL support
(no need to Insert/Update/Delete)
• JDBC support
• Support concurrent users(10 users per sec)
• High availability
10. Shark on Spark
• It can replace RDBMS
• Low latency for ad-hoc query with caching tables
• HiveQL support
• JDBC support
• Support concurrent users(10users per sec)
• High availability
11. Shark on Spark
Shark Worker
Server
Worker
Worker
Worker
YARN
Worker
Worker
Worker
Worker
Load
balancer
Shark
Server
Tachyon
JDBC
HDFS
12. Function & Performance
• All queries work good with some modifications of
queries. But..
13. Improvement
• Improvement and Bug fix
• Bug fix of Hive 0.11 / Support Non ASCII table name
• Implement UDF(Oracle like functions)
• Change Correlation Sub Query to Left Outer Join
• Performance Improvement
• Control reducer count, block size
• Material view can reduce query time
• Caching query result
14. Improvement - Example
• Mainly used for querying small dataset that doesn’t change over time
• Can reduce query time with a simple understanding of Shark and Hive
SQL Operation HDFS
Cache
Request location Manager
of result
Session
CLIService Manager
Manager
Fetch
Operation
read result
SQL
Operation
Close
Operation
write result
delete result
Spark Cluster
Spark Task
16. In-Stream data processing
• Replace Hive batch job with Spark
• One hour batch job -> 5 sec batch job in 1 minute of window time
• Calculate top 100 keywords and applications
• Processing data with 530MB/s
-> 1 mil records / sec
• Must be implemented within one month
17. The answer is Spark
Streaming!!
• Very similar to Spark
• DStream is very similar to RDD
• Can use most functions that RDD provides(groupByKey,
sortByKey)
• Easily change batch application to in-stream processing
application
• Support local environment
• Can evaluate the application with laptop
18. Implementation process
• It only takes one week to complete the process
Implement
batch
application
Batch job application
Evaluate
with
local mode
Evaluate
with
Cluster
Change the code to in-stream
processing
Evaluate
with
local mode
Evaluate
with
cluster
In-stream application
19. Data processing architecture
• Only “84 lines of code” are needed!
HDFS
SocketInputDStr
Spark Streaming
Receiver
Spark Streaming
Receiver eam
Spark
Streaming
Elastic
Search
SocketInputDStr
Spark Streaming
Receiver
Spark Streaming
Receiver eam
SocketInputDStr
Spark Streaming
Receiver
Spark Streaming
Receiver eam
HDFS
Collector
HDFS
Collector
HDFS
Collector
20. Performance
• About 3 times faster than other In-streaming
processing engines.
21. Implement Variable Column
Table
Key A B C
A1 0.1 0.3 0.2
A2 0.2 0.1 0.2
A3 0.5 0.1 0.1
…
Key A E F
B1 0.1 0.3 0.2
B2 0.2 0.1 0.2
B3 0.5 0.1 0.1
…
Key A D F
B1 0.1 0.3 0.2
B2 0.2 0.1 0.2
B3 0.5 0.1 0.1
…
• Dataset is too various
• Each data size is about
700M ~ 1G
• Query is very simple
• Procedure support
• Non-Ascii Column name
• Support BI solution
22. RDBMS Approach
• Too many records are created
• Performance and disk space problem
• If we have 2,000 records with 2,000 columns,
total count of retrieval records is 4 million
records.
Key1 Key2 Value
A1 A 0.1
A1 B 0.3
A1 C 0.2
…
23. Tajo Approach
• Using LDW feature
• Extremely high speed data return
• Customizable JDBC Interface
24. Lessons learned
• We can implement OLTP style systems and
In-stream data processing with Spark and Tajo.
• Win-win between community and company
• Test in real working cluster
• Finding some bugs and function requirements, as well.
• Mainly focusing on low latency query and in-streaming data
processing.