Gruter TECHDAY 2014 Realtime Processing in Telco

Big Telco, Bigger
real-time demands:
Real-time processing
in Telco

Jung Ryong Lee
• IT Manager of SK Telecom, South Korea’s
largest wireless communications provider
• Work on commercial products (~ ’12)
• Has worked with email archived solutions
• Has worked with IDS using In-stream processing engine
• Open source activity (’14 ~)
• Committer of Apache REEF

Overview
• Background
• Real-time analytics in Telco
• Project 1 - High speed data processing
• Issues & solutions
• Performance
• Project 2 - In-stream processing
• Project 3 - Manufacturing Optimization
• Lessons Learned

Background
• Telco data characteristics
• Huge amount of data daily
• 40 TB/day
• 15 PB (estimated by the end of 2014)
• Active user of Hadoop
• Involved with 10 + Hadoop clusters
10 + Clusters
• The largest one has 500 + nodes and a total of 900 + nodes
altogether.
• Uses various commercial MPP databases for analytics

Real-time analytics in Telco
Introduce 3 projects using Spark and Tajo
1. The first being high speed data processing
• Replacement of MPP database
2. The second being In-Stream data processing
• Replacement of Hive batch job
3. The third being manufacturing optimization
• Implementing of variable column table

Previous approach
Working, But…
Operational
Systems
Integration Layer Data Warehouse
Marketing
Sales
Hadoop MPP DBMS
ERP
SCM
ODS
Staging Area
Data Vault
Data Marts
Strategic
Marts
Hadoop MPP DBMS

Previous approach
have to load too much data into MPP DBMS
Operational
Systems
Marketing
Sales
Hadoop MPP DBMS
ERP
SCM
ODS
Staging Area
Data Valut
Data Marts
Strategic
Marts
Hadoop MPP DBMS

New approach
Support High speed ETL data processing
Real-time query processing for web client users
Operational
Systems
Marketing
Sales
Hadoop MPP DBMS
ERP
SCM
ODS
Staging Area
Data Valut
Data Marts
Strategic
Marts
Hadoop

System Requirements
• Low latency ad-hoc query(< 2secs)
• ANSI SQL support
(no need to Insert/Update/Delete)
• JDBC support
• Support concurrent users(10 users per sec)
• High availability

Shark on Spark
• It can replace RDBMS
• Low latency for ad-hoc query with caching tables
• HiveQL support
• JDBC support
• Support concurrent users(10users per sec)
• High availability

Shark on Spark
Shark Worker
Server
Worker
Worker
Worker
YARN
Worker
Worker
Worker
Worker
Load
balancer
Shark
Server
Tachyon
JDBC
HDFS

Function & Performance
• All queries work good with some modifications of
queries. But..

Improvement
• Improvement and Bug fix
• Bug fix of Hive 0.11 / Support Non ASCII table name
• Implement UDF(Oracle like functions)
• Change Correlation Sub Query to Left Outer Join
• Performance Improvement
• Control reducer count, block size
• Material view can reduce query time
• Caching query result

Improvement - Example
• Mainly used for querying small dataset that doesn’t change over time
• Can reduce query time with a simple understanding of Shark and Hive
SQL Operation HDFS
Cache
Request location Manager
of result
Session
CLIService Manager
Manager
Fetch
Operation
read result
SQL
Operation
Close
Operation
write result
delete result
Spark Cluster
Spark Task

Result of Caching query
• Most queries complete in 0.3 seconds.

In-Stream data processing
• Replace Hive batch job with Spark
• One hour batch job -> 5 sec batch job in 1 minute of window time
• Calculate top 100 keywords and applications
• Processing data with 530MB/s
-> 1 mil records / sec
• Must be implemented within one month

The answer is Spark
Streaming!!
• Very similar to Spark
• DStream is very similar to RDD
• Can use most functions that RDD provides(groupByKey,
sortByKey)
• Easily change batch application to in-stream processing
application
• Support local environment
• Can evaluate the application with laptop

Implementation process
• It only takes one week to complete the process
Implement
batch
application
Batch job application
Evaluate
with
local mode
Evaluate
with
Cluster
Change the code to in-stream
processing
Evaluate
with
local mode
Evaluate
with
cluster
In-stream application

Data processing architecture
• Only “84 lines of code” are needed!
HDFS
SocketInputDStr
Spark Streaming
Receiver
Spark Streaming
Receiver eam
Spark
Streaming
Elastic
Search
SocketInputDStr
Spark Streaming
Receiver
Spark Streaming
Receiver eam
SocketInputDStr
Spark Streaming
Receiver
Spark Streaming
Receiver eam
HDFS
Collector
HDFS
Collector
HDFS
Collector

Performance
• About 3 times faster than other In-streaming
processing engines.

Implement Variable Column
Table
Key A B C
A1 0.1 0.3 0.2
A2 0.2 0.1 0.2
A3 0.5 0.1 0.1
…
Key A E F
B1 0.1 0.3 0.2
B2 0.2 0.1 0.2
B3 0.5 0.1 0.1
…
Key A D F
B1 0.1 0.3 0.2
B2 0.2 0.1 0.2
B3 0.5 0.1 0.1
…
• Dataset is too various
• Each data size is about
700M ~ 1G
• Query is very simple
• Procedure support
• Non-Ascii Column name
• Support BI solution

RDBMS Approach
• Too many records are created
• Performance and disk space problem
• If we have 2,000 records with 2,000 columns,
total count of retrieval records is 4 million
records.
Key1 Key2 Value
A1 A 0.1
A1 B 0.3
A1 C 0.2
…

Tajo Approach
• Using LDW feature
• Extremely high speed data return
• Customizable JDBC Interface

Lessons learned
• We can implement OLTP style systems and
In-stream data processing with Spark and Tajo.
• Win-win between community and company
• Test in real working cluster
• Finding some bugs and function requirements, as well.
• Mainly focusing on low latency query and in-streaming data
processing.

Gruter TECHDAY 2014 Realtime Processing in Telco

Gruter TECHDAY 2014 Realtime Processing in Telco

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Gruter TECHDAY 2014 Realtime Processing in Telco

Ähnlich wie Gruter TECHDAY 2014 Realtime Processing in Telco (20)

Mehr von Gruter

Mehr von Gruter (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Gruter TECHDAY 2014 Realtime Processing in Telco