Data Integration on Alibaba Cloud's Big Data Platform: MaxCompute, DataWorks Demo

Presented by Derek Meng
Data Integration
On the Alibaba Cloud
Big Data Platform
From OSS, RDS to
MaxCompute

01
03
02
04
General Process of Data Integration
DataWorks Basics
MaxCompute Basics
Getting Start with Alibaba Cloud
DATA INTEGRATION MAXCOMPUTE
DATAWORKS DEMO
Overview
2 /25
(Slide No. 3-9) (Slide No. 10-20)
(Slide No. 23-24)(Slide No. 21-22)

01
3 /25

Data Source and Type
Data Source and Type Introduction
1
2
4 /25

Data Integration
Data Integration
Data Acquisition Data Transformation Data Governance
5 /25
Unstructured Data
TXT
Picture
Video
Audio
….
Semi-Structured
Log
XML
JSON
….
Structured Data
Oracle
MySQL
SQLServer
PostgreSQL
…

Alibaba Cloud Big Data Architecture
1
2
General Process Data Integration
7 /25

01
Offline
Streaming
Real-Time
Streaming Process
Schedule / Maintain
02
03 Get Insight
Decision Support
Data Warehouse
8 /25
Data Source
Acquisition
• Database
• Local File
• OSS
Data Scrubbing
• SQL
• Custom Code
Data EDA
• Statistics
• Modeling
Data Storage
• Database
Report BI
Agent
• Console App
• Servers
• Sensors
Transfer and Buffer
• Streaming Transfer
Tools
Streaming Process
• Streaming Process
Tools
Data Storage
• Database
01
02
Unified Data Storage
• Database
Ad-Hoc
• Ad-Hoc Query
General Data Processing Workflow

Offline Data Process
9 /25
RDS
Database
OSS Data
Store
Server Load
Balancer
ECS Cluster
Table Store
Auto Scaling
MaxCompute
RDBMS
MySql, Sql Server, Oracle, DB2……
Hadoop Data
Hive, HBASE
Other Data Source
Txt File, Web logs, Vedio /
Audio
Data Source

MaxCompute Basics
Basic Concepts of MaxCompute
MaxCompute Architecture
1
2
MaxCompute Data Channel and SQL3
11 /25

12 /25
• Project is the most basic unit for resource
isolation
• Multiple projects can share the resources of
the same cluster
• A Project is similar to Oracle’s Database
• Tables, users and jobs are all subordinate to
a project
• After authorization, various projects can
achieve data interoperability
Basic Concepts
PROJECT 2 PROJECT 4
PROJECT 3
PROJECT 1
Table
User
Security
Policy
Job
Resource

13 /25
• Most of the MaxCompute-processed data is stored in a structured bi-dimensional table
• Tables are subordinate to the project
• Tables can be partitioned
• Data types in a table include Bigint, Boolean, Double, Date/Time, String, and Decimal
• Data is managed by the Pangu storage system. The automatic multi-replica storage
policy improves the data availability and blocks underlying hardware faults
• Column-store structure, compressed storage
• Built-in data lifecycle management policy
• Storage quota-based multi-tenant management mechanism
Storage

MaxCompute Basics
1
2
14 /25

MaxCompute Basics
15 /25
SQL MapReduce Graph
Machine
Learning
10000 10000 10000
Cluster 1 Cluster 2 Cluster 3
Apsara Distributed System
MaxCompute Engine

MaxCompute Basics
1
2
16 /25

17 /25
Tunnel
• The channel for data to go in and out of MaxCompute
• High-concurrency upload/download
• Horizontal expansion of service capabilities
• 1P throughput supported in a single day
• Batch and Real-time modes
• The real-time mode supports pub/sub models
• ODPS Tunnel-based tools include TT, CDP, Flume, and Fluentd

18 /25
• Reads and writes to tables are supported, but views are not supported
• Writes to tables adopt the Append mode
• Concurrency is supported to improve overall throughput
• Frequent commits are avoided
• The target partition for data uploads must exist
• Real-time upload mode
Tunnel

19 /25
Data Upload/Download in Tunnel
• odps@ > tunnel upload log.txt test_project.test_table/p1="b1",p2="b2“;
• odps@ > tunnel download test_project.test_table/p1="b1",p2="b2" log.txt;
• It is a Tunnel SDK-based command line tool that can be used for uploading local text
files to ODPS or downloading table data to a local location
• The table partitions should be established
• DataX, CDP, and TT have implemented better tools based on Tunnel, and the tools
can be used to support data interaction between ODPS and relational databases
• The log data can be imported using Flume, and Fluentd tools
• Special scenario users can develop custom tools based on Tunnel
Tunnel Command

20 /25
SQL
• Applicable to process a large amount of data (terabytes to petabytes)
• High Latency: the running time of every SQL statement ranges from dozens of
seconds to several hours.
• The syntax is similar to HQL of Hive, with some extensions on the basis of the
standard SQL.
• There is no transaction, and no primary key.
• UPDATE and DELETE commands are not supported.

04
Getting Started with Alibaba Cloud
23 /25

Data Integration on Alibaba Cloud's Big Data Platform: MaxCompute, DataWorks Demo

Data Integration on Alibaba Cloud's Big Data Platform: MaxCompute, DataWorks Demo

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Data Integration on Alibaba Cloud's Big Data Platform: MaxCompute, DataWorks Demo

Ähnlich wie Data Integration on Alibaba Cloud's Big Data Platform: MaxCompute, DataWorks Demo (20)

Mehr von Alibaba Cloud

Mehr von Alibaba Cloud (16)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data Integration on Alibaba Cloud's Big Data Platform: MaxCompute, DataWorks Demo

Hinweis der Redaktion