This document summarizes Derek Meng's presentation on data integration using Alibaba Cloud's MaxCompute big data platform. It discusses the general process of data integration including data acquisition, transformation, and governance. It provides an overview of MaxCompute basics, including its architecture, basic concepts such as projects and tables, and how to use MaxCompute's data channel and SQL. The document concludes with a brief introduction to DataWorks for data integration and a demo.
4. Data Source and Type
Data Source and Type Introduction
1
2
General Process of Data Integration
4 /25
5. Data Integration
Data Integration
Data Acquisition Data Transformation Data Governance
5 /25
Unstructured Data
TXT
Picture
Video
Audio
….
Semi-Structured
Log
XML
JSON
….
Structured Data
Oracle
MySQL
SQLServer
PostgreSQL
…
7. Alibaba Cloud Big Data Architecture
General Process of Data Integration
1
2
General Process Data Integration
7 /25
8. 01
Offline
Streaming
Real-Time
Streaming Process
Schedule / Maintain
02
03 Get Insight
Decision Support
Data Warehouse
8 /25
Data Source
Acquisition
• Database
• Local File
• OSS
Data Scrubbing
• SQL
• Custom Code
Data EDA
• Statistics
• Modeling
Data Storage
• Database
Report BI
Agent
• Console App
• Servers
• Sensors
Transfer and Buffer
• Streaming Transfer
Tools
Streaming Process
• Streaming Process
Tools
Data Storage
• Database
01
02
Unified Data Storage
• Database
Ad-Hoc
• Ad-Hoc Query
General Data Processing Workflow
9. Offline Data Process
9 /25
RDS
Database
OSS Data
Store
Server Load
Balancer
ECS Cluster
Table Store
Auto Scaling
MaxCompute
RDBMS
MySql, Sql Server, Oracle, DB2……
Hadoop Data
Hive, HBASE
Other Data Source
Txt File, Web logs, Vedio /
Audio
Data Source
12. 12 /25
• Project is the most basic unit for resource
isolation
• Multiple projects can share the resources of
the same cluster
• A Project is similar to Oracle’s Database
• Tables, users and jobs are all subordinate to
a project
• After authorization, various projects can
achieve data interoperability
Basic Concepts
PROJECT 2 PROJECT 4
PROJECT 3
PROJECT 1
Table
User
Security
Policy
Job
Resource
13. 13 /25
• Most of the MaxCompute-processed data is stored in a structured bi-dimensional table
• Tables are subordinate to the project
• Tables can be partitioned
• Data types in a table include Bigint, Boolean, Double, Date/Time, String, and Decimal
• Data is managed by the Pangu storage system. The automatic multi-replica storage
policy improves the data availability and blocks underlying hardware faults
• Column-store structure, compressed storage
• Built-in data lifecycle management policy
• Storage quota-based multi-tenant management mechanism
Storage
17. 17 /25
Tunnel
• The channel for data to go in and out of MaxCompute
• High-concurrency upload/download
• Horizontal expansion of service capabilities
• 1P throughput supported in a single day
• Batch and Real-time modes
• The real-time mode supports pub/sub models
• ODPS Tunnel-based tools include TT, CDP, Flume, and Fluentd
18. 18 /25
• Reads and writes to tables are supported, but views are not supported
• Writes to tables adopt the Append mode
• Concurrency is supported to improve overall throughput
• Frequent commits are avoided
• The target partition for data uploads must exist
• Real-time upload mode
Tunnel
19. 19 /25
Data Upload/Download in Tunnel
• odps@ > tunnel upload log.txt test_project.test_table/p1="b1",p2="b2“;
• odps@ > tunnel download test_project.test_table/p1="b1",p2="b2" log.txt;
• It is a Tunnel SDK-based command line tool that can be used for uploading local text
files to ODPS or downloading table data to a local location
• The table partitions should be established
• DataX, CDP, and TT have implemented better tools based on Tunnel, and the tools
can be used to support data interaction between ODPS and relational databases
• The log data can be imported using Flume, and Fluentd tools
• Special scenario users can develop custom tools based on Tunnel
Tunnel Command
20. 20 /25
SQL
• Applicable to process a large amount of data (terabytes to petabytes)
• High Latency: the running time of every SQL statement ranges from dozens of
seconds to several hours.
• The syntax is similar to HQL of Hive, with some extensions on the basis of the
standard SQL.
• There is no transaction, and no primary key.
• UPDATE and DELETE commands are not supported.
(1) Cooperation with the partners of other BUs NOTE: There must be open and feasible cooperation modes.
(2) Overlap with other products A: Elements that are under planning and overlap with existing products B: Elements allowing differentiated cooperation. Emphasize on the two existing differentiated elements of the other party, and then complete the whole development.
Illustrate the above information in two PPT slides.