2. About me
• Education
– NCU (MIS)
– NCCU (CS)
• Experience
– Raritan, TWM, FET, CHT
• Teaching
– III
• Community
– TW Spark User Group
– TW Hadoop User Group
• Research
– III MIC - special columnist
– A.I. robot
– Big Data & Machine learning
• Team Group
– 聯瞻資訊顧問、慶騰資訊顧問
2
66. HDP Hive 練習 – DML
• 進行 table join
SELECT
D.*
,T.HOURS_LOGGED
,T.MILES_LOGGED
FROM DRIVERS D
JOIN TIMESHEET T
ON D.DRIVERID = T.DRIVERID;
67. HDP Hive 練習 – DML
• 進行 table join SELECT
D.DRIVERID
,D.NAME
,T.TOTAL_HOURS
,T.TOTAL_MILES
FROM DEFAULT.DRIVERS D
JOIN (
SELECT
DRIVERID
,SUM(HOURS_LOGGED)TOTAL_HOURS
,SUM(MILES_LOGGED)TOTAL_MILES
FROM DEFAULT.TIMESHEET
GROUP BY DRIVERID
) T
ON (D.DRIVERID = T.DRIVERID);
73. • 壓縮格式
– high level compression (one of NONE, ZLIB,
SNAPPY)
• 建立表格
– create table Addresses ( name string, street
string, city string, state string, zip int ) stored as
orc tblproperties ("orc.compress"="NONE");
• 知識層
延伸閱讀 (2)
89. 資 料 分 析
• 客戶取數
– 找一群喜歡香水香氛的用戶
– 找一群關注營養補給與商業理財
SELECT NAME FROM PMART.M_WEBLOG
WHERE CAT3 LIKE '%香水香氛%';
SELECT NAME FROM PMART.M_WEBLOG
WHERE CAT2 LIKE '%營養補給%'
AND CAT3 LIKE '%商業理財%';
觀察自己的資料狀況,決定查詢條件
或用 beeline 查詢(非中文):
sudo -u hive beeline -u "jdbc:hive2://sandbox.hortonworks.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2"
92. • 到內層,用 Python 存取 hive 資料
延伸閱讀 (4)
import pyhs2
import woothee
conn = pyhs2.connect(host='localhost', port=10000,authMechanism='PLAIN',
user='hive', password='',database='pmart')
with conn.cursor() as cur:
cur.execute("select * from m_weblog limit 10")
for i in cur.fetch():
print i[5].decode('utf-8')
with conn.cursor() as cur:
cur.execute("select ua from m_weblog limit 10")
for i in cur.fetch():
print woothee.parse(i[2])
95. 資 料 視 覺 化 (2)
• 用戶總瀏覽紀錄
• 熱門商品分類
%jdbc(hive)
SELECT NAME, COUNT(*) AS CNT FROM PMART.M_WEBLOG
GROUP BY NAME;
%jdbc(hive)
SELECT CAT2, COUNT(*) AS CNT FROM PMART.M_weblog
GROUP BY CAT2;
96. 資 料 視 覺 化 (3)
• 用戶瀏覽次數
• 觀察特殊商品次數
%jdbc(hive)
SELECT DATE_SUB(DT,0) AS DT, NAME, COUNT(*) AS CNT
FROM PMART.M_WEBLOG
GROUP BY DATE_SUB(DT,0), NAME
ORDER BY CNT DESC;
%jdbc(hive)
SELECT DT, TYPE, COUNT(*) FROM (
SELECT CASE
WHEN (CAT3 LIKE '%男%') THEN '男'
WHEN (CAT3 LIKE '%女%') THEN '女'
ELSE '無' END
AS TYPE, DATE_SUB(DT,0) AS DT
FROM PMART.M_WEBLOG
)A GROUP BY TYPE, DT;
99. • 註冊 UDF 程式
– create function test as
'com.example.hive.udf.LowerCase' using jar
'hdfs:///user/admin/jars/hiveUDF_fat.jar';
• 刪除已註冊 UDF 程式 (需登出才生效)
– drop function test;
– !quit
• 查看已註冊的 jar
– list jar;
延伸閱讀 (2)
100. • 執行英文名字大寫轉換
– SELECT NAME FROM PDATA.PROFILE LIMIT 10;
• 套用 UDF 結果
– SELECT TEST(NAME) AS NAME FROM PDATA.PROFILE LIMIT 10;
延伸閱讀 (3)
102. 冷熱資料常用的工具
102
SparkSQL Good for iterative processing, access existing Hive
tables, given results fits in memory
HAWQ Good for traditional BI-like queries, star schemas,
cubes OLAP
HIVE(LLAP) Good for petabyte scale mixed with smaller tables
requiring sub-second queries
Phoenix Good way to interact with HBase tables, good with
time series, good indexing
Drill、Presto Query federation-like capabilities but limited SQL
syntax. Performance varies quite a bit.
118. HDP 安裝與實作 (12)
• 執行 Ambari Server
– ambari-server setup
118
[root@master tmp]# ambari-server setup
Using python /usr/bin/python2
Setup ambari-server
Checking SELinux...
SELinux status is 'enabled'
SELinux mode is 'permissive'
WARNING: SELinux is set to 'permissive' mode and temporarily disabled.
OK to continue [y/n] (y)? y
Customize user account for ambari-server daemon [y/n] (n)? n
Adjusting ambari-server permissions and ownership...
Checking firewall status...
Checking JDK...
[1] Oracle JDK 1.8 + Java Cryptography Extension (JCE) Policy Files 8
[2] Oracle JDK 1.7 + Java Cryptography Extension (JCE) Policy Files 7
[3] Custom JDK
====================================================================
Enter choice (1): 3
WARNING: JDK must be installed on all hosts and JAVA_HOME must be valid on all hosts.
WARNING: JCE Policy files are required for configuring Kerberos security. If you plan to use Kerberos,please make sure JCE Unlimited Strength
Jurisdiction Policy Files are valid on all hosts.
Path to JAVA_HOME: /usr/java/java
Validating JDK on Ambari Server...done.
Completing setup...
Configuring database...
Enter advanced database configuration [y/n] (n)? Y
Configuring database...
==============================================================================
Choose one of the following options:
[1] - PostgreSQL (Embedded)
[2] - Oracle
[3] - MySQL / MariaDB
[4] - PostgreSQL
[5] - Microsoft SQL Server (Tech Preview)
[6] - SQL Anywhere
[7] - BDB
==============================================================================
Enter choice (1): 1
Database admin user (postgres):
Database name (ambari):
Postgres schema (ambari):
Username (ambari):
Enter Database Password (bigdata):
Default properties detected. Using built-in database.
Configuring ambari database...
Checking PostgreSQL...
Running initdb: This may take up to a minute.
About to start PostgreSQL
Configuring local database...
Configuring PostgreSQL...
Backup for pg_hba found, reconfiguration not required
Creating schema and user...
done.
Creating tables...
done.
Extracting system views...
..........ambari-admin-2.5.0.3.7.jar
Adjusting ambari-server permissions and ownership...
Ambari Server 'setup' completed successfully.
[root@master tmp]#
168. Elasticsearch 特性
• Near Realtime(NRT)
• Document base NO-SQL
• RESTful API
• Fast installation
• Easy build cluster
Speed
Scalability
Easy
use
NRT
Document
based
RESTful
Cluster
211. Kibana
• Very good tool for knowing your data
• Debug for ES RESTful API
• Dashboard
data
discovery
Dashboard
dev
tool
data
visualize
index
mgt
time
series
embedded
graph
250. 安裝 Metricbeat
⚫ Metricbeat helps you
monitor your servers and
the services they host by
collecting metrics from the
operating system and
services.
⚫ 蒐集伺服器與服務的指標數
據,例如 CPU 使用率
275. 結論
項目 說明
Elasticsearch Near realtime search engine
RESTful API
Logstash Collection data
Parse data
Transform data
Kibana Data discovery
Data visualize
Dashboard
Time series(Timelion)