海量資料與圖書館

迎接海量資料
柯皓仁
國立臺灣師範大學圖書資訊學研究所
1
海量資料 = 巨量資料 = 大數據 = Big Data

課程大綱
資料量到底有多大？
究竟甚麼是海量資料？
海量資料的應用與軼事(或真人真事)
應用實例一二三
一點點技術
結論
2

你每天產生了多少資料量
玩玩Candy Crush Saga
上上臉書、打打卡
Line一下
看看電子報
17Life、IherGo一下
用手機打電話給朋友、家人
搭捷運、高鐵，開車上高速公路
進了超級市場買東西
不小心抬頭一看… 啊，那裏有個監視器
4

爆炸的資料量
5
http://aadamov.wordpress.com/2012/03/17/the-explosive-growth-in-the-volume-of-digital-
data-will-demand-more-it-professsional/

你熟悉幾個 B(yte) ?
6
http://en.wikipedia.org/wiki/Zetta-

爆炸的資料量(續)
7
https://www-
304.ibm.com/connections/blogs/government/resource/BLOGS_UPLOADED_IMA
GES/NNECDataGrowth.jpg

爆炸的資料量 (續)
根據IDC Digital Universe Study，在2020年時
數位資料量將達 35 ZB (2011年時是1.8 ZB)
機構所處理的資訊量會翻50倍
資訊存放的載體量會翻75倍
(實體與虛擬)伺服器會翻10倍
8
http://aadamov.wordpress.com/2012/03/17/the-explosive-
growth-in-the-volume-of-digital-data-will-demand-more-it-
professsional/

為何資料量會暴增？
感知化(instrumented)
所有事物都能被感測
物聯化(interconnected)
感測過程中產生了大量的數據，需要輸送到後台進
行處理
智能化(intelligent)
從龐雜巨量的數據資料中，分析出用的資訊，幫助
人們做決策
9
(胡世忠，2013)

究竟甚麼是海量資料?
10

海量資料的定義
Dumbill(2013)
Big data is data that exceeds the processing
capacity of conventional database systems. The
data is too big, moves too fast, or doesn’t fit the
strictures of your database architectures. To gain
value from this data, you must choose an
alternative way to process it.
Big data 的資料超越傳統資料庫系統所能處理的能
力。Big data 資料量大、變動迅速，甚至不符合資
料庫架構。若要取得這些資料的價值，就必須採取
其他替代方式來作業
如何從海量資料中偵測樣態、洞悉真相、預測
複雜問題的答案，是海量資料分析的關鍵
11

海量資料帶來的挑戰
海量資料為組織帶來大量及快速成長的資料或
資訊來源之挑戰，同時呈現複雜的分析範圍及
使用問題：
必須具有一個能載入、確認、分析大量資料的計算
基礎建設
能從多重資料來源評估混雜的結構與非結構資料
處理未有明顯綱要或結構且難以預測的資料內容
能夠即時(或近乎即時)地收集、分析與回應
12
Villars, Olofson & Eastwood (2011)

不同的資料型態是我們正在遇到
的情況…
關聯性資料庫 (交易、學生資
料檔)
結構化
資料
電子郵件、部落格文章
半結構
化資料
文字、圖像、聲音、影片
非結構
化資料
13

海量資料分析的驅動力
14
http://shhrota.com/2012/01/02/the-big-in-big-data/

BIG DATA的四個特性
15
http://www.datasciencecentral.com/profiles/blogs/data-veracity

導致資料可靠性不明的原因
蓄意欺騙
無心欺瞞
時序錯誤
感應器老化
製程的不精確
…
16
(胡世忠，2013)

海量資料的軼事與應用
17

Google與流感
18
http://www.google.org/flutrends/

海量資料的軼事與應用
Walmart
尿布、啤酒、年輕爸爸
手電筒、電池、餡餅
職業運動
魔球
以賽事的影音檔分析球員資料，進行球員訓練、運
動損傷預防、治療管理
聯合國全球脈動：以部落格、論壇、社群網站
發文進行情緒偵測，搭配大眾運輸乘坐率，預
測失業率發展(http://www.unglobalpulse.org/)
19

情緒偵測、大眾運輸乘坐率、失
業率
20
http://www.slideshare.net/unglobalpulse/globalpulse
sasmethodspaper2011?ref=http://www.unglobalpulse
.org/projects/can-social-media-mining-add-depth-
unemployment-statistics

更多海量資料的軼事與應用(續)
出現在電影裡的情節
海洋導航家(pathfinder of the seas)莫銳的故事
從航海日誌(特定時間地點，對於風向、洋流、天氣
的紀錄標準航行紀錄表格)、瓶中信…繪製海圖
資料化(datafication)：將現象以量化格式呈現，以
便整理分析(測量、記錄) (量化、標準化、蒐集)
21

UPS在每輛車子裝設感測器、無線電與GPS
預測引擎是否故障(測量監控零件)，知道車輛位置、
判斷有否延誤與員工行蹤，調整最佳送貨路線(如減
少必須通過十字路口的次數，節能減碳)
手機(位置資料)與即時路況報告
情感分析(sentimental detection)
如果在你家的地板、牆上鋪上一層觸控感應的
材質…
B&N分析Nook電子書閱讀器的資料讀者遇
到長篇非小說類書籍會半途而廢 Nook短篇
Coursera 如果學生反覆收看某節課程…
反恐、疾病趨勢偵測、終結貧窮、拯救地球
22

Google Books Ngram Viewer
時代氛圍與圖書調性
(https://www.facebook.com/photo.php?fbid=6794792720776
35&set=a.261088210583412.84760.261083417250558&type=
1&theater)
23
http://books.google.com/ngrams

紐約市非法改建住宅發生火災的問題
住宅資料、住宅屋型與年代
法拍屋與否、欠房屋稅與否、水電費異常否
救護車出勤紀錄、犯罪率、鼠患投訴
新砌的磚牆
未採用巨量資料之前，遭投訴而須撤離的比例僅
有13%；採用之後提升到70%
24
(麥爾筍伯格、庫基耶，2013)

從海量資料的軼事與應用告訴我
們的事…
心中有目標
樣本=母體
資料多(雖然雜亂)比資料好更重要
資料化的重要性
因果與相關
25
(麥爾筍伯格、庫基耶，2013)

海量資料分析的應用
26
http://www.emc.com/leadership/digital-
universe/iview/big-data-2020.htm

海量資料分析的潛在使用案例
27
http://practicalanalytics.wordpress.com/2011/12/12
/big-data-analytics-use-cases/

應用實例一：
運用時間序列分群於社會性標籤
28
Chen, Tseng, Ke & Sun (2011)

流程與步驟
資料收集
前置處理
時間序列表示
時間序列分群
推薦群聚
29

資料蒐集
資料來源：
收集方式：
30

前置處理
斷詞切字與詞性標記
刪除停用字
特徵選擇
權重計算
31

32
刪除停用字刪除停用字刪除停用字刪除停用字
原文原文原文原文經經經經CKIP處理過後處理過後處理過後處理過後
保留保留保留保留N
和和和和V詞詞詞詞

33
• 刪除在語料庫中出現次數過多與過刪除在語料庫中出現次數過多與過刪除在語料庫中出現次數過多與過刪除在語料庫中出現次數過多與過
少的詞彙少的詞彙少的詞彙少的詞彙
• 加入加入加入加入Log likelihood RationLog likelihood RationLog likelihood RationLog likelihood Ration計算計算計算計算
語料庫中詞彙的語料庫中詞彙的語料庫中詞彙的語料庫中詞彙的LLRLLRLLRLLR值值值值，，，，取取取取top 50top 50top 50top 50當當當當
做該文章的特徵值做該文章的特徵值做該文章的特徵值做該文章的特徵值．．．．
金牌德國背叛妻子機器人奧運會舉重奪得動人相片來到
觀眾獻給出場雅典心願亞軍冠軍奧地利選手背後禮物舉
起裏服務站訓練家族維也納緣起電視機車禍參觀拿到奧運
吻力量感動淚水光明書籤頒獎離開感人成績窩口袋大全運動
員這時賽
一位來自奧地利之奧運舉重金牌選手一位來自奧地利之奧運舉重金牌選手一位來自奧地利之奧運舉重金牌選手一位來自奧地利之奧運舉重金牌選手
其背後感人的故事其背後感人的故事其背後感人的故事其背後感人的故事。。。。
特徵選擇特徵選擇特徵選擇特徵選擇

權重計算
W1 W2 W3 W4 W5 W6 W7 W8 W9 W10
D1 5 6 7 10 8
D2 12 13 7 6 10
D3 12 14 15 8 10 11
tag a 在1/1標記了D1和D2這兩篇文章
tag a＠1/1[5,18,7,13,7,10,8,6,10,0]
tag a 在1/2標記了D2和D3這兩篇文章
tag a@1/2[0,12,12,27,7,0,15,14,20,11]
34
以TFTFTFTF----IDFIDFIDFIDF表示詞彙的權重

時間序列示意
以兩個星期，15天為一個時間區間，每一個時
間區間裡有14個時間向量
35
1,1tag
1,3tag
1,2tag
1,4tag
1,5tag
1,13tag
1,14tag
1,1V
1,1 1 1,2 1,1
1,2 2 1,3 1,2
1,1
1,14 14 1,15 1,14
( , )
( , )
( , )
v t tag tag
v t tag tag
V
v t tag tag
=
= − 
 
= − 
 
 
 = − 
⋮

計算時間序列相似度
36
jtag
itag
,1 1( , )iv t ,2 2( , )iv t
,3 3( , )iv t
,4 4( , )iv t ,5 5( , )iv t ,6 6( , )iv t
,1 1( , )jv t ,2 2( , )jv t
,3 3( , )jv t
,4 4( , )jv t ,5 5( , )jv t ,6 6( , )jv t
,1 ,1 , ,( , ) ( ( , ) ... ( , ))/i j i j i N j Nsimtag tag simlarity v v simlarity v v N= + +

計算時間序列相似度
37
ㄟ…有時候也不是那麼容易

時間序列分群
聚合式階層式分群法，採用平均連結聚合方式
計算群聚間的距離
38
Cluster A
Cluster B

推薦群聚
分群完成後，分在同一群的標籤表示：
時間序列走勢→使用相同的詞彙→相同的概念
本研究分成
相同時間區間的推薦
不同時間區間的推薦
39

事情真的很複雜…
44
網頁標題使用標籤
I'm Vlog-涼麵不是簡簡單單就可以吃的 portnoy 涼麵鄭龜
I'm Vlog-Manny Ramirez 耍寶集 manny ramirez 紅襪
I'm Vlog-【失敗的教育】遼寧少女痛罵地震
災民〈繁體字幕〉
四川地震
標籤標籤標籤標籤標籤標籤標籤標籤相似度相似度相似度相似度
四川地震 manny 0.0366>0.00223
portnoy 涼麵 0.0357>0.00223
manny portnoy 0.0251>0.00223
紅襪涼麵 0.0251>0.00223

事情真的很複雜… (續)
I’m Vlog 是一個影音網站。
蒐集此網頁資料時，影音內的資料是無法被收
集起來的。
影音檔的標題
使用者的敘述
網頁架構的資料
在產生標籤的向量時，因有這些相同的詞彙，
即使使用者使用的標籤不同，也會產生關聯
45

小結
資料來源
網頁、社群網站(社會書籤網站)
資料分析
網頁內容分析、社會標籤分群、時間序列分析
議題
網頁內容的雜亂度
抽樣？母體？
46

應用二：
電子資源使用者行為分析分析
47
Ke, Kwakkelaar, Tai & Chen (2002)

資料來源：Web Server Log
Transaction log
NCSA-defined CLF (Common Log Format) logged
by WWW servers
IP address, date and time, requests, and bytes returned
Proprietary logs
Example of NCSA-defined CLF Requests
“GET /cgi-bin/search.pl?collection=journals&search_field=xmlsearch_field=xmlsearch_field=xmlsearch_field=xml&
GetSearchResults=Search&fields=Anyields=Anyields=Anyields=Any HTTP/1.1"
"GET /cgi-bin/sciserv.pl?collection=journals&journal=01429418journal=01429418journal=01429418journal=01429418&
issue=v18i0003issue=v18i0003issue=v18i0003issue=v18i0003&article=181_tpocfc181_tpocfc181_tpocfc181_tpocfc&form=pdfform=pdfform=pdfform=pdf &file=file.pdf
HTTP/1.0"
48

重複造訪次數
32%
16%
9%
6%
5%
3%
3%
3%
2%
21%
1
2
3
4
5
6
7
8
9
>= 10
Be Careful – proxy/cache
50

單次使用時間
31%
9%
6%
4%4%
13%
15%
7%
4%
2% 1%2% 2%
0
1
2
3
4
10
20
30
40
50
60
90
> 90
51

解讀數字背後的意義
Relatively few repeated users?
Users mistakenly visit an inappropriate electronic
resource
New comers
Access E-journal systems in a very focused way,
only accessing the system when they know exactly
which article they are interested in
Visit an electronic resource via inter-linking
Proxy servers/cache servers/shared PC
Short session length
Need further investigation into information seeking
behavior of users
Browsing? Query?
52

伺服器負擔
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
#
o
f
l
o
g
s
O'clock
53
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
Sun Mon Tue Wed Thu Fri Sat
294096
761068
799275 830933 830657 800239
423694
#
o
f
L
o
g
s
Best schedule for system maintenance
Better performance during light-loading period

使用者在系統中的行動
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
CR HP PS ES QQ TF SS HM AR JL AL IL SR PF
1750 1977 2146 11768 12391
81492 113036
372581
416573
485974
545916
578504
616737
1499117
#
o
f
L
o
g
s
Types of Requests
Online help
Copyright Disclaimer
Submit Query
Full text download
Browsing
54

Small fraction of accesses is for online help
Take actions to …
Increase use of online help
Improve on-line help’s quality and accessibility
Help users to know specific features of systems
Proactive and context sensitive mechanisms –
Today’s Tip
Small fraction of accesses is for copyright
disclaimer
Combined with the assumption that usage terms
and conditions may not always be strictly observed
Reinforces the notion that libraries have to stress
the significance of fair and legal use of electronic
resources
55

每位使用者在系統中的檢索次數
(跨Session)
47%
8%
6%
4%
4%
3%
9%
8%
4%
7%
0
1
2
3
4
5
10
20
30
> 30
56

About half of the valid IP addresses do not
issue queries
Reasonable for E-journal systems
About 10% of users query more than 20 times
Do they make use of E-journal systems from an
A&I database point of view?
Librarians have to clarify the different roles of A&I
databases and E-journal systems
Reflects the significance of linking A&I databases and E-
journal systems
57

檢索詞排名-頻率圖
0.0000%
0.0020%
0.0040%
0.0060%
0.0080%
0.0100%
0.0120%
0.0140%
0.0160%
Frequency
Percentage
Rank
58

資料分析的光明與黑暗面
ArticlesJournalAllofFrequencyDownload
ArticlesJournalSubscribed-NonofFrequencyDownload
Ratio =
DownloadedTitlesJournalAll
DownloadedTitlesJournalSubscribed-Non
Ratio =
(Left Table)
(Right Table)
Subscriber Based
60

資料分析的光明與黑暗面 (續)
CustomersAllbyDownloadedArticles
CustomersgSubscribin-NonbyDownloadedArticles
Ratio =
Journal Based
61

資料探勘、使用者行為、推薦
62

小結
資料來源
Web Server Log
期刊基本資料
資料分析
重複造訪次數、伺服器負擔、單次使用長度、檢索
次數、檢索關鍵字、下載文章
解讀隱藏在數字背後的意義
資料分析的光明與黑暗面
即時與批次分析
63

應用三：提問與回答
64
(Dumais, Banko, Brill, Lin & Ng, 2002)

65
Question Answering
Retrieve small snippets of text that contain the
actual answer to a question rather than the
document lists traditionally returned by text
retrieval systems
Find the answer about 「台灣最高的山峰是什麼？」
Search Engine:
台灣、最高、山峰 →Related Docs. of the keywords
Question Answering System:
台灣最高的山峰是什麼？ → 玉山
一個展現海量資料的QA範例 -- Web Question
Answering: Is More Always Better (Dumais, Banko,
Brill, Lin & Ng, 2002)

66
Introduction
Focus on factoid questions
Motivated by observations in NLP – significant
improvements in in accuracy can be attained
simply by increasing the amount of data used
for learning
Ah… Web has tremendous amount of data
Instead of focusing on linguistic resources,
such as part-of-speech tagging, syntactic
parsing, semantic relations, named entity
extraction, dictionaries, WordNet, this paper
focuses on DATA (Web Data)

67
Exploiting Redundancy for QA
Redundancy: multiple, differently phrased,
answer occurrences
Enable Simple Query Rewrites
It is difficult to extract the correct answer from a
small corpus for a question, if the corpus contains
few documents for that question
The greater the number of information sources we
can draw from, the easier the task of rewriting the
question becomes, since the answer is more likely
to be expressed in different manners
“Who killed Abraham Lincoln?”
“John Wilkes Booth altered history with a bullet. He will
forever be known as the man who ended Abraham
Lincoln’s life”

68
Exploiting Redundancy for QA
(Cont.)
Facilitates Answer Mining
Even when no obvious answer strings can be
found in the text, redundancy can improve the
efficacy of question answering
“How many times did Bjorn Borg win Wimbledon?”

70
Rewrite Example
For each query, also generate a final rewrite
which is a backoff to a simple ANDing of non-
stop words in the query
Rewrite example: “Who created the character
of Scrooge?”

71
Mine N-Grams
From the page summaries returned by the
search engine, n-grams are mined.
The returned summaries contain the query terms,
usually with a few words of surrounding context.
In some cases, this surrounding context has
truncated the answer string, which may negatively
impact results. (Hope not harmful)
The summary text is then processed to
retrieve only strings to the left or right of the
query string, as specified in the rewrite triple.

72
Mine N-Grams (Cont.)
1-, 2-, and 3-grams are extracted from the
summaries.
The final score for an n-gram is based on the
rewrite rules that generated it and the number
of unique summaries in which it occurred
When searching for candidate answers, we
enforce the constraint that at most one
stopword is permitted to appear in any
potential n-gram answers

73
Filter/Reweight N-Grams
The n-grams are filtered and reweighted
according to how well each candidate
matches the expected answer-type, as
specified by a handful of handwritten filters.
Analyze and assign the query one of seven
question types
who-question, what-question, or how-many-question
Based on the query type that has been assigned,
the system determines what collection of filters to
apply to the set of potential answers found during
n-gram harvesting.
The answers are analyzed for features relevant to the
filters, and then rescored according to the presence of
such information

74
Tile N-Grams
Merges similar answers and assembles longer answers out of
answer fragments.
Tiling constructs longer n-grams from sequences of overlapping
shorter n-grams. "A B C“ + "B C D" "A B C D."
The algorithm proceeds greedily from the top-scoring candidate -
all subsequent candidates (up to a certain cutoff) are checked to
see if they can be tiled with the current candidate answer.
If so, the higher scoring candidate is replaced with the longer tiled n-
gram, and the lower scoring candidate is removed.
The algorithm stops only when no n-grams can be further tiled.

75
Experiments
500 TREC-9 queries
Generate a ranked list of 5 candidate
answers, a maximum of 50 bytes long
MRR, Number of questions correctly
answered (NumCorrect), proportion of
questions correctly answered (PropCorrect)
Performance under default setting: MRR
(0.507), PropCorrect (61%), average answer
length (12 bytes)
70% of the correct answers occur in the first
position, and 90% in the first or second
positions

76
Experiments – Number of Snippets
Vary the number of summaries (snippets)
from the search engine and use as input to
the n-gram mining process
Default setting: 100
Peaking 0.514 MRR
with 200 snippets
When 1000 snippets
are used, the weaker
AND rewrites dominate
the matches Importance of redundancy
in answer extraction

77
TREC vs. Web Databases
The lack of redundancy in TREC accounts for
a large part of this drop off in performance

個人化搜尋引擎
79
楊雅雯、柯皓仁、楊維邦 (2000)；楊雅雯(2001)

80
結合資料探勘與個人化服務
柯皓仁、楊雅雯、吳安琪、戴玉旻、楊維邦(2002)

81
個人推薦
余明哲(2003)

82
館合成本分析– NCTU對外申請
費用前十名期刊
排名期刊刊名館合費
用((((US$)US$)US$)US$)
館合次數 2001200120012001 期刊訂費
(US$)(US$)(US$)(US$)
1 SPIE (Journals and
Proceedings)
159 12 NA
2 Journal of Luminescence 103 22 2113
3 Journal of the Electrochemical
Society
82 78 560
4 Statistics in Medicine 34 25 2495
5 Journal of Microcolumn
Separations
27 21 1002
6 Journal / American Water
Works Association
26 18 85
7 The Journal of Chemical
Physics
21 21 4455
8 Journal of the Patent and
Trademark Office Society
22 12 50
9 Journal of Solid State
Chemistry
20 19 3499
10 Journal of Applied Physics 19 20 3100
黃明居、柯皓仁(2003)

83
館合成本分析– NCTU對外申請
期刊分析
Number of Serial Titles Percentage
Total Serials Title Accessed 1604 100%
Title with One Request Only 1096 68%
Title with One to Four Requests 1512 94%
Title with Five or More Requests 92 6%
Title with Ten or More Requests 39 2%
黃明居、柯皓仁(2003)

小結
資料來源
自動化系統檢索紀錄
自動化系統借閱歷史紀錄
NDDS館際合作交易紀錄
資料分析目的
個人化推薦
館藏發展
84

海量資料系統概觀
86
http://blogs.vmware.com/vfabric/2012/08/4-key-
architecture-considerations-for-big-data-analytics.html

海量資料分析平台六大要件
海量資料
分析平台
海量資料
分析平台
Hadoop
系統
Hadoop
系統
江河運算江河運算
資料倉儲資料倉儲
文本分析文本分析
資訊整合
和治理
資訊整合
和治理
視覺化與
發現
視覺化與
發現
87
(胡世忠，2013)

Hadoop
透過分散式的資料處理模式，快速完成資料處
理
分散式檔案系統 (Hadoop Distributed File System,
HDFS)
分散式處理程式框架 (MapReduce)
88
(胡世忠，2013)
http://hadoop.apache.org/docs/stable/hdfs_design.ht
ml
https://developers.google.com/appengine/docs/pytho
n/dataprocessing/

江河運算 (Streaming Computing)
源源不絕的資料流進入江河運算引擎，在資料
儲存前即完成分析
89
(胡世忠，2013)
http://www.rosebt.com/1/category/ibm big data
platform/1.html

資料治理
透過一系列的政策和程序，確保資料品質(正確
性、完整性、保密與隱私等)
90(胡世忠，2013)
http://www.dataversity.net/assessing-big-data-
governance/

海量資料分析程序
91
http://practicalanalytics.wordpress.com/2011/12/12
/big-data-analytics-use-cases/

從大資料到小資料
先別說海量資料了，圖書館有真的利用手頭邊
的資料嗎？
基於資料的決策 (some kind of evidence-based)
資料圖書館學 (data librarianship)
協助使用者發現與使用資料 (GIS or SS data librarians)
及早參與研究，與研究者合作管理、分享和保存研
究資料
大多數研究者所產生的研究資料少於100GB
(https://www.sciencemag.org/content/331/6018/692)
基於研究資料公開、共享的要求
資料庋用 (data curation)
93

有了海量資料就有了一切？
只要有足夠的資料，數字自己會說話
不可能 (呃，有人認為資料不是愈多愈好，相關不見
得比因果好)
海量資料讓我們的城市更聰明、更有效能
某種程度啦
海量資料對所有社會族群一視同仁
幾乎不可能
海量資料是匿名的，所以不可能侵犯隱私
完全錯誤
海量資料是科學的未來
某種程度是正確的，但有其限制
94
(Crawford, 2013)

結論
基於資料的決策
先別說海量資料了，圖書館有真的利用手頭邊
的資料嗎？
Big Data – Linked Data – Linked Open Data
海量資料的黑暗面
圖書館資料的4V
圖書館的海量資料在誰手中？
資料價值鏈環節：資料持有人、資料專家及資
料分析技術、有巨量資料思維者 (麥爾筍伯格、庫基耶，
2013)
95

圖書館的”海量”資料
96
(Nicholson, 2003, 2006)

參考文獻
Chen, S.Y., Tseng, T. T., Ke, H. R. & Sun, C. T. (2011). Social Trend
Tracking by Time Series Based Social Tagging Clustering. Expert
Systems with Applications, 38(10): 12807-12817.
Crawford, K. (2013). Think Again: Big Data. Retrieved from
http://www.foreignpolicy.com/articles/2013/05/09/think_again_big_data.
Dumais, S., Banko, M., Brill, E., Lin, J. and Ng, A. (2002). Web question
answering: is more always better? SIGIR '02 Proceedings of the 25th
annual international ACM SIGIR conference on Research and
development in information retrieval, 291-298.
Dumbill, E. (2013, March). Making Sense of Big Data. Big Data, 1(1).
Retrieved from
http://online.liebertpub.com/doi/pdfplus/10.1089/big.2012.1503.
Ke, H. R., Kwakkelaar R., Tai, Y. M., and Chen, L. C. (2002). Exploring
Behavior of E-Journal Users in Taiwan – Transaction Log Analysis of
Elsevier ScienceDirect OnSite. Library & Information Science Research,
24 (3), 265-291.
97

參考文獻 (續)
Nicholson, S (2006). The basis for bibliomining: Frameworks for
bringing together usage-based data mining and bibliometrics through
data warehousing in digital library services. Information Processing and
Management, 42, 785-804.
Nicholson, S (2003). The Bibliomining Process: Data Warehousing and
Data Mining for Library Decision Making. Information Technology and
Libraries, 22 (4), 146-151.
Villars, R. L., Olofson, C. W., & Eastwood, M. (2011, June). Big data:
What it is and why you should care. White Paper, IDC. Retrieved from
http://sites.amd.com/es/Documents/Big-Data-WP-06-2011.pdf.
Wilson, J. (2013) Big Data – What’s the big deal?. Retrieved from
http://academic.sla.org/?p=1030.
98

參考文獻 (續)
胡世忠(2013)。雲端時代的殺手級應用：海量資料分析。天下雜誌。
麥爾筍伯格、庫基耶(著)，林俊宏(譯)(2013)。大數據。天下遠見出版。
黃明居、柯皓仁 (2003)。全國館際合作系統績效衡量與使用者分析之研
究。大學圖書館 7卷1期 (民國92年3月)，頁56-74。
柯皓仁、楊雅雯、吳安琪、戴玉旻、楊維邦 (2002)。個人化及群體化圖
書館資訊服務初探。國家圖書館館刊 91年第1期 (民國91年6月)，頁161-
195。
楊雅雯、柯皓仁、楊維邦 (2000/10)。個人化數位圖書資訊環境 – 以
PIE@NCTU為例。2000年台灣區網際網路研討會 (TANET 2000), 頁
467-474.
余明哲 (2003/06)。圖書館個人化館藏推薦系統。交通大學資訊科學研究
所碩士論文。
楊雅雯(2001/06)。個人化數位圖書資訊環境 - 以PIE@NCTU為例。交通
大學資訊科學研究所碩士論文。
99

海量資料與圖書館

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie 海量資料與圖書館

Ähnlich wie 海量資料與圖書館 (20)

Mehr von 皓仁柯

Mehr von 皓仁柯 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (11)