11. 海量資料的定義
Dumbill(2013)
Big data is data that exceeds the processing
capacity of conventional database systems. The
data is too big, moves too fast, or doesn’t fit the
strictures of your database architectures. To gain
value from this data, you must choose an
alternative way to process it.
Big data 的資料超越傳統資料庫系統所能處理的能
力。Big data 資料量大、變動迅速,甚至不符合資
料庫架構。若要取得這些資料的價值,就必須採取
其他替代方式來作業
如何從海量資料中偵測樣態、洞悉真相、預測
複雜問題的答案,是海量資料分析的關鍵
11
36. 計算時間序列相似度
36
jtag
itag
,1 1( , )iv t ,2 2( , )iv t
,3 3( , )iv t
,4 4( , )iv t ,5 5( , )iv t ,6 6( , )iv t
,1 1( , )jv t ,2 2( , )jv t
,3 3( , )jv t
,4 4( , )jv t ,5 5( , )jv t ,6 6( , )jv t
,1 ,1 , ,( , ) ( ( , ) ... ( , ))/i j i j i N j Nsimtag tag simlarity v v simlarity v v N= + +
48. 資料來源:Web Server Log
Transaction log
NCSA-defined CLF (Common Log Format) logged
by WWW servers
IP address, date and time, requests, and bytes returned
Proprietary logs
Example of NCSA-defined CLF Requests
“GET /cgi-bin/search.pl?collection=journals&search_field=xmlsearch_field=xmlsearch_field=xmlsearch_field=xml&
GetSearchResults=Search&fields=Anyields=Anyields=Anyields=Any HTTP/1.1"
"GET /cgi-bin/sciserv.pl?collection=journals&journal=01429418journal=01429418journal=01429418journal=01429418&
issue=v18i0003issue=v18i0003issue=v18i0003issue=v18i0003&article=181_tpocfc181_tpocfc181_tpocfc181_tpocfc&form=pdfform=pdfform=pdfform=pdf &file=file.pdf
HTTP/1.0"
48
52. 解讀數字背後的意義
Relatively few repeated users?
Users mistakenly visit an inappropriate electronic
resource
New comers
Access E-journal systems in a very focused way,
only accessing the system when they know exactly
which article they are interested in
Visit an electronic resource via inter-linking
Proxy servers/cache servers/shared PC
Short session length
Need further investigation into information seeking
behavior of users
Browsing? Query?
52
53. 伺服器負擔
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
#
o
f
l
o
g
s
O'clock
53
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
Sun Mon Tue Wed Thu Fri Sat
294096
761068
799275 830933 830657 800239
423694
#
o
f
L
o
g
s
Best schedule for system maintenance
Better performance during light-loading period
55. 解讀數字背後的意義
Small fraction of accesses is for online help
Take actions to …
Increase use of online help
Improve on-line help’s quality and accessibility
Help users to know specific features of systems
Proactive and context sensitive mechanisms –
Today’s Tip
Small fraction of accesses is for copyright
disclaimer
Combined with the assumption that usage terms
and conditions may not always be strictly observed
Reinforces the notion that libraries have to stress
the significance of fair and legal use of electronic
resources
55
57. 解讀數字背後的意義
About half of the valid IP addresses do not
issue queries
Reasonable for E-journal systems
About 10% of users query more than 20 times
Do they make use of E-journal systems from an
A&I database point of view?
Librarians have to clarify the different roles of A&I
databases and E-journal systems
Reflects the significance of linking A&I databases and E-
journal systems
57
65. 65
Question Answering
Retrieve small snippets of text that contain the
actual answer to a question rather than the
document lists traditionally returned by text
retrieval systems
Find the answer about 「台灣最高的山峰是什麼?」
Search Engine:
台灣、最高、山峰 →Related Docs. of the keywords
Question Answering System:
台灣最高的山峰是什麼? → 玉山
一個展現海量資料的QA範例 -- Web Question
Answering: Is More Always Better (Dumais, Banko,
Brill, Lin & Ng, 2002)
66. 66
Introduction
Focus on factoid questions
Motivated by observations in NLP – significant
improvements in in accuracy can be attained
simply by increasing the amount of data used
for learning
Ah… Web has tremendous amount of data
Instead of focusing on linguistic resources,
such as part-of-speech tagging, syntactic
parsing, semantic relations, named entity
extraction, dictionaries, WordNet, this paper
focuses on DATA (Web Data)
67. 67
Exploiting Redundancy for QA
Redundancy: multiple, differently phrased,
answer occurrences
Enable Simple Query Rewrites
It is difficult to extract the correct answer from a
small corpus for a question, if the corpus contains
few documents for that question
The greater the number of information sources we
can draw from, the easier the task of rewriting the
question becomes, since the answer is more likely
to be expressed in different manners
“Who killed Abraham Lincoln?”
“John Wilkes Booth altered history with a bullet. He will
forever be known as the man who ended Abraham
Lincoln’s life”
68. 68
Exploiting Redundancy for QA
(Cont.)
Facilitates Answer Mining
Even when no obvious answer strings can be
found in the text, redundancy can improve the
efficacy of question answering
“How many times did Bjorn Borg win Wimbledon?”
70. 70
Rewrite Example
For each query, also generate a final rewrite
which is a backoff to a simple ANDing of non-
stop words in the query
Rewrite example: “Who created the character
of Scrooge?”
71. 71
Mine N-Grams
From the page summaries returned by the
search engine, n-grams are mined.
The returned summaries contain the query terms,
usually with a few words of surrounding context.
In some cases, this surrounding context has
truncated the answer string, which may negatively
impact results. (Hope not harmful)
The summary text is then processed to
retrieve only strings to the left or right of the
query string, as specified in the rewrite triple.
72. 72
Mine N-Grams (Cont.)
1-, 2-, and 3-grams are extracted from the
summaries.
The final score for an n-gram is based on the
rewrite rules that generated it and the number
of unique summaries in which it occurred
When searching for candidate answers, we
enforce the constraint that at most one
stopword is permitted to appear in any
potential n-gram answers
73. 73
Filter/Reweight N-Grams
The n-grams are filtered and reweighted
according to how well each candidate
matches the expected answer-type, as
specified by a handful of handwritten filters.
Analyze and assign the query one of seven
question types
who-question, what-question, or how-many-question
Based on the query type that has been assigned,
the system determines what collection of filters to
apply to the set of potential answers found during
n-gram harvesting.
The answers are analyzed for features relevant to the
filters, and then rescored according to the presence of
such information
74. 74
Tile N-Grams
Merges similar answers and assembles longer answers out of
answer fragments.
Tiling constructs longer n-grams from sequences of overlapping
shorter n-grams. "A B C“ + "B C D" "A B C D."
The algorithm proceeds greedily from the top-scoring candidate -
all subsequent candidates (up to a certain cutoff) are checked to
see if they can be tiled with the current candidate answer.
If so, the higher scoring candidate is replaced with the longer tiled n-
gram, and the lower scoring candidate is removed.
The algorithm stops only when no n-grams can be further tiled.
75. 75
Experiments
500 TREC-9 queries
Generate a ranked list of 5 candidate
answers, a maximum of 50 bytes long
MRR, Number of questions correctly
answered (NumCorrect), proportion of
questions correctly answered (PropCorrect)
Performance under default setting: MRR
(0.507), PropCorrect (61%), average answer
length (12 bytes)
70% of the correct answers occur in the first
position, and 90% in the first or second
positions
76. 76
Experiments – Number of Snippets
Vary the number of summaries (snippets)
from the search engine and use as input to
the n-gram mining process
Default setting: 100
Peaking 0.514 MRR
with 200 snippets
When 1000 snippets
are used, the weaker
AND rewrites dominate
the matches Importance of redundancy
in answer extraction
77. 77
TREC vs. Web Databases
The lack of redundancy in TREC accounts for
a large part of this drop off in performance
82. 82
館合成本分析– NCTU對外申請
費用前十名期刊
排名 期刊刊名 館合費
用((((US$)US$)US$)US$)
館合次數 2001200120012001 期刊訂費
(US$)(US$)(US$)(US$)
1 SPIE (Journals and
Proceedings)
159 12 NA
2 Journal of Luminescence 103 22 2113
3 Journal of the Electrochemical
Society
82 78 560
4 Statistics in Medicine 34 25 2495
5 Journal of Microcolumn
Separations
27 21 1002
6 Journal / American Water
Works Association
26 18 85
7 The Journal of Chemical
Physics
21 21 4455
8 Journal of the Patent and
Trademark Office Society
22 12 50
9 Journal of Solid State
Chemistry
20 19 3499
10 Journal of Applied Physics 19 20 3100
黃明居、柯皓仁(2003)
83. 83
館合成本分析– NCTU對外申請
期刊分析
Number of Serial Titles Percentage
Total Serials Title Accessed 1604 100%
Title with One Request Only 1096 68%
Title with One to Four Requests 1512 94%
Title with Five or More Requests 92 6%
Title with Ten or More Requests 39 2%
黃明居、柯皓仁(2003)
97. 參考文獻
Chen, S.Y., Tseng, T. T., Ke, H. R. & Sun, C. T. (2011). Social Trend
Tracking by Time Series Based Social Tagging Clustering. Expert
Systems with Applications, 38(10): 12807-12817.
Crawford, K. (2013). Think Again: Big Data. Retrieved from
http://www.foreignpolicy.com/articles/2013/05/09/think_again_big_data.
Dumais, S., Banko, M., Brill, E., Lin, J. and Ng, A. (2002). Web question
answering: is more always better? SIGIR '02 Proceedings of the 25th
annual international ACM SIGIR conference on Research and
development in information retrieval, 291-298.
Dumbill, E. (2013, March). Making Sense of Big Data. Big Data, 1(1).
Retrieved from
http://online.liebertpub.com/doi/pdfplus/10.1089/big.2012.1503.
Ke, H. R., Kwakkelaar R., Tai, Y. M., and Chen, L. C. (2002). Exploring
Behavior of E-Journal Users in Taiwan – Transaction Log Analysis of
Elsevier ScienceDirect OnSite. Library & Information Science Research,
24 (3), 265-291.
97
98. 參考文獻 (續)
Nicholson, S (2006). The basis for bibliomining: Frameworks for
bringing together usage-based data mining and bibliometrics through
data warehousing in digital library services. Information Processing and
Management, 42, 785-804.
Nicholson, S (2003). The Bibliomining Process: Data Warehousing and
Data Mining for Library Decision Making. Information Technology and
Libraries, 22 (4), 146-151.
Villars, R. L., Olofson, C. W., & Eastwood, M. (2011, June). Big data:
What it is and why you should care. White Paper, IDC. Retrieved from
http://sites.amd.com/es/Documents/Big-Data-WP-06-2011.pdf.
Wilson, J. (2013) Big Data – What’s the big deal?. Retrieved from
http://academic.sla.org/?p=1030.
98