SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
Osaka.Rもくもく会
2020年6月6日(土)14-18時
@usagisan2020 太田博三
本日はRで文書自動要約!
▶背景:
1) blogなどの非構造データが指数関数的に増加
しており、ニュースでも3行要約などが出てきました。
2) Livedoor newsのざっくりいうと・・・が有名
文書自動要約技術の流れ
1) グラフ・ベース型: Graph-Based-Method
→ Google検索のPageRank(Rでは
lexRankr)を基にして、作られたもの
(TextRankアルゴリズム)がある。
2)特徴量ベース型: Feature-Based-Method
→ラーン(Luhn)が自動要約の第1人者といわれている。
こちらのLuhnの論文を参照のこと
※参照元: 久保さんのqiita, “Collective Inteligence Oreilly publishing”
文書自動要約技術の流れ
3) トピックベースモデル: Topic-Based-Method
→LSA(Latent Semantic Analysis)に代表されるように、特異値
分解(SVD)し2次元座標上にプロットするやり方
4) エンコーダー・デコーダー型: Encoder-Decoder Based
Method
→機械翻訳で用いられている手法のひとつで、エンコーダー
で圧縮(ある指標に基いて文を抽出し)て、デコーダー(復元)で
再度、文を生成するといったやり方です。
※参照元: 久保さんのqiita, “Collective Inteligence Oreilly publishing”
本日はtextmineRで文書自動要約を!
▶用いたpackage:
1) textmineR
2) LSAfun: Applied Latent Semantic Analysis (LSA)
Functions
1)従来のもの VS 2)LSAfun(GPT-2ベース)最近
のものとの構図です。
※1 LSAfun
https://sites.google.com/site/fritzgntr/software-resources
https://cran.r-project.org/web/packages/LSAfun/index.html
1) textmineR
1) textmineRとは、
- Google検索のアルゴリズムのPageRankに類似した、
TextRankアルゴリズムを用います 。
アルゴリズムの処理手順:
①ドキュメントを文に分割、
②関連性について各文にスコアを付与
③スコアの高い文が要約された「文」とみなす
詳しくは、 以下のURLsを参考ください。
https://hazm.at/mox/machine-learning/natural-language-
processing/summarization/textrank/index.html
https://www.anlp.jp/proceedings/annual_meeting/2010/pdf_dir/B1
-4.pdf
https://qiita.com/icoxfog417/items/d06651db10e27220c819
1) textmineR
1) textmineRとは、
今回は下記URLを参考にしました。
5. Document summarization written by Thomas W.
Jones on 2019-04-17
https://cran.r-
project.org/web/packages/textmineR/vignettes/e_doc
_summarization.html
1) textmineR script
library(textmineR)
# load the data
data(movie_review, package =
"text2vec")
# let's take a sample so the
demo will run quickly
# note: textmineR is generally
quite scaleable, depending on
your system
set.seed(123)
s <-
sample(1:nrow(movie_review),
200)
movie_review <-
movie_review[ s , ]
# use LDA to get embeddings into
probability space
# This will take considerably longer as
the TCM matrix has many more rows
# than a DTM
embeddings <- FitLdaModel(dtm =
tcm,
k = 50,
iterations = 200,
burnin = 180,
alpha = 0.1,
beta = 0.05,
optimize_alpha = TRUE,
calc_likelihood = FALSE,
calc_coherence = FALSE,
calc_r2 = FALSE,
cpus = 2)
1) textmineR script
# parse it into sentences
sent <- stringi::stri_split_boundaries(doc,
type = "sentence")[[ 1 ]]
names(sent) <- seq_along(sent) # so we
know index and order
# embed the sentences in the model
e <- CreateDtm(sent, ngram_window =
c(1,1), verbose = FALSE, cpus = 2)
# remove any documents with 2 or fewer
words
e <- e[ rowSums(e) > 2 , ]
vocab <- intersect(colnames(e),
colnames(gamma))
e <- e / rowSums(e)
e <- e[ , vocab ] %*% t(gamma[ , vocab ])
e <- as.matrix(e)
# get the pairwise distances between each embedded sentence
e_dist <- CalcHellingerDist(e)
# turn into a similarity matrix
g <- (1 - e_dist) * 100
# we don't need sentences connected to themselves
diag(g) <- 0
# turn into a nearest-neighbor graph
g <- apply(g, 1, function(x){
x[ x < sort(x, decreasing = TRUE)[ 3 ] ] <- 0
x
})
# by taking pointwise max, we'll make the matrix symmetric
again
g <- pmax(g, t(g))
g <- graph.adjacency(g, mode = "undirected", weighted = TRUE)
# calculate eigenvector centrality
ev <- evcent(g)
# format the result
result <- sent[ names(ev$vector)[ order(ev$vector, decreasing =
TRUE)[ 1:3 ] ] ]
result <- result[ order(as.numeric(names(result))) ]
paste(result, collapse = " ")
1) textmineR script
library(igraph)
summarizer <- function(doc, gamma) {
# recursive fanciness to handle multiple docs at once
if (length(doc) > 1 )
# use a try statement to catch any weirdness that may
arise
return(sapply(doc, function(d) try(summarizer(d,
gamma))))
# parse it into sentences
sent <- stringi::stri_split_boundaries(doc, type =
"sentence")[[ 1 ]]
names(sent) <- seq_along(sent) # so we know index and
order
# embed the sentences in the model
e <- CreateDtm(sent, ngram_window = c(1,1), verbose =
FALSE, cpus = 2)
# remove any documents with 2 or fewer words
e <- e[ rowSums(e) > 2 , ]
vocab <- intersect(colnames(e), colnames(gamma))
e <- e / rowSums(e)
e <- e[ , vocab ] %*% t(gamma[ , vocab ])
e <- as.matrix(e)
# get the pairwise distances between each embedded sentence
e_dist <- CalcHellingerDist(e)
# turn into a similarity matrix
g <- (1 - e_dist) * 100
# we don't need sentences connected to themselves
diag(g) <- 0
# turn into a nearest-neighbor graph
g <- apply(g, 1, function(x){
x[ x < sort(x, decreasing = TRUE)[ 3 ] ] <- 0
x
})
# by taking pointwise max, we'll make the matrix symmetric
again
g <- pmax(g, t(g))
g <- graph.adjacency(g, mode = "undirected", weighted = TRUE)
# calculate eigenvector centrality
ev <- evcent(g)
# format the result
result <- sent[ names(ev$vector)[ order(ev$vector, decreasing =
TRUE)[ 1:3 ] ] ]
result <- result[ order(as.numeric(names(result))) ]
paste(result, collapse = " ")
}
1) textmineR script
docs <- movie_review$review[ 1:3 ]
names(docs) <-
movie_review$id[ 1:3 ]
sums <- summarizer(docs, gamma =
embeddings$gamma)
sums
#>
2595_9
#> "Brilliant adaptation of the largely interior
monologues of Leopold Bloom, Stephen
Dedalus, and Molly Bloom by Joseph Strick in
recreating the endearing portrait of Dublin on
June 16, 1904 - Bloomsday - a day to be
celebrated - double entendre intended! Bravo
director Strick, screenwriter Haines, as well as
casting director and cinematographer in
creating this masterpiece. Gunter Grass' novel,
The Tin Drum filmed by Volker Schlndorff
(1979)is another fine film adaptation of interior
monologue which I favorably compare with
Strick's film.While there are clearly recognized
Dublin landmarks in the original novel and in
the film, there are also recognizable characters,
although with different names in the novel. "
2) LSAfun
• 2) LSAfun: Applied Latent Semantic Analysis
(LSA) Functions
※ LSAfunは、性能の良いマシーンで、次回に。
• https://sites.google.com/site/fritzgntr/software-resources
• https://cran.r-project.org/web/packages/LSAfun/index.html
まとめ
• 自動文書要約は、やはり主観評価だと思い
ます。
• 評価指標はあるものの、文書の長さと各人の
目的により、満足度となる効用は異なると考
えられるのではないかと。。。
• Rで簡易的に、いくつかの異なる手法で、比
較考察できるのはメリットだと思われます。
• なので、あまり深入りしても、あのー、そのー
的な??(すみません)

Weitere ähnliche Inhalte

Mehr von 博三 太田

EC_attribute_exstraction_20221122.pdf
EC_attribute_exstraction_20221122.pdfEC_attribute_exstraction_20221122.pdf
EC_attribute_exstraction_20221122.pdf博三 太田
 
Python nlp handson_20220225_v5
Python nlp handson_20220225_v5Python nlp handson_20220225_v5
Python nlp handson_20220225_v5博三 太田
 
LT_hannari python45th_20220121_2355
LT_hannari python45th_20220121_2355LT_hannari python45th_20220121_2355
LT_hannari python45th_20220121_2355博三 太田
 
Logics 18th ota_20211201
Logics 18th ota_20211201Logics 18th ota_20211201
Logics 18th ota_20211201博三 太田
 
Jsai2021 winter ppt_ota_20211127
Jsai2021 winter ppt_ota_20211127Jsai2021 winter ppt_ota_20211127
Jsai2021 winter ppt_ota_20211127博三 太田
 
2021年度 人工知能学会全国大会 第35回
2021年度 人工知能学会全国大会 第35回2021年度 人工知能学会全国大会 第35回
2021年度 人工知能学会全国大会 第35回博三 太田
 
Lt conehito 20210225_ota
Lt conehito 20210225_otaLt conehito 20210225_ota
Lt conehito 20210225_ota博三 太田
 
Seattle consultion 20200822
Seattle consultion 20200822Seattle consultion 20200822
Seattle consultion 20200822博三 太田
 
Syumai lt_mokumoku__20200807_ota
Syumai  lt_mokumoku__20200807_otaSyumai  lt_mokumoku__20200807_ota
Syumai lt_mokumoku__20200807_ota博三 太田
 
Lt syumai moku_mokukai_20200613
Lt syumai moku_mokukai_20200613Lt syumai moku_mokukai_20200613
Lt syumai moku_mokukai_20200613博三 太田
 
Online python data_analysis19th_20200516
Online python data_analysis19th_20200516Online python data_analysis19th_20200516
Online python data_analysis19th_20200516博三 太田
 
本当に言いたい事をくみ取って応答する対話システムの構築に向けて - 昨年(2019年度)の取り組み -
本当に言いたい事をくみ取って応答する対話システムの構築に向けて- 昨年(2019年度)の取り組み -本当に言いたい事をくみ取って応答する対話システムの構築に向けて- 昨年(2019年度)の取り組み -
本当に言いたい事をくみ取って応答する対話システムの構築に向けて - 昨年(2019年度)の取り組み -博三 太田
 
Thesis sigconf2019 1123_hiromitsu.ota
Thesis sigconf2019 1123_hiromitsu.otaThesis sigconf2019 1123_hiromitsu.ota
Thesis sigconf2019 1123_hiromitsu.ota博三 太田
 
Sigconf 2019 slide_ota_20191123
Sigconf 2019 slide_ota_20191123Sigconf 2019 slide_ota_20191123
Sigconf 2019 slide_ota_20191123博三 太田
 
Japan.r ver0.1 2018 slide_ota_20181201
Japan.r ver0.1 2018 slide_ota_20181201Japan.r ver0.1 2018 slide_ota_20181201
Japan.r ver0.1 2018 slide_ota_20181201博三 太田
 
Japan.r 2018 slide ota_20181201
Japan.r 2018 slide ota_20181201Japan.r 2018 slide ota_20181201
Japan.r 2018 slide ota_20181201博三 太田
 
Sig kbs slide-20181123_ota
Sig kbs slide-20181123_otaSig kbs slide-20181123_ota
Sig kbs slide-20181123_ota博三 太田
 
日本語学習者属性別の言語行為の対話自動生成への適用に関する 一考察 太田博三
日本語学習者属性別の言語行為の対話自動生成への適用に関する 一考察 太田博三日本語学習者属性別の言語行為の対話自動生成への適用に関する 一考察 太田博三
日本語学習者属性別の言語行為の対話自動生成への適用に関する 一考察 太田博三博三 太田
 
言語資源活用ワークショップ2018 ポスター発表 P-4-03-E 日本語学習者属性別の言語行為の対話自動生成への適用に関する 一考察
言語資源活用ワークショップ2018 ポスター発表 P-4-03-E 日本語学習者属性別の言語行為の対話自動生成への適用に関する 一考察言語資源活用ワークショップ2018 ポスター発表 P-4-03-E 日本語学習者属性別の言語行為の対話自動生成への適用に関する 一考察
言語資源活用ワークショップ2018 ポスター発表 P-4-03-E 日本語学習者属性別の言語行為の対話自動生成への適用に関する 一考察博三 太田
 
Thesis 34th sig-kst_ver0.2_ota
Thesis 34th sig-kst_ver0.2_otaThesis 34th sig-kst_ver0.2_ota
Thesis 34th sig-kst_ver0.2_ota博三 太田
 

Mehr von 博三 太田 (20)

EC_attribute_exstraction_20221122.pdf
EC_attribute_exstraction_20221122.pdfEC_attribute_exstraction_20221122.pdf
EC_attribute_exstraction_20221122.pdf
 
Python nlp handson_20220225_v5
Python nlp handson_20220225_v5Python nlp handson_20220225_v5
Python nlp handson_20220225_v5
 
LT_hannari python45th_20220121_2355
LT_hannari python45th_20220121_2355LT_hannari python45th_20220121_2355
LT_hannari python45th_20220121_2355
 
Logics 18th ota_20211201
Logics 18th ota_20211201Logics 18th ota_20211201
Logics 18th ota_20211201
 
Jsai2021 winter ppt_ota_20211127
Jsai2021 winter ppt_ota_20211127Jsai2021 winter ppt_ota_20211127
Jsai2021 winter ppt_ota_20211127
 
2021年度 人工知能学会全国大会 第35回
2021年度 人工知能学会全国大会 第35回2021年度 人工知能学会全国大会 第35回
2021年度 人工知能学会全国大会 第35回
 
Lt conehito 20210225_ota
Lt conehito 20210225_otaLt conehito 20210225_ota
Lt conehito 20210225_ota
 
Seattle consultion 20200822
Seattle consultion 20200822Seattle consultion 20200822
Seattle consultion 20200822
 
Syumai lt_mokumoku__20200807_ota
Syumai  lt_mokumoku__20200807_otaSyumai  lt_mokumoku__20200807_ota
Syumai lt_mokumoku__20200807_ota
 
Lt syumai moku_mokukai_20200613
Lt syumai moku_mokukai_20200613Lt syumai moku_mokukai_20200613
Lt syumai moku_mokukai_20200613
 
Online python data_analysis19th_20200516
Online python data_analysis19th_20200516Online python data_analysis19th_20200516
Online python data_analysis19th_20200516
 
本当に言いたい事をくみ取って応答する対話システムの構築に向けて - 昨年(2019年度)の取り組み -
本当に言いたい事をくみ取って応答する対話システムの構築に向けて- 昨年(2019年度)の取り組み -本当に言いたい事をくみ取って応答する対話システムの構築に向けて- 昨年(2019年度)の取り組み -
本当に言いたい事をくみ取って応答する対話システムの構築に向けて - 昨年(2019年度)の取り組み -
 
Thesis sigconf2019 1123_hiromitsu.ota
Thesis sigconf2019 1123_hiromitsu.otaThesis sigconf2019 1123_hiromitsu.ota
Thesis sigconf2019 1123_hiromitsu.ota
 
Sigconf 2019 slide_ota_20191123
Sigconf 2019 slide_ota_20191123Sigconf 2019 slide_ota_20191123
Sigconf 2019 slide_ota_20191123
 
Japan.r ver0.1 2018 slide_ota_20181201
Japan.r ver0.1 2018 slide_ota_20181201Japan.r ver0.1 2018 slide_ota_20181201
Japan.r ver0.1 2018 slide_ota_20181201
 
Japan.r 2018 slide ota_20181201
Japan.r 2018 slide ota_20181201Japan.r 2018 slide ota_20181201
Japan.r 2018 slide ota_20181201
 
Sig kbs slide-20181123_ota
Sig kbs slide-20181123_otaSig kbs slide-20181123_ota
Sig kbs slide-20181123_ota
 
日本語学習者属性別の言語行為の対話自動生成への適用に関する 一考察 太田博三
日本語学習者属性別の言語行為の対話自動生成への適用に関する 一考察 太田博三日本語学習者属性別の言語行為の対話自動生成への適用に関する 一考察 太田博三
日本語学習者属性別の言語行為の対話自動生成への適用に関する 一考察 太田博三
 
言語資源活用ワークショップ2018 ポスター発表 P-4-03-E 日本語学習者属性別の言語行為の対話自動生成への適用に関する 一考察
言語資源活用ワークショップ2018 ポスター発表 P-4-03-E 日本語学習者属性別の言語行為の対話自動生成への適用に関する 一考察言語資源活用ワークショップ2018 ポスター発表 P-4-03-E 日本語学習者属性別の言語行為の対話自動生成への適用に関する 一考察
言語資源活用ワークショップ2018 ポスター発表 P-4-03-E 日本語学習者属性別の言語行為の対話自動生成への適用に関する 一考察
 
Thesis 34th sig-kst_ver0.2_ota
Thesis 34th sig-kst_ver0.2_otaThesis 34th sig-kst_ver0.2_ota
Thesis 34th sig-kst_ver0.2_ota
 

Kürzlich hochgeladen

College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 

Kürzlich hochgeladen (20)

College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 

Osaka.R もくもく会 20200606 14-18時 太田博三

  • 3. 文書自動要約技術の流れ 1) グラフ・ベース型: Graph-Based-Method → Google検索のPageRank(Rでは lexRankr)を基にして、作られたもの (TextRankアルゴリズム)がある。 2)特徴量ベース型: Feature-Based-Method →ラーン(Luhn)が自動要約の第1人者といわれている。 こちらのLuhnの論文を参照のこと ※参照元: 久保さんのqiita, “Collective Inteligence Oreilly publishing”
  • 4. 文書自動要約技術の流れ 3) トピックベースモデル: Topic-Based-Method →LSA(Latent Semantic Analysis)に代表されるように、特異値 分解(SVD)し2次元座標上にプロットするやり方 4) エンコーダー・デコーダー型: Encoder-Decoder Based Method →機械翻訳で用いられている手法のひとつで、エンコーダー で圧縮(ある指標に基いて文を抽出し)て、デコーダー(復元)で 再度、文を生成するといったやり方です。 ※参照元: 久保さんのqiita, “Collective Inteligence Oreilly publishing”
  • 5. 本日はtextmineRで文書自動要約を! ▶用いたpackage: 1) textmineR 2) LSAfun: Applied Latent Semantic Analysis (LSA) Functions 1)従来のもの VS 2)LSAfun(GPT-2ベース)最近 のものとの構図です。 ※1 LSAfun https://sites.google.com/site/fritzgntr/software-resources https://cran.r-project.org/web/packages/LSAfun/index.html
  • 6. 1) textmineR 1) textmineRとは、 - Google検索のアルゴリズムのPageRankに類似した、 TextRankアルゴリズムを用います 。 アルゴリズムの処理手順: ①ドキュメントを文に分割、 ②関連性について各文にスコアを付与 ③スコアの高い文が要約された「文」とみなす 詳しくは、 以下のURLsを参考ください。 https://hazm.at/mox/machine-learning/natural-language- processing/summarization/textrank/index.html https://www.anlp.jp/proceedings/annual_meeting/2010/pdf_dir/B1 -4.pdf https://qiita.com/icoxfog417/items/d06651db10e27220c819
  • 7. 1) textmineR 1) textmineRとは、 今回は下記URLを参考にしました。 5. Document summarization written by Thomas W. Jones on 2019-04-17 https://cran.r- project.org/web/packages/textmineR/vignettes/e_doc _summarization.html
  • 8. 1) textmineR script library(textmineR) # load the data data(movie_review, package = "text2vec") # let's take a sample so the demo will run quickly # note: textmineR is generally quite scaleable, depending on your system set.seed(123) s <- sample(1:nrow(movie_review), 200) movie_review <- movie_review[ s , ] # use LDA to get embeddings into probability space # This will take considerably longer as the TCM matrix has many more rows # than a DTM embeddings <- FitLdaModel(dtm = tcm, k = 50, iterations = 200, burnin = 180, alpha = 0.1, beta = 0.05, optimize_alpha = TRUE, calc_likelihood = FALSE, calc_coherence = FALSE, calc_r2 = FALSE, cpus = 2)
  • 9. 1) textmineR script # parse it into sentences sent <- stringi::stri_split_boundaries(doc, type = "sentence")[[ 1 ]] names(sent) <- seq_along(sent) # so we know index and order # embed the sentences in the model e <- CreateDtm(sent, ngram_window = c(1,1), verbose = FALSE, cpus = 2) # remove any documents with 2 or fewer words e <- e[ rowSums(e) > 2 , ] vocab <- intersect(colnames(e), colnames(gamma)) e <- e / rowSums(e) e <- e[ , vocab ] %*% t(gamma[ , vocab ]) e <- as.matrix(e) # get the pairwise distances between each embedded sentence e_dist <- CalcHellingerDist(e) # turn into a similarity matrix g <- (1 - e_dist) * 100 # we don't need sentences connected to themselves diag(g) <- 0 # turn into a nearest-neighbor graph g <- apply(g, 1, function(x){ x[ x < sort(x, decreasing = TRUE)[ 3 ] ] <- 0 x }) # by taking pointwise max, we'll make the matrix symmetric again g <- pmax(g, t(g)) g <- graph.adjacency(g, mode = "undirected", weighted = TRUE) # calculate eigenvector centrality ev <- evcent(g) # format the result result <- sent[ names(ev$vector)[ order(ev$vector, decreasing = TRUE)[ 1:3 ] ] ] result <- result[ order(as.numeric(names(result))) ] paste(result, collapse = " ")
  • 10. 1) textmineR script library(igraph) summarizer <- function(doc, gamma) { # recursive fanciness to handle multiple docs at once if (length(doc) > 1 ) # use a try statement to catch any weirdness that may arise return(sapply(doc, function(d) try(summarizer(d, gamma)))) # parse it into sentences sent <- stringi::stri_split_boundaries(doc, type = "sentence")[[ 1 ]] names(sent) <- seq_along(sent) # so we know index and order # embed the sentences in the model e <- CreateDtm(sent, ngram_window = c(1,1), verbose = FALSE, cpus = 2) # remove any documents with 2 or fewer words e <- e[ rowSums(e) > 2 , ] vocab <- intersect(colnames(e), colnames(gamma)) e <- e / rowSums(e) e <- e[ , vocab ] %*% t(gamma[ , vocab ]) e <- as.matrix(e) # get the pairwise distances between each embedded sentence e_dist <- CalcHellingerDist(e) # turn into a similarity matrix g <- (1 - e_dist) * 100 # we don't need sentences connected to themselves diag(g) <- 0 # turn into a nearest-neighbor graph g <- apply(g, 1, function(x){ x[ x < sort(x, decreasing = TRUE)[ 3 ] ] <- 0 x }) # by taking pointwise max, we'll make the matrix symmetric again g <- pmax(g, t(g)) g <- graph.adjacency(g, mode = "undirected", weighted = TRUE) # calculate eigenvector centrality ev <- evcent(g) # format the result result <- sent[ names(ev$vector)[ order(ev$vector, decreasing = TRUE)[ 1:3 ] ] ] result <- result[ order(as.numeric(names(result))) ] paste(result, collapse = " ") }
  • 11. 1) textmineR script docs <- movie_review$review[ 1:3 ] names(docs) <- movie_review$id[ 1:3 ] sums <- summarizer(docs, gamma = embeddings$gamma) sums #> 2595_9 #> "Brilliant adaptation of the largely interior monologues of Leopold Bloom, Stephen Dedalus, and Molly Bloom by Joseph Strick in recreating the endearing portrait of Dublin on June 16, 1904 - Bloomsday - a day to be celebrated - double entendre intended! Bravo director Strick, screenwriter Haines, as well as casting director and cinematographer in creating this masterpiece. Gunter Grass' novel, The Tin Drum filmed by Volker Schlndorff (1979)is another fine film adaptation of interior monologue which I favorably compare with Strick's film.While there are clearly recognized Dublin landmarks in the original novel and in the film, there are also recognizable characters, although with different names in the novel. "
  • 12. 2) LSAfun • 2) LSAfun: Applied Latent Semantic Analysis (LSA) Functions ※ LSAfunは、性能の良いマシーンで、次回に。 • https://sites.google.com/site/fritzgntr/software-resources • https://cran.r-project.org/web/packages/LSAfun/index.html
  • 13. まとめ • 自動文書要約は、やはり主観評価だと思い ます。 • 評価指標はあるものの、文書の長さと各人の 目的により、満足度となる効用は異なると考 えられるのではないかと。。。 • Rで簡易的に、いくつかの異なる手法で、比 較考察できるのはメリットだと思われます。 • なので、あまり深入りしても、あのー、そのー 的な??(すみません)