Osaka.R もくもく会 20200606 14-18時　太田博三

Osaka.Rもくもく会
2020年6月6日（土）14-18時
@usagisan2020 太田博三

本日はRで文書自動要約！
▶背景:
1) blogなどの非構造データが指数関数的に増加
しており、ニュースでも3行要約などが出てきました。
2) Livedoor newsのざっくりいうと・・・が有名

文書自動要約技術の流れ
1) グラフ・ベース型: Graph-Based-Method
→ Google検索のPageRank（Rでは
lexRankr）を基にして、作られたもの
(TextRankアルゴリズム)がある。
2)特徴量ベース型: Feature-Based-Method
→ラーン(Luhn)が自動要約の第1人者といわれている。
こちらのLuhnの論文を参照のこと
※参照元: 久保さんのqiita, “Collective Inteligence Oreilly publishing”

文書自動要約技術の流れ
3) トピックベースモデル: Topic-Based-Method
→LSA(Latent Semantic Analysis)に代表されるように、特異値
分解(SVD)し2次元座標上にプロットするやり方
4) エンコーダー・デコーダー型: Encoder-Decoder Based
Method
→機械翻訳で用いられている手法のひとつで、エンコーダー
で圧縮(ある指標に基いて文を抽出し)て、デコーダー(復元)で
再度、文を生成するといったやり方です。
※参照元: 久保さんのqiita, “Collective Inteligence Oreilly publishing”

本日はtextmineRで文書自動要約を！
▶用いたpackage:
1) textmineR
2) LSAfun: Applied Latent Semantic Analysis (LSA)
Functions
1)従来のもの VS 2)LSAfun（GPT-2ベース）最近
のものとの構図です。
※1 LSAfun
https://sites.google.com/site/fritzgntr/software-resources
https://cran.r-project.org/web/packages/LSAfun/index.html

1) textmineR
1) textmineRとは、
- Google検索のアルゴリズムのPageRankに類似した、
TextRankアルゴリズムを用います。
アルゴリズムの処理手順:
①ドキュメントを文に分割、
②関連性について各文にスコアを付与
③スコアの高い文が要約された「文」とみなす
詳しくは、以下のURLｓを参考ください。
https://hazm.at/mox/machine-learning/natural-language-
processing/summarization/textrank/index.html
https://www.anlp.jp/proceedings/annual_meeting/2010/pdf_dir/B1
-4.pdf
https://qiita.com/icoxfog417/items/d06651db10e27220c819

1) textmineR
1) textmineRとは、
今回は下記URLを参考にしました。
5. Document summarization written by Thomas W.
Jones on 2019-04-17
https://cran.r-
project.org/web/packages/textmineR/vignettes/e_doc
_summarization.html

1) textmineR script
library(textmineR)
# load the data
data(movie_review, package =
"text2vec")
# let's take a sample so the
demo will run quickly
# note: textmineR is generally
quite scaleable, depending on
your system
set.seed(123)
s <-
sample(1:nrow(movie_review),
200)
movie_review <-
movie_review[ s , ]
# use LDA to get embeddings into
probability space
# This will take considerably longer as
the TCM matrix has many more rows
# than a DTM
embeddings <- FitLdaModel(dtm =
tcm,
k = 50,
iterations = 200,
burnin = 180,
alpha = 0.1,
beta = 0.05,
optimize_alpha = TRUE,
calc_likelihood = FALSE,
calc_coherence = FALSE,
calc_r2 = FALSE,
cpus = 2)

1) textmineR script
# parse it into sentences
sent <- stringi::stri_split_boundaries(doc,
type = "sentence")[[ 1 ]]
names(sent) <- seq_along(sent) # so we
know index and order
# embed the sentences in the model
e <- CreateDtm(sent, ngram_window =
c(1,1), verbose = FALSE, cpus = 2)
# remove any documents with 2 or fewer
words
e <- e[ rowSums(e) > 2 , ]
vocab <- intersect(colnames(e),
colnames(gamma))
e <- e / rowSums(e)
e <- e[ , vocab ] %*% t(gamma[ , vocab ])
e <- as.matrix(e)
# get the pairwise distances between each embedded sentence
e_dist <- CalcHellingerDist(e)
# turn into a similarity matrix
g <- (1 - e_dist) * 100
# we don't need sentences connected to themselves
diag(g) <- 0
# turn into a nearest-neighbor graph
g <- apply(g, 1, function(x){
x[ x < sort(x, decreasing = TRUE)[ 3 ] ] <- 0
x
})
# by taking pointwise max, we'll make the matrix symmetric
again
g <- pmax(g, t(g))
g <- graph.adjacency(g, mode = "undirected", weighted = TRUE)
# calculate eigenvector centrality
ev <- evcent(g)
# format the result
result <- sent[ names(ev$vector)[ order(ev$vector, decreasing =
TRUE)[ 1:3 ] ] ]
result <- result[ order(as.numeric(names(result))) ]
paste(result, collapse = " ")

1) textmineR script
library(igraph)
summarizer <- function(doc, gamma) {
# recursive fanciness to handle multiple docs at once
if (length(doc) > 1 )
# use a try statement to catch any weirdness that may
arise
return(sapply(doc, function(d) try(summarizer(d,
gamma))))
# parse it into sentences
sent <- stringi::stri_split_boundaries(doc, type =
"sentence")[[ 1 ]]
names(sent) <- seq_along(sent) # so we know index and
order
# embed the sentences in the model
e <- CreateDtm(sent, ngram_window = c(1,1), verbose =
FALSE, cpus = 2)
# remove any documents with 2 or fewer words
e <- e[ rowSums(e) > 2 , ]
vocab <- intersect(colnames(e), colnames(gamma))
e <- e / rowSums(e)
e <- e[ , vocab ] %*% t(gamma[ , vocab ])
e <- as.matrix(e)
# get the pairwise distances between each embedded sentence
e_dist <- CalcHellingerDist(e)
# turn into a similarity matrix
g <- (1 - e_dist) * 100
# we don't need sentences connected to themselves
diag(g) <- 0
# turn into a nearest-neighbor graph
g <- apply(g, 1, function(x){
x[ x < sort(x, decreasing = TRUE)[ 3 ] ] <- 0
x
})
# by taking pointwise max, we'll make the matrix symmetric
again
g <- pmax(g, t(g))
g <- graph.adjacency(g, mode = "undirected", weighted = TRUE)
# calculate eigenvector centrality
ev <- evcent(g)
# format the result
result <- sent[ names(ev$vector)[ order(ev$vector, decreasing =
TRUE)[ 1:3 ] ] ]
result <- result[ order(as.numeric(names(result))) ]
paste(result, collapse = " ")
}

1) textmineR script
docs <- movie_review$review[ 1:3 ]
names(docs) <-
movie_review$id[ 1:3 ]
sums <- summarizer(docs, gamma =
embeddings$gamma)
sums
#>
2595_9
#> "Brilliant adaptation of the largely interior
monologues of Leopold Bloom, Stephen
Dedalus, and Molly Bloom by Joseph Strick in
recreating the endearing portrait of Dublin on
June 16, 1904 - Bloomsday - a day to be
celebrated - double entendre intended! Bravo
director Strick, screenwriter Haines, as well as
casting director and cinematographer in
creating this masterpiece. Gunter Grass' novel,
The Tin Drum filmed by Volker Schlndorff
(1979)is another fine film adaptation of interior
monologue which I favorably compare with
Strick's film.While there are clearly recognized
Dublin landmarks in the original novel and in
the film, there are also recognizable characters,
although with different names in the novel. "

2) LSAfun
• 2) LSAfun: Applied Latent Semantic Analysis
(LSA) Functions
※ LSAfunは、性能の良いマシーンで、次回に。
• https://sites.google.com/site/fritzgntr/software-resources
• https://cran.r-project.org/web/packages/LSAfun/index.html

まとめ
• 自動文書要約は、やはり主観評価だと思い
ます。
• 評価指標はあるものの、文書の長さと各人の
目的により、満足度となる効用は異なると考
えられるのではないかと。。。
• Rで簡易的に、いくつかの異なる手法で、比
較考察できるのはメリットだと思われます。
• なので、あまり深入りしても、あのー、そのー
的な??(すみません)

Osaka.R もくもく会 20200606 14-18時　太田博三

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von 博三太田

Mehr von 博三太田 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)