SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
The Effects of Data Size and Frequency
Range on Distributional Semantic Models
Magnus Sahlgren and Alessandro Lenci, Proceedings of the 2016
Conference on EMNLP, pp.975-980, 2016
図や表は論⽂より引⽤
⽂献紹介
2017.02.03
⾃然⾔語処理研究室 修⼠2年 髙橋寛治
概要
Distributional Semantic Models(DSMs)の調査
データサイズの異なるコーパス
処理対象の語の頻度
分かったこと
ニューラルネットワークはデータ量が⼩さいと弱い
データ量が⼩さい時は、特異値分解(SVD)
今後の課題として、モデルの組み合わせなどを考慮
The Effects of Data Size and Frequency Range on Distributional Semantic Models
はじめに
DSMsは⾔語処理でよく使う⼿法
• 次元削減
• 類似度計算
モデルの選択は、あまり重要視されない
本稿で下記を調べる
データ量に対する性能
低頻度語に対する性能
The Effects of Data Size and Frequency Range on Distributional Semantic Models
Distributional Semantic Models(DSM)
実験で⽤いるモデル
• 単純な共起モデル(PMI)
• ⾏列モデル(SVD)
• ランダムインデクシング
• ニューラルネットワークモデル(word2vec)
The Effects of Data Size and Frequency Range on Distributional Semantic Models
実験
ukWaCコーパス:16億語
共起:パラメータは揃える(±2単語)
ukWaCの共起は400万×400万、まず5万次元に削減
TSVD:200次元,ISVD:2800次元(200-3000)
RI:2000次元、CBOW,SGNS:200次元
ベンチマーク
同義語選択問題2種類(精度で評価)
類似度・関連度タスク3種類(スピアマンの順位相関係数)
The Effects of Data Size and Frequency Range on Distributional Semantic Models
The Effects of Data Size and Frequency Range on
Distributional Semantic Models
データサイズによる⽐較
⼩さいコーパス
ニューラルネットが弱い
特異値分解が強い
※⼩さいコーパスは⽐較が難しい
かもしれない
4択だとランダムでも25%
ISVDが全体的に良い
単純な共起以外では、劇的な差が
あるわけではない
ニューラルネット
データサイズが⼤きくなればなる
ほど性能が良くなる
The Effects of Data Size and Frequency Range on
Distributional Semantic Models
データサイズによる⽐較
スコアの平均
頻度による⽐較
頻度別で⽐較
⾼頻度(1,387)、 中頻度(656) 、低頻度(350) 、混ぜたもの(3458)
The Effects of Data Size and Frequency Range on Distributional Semantic Models
スコアの平均。10億語ですべて学習
頻度による⽐較
ISVD
MEDIUM, MIXEDで良い
The Effects of Data Size and Frequency Range on Distributional Semantic Models
スコアの平均。10億語ですべて学習
頻度による⽐較
ニューラルネットベースのモデル
頻度が⾼ければ⾼いほど良い
The Effects of Data Size and Frequency Range on Distributional Semantic Models
スコアの平均。10億語ですべて学習
頻度による⽐較
中頻度に強い
CO, PPMI, TSVD, ISVD
The Effects of Data Size and Frequency Range on Distributional Semantic Models
スコアの平均。10億語ですべて学習
頻度による⽐較
ニューラルネットだが、低頻度に強い場合も
CBOWが低頻度で最も強い
The Effects of Data Size and Frequency Range on Distributional Semantic Models
スコアの平均。10億語ですべて学習
まとめ
DSMへのデータサイズおよび頻度の影響調査
ニューラルネットは⼩さいデータに弱い
それ以外のDSMは⼩さいデータ向き
ISVDが頑健であった
モデルの使い⽅としての今後の課題
頻度ごとに別のモデルを⽤いる
頻度およびデータ量に合わせて、パラメータ調整
The Effects of Data Size and Frequency Range on Distributional Semantic Models

Weitere ähnliche Inhalte

Ähnlich wie 20170203The Effects of Data Size and Frequency Range on Distributional Semantic Models

ICASSP2017読み会 (acoustic modeling and adaptation)
ICASSP2017読み会 (acoustic modeling and adaptation)ICASSP2017読み会 (acoustic modeling and adaptation)
ICASSP2017読み会 (acoustic modeling and adaptation)Shinnosuke Takamichi
 
Approximate Scalable Bounded Space Sketch for Large Data NLP
Approximate Scalable Bounded Space Sketch for Large Data NLPApproximate Scalable Bounded Space Sketch for Large Data NLP
Approximate Scalable Bounded Space Sketch for Large Data NLPKoji Matsuda
 
多変量解析の一般化
多変量解析の一般化多変量解析の一般化
多変量解析の一般化Akisato Kimura
 
Session4:「先進ビッグデータ応用を支える機械学習に求められる新技術」/比戸将平
Session4:「先進ビッグデータ応用を支える機械学習に求められる新技術」/比戸将平Session4:「先進ビッグデータ応用を支える機械学習に求められる新技術」/比戸将平
Session4:「先進ビッグデータ応用を支える機械学習に求められる新技術」/比戸将平Preferred Networks
 
機械学習デザインパターン Machine Learning Design Patterns
機械学習デザインパターン Machine Learning Design Patterns機械学習デザインパターン Machine Learning Design Patterns
機械学習デザインパターン Machine Learning Design PatternsHironori Washizaki
 

Ähnlich wie 20170203The Effects of Data Size and Frequency Range on Distributional Semantic Models (6)

ICASSP2017読み会 (acoustic modeling and adaptation)
ICASSP2017読み会 (acoustic modeling and adaptation)ICASSP2017読み会 (acoustic modeling and adaptation)
ICASSP2017読み会 (acoustic modeling and adaptation)
 
Approximate Scalable Bounded Space Sketch for Large Data NLP
Approximate Scalable Bounded Space Sketch for Large Data NLPApproximate Scalable Bounded Space Sketch for Large Data NLP
Approximate Scalable Bounded Space Sketch for Large Data NLP
 
多変量解析の一般化
多変量解析の一般化多変量解析の一般化
多変量解析の一般化
 
Overview and Roadmap
Overview and RoadmapOverview and Roadmap
Overview and Roadmap
 
Session4:「先進ビッグデータ応用を支える機械学習に求められる新技術」/比戸将平
Session4:「先進ビッグデータ応用を支える機械学習に求められる新技術」/比戸将平Session4:「先進ビッグデータ応用を支える機械学習に求められる新技術」/比戸将平
Session4:「先進ビッグデータ応用を支える機械学習に求められる新技術」/比戸将平
 
機械学習デザインパターン Machine Learning Design Patterns
機械学習デザインパターン Machine Learning Design Patterns機械学習デザインパターン Machine Learning Design Patterns
機械学習デザインパターン Machine Learning Design Patterns
 

Mehr von Kanji Takahashi

20180718Eightニュースフィード活性化のための自然言語処理の取り組み
20180718Eightニュースフィード活性化のための自然言語処理の取り組み20180718Eightニュースフィード活性化のための自然言語処理の取り組み
20180718Eightニュースフィード活性化のための自然言語処理の取り組みKanji Takahashi
 
論文読み会 Creating Speech and Language Data With Amazon’s Mechanical Turk
論文読み会 Creating Speech and Language Data With Amazon’s Mechanical Turk論文読み会 Creating Speech and Language Data With Amazon’s Mechanical Turk
論文読み会 Creating Speech and Language Data With Amazon’s Mechanical TurkKanji Takahashi
 
論文読み会 Enriching Word Vectors with Subword Information
論文読み会 Enriching Word Vectors with Subword Information論文読み会 Enriching Word Vectors with Subword Information
論文読み会 Enriching Word Vectors with Subword InformationKanji Takahashi
 
第17回Machine Learning 15 minutes!:ビジネスの出会いを科学する
第17回Machine Learning 15 minutes!:ビジネスの出会いを科学する第17回Machine Learning 15 minutes!:ビジネスの出会いを科学する
第17回Machine Learning 15 minutes!:ビジネスの出会いを科学するKanji Takahashi
 
論文読み会 Data Augmentation for Low-Resource Neural Machine Translation
論文読み会 Data Augmentation for Low-Resource Neural Machine Translation論文読み会 Data Augmentation for Low-Resource Neural Machine Translation
論文読み会 Data Augmentation for Low-Resource Neural Machine TranslationKanji Takahashi
 
Enriching Morphologically Poor Languages for Statistical Machine Translation
Enriching Morphologically Poor Languages for Statistical Machine TranslationEnriching Morphologically Poor Languages for Statistical Machine Translation
Enriching Morphologically Poor Languages for Statistical Machine TranslationKanji Takahashi
 
A Beam-Search Decoder for Normalization of Social Media Text with Application...
A Beam-Search Decoder for Normalization of Social Media Text with Application...A Beam-Search Decoder for Normalization of Social Media Text with Application...
A Beam-Search Decoder for Normalization of Social Media Text with Application...Kanji Takahashi
 
Reducing the Impact of Data Sparsity in Statistical Machine Translation
Reducing the Impact of Data Sparsity in Statistical Machine TranslationReducing the Impact of Data Sparsity in Statistical Machine Translation
Reducing the Impact of Data Sparsity in Statistical Machine TranslationKanji Takahashi
 
文献紹介:Morphological analysis for Statistical Machine Translation
文献紹介:Morphological analysis for Statistical Machine Translation文献紹介:Morphological analysis for Statistical Machine Translation
文献紹介:Morphological analysis for Statistical Machine TranslationKanji Takahashi
 
Distributed Representations of Words and Phrases and their Compositionally
Distributed Representations of Words and Phrases and their CompositionallyDistributed Representations of Words and Phrases and their Compositionally
Distributed Representations of Words and Phrases and their CompositionallyKanji Takahashi
 
Domain-spesific Paraphrase Extraction
Domain-spesific Paraphrase ExtractionDomain-spesific Paraphrase Extraction
Domain-spesific Paraphrase ExtractionKanji Takahashi
 
Vietnamese Word Segmentation with CRFs and SVMs: An Investigation
Vietnamese Word Segmentation with CRFs and SVMs: An InvestigationVietnamese Word Segmentation with CRFs and SVMs: An Investigation
Vietnamese Word Segmentation with CRFs and SVMs: An InvestigationKanji Takahashi
 
Improving vietnamese word segmentation and pos tagging using MEM with various...
Improving vietnamese word segmentation and pos tagging using MEM with various...Improving vietnamese word segmentation and pos tagging using MEM with various...
Improving vietnamese word segmentation and pos tagging using MEM with various...Kanji Takahashi
 
20150916How Far are We from Fully Automatic High Quality Grammatical Error Co...
20150916How Far are We from Fully Automatic High Quality Grammatical Error Co...20150916How Far are We from Fully Automatic High Quality Grammatical Error Co...
20150916How Far are We from Fully Automatic High Quality Grammatical Error Co...Kanji Takahashi
 
20150728So similar and yet incompatible: Toward automated identification of s...
20150728So similar and yet incompatible:Toward automated identification of s...20150728So similar and yet incompatible:Toward automated identification of s...
20150728So similar and yet incompatible: Toward automated identification of s...Kanji Takahashi
 
20150701 Improving SMT quality with morpho-syntactic analysis
20150701 Improving SMT quality with morpho-syntactic analysis20150701 Improving SMT quality with morpho-syntactic analysis
20150701 Improving SMT quality with morpho-syntactic analysisKanji Takahashi
 
文献紹介20150508 Paraphrasing Adaptation for Web Search Ranking
文献紹介20150508 Paraphrasing Adaptation for Web Search Ranking文献紹介20150508 Paraphrasing Adaptation for Web Search Ranking
文献紹介20150508 Paraphrasing Adaptation for Web Search RankingKanji Takahashi
 
20150415 automatic retirieval_and_clustering_of_similar_words
20150415 automatic retirieval_and_clustering_of_similar_words20150415 automatic retirieval_and_clustering_of_similar_words
20150415 automatic retirieval_and_clustering_of_similar_wordsKanji Takahashi
 
A baseline system for chinese near synonym choice
A baseline system for chinese near synonym choiceA baseline system for chinese near synonym choice
A baseline system for chinese near synonym choiceKanji Takahashi
 
20150225文献紹介 On WordNet Semantic Classes and Dependency Parsing
20150225文献紹介 On WordNet Semantic Classes and Dependency Parsing20150225文献紹介 On WordNet Semantic Classes and Dependency Parsing
20150225文献紹介 On WordNet Semantic Classes and Dependency ParsingKanji Takahashi
 

Mehr von Kanji Takahashi (20)

20180718Eightニュースフィード活性化のための自然言語処理の取り組み
20180718Eightニュースフィード活性化のための自然言語処理の取り組み20180718Eightニュースフィード活性化のための自然言語処理の取り組み
20180718Eightニュースフィード活性化のための自然言語処理の取り組み
 
論文読み会 Creating Speech and Language Data With Amazon’s Mechanical Turk
論文読み会 Creating Speech and Language Data With Amazon’s Mechanical Turk論文読み会 Creating Speech and Language Data With Amazon’s Mechanical Turk
論文読み会 Creating Speech and Language Data With Amazon’s Mechanical Turk
 
論文読み会 Enriching Word Vectors with Subword Information
論文読み会 Enriching Word Vectors with Subword Information論文読み会 Enriching Word Vectors with Subword Information
論文読み会 Enriching Word Vectors with Subword Information
 
第17回Machine Learning 15 minutes!:ビジネスの出会いを科学する
第17回Machine Learning 15 minutes!:ビジネスの出会いを科学する第17回Machine Learning 15 minutes!:ビジネスの出会いを科学する
第17回Machine Learning 15 minutes!:ビジネスの出会いを科学する
 
論文読み会 Data Augmentation for Low-Resource Neural Machine Translation
論文読み会 Data Augmentation for Low-Resource Neural Machine Translation論文読み会 Data Augmentation for Low-Resource Neural Machine Translation
論文読み会 Data Augmentation for Low-Resource Neural Machine Translation
 
Enriching Morphologically Poor Languages for Statistical Machine Translation
Enriching Morphologically Poor Languages for Statistical Machine TranslationEnriching Morphologically Poor Languages for Statistical Machine Translation
Enriching Morphologically Poor Languages for Statistical Machine Translation
 
A Beam-Search Decoder for Normalization of Social Media Text with Application...
A Beam-Search Decoder for Normalization of Social Media Text with Application...A Beam-Search Decoder for Normalization of Social Media Text with Application...
A Beam-Search Decoder for Normalization of Social Media Text with Application...
 
Reducing the Impact of Data Sparsity in Statistical Machine Translation
Reducing the Impact of Data Sparsity in Statistical Machine TranslationReducing the Impact of Data Sparsity in Statistical Machine Translation
Reducing the Impact of Data Sparsity in Statistical Machine Translation
 
文献紹介:Morphological analysis for Statistical Machine Translation
文献紹介:Morphological analysis for Statistical Machine Translation文献紹介:Morphological analysis for Statistical Machine Translation
文献紹介:Morphological analysis for Statistical Machine Translation
 
Distributed Representations of Words and Phrases and their Compositionally
Distributed Representations of Words and Phrases and their CompositionallyDistributed Representations of Words and Phrases and their Compositionally
Distributed Representations of Words and Phrases and their Compositionally
 
Domain-spesific Paraphrase Extraction
Domain-spesific Paraphrase ExtractionDomain-spesific Paraphrase Extraction
Domain-spesific Paraphrase Extraction
 
Vietnamese Word Segmentation with CRFs and SVMs: An Investigation
Vietnamese Word Segmentation with CRFs and SVMs: An InvestigationVietnamese Word Segmentation with CRFs and SVMs: An Investigation
Vietnamese Word Segmentation with CRFs and SVMs: An Investigation
 
Improving vietnamese word segmentation and pos tagging using MEM with various...
Improving vietnamese word segmentation and pos tagging using MEM with various...Improving vietnamese word segmentation and pos tagging using MEM with various...
Improving vietnamese word segmentation and pos tagging using MEM with various...
 
20150916How Far are We from Fully Automatic High Quality Grammatical Error Co...
20150916How Far are We from Fully Automatic High Quality Grammatical Error Co...20150916How Far are We from Fully Automatic High Quality Grammatical Error Co...
20150916How Far are We from Fully Automatic High Quality Grammatical Error Co...
 
20150728So similar and yet incompatible: Toward automated identification of s...
20150728So similar and yet incompatible:Toward automated identification of s...20150728So similar and yet incompatible:Toward automated identification of s...
20150728So similar and yet incompatible: Toward automated identification of s...
 
20150701 Improving SMT quality with morpho-syntactic analysis
20150701 Improving SMT quality with morpho-syntactic analysis20150701 Improving SMT quality with morpho-syntactic analysis
20150701 Improving SMT quality with morpho-syntactic analysis
 
文献紹介20150508 Paraphrasing Adaptation for Web Search Ranking
文献紹介20150508 Paraphrasing Adaptation for Web Search Ranking文献紹介20150508 Paraphrasing Adaptation for Web Search Ranking
文献紹介20150508 Paraphrasing Adaptation for Web Search Ranking
 
20150415 automatic retirieval_and_clustering_of_similar_words
20150415 automatic retirieval_and_clustering_of_similar_words20150415 automatic retirieval_and_clustering_of_similar_words
20150415 automatic retirieval_and_clustering_of_similar_words
 
A baseline system for chinese near synonym choice
A baseline system for chinese near synonym choiceA baseline system for chinese near synonym choice
A baseline system for chinese near synonym choice
 
20150225文献紹介 On WordNet Semantic Classes and Dependency Parsing
20150225文献紹介 On WordNet Semantic Classes and Dependency Parsing20150225文献紹介 On WordNet Semantic Classes and Dependency Parsing
20150225文献紹介 On WordNet Semantic Classes and Dependency Parsing
 

20170203The Effects of Data Size and Frequency Range on Distributional Semantic Models