PR12 논문 리뷰 Distributed Representations of Sentences and Documents

•

1 like•2,197 views

keunbong kwak

Tensorflow KR의 PR12 발표 자료입니다. 이 논문은 일명 doc2vec으로 불리는 논문입니다. word2vec의 후속 논문으로, 문장 혹은 문서의 embedding을 구하는 방법을 제안하는 논문입니다.

Technology

DistributedRepresentationsof
SentencesandDocuments
(2014) Quoc Le, Tomas Mikolov
발표: 곽근봉

© NBT All Rights Reserved.
이논문을선정한이유
추천 엔진에서의
Cold Start Problem

© NBT All Rights Reserved.
이논문을선정한이유
추천 엔진에서의
Cold Start Problem
컨텐츠 태깅

© NBT All Rights Reserved.
이논문을선정한이유
추천 엔진에서의
Cold Start Problem
컨텐츠 태깅
문서 유사도

© NBT All Rights Reserved.
이논문을선정한이유
추천 엔진에서의
Cold Start Problem
컨텐츠 태깅
컨텐츠 유사도컨텐츠 임베딩

© NBT All Rights Reserved.
참고 자료
Lucy Park님 Pycon 발표자료
https://www.lucypark.kr/docs/2015-pyconkr/#1
Ratsgo님 블로그
https://ratsgo.github.io/natural%20language%20processing/2017/03/08/word2vec/
PyData 2017 발표자료
https://www.youtube.com/watch?v=zFScws0mb7M

© NBT All Rights Reserved.
개요
Word2Vec의후속버전Doc2Vec
• 문장 분석을 위한 새로운 embedding방식을 제안
• Word2Vec의 아이디어를 활용
• PV-DM & PV-DBOW

© NBT All Rights Reserved.
문제정의–이논문에서풀고자했던것
• 문장, 문단, 문서의 특징을 capture할 수 있는 embedding
• Embedding을 통한 Classification 성능 향상

© NBT All Rights Reserved.
Word2Vec
딥러닝을 활용한 자연어처리에
단어가 출현하는 위치를 기반으로 단어를 학습시켜보자 (Skip-grams)
Center word
(Position t)
Output Context words
(m word window)
Output Context words
(m word window)
P(wt+1|wt)P(wt-1|wt)
“저는
P(wt-2|wt)
관심이 많습니다”
P(wt+2|wt)

© NBT All Rights Reserved.
Word2Vec
단어가 출현하는 위치를 기반으로 단어를 학습시켜보자

© NBT All Rights Reserved.
Word2Vec
학습 결과
의미, 문법과 관련된 정보들이 Capture 된다!

© NBT All Rights Reserved.
모델설명PV-DM(예시)
“나는 배가 고파서 밥을 먹었다”
Paragraph Dictionary Word Dictionary
ID Paragraph
1 나는 배가 고파
서 밥을 먹었다
ID Word
1 나는
2 배가
3 고파서
4 밥을
5 먹었다

© NBT All Rights Reserved.
모델설명PV-DM(예시)
“나는 배가 고파서 밥을 먹었다”
Paragraph Embedding Word Embedding
ID Paragraph
Embedding
1 [0.5, 0.41, 0.55
…]
ID Word Embedding
1 [0.2, 0.11, 0.55 …]
2 [0.9, 0.41, 0.75 …]
3 [0.4, 0.15, 0.53 …]
4 [0.3, 0.78, 0.48 …]
5 [0.6, 0.23, 0.12 …]

© NBT All Rights Reserved.
모델설명PV-DM(예시)
“나는 배가 고파서 밥을 먹었다”
Step Input Label
1 [나는 배가 고파서 밥을 먹었다, 나는, 배가, 고파서] 밥을
2 [나는 배가 고파서 밥을 먹었다, 배가, 고파서, 밥을] 먹었다

© NBT All Rights Reserved.
모델설명PV-DM(예시)
“나는 배가 고파서 밥을 먹었다”
Step Input Label
1 [d1, w1, w2, w3] w4
2 [d1, w2, w3, w4] w5

© NBT All Rights Reserved.
모델설명PV-DM(예시)
“나는 배가 고파서 밥을 먹었다”
Step Input Label
1
[[0.5, 0.41, 0.55 …], [0.2, 0.11, 0.55 …]
, [0.9, 0.41, 0.75 …], [0.4, 0.15, 0.53 …]]
[0,0,0,0,1,0]
2
[[0.5, 0.41, 0.55 …], [0.9, 0.41, 0.75 …],
[0.4, 0.15, 0.53 …], [0.3, 0.78, 0.48 …]]
[0,0,0,0,0,1]

© NBT All Rights Reserved.
PV-DMInferenceStage

© NBT All Rights Reserved.
모델설명PV-DBOW(예시)
“나는 배가 고파서 밥을 먹었다”
Step Input Label
1 나는 배가 고파서 밥을 먹었다 나는
2 나는 배가 고파서 밥을 먹었다 배가
3 나는 배가 고파서 밥을 먹었다 고파서
4 나는 배가 고파서 밥을 먹었다 밥을
5 나는 배가 고파서 밥을 먹었다 먹었다

© NBT All Rights Reserved.
그이외의발견들
• PV-DM이 PV-DBOW 보다 일반적으로 더 좋은 성능을 낸다
• PV-DM이랑 PV-DBOW를 concat으로 합치는게 sum 보다 낫다
• Window size는 5~12로 잡는게 일반적으로 좋더라

© NBT All Rights Reserved.
사용해보기
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb

© NBT All Rights Reserved.
결론
• Word2Vec과 유사한 아이디어로 문서도 Embedding 가능
• 성능도 꽤 잘 나옴
• 쉽게 활용 가능

What's hot

メルペイの与信モデリングで安全・安心のために実践していること

Yuhi Kawakami

研究を基にしたオープンソース開発チェックポイント

Recruit Technologies

Continuous representations of words and documents, which is recently referred to as Word Embeddings, have recently demonstrated large advancements in many of the Natural language processing tasks. In this presentation we will provide an introduction to the most common methods of learning these representations. As well as previous methods in building these representations before the recent advances in deep learning, such as dimensionality reduction on the word co-occurrence matrix. Moreover, we will present the continuous bag of word model (CBOW), one of the most successful models for word embeddings and one of the core models in word2vec, and in brief a glance of many other models of building representations for other tasks such as knowledge base embeddings. Finally, we will motivate the potential of using such embeddings for many tasks that could be of importance for the group, such as semantic similarity, document clustering and retrieval.

Word Embeddings, why the hype ?

Hady Elsahar

Weaviate and Pinecone are both search engines that allow developers to build powerful search and discovery applications. Weaviate is designed specifically for natural language or numerical data and is based on contextualized embeddings, while Pinecone is a more general-purpose vector search engine that can be used for a wide range of data types, including images, audio, and sensor data. Both Weaviate and Pinecone use similar approaches to document loading and vectorization, but differ in their focus and capabilities. Weaviate provides REST and GraphQL APIs that allow developers to easily interact with the search engine using Lua or other programming languages, and supports features such as natural language processing and knowledge graph creation. Pinecone, on the other hand, provides built-in similarity search functionality and is optimized for large-scale, high-throughput search applications. When choosing between Weaviate and Pinecone, it's important to consider factors such as your specific use case, performance requirements, flexibility, data sources, and cost. Weaviate may be a better fit if your use case involves natural language processing or you need to integrate with Lua-based tools such as OpenResty or Tarantool. Pinecone may be a better fit if you need to handle large-scale, high-throughput search applications or work with a wide range of data types. Ultimately, the choice between Weaviate and Pinecone will depend on the specific requirements of your project and the features and capabilities that are most important to you.

Weaviate and Pinecone Comparison.pdf

Evgenios Skitsanos

딥러닝 논문 리뷰 Learning phrase representations using rnn encoder decoder for stati...

keunbong kwak

What is word2vec?

Traian Rebedea

GloVe:Global vectors for word representation

keunbong kwak

Visualizing and understanding neural models in NLP

Naoaki Okazaki

機器學習簡報 / 机器学习简报 Machine Learning

Will Kuan 官大鈞

In information retrieval there is a long history of learning vector representations for words. In recent times, neural word embeddings have gained significant popularity for many natural language processing tasks, such as word analogy and machine translation. The goal of this talk is to introduce basic intuitions behind these simple but elegant models of text representation. We will start our discussion with classic vector space models and then make our way to recently proposed neural word embeddings. We will see how these models can be useful for analogical reasoning as well applied to many information retrieval tasks.

A Simple Introduction to Word Embeddings

Bhaskar Mitra

파이썬을 활용한 챗봇 서비스 개발 3일차

Taekyung Han

Word embeddings

Shruti kar

Word2Vec

mohammad javad hasani

Word embedding

ShivaniChoudhary74

Tutorial on word2vec

Leiden University

An introduction to the Transformers architecture and BERT

Suman Debnath

Bart : Denoising Sequence-to-Sequence Pre-training for Natural Language Gener...

taeseon ryu

Attention mechanism 소개 자료

Whi Kwon

[DL輪読会]It's not just size that maters small language models are also few sho...

Deep Learning JP

안녕하세요 딥러닝 논문읽기 모임입니다 오늘 업로드된 논문 리뷰 영상은 올해 발표된, RoFormer: Enhanced Transformer with Rotary Position Embedding 라는 제목의 논문입니다. 해당 논문은 Rotary Position Embedding을 이용하여 Transformer를 개선 시킨 논문입니다. Position embedding은 Self attention의 포지션에 대한 위치를 기억 시키기 위해 사용이 되는 중요한 요소중 하나 인대요, Rotary Position Embedding은 선형대수학 시간때 배우는 회전행렬을 사용하여 위치에 대한 정보를 인코딩 하는 방식으로 대체하여 모델의 성능을 끌어 올렸습니다. 논문에 대한 백그라운드 부터, 수식에 대한 디테일한 리뷰까지, 논문 리뷰를 자연어 처리 진명훈님이 디테일한 논문 리뷰 도와주셨습니다!

RoFormer: Enhanced Transformer with Rotary Position Embedding

taeseon ryu

What's hot (20)

メルペイの与信モデリングで安全・安心のために実践していること

研究を基にしたオープンソース開発チェックポイント

Word Embeddings, why the hype ?

Weaviate and Pinecone Comparison.pdf

딥러닝 논문 리뷰 Learning phrase representations using rnn encoder decoder for stati...

What is word2vec?

GloVe:Global vectors for word representation

Visualizing and understanding neural models in NLP

機器學習簡報 / 机器学习简报 Machine Learning

A Simple Introduction to Word Embeddings

파이썬을 활용한 챗봇 서비스 개발 3일차

Word embeddings

Word2Vec

Word embedding

Tutorial on word2vec

An introduction to the Transformers architecture and BERT

Bart : Denoising Sequence-to-Sequence Pre-training for Natural Language Gener...

Attention mechanism 소개 자료

[DL輪読会]It's not just size that maters small language models are also few sho...

RoFormer: Enhanced Transformer with Rotary Position Embedding

Recently uploaded

도심 하늘에서 시속 200km로 비행할 수 있는 미래 항공 모빌리티 'S-A2'

Hyundai Motor Group

Grid Layout (Kitworks Team Study 장현정 발표자료)

Wonjun Hwang

Continual Active Learning for Efficient Adaptation of Machine LearningModels ...

Kim Daeun

■ 디지털 제조 & 뿌리산업 컨퍼런스, 제조산업 혁신 및 성장 전략 소개 ■ 빌드스마트포럼 2024, AEC와 Al·메타버스의 시너지 탐구 ■ 알테어, 제품 개발을 위한 AI 기술 본격화 추진 ■ 유니티 뮤즈의 AI 활용 및 모델 훈련 ■ 아레스 캐드 2025의 새로운 기능 ■ 1D 시뮬레이션을 위한 카티아 다이몰라 ■ PyMAPDL의 기초부터 활용까지

캐드앤그래픽스 2024년 5월호 목차

캐드앤그래픽스

Presentation material from the IT graduate school joint event - Korea University Graduate School of Computer Information and Communication - Sogang University Graduate School of Information and Communication - Sungkyunkwan University Graduate School of Information and Communication - Yonsei University Graduate School of Engineering - Hanyang University Graduate School of Artificial Intelligence Convergence

A future that integrates LLMs and LAMs (Symposium)

Tae Young Lee

MOODv2 : Masked Image Modeling for Out-of-Distribution Detection

Kim Daeun

[Terra] Terra Money: Stability and Adoption

Seung-chan Baeg

Recently uploaded (7)

도심 하늘에서 시속 200km로 비행할 수 있는 미래 항공 모빌리티 'S-A2'

Grid Layout (Kitworks Team Study 장현정 발표자료)

Continual Active Learning for Efficient Adaptation of Machine LearningModels ...

캐드앤그래픽스 2024년 5월호 목차

A future that integrates LLMs and LAMs (Symposium)

MOODv2 : Masked Image Modeling for Out-of-Distribution Detection

[Terra] Terra Money: Stability and Adoption

PR12 논문 리뷰 Distributed Representations of Sentences and Documents

1. DistributedRepresentationsof SentencesandDocuments (2014) Quoc Le, Tomas Mikolov 발표: 곽근봉

6. © NBT All Rights Reserved. 참고 자료 Lucy Park님 Pycon 발표자료 https://www.lucypark.kr/docs/2015-pyconkr/#1 Ratsgo님 블로그 https://ratsgo.github.io/natural%20language%20processing/2017/03/08/word2vec/ PyData 2017 발표자료 https://www.youtube.com/watch?v=zFScws0mb7M

9. © NBT All Rights Reserved. Word2Vec 딥러닝을 활용한 자연어처리에 단어가 출현하는 위치를 기반으로 단어를 학습시켜보자 (Skip-grams) Center word (Position t) Output Context words (m word window) Output Context words (m word window) P(wt+1|wt)P(wt-1|wt) “저는 P(wt-2|wt) 관심이 많습니다” P(wt+2|wt)

14. © NBT All Rights Reserved. 모델설명PV-DM(예시) “나는 배가 고파서 밥을 먹었다” Paragraph Embedding Word Embedding ID Paragraph Embedding 1 [0.5, 0.41, 0.55 …] ID Word Embedding 1 [0.2, 0.11, 0.55 …] 2 [0.9, 0.41, 0.75 …] 3 [0.4, 0.15, 0.53 …] 4 [0.3, 0.78, 0.48 …] 5 [0.6, 0.23, 0.12 …]

17. © NBT All Rights Reserved. 모델설명PV-DM(예시) “나는 배가 고파서 밥을 먹었다” Step Input Label 1 [[0.5, 0.41, 0.55 …], [0.2, 0.11, 0.55 …] , [0.9, 0.41, 0.75 …], [0.4, 0.15, 0.53 …]] [0,0,0,0,1,0] 2 [[0.5, 0.41, 0.55 …], [0.9, 0.41, 0.75 …], [0.4, 0.15, 0.53 …], [0.3, 0.78, 0.48 …]] [0,0,0,0,0,1]

PR12 논문 리뷰 Distributed Representations of Sentences and Documents

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

More from keunbong kwak

More from keunbong kwak (11)

Recently uploaded

Recently uploaded (7)

PR12 논문 리뷰 Distributed Representations of Sentences and Documents