임태현, Text-CNN을 이용한 Sentiment 분설모델 구현

Text-CNN을 이용한
Sentiment 분석 모델 구현
ATR Team
임태현
Deep Learning Study

1. 개요
특정 문장의 sentiment(우호도)를 기계학습으로
찾아내는 것은 이미 널리 알려진 이론이다.
여기서는 영화 리뷰를 이용하여, 단문의
sentiment를 학습시키도록 한다.
모델 작성에 있어, paragraph vector 나 RNN 모
델 보다 좀 더 효율적이라고 알려진 CNN 모델을
사용해본다.

2. 이론
이번 실험에 사용된 모델은 Kim Yoon의
Convolutional Neural Network for Sentence
Classification 의 모델을 참고해서 작성되었다
코드 작성의 일부는 WildML 에서 도움을 받았다.

3. 소스코드
• Python
• Tensorflow
• KoNLPy
• Github
– https://github.com/ioatr/textcnn

4. 데이터 준비
• 네이버 영화 평점 데이터셋을 사용
• PyCon Korea 2015 에서 소개
• https://github.com/e9t/nsmc/

4.1 단어 분해
문장안에서 POS(part-of-speech, 어근/접두사/
품사/기타..) 를 분리하기 위해서 koNLPy 를 사용
한국어는 조사때문에 품사분리를 해야 word vector 를 만들기 좋다.
from konlpy.tag import Twitter
pos_tagger = Twitter()
def tokenize(doc):
return ['/'.join(t) for t in pos_tagger.pos(doc, norm=True, stem=True)]

4.2 문장 변환
• 문장의 최대 길이를 60자로 제한
– 고정된 크기의 이미지를 생성하기 위해
• 60자 이하는 PAD 를 넣어서 처리
def build_vocab(tokens):
print('building vocabulary')
vocab = dict()
vocab['#UNKOWN'] = 0
vocab['#PAD'] = 1
for t in tokens:
if t not in vocab:
vocab[t] = len(vocab)
return vocab

5. 모델 작성
문장을 n x k 크기의
이미지로 변환한다.
컨볼류션을 통해
피쳐맵을 만든다
여러 피쳐맵을
하나로 합친다
FC레이어를 통해
결과값을 만들어낸다

Tensorflow 모델
embedding
convolution
Batch input
dropout
Hidden layer
argmax

5.1 모델 파라미터
sequence_length : 60, 문장안의 단어 수
num_classes : 2, [추천,비추천]
vocab_size : 48000, word2vec 을 위한 단어장크기
embedding_size : 128
filter_sizes : [3,4,5], convolution 필터 크기
num_filters : 128, convolution 채널 수
import numpy as np
import tensorflow as tf
class TextCNN(object):
def __init__(self, sequence_length, num_classes, vocab_size,
embedding_size, filter_sizes, num_filters):

5.2 입력 파라미터
input : batch size x 문장길이
label : batch size x 2(추천/비추천)
dropout_keep_prob : 드롭아웃 적용비율
* 학습할 때는 0.5, 실제 평가시에는 1을 사용
input = tf.placeholder(tf.int32, [None, sequence_length], name='input')
label = tf.placeholder(tf.float32, [None, num_classes], name='label')
dropout_keep_prob = tf.placeholder(tf.float32, name='dropout_keep_prob')

5.3 word embedding
W = [vocab_size, embedding_size]
embedded_chars_base = [None, sequence_length,
embedding_size]
embedded_chars = [None, sequence_length, embedding_size,
1]
with tf.name_scope('embedding'):
W = tf.Variable(random ([vocab_size, embedding_size], -1.0, 1.0), name='W')
embedded_chars_base = tf.nn.embedding_lookup(W, input)
embedded_chars = tf.expand_dims(embedded_chars_base, -1)

tf.nn.embedding_lookup
[embedding vector]
[embedding vector]
[embedding vector]
[embedding vector]
[embedding vector]
[embedding
vector ]
[embedding vector]
[vocab_size, embedding_size]
[vector 0]
[vector 3]
[vector1]
[vector4]
[vector5]
[None, sequence_length]
0
이것
3
정말
1
좋은
4
것
5
같다
[None, sequence_length, embedding_size]
input W embedded_chars

5.4 Convolution
filter_size : 컨볼루션을 적용할 단어수 [3,4,5]
K : 문장안에서 피쳐를 뽑아낼 단어집합
* 여러 피쳐맵을 만들기 위해 filter 크기에 따라 같이 변한다
with tf.name_scope('conv-maxpool-%s' % filter_size):
filter_shape = [filter_size, embedding_size, 1, num_filters]
W = tf.Variable(tf.truncated_normal(filter_shape, 0.1))
b = tf.Variable(tf.constant(0.1, [num_filters]))
conv = tf.nn.conv2d(embedded_chars, W, [1,1,1,1], 'VALID‘)
h = tf.nn.relu(tf.nn.bias_add(conv, b))
k =[1, sequence_length - filter_size + 1, 1, 1],
pooled = tf.nn.max_pool(h, k, [1, 1, 1, 1], 'VALID')

5.5 dropout
dropout_keep_prob : 1 이면 드롭되는 노드가 없고 0에 가
까울수록 드롭률이 높아진다.
* Tensorflow 는 실제로 1/keep_prob 으로 출력노드를 노멀라이즈된 상태로 확장하는
기법을 사용한다
드랍아웃은 오버피팅되는 것을 막기 위해서 노드중에 일부
만 트레이닝을 시키는 기법이다.
* 오버피팅 : 학습용 데이터에 너무 잘 적용되어서, 도리어 새로운 데이터에 대해서 결과
가 잘 나오지 않는 현상
with tf.name_scope('dropout'):
h_drop = tf.nn.dropout(h_pool_flat, dropout_keep_prob)

5.6 output
argmax : 벡터중에 가장 큰값을 가지는 원소의 인덱스값을
반환하는 함수
V = [0, 0.1, 0.3, 0.99, 0.4]
argmax(v) = 3 (v[3] = 0.99 로 제일 크다)
output : argmax(W x h + b)
with tf.name_scope('output'):
W = tf.get_variable(shape=[total, num_classes], name=‘W’)
b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name='b')
scores = tf.nn.xw_plus_b(h_drop, W, b, name='scores')
predictions = tf.argmax(scores, 1, name='predictions')

5.7 loss
크로스 엔트로피 : 두 집합의 확률분포의 차
이
Cross_entropy(p, q) = - 𝑥 𝑝 𝑥 log 𝑞(𝑥)
with tf.name_scope('loss'):
losses = tf.nn.softmax_cross_entropy_with_logits(scores, label)
loss = tf.reduce_mean(losses)

6. 결과
cross-entropy 평균 값
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
100
500
900
1300
1700
2100
2500
2900
3300
3700
4100
4500
4900
5300
5700
6100
6500
6900
7300
7700
8100
8500
8900
9300
9700

검증용 데이터에 의한 정확도
0
0.2
0.4
0.6
0.8
1
1.2
100
500
900
1300
1700
2100
2500
2900
3300
3700
4100
4500
4900
5300
5700
6100
6500
6900
7300
7700
8100
8500
8900
9300
9700

6.1 테스트 샘플
사용자 평가를 문장으로 입력하세요: 정말 마음에 드네요
입력 문장을 다음의 토큰으로 분해:
['정말/Noun', '마음/Noun', '에/Josa', '드네/Noun', '요/Josa']
추천
사용자 평가를 문장으로 입력하세요: 에이 이건 정말 아니다... 이게 뭐야
['에이/Noun', '이/Determiner', '것/Noun', '은/Josa', '정말/Noun',
'아니다/Adjective', '.../Punctuation', '이/Noun', '게/Josa', '뭐/Noun', '야/Josa']
비추천
사용자 평가를 문장으로 입력하세요: 와 말도안돼 짱이다
['와/Noun', '말/Noun', '도안/Noun', '돼다/Verb', '짱/Noun', '이다/Josa']
추천

임태현, Text-CNN을 이용한 Sentiment 분설모델 구현

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie 임태현, Text-CNN을 이용한 Sentiment 분설모델 구현

Ähnlich wie 임태현, Text-CNN을 이용한 Sentiment 분설모델 구현 (20)

임태현, Text-CNN을 이용한 Sentiment 분설모델 구현