Notice
Recent Posts
Recent Comments
NeuroWhAI의 잡블로그
[Keras] 영어 문장 감정(긍정/부정) 분석 예제 코드 - 인종차별? 본문
책의 예제를 따라 실습했는데 작동은 그럭저럭 잘 합니다.
문제는 이 녀석이 인종차별주의자라는 것이죠!
... Epoch 19/20 4960/4960 - 1s 150us/step - loss: 8.6177e-04 - acc: 0.9998 - val_loss: 0.0182 - val_acc: 0.9934 Epoch 20/20 4960/4960 - 1s 144us/step - loss: 5.9968e-04 - acc: 0.9998 - val_loss: 0.0186 - val_acc: 0.9953 Input a test sentence : I love it! Positive : 99.99902248382568%, Negative : 0.0009764923561306205% Input a test sentence : I hate it... Positive : 0.004776037167175673%, Negative : 99.99521970748901% Input a test sentence : white Positive : 51.50456428527832%, Negative : 48.49543273448944% Input a test sentence : black Positive : 48.19612205028534%, Negative : 51.80387496948242% Input a test sentence : white person Positive : 53.274279832839966%, Negative : 46.725720167160034% Input a test sentence : black person Positive : 38.23862969875336%, Negative : 61.761367321014404% Input a test sentence : You are a racist! Positive : 80.90205788612366%, Negative : 19.09794509410858%
긍정적 수치에서 white가 black보다 약 3% 높고
white person이 black person보다 15%나 높습니다.
부정적 수치에서는 차이가 더 크네요.
데이터셋은 Kaggle의 UMICH SI650를 사용했습니다.
데이터가 사람들이 한 말에서 나온 것이니 다소 치우친 주관이 들어가 있을 수 밖에 없겠죠.
물론 그냥 저의 추측입니다.
아래는 실습 코드.
from keras.layers.core import Dense, SpatialDropout1D
from keras.layers.convolutional import Conv1D
from keras.layers.embeddings import Embedding
from keras.layers.pooling import GlobalMaxPooling1D
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.utils import np_utils
from sklearn.model_selection import train_test_split
import collections
import matplotlib.pyplot as plt
import nltk
import numpy as np
import codecs
np.random.seed(42)
INPUT_FILE = "training.txt"
VOCAB_SIZE = 5000
EMBED_SIZE = 100
NUM_FILTERS = 256
NUM_WORDS = 3
BATCH_SIZE = 64
NUM_EPOCHS = 20
counter = collections.Counter()
fin = codecs.open(INPUT_FILE, "r", encoding='utf-8')
maxlen = 0
for line in fin:
_, sent = line.strip().split("\t")
words = [x.lower() for x in nltk.word_tokenize(sent)]
if len(words) > maxlen:
maxlen = len(words)
for word in words:
counter[word] += 1
fin.close()
word2index = collections.defaultdict(int)
for wid, word in enumerate(counter.most_common(VOCAB_SIZE)):
word2index[word[0]] = wid + 1
vocab_sz = len(word2index) + 1
index2word = {v:k for k, v in word2index.items()}
xs, ys = [], []
fin = codecs.open(INPUT_FILE, "r", encoding='utf-8')
for line in fin:
label, sent = line.strip().split("\t")
ys.append(int(label))
words = [x.lower() for x in nltk.word_tokenize(sent)]
wids = [word2index[word] for word in words]
xs.append(wids)
fin.close()
X = pad_sequences(xs, maxlen=maxlen)
Y = np_utils.to_categorical(ys)
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3,
random_state=42)
model = Sequential()
model.add(Embedding(vocab_sz, EMBED_SIZE, input_length=maxlen))
model.add(SpatialDropout1D(0.2))
model.add(Conv1D(filters=NUM_FILTERS, kernel_size=NUM_WORDS,
activation="relu"))
model.add(GlobalMaxPooling1D())
model.add(Dense(2, activation="softmax"))
model.compile(optimizer="adam", loss="categorical_crossentropy",
metrics=["accuracy"])
history = model.fit(x_train, y_train, batch_size=BATCH_SIZE,
epochs=NUM_EPOCHS,
validation_data=(x_test, y_test))
def predict(text):
words = [x.lower() for x in nltk.word_tokenize(text)]
wids = [word2index[word] for word in words]
x_predict = pad_sequences([wids], maxlen=maxlen)
y_predict = model.predict(x_predict)
print(f"Positive : {y_predict[0][1] * 100}%,", f"Negative : {y_predict[0][0] * 100}%")
while True:
text = input("Input a test sentence : ")
if (text == "q"):
break
predict(text)
'개발 및 공부 > 라이브러리&프레임워크' 카테고리의 다른 글
[C++] digits10, max_digits10 (0) | 2019.09.18 |
---|---|
[Rust] Rocket 사용해서 이미지 업로드 서버 만들기 (1) | 2019.04.14 |
[Keras] Seq2Seq에 Attention 매커니즘 적용 성공! (+코드) (0) | 2018.12.04 |
[Keras] Seq2Seq에 Attention 매커니즘 적용 실패 (0) | 2018.12.02 |
[Keras] 커스텀 RNN, GRU 셀 만들고 IMDB 학습 테스트 (0) | 2018.12.01 |
Comments