NeuroWhAI의 잡블로그

[Keras] 영어 문장 감정(긍정/부정) 분석 예제 코드 - 인종차별? 본문

개발 및 공부/라이브러리&프레임워크

[Keras] 영어 문장 감정(긍정/부정) 분석 예제 코드 - 인종차별?

NeuroWhAI 2018. 12. 7. 18:59


책의 예제를 따라 실습했는데 작동은 그럭저럭 잘 합니다.

문제는 이 녀석이 인종차별주의자라는 것이죠!

...
Epoch 19/20
4960/4960 - 1s 150us/step - loss: 8.6177e-04 - acc: 0.9998 - val_loss: 0.0182 - val_acc: 0.9934
Epoch 20/20
4960/4960 - 1s 144us/step - loss: 5.9968e-04 - acc: 0.9998 - val_loss: 0.0186 - val_acc: 0.9953
Input a test sentence : I love it!
Positive : 99.99902248382568%, Negative : 0.0009764923561306205%
Input a test sentence : I hate it...
Positive : 0.004776037167175673%, Negative : 99.99521970748901%
Input a test sentence : white
Positive : 51.50456428527832%, Negative : 48.49543273448944%
Input a test sentence : black
Positive : 48.19612205028534%, Negative : 51.80387496948242%
Input a test sentence : white person
Positive : 53.274279832839966%, Negative : 46.725720167160034%
Input a test sentence : black person
Positive : 38.23862969875336%, Negative : 61.761367321014404%
Input a test sentence : You are a racist!
Positive : 80.90205788612366%, Negative : 19.09794509410858%

긍정적 수치에서 white가 black보다 약 3% 높고

white person이 black person보다 15%나 높습니다.

부정적 수치에서는 차이가 더 크네요.


데이터셋은 Kaggle의 UMICH SI650를 사용했습니다.

데이터가 사람들이 한 말에서 나온 것이니 다소 치우친 주관이 들어가 있을 수 밖에 없겠죠.

물론 그냥 저의 추측입니다.


아래는 실습 코드.

from keras.layers.core import Dense, SpatialDropout1D
from keras.layers.convolutional import Conv1D
from keras.layers.embeddings import Embedding
from keras.layers.pooling import GlobalMaxPooling1D
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.utils import np_utils
from sklearn.model_selection import train_test_split
import collections
import matplotlib.pyplot as plt
import nltk
import numpy as np
import codecs

np.random.seed(42)

INPUT_FILE = "training.txt"
VOCAB_SIZE = 5000
EMBED_SIZE = 100
NUM_FILTERS = 256
NUM_WORDS = 3
BATCH_SIZE = 64
NUM_EPOCHS = 20

counter = collections.Counter()
fin = codecs.open(INPUT_FILE, "r", encoding='utf-8')
maxlen = 0
for line in fin:
  _, sent = line.strip().split("\t")
  words = [x.lower() for x in nltk.word_tokenize(sent)]
  if len(words) > maxlen:
    maxlen = len(words)
  for word in words:
    counter[word] += 1
fin.close()

word2index = collections.defaultdict(int)
for wid, word in enumerate(counter.most_common(VOCAB_SIZE)):
  word2index[word[0]] = wid + 1
vocab_sz = len(word2index) + 1
index2word = {v:k for k, v in word2index.items()}

xs, ys = [], []
fin = codecs.open(INPUT_FILE, "r", encoding='utf-8')
for line in fin:
  label, sent = line.strip().split("\t")
  ys.append(int(label))
  words = [x.lower() for x in nltk.word_tokenize(sent)]
  wids = [word2index[word] for word in words]
  xs.append(wids)
fin.close()
X = pad_sequences(xs, maxlen=maxlen)
Y = np_utils.to_categorical(ys)

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3,
                                                   random_state=42)

model = Sequential()
model.add(Embedding(vocab_sz, EMBED_SIZE, input_length=maxlen))
model.add(SpatialDropout1D(0.2))
model.add(Conv1D(filters=NUM_FILTERS, kernel_size=NUM_WORDS,
                activation="relu"))
model.add(GlobalMaxPooling1D())
model.add(Dense(2, activation="softmax"))

model.compile(optimizer="adam", loss="categorical_crossentropy",
             metrics=["accuracy"])

history = model.fit(x_train, y_train, batch_size=BATCH_SIZE,
                   epochs=NUM_EPOCHS,
                   validation_data=(x_test, y_test))

def predict(text):
  words = [x.lower() for x in nltk.word_tokenize(text)]
  wids = [word2index[word] for word in words]
  x_predict = pad_sequences([wids], maxlen=maxlen)
  y_predict = model.predict(x_predict)
  print(f"Positive : {y_predict[0][1] * 100}%,", f"Negative : {y_predict[0][0] * 100}%")
  
while True:
  text = input("Input a test sentence : ")
  if (text == "q"):
    break
  predict(text)




Comments