问题导读:
1、自然语言处理涉及哪些内容?
2、如何使用TensorFlow中嵌入的一袋单词?
3、如何查看数据集中文本长度的直方图?
4、如何声明用于训练模型的损失函数?
上一篇:TensorFlow ML cookbook 第六章6、7节 改进线性模型的预测和学习玩Tic Tac Toe
http://www.aboutyun.com/forum.php?mod=viewthread&tid=26654
自然语言处理
在这里,我们将介绍在TensorFlow中处理文本的介绍。我们首先介绍单词嵌入如何工作并使用单词包方法,然后我们继续实现更高级的嵌入,如Word2vec和Doc2vec:
- 使用文字袋
- 实施TF-IDF
- 使用Skip-gram嵌入
- 使用CBOW嵌入
- 使用Word2vec进行预测
- 使用Doc2vec进行情感分析
作为一个注释,读者可以在https://github. com/nfmcclure/tensorflow_cookbook.上在线查找本章的所有代码。
介绍
到目前为止,我们只考虑了大多数数字输入操作的机器学习算法。如果我们想要使用文本,我们必须找到一种方法将文本转换为数字。有很多方法可以做到这一点,我们将探讨一些常见的方法。
如果我们考虑句子TensorFlow使机器学习变得容易,我们可以按照我们观察它们的顺序将单词转换为数字。这将使句子变为1 2 3 4 5.然后,当我们看到一个新句子时,机器学习很容易,我们可以将其翻译为3 4 0 5,表示我们没有看到的索引为零的单词。通过这两个例子,我们将词汇量限制为六个数字。对于大文本,我们可以选择我们想要保留多少个单词,并且通常保留最常用的单词,用零指数标记其他所有单词。
如果单词学习的数值为4,并且单词make的数值为2,那么假设学习是两次使得是很自然的。由于我们不希望单词之间存在这种类型的数字关系,我们假设这些数字代表类别而不是关系数字。
另一个问题是这两个句子的大小不同。我们制作的每个观察(在这种情况下为句子)需要与我们希望创建的模型具有相同的大小输入。为了解决这个问题,我们将每个句子创建为稀疏向量,如果该单词出现在该索引中,则该特定索引中的值为1:
这种方法的缺点是我们失去了任何词序的指示。 TensorFlow这两个句子使机器学习变得简单,机器学习使TensorFlow变得容易,导致相同的句子向量。
值得注意的是,这些向量的长度等于我们选择的词汇量。选择非常大的词汇量是很常见的,因此这些句子向量可能非常稀疏。我们在本简介中介绍的这种嵌入称为单词包。我们将在下一节中实现这一点。
另一个缺点是单词和TensorFlow具有相同的数字索引值1。我们可以想象这个词可能没有TensorFlow这个词的重要性那么重要。
我们将在本章中探索不同类型的嵌入,试图解决这些问题,但首先我们从一堆文字的实现开始。
使用袋子的话
我们首先展示如何使用TensorFlow中嵌入的一袋单词。这种映射是我们在介绍中介绍的。在这里,我们将展示如何使用此类嵌入来进行垃圾邮件预测。
做好准备
为了说明如何使用带有文本数据集的单词包,我们将使用来自UCI机器学习数据存储库的https:ham电话文本数据库(https://archive.ics.uci. edu/ml/datasets/SMS+Spam+Collection)。这是垃圾邮件或非垃圾邮件(ham 下文简称 火腿 )的电话短信集合。我们将下载此数据,将其存储以备将来使用,然后继续使用单词包方法来预测文本是否为垃圾邮件。将在文字袋上操作的模型将是没有隐藏层的逻辑模型。我们将使用批量大小为1的随机训练,并在最后的保持测试集上计算精度。
怎么做…
对于这个例子,我们将首先获取数据,规范化和分割文本,通过嵌入函数运行它,并训练逻辑函数来预测垃圾邮件:
1、第一项任务是为此任务导入必要的库。在通常的库中,我们需要一个.zip文件库来解压缩来自UCI机器学习网站的数据,我们从中检索它:
[mw_shl_code=python,true]import tensorflow as tf
import matplotlib.pyplot as plt
import os
import numpy as np
import csv
import string
import requests
import io
from zipfile import ZipFile
from tensorflow.contrib import learn
sess = tf.Session() [/mw_shl_code]
2.每次运行脚本时,我们都会保存并检查文件是否已保存,而不是下载文本数据。 如果我们想要更改脚本参数,这可以防止我们反复下载数据。 下载后,我们将提取输入和目标数据,并将目标更改为1表示垃圾邮件,0表示火腿:
[mw_shl_code=python,true]save_file_name = os.path.join('temp','temp_spam_data.csv')
if os.path.isfile(save_file_name):
text_data = []
with open(save_file_name, 'r') as temp_output_file:
reader = csv.reader(temp_output_file)
for row in reader:
text_data.append(row)
else:
zip_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
r = requests.get(zip_url)
z = ZipFile(io.BytesIO(r.content))
file = z.read('SMSSpamCollection')
# Format Data
text_data = file.decode()
text_data = text_data.encode('ascii',errors='ignore')
text_data = text_data.decode().split('\n')
text_data = [x.split('\t') for x in text_data if len(x)>=1]
# And write to csv
with open(save_file_name, 'w') as temp_output_file:
writer = csv.writer(temp_output_file)
writer.writerows(text_data)
texts = [x[1] for x in text_data]
target = [x[0] for x in text_data]
# Relabel 'spam' as 1, 'ham' as 0
target = [1 if x=='spam' else 0 for x in target] [/mw_shl_code]
3.为了减少潜在的词汇量,我们将文本规范化。 为此,我们消除了文本中大小写和数字的影响。 使用以下代码:
[mw_shl_code=python,true]# Convert to lower case
texts = [x.lower() for x in texts]
# Remove punctuation
texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]
# Remove numbers
texts = [''.join(c for c in x if c not in '0123456789') for x in texts]
# Trim extra whitespace
texts = [' '.join(x.split()) for x in texts] [/mw_shl_code]
4.我们还必须确定最大句子大小。 为此,我们查看数据集中文本长度的直方图。 我们看到一个很好的截止可能是25字左右。 使用以下代码:
[mw_shl_code=python,true]# Plot histogram of text lengths
text_lengths = [len(x.split()) for x in texts]
text_lengths = [x for x in text_lengths if x < 50]
plt.hist(text_lengths, bins=25)
plt.title('Histogram of # of Words in Texts')
sentence_size = 25
min_word_freq = 3[/mw_shl_code]
图1:数据中每个文本中单词数的直方图。 我们使用它来确定每个文本中要考虑的最大单词长度。 我们将其设置为25个单词,但也可以轻松设置为30或40。
5.TensorFlow在learn.preprocessing库下有一个用于确定词汇嵌入的内置处理工具,称为VocabularyProcessor():
[mw_shl_code=python,true]vocab_processor = learn.preprocessing. VocabularyProcessor(sentence_size, min_frequency=min_word_freq)
vocab_processor.fit_transform(texts)
embedding_size = len(vocab_processor.vocabulary_) [/mw_shl_code]
6.现在我们将数据分成火车和测试集:
[mw_shl_code=python,true]train_indices = np.random.choice(len(texts), round(len(texts)*0.8), replace=False)
test_indices = np.array(list(set(range(len(texts))) - set(train_ indices)))
texts_train = [x for ix, x in enumerate(texts) if ix in train_ indices]
texts_test = [x for ix, x in enumerate(texts) if ix in test_ indices]
target_train = [x for ix, x in enumerate(target) if ix in train_ indices]
target_test = [x for ix, x in enumerate(target) if ix in test_ indices][/mw_shl_code]
7.接下来我们声明单词的嵌入矩阵。 句子词将被翻译成指数。 这些索引将被转换为一个热编码的向量,我们可以使用单位矩阵创建,这将是我们的单词嵌入的大小。 我们将使用此矩阵查找每个单词的稀疏向量,并将它们一起添加到稀疏句子向量中。 使用以下代码:
[mw_shl_code=python,true]identity_mat = tf.diag(tf.ones(shape=[embedding_size])) [/mw_shl_code]
8.由于我们最终会进行逻辑回归来预测垃圾邮件的概率,我们需要声明我们的逻辑回归变量。 然后我们也声明了我们的数据占位符。 重要的是要注意x_data输入占位符应该是整数类型,因为它将用于查找我们的单位矩阵的行索引,而TensorFlow要求查找为整数:
[mw_shl_code=python,true]A = tf.Variable(tf.random_normal(shape=[embedding_size,1]))
b = tf.Variable(tf.random_normal(shape=[1,1]))
# Initialize placeholders
x_data = tf.placeholder(shape=[sentence_size], dtype=tf.int32)
y_target = tf.placeholder(shape=[1, 1], dtype=tf.float32) [/mw_shl_code]
9.现在我们使用TensorFlow的嵌入查找函数,它将句子中单词的索引映射到我们单位矩阵的单热编码向量。 当我们有这个矩阵时,我们通过总结上述单词向量来创建句子向量。 使用以下代码:
[mw_shl_code=python,true]x_embed = tf.nn.embedding_lookup(identity_mat, x_data)
x_col_sums = tf.reduce_sum(x_embed, 0) [/mw_shl_code]
10.现在我们为每个句子都有固定长度的句子向量,我们想要进行逻辑回归。 为此,我们需要声明实际的模型操作。 由于我们一次只做一个数据点(随机训练),我们将扩展输入的维度并对其进行线性回归操作。 请记住,TensorFlow具有包含sigmoid函数的loss函数,因此我们不需要在此输出中包含它:
[mw_shl_code=python,true]x_col_sums_2D = tf.expand_dims(x_col_sums, 0)
model_output = tf.add(tf.matmul(x_col_sums_2D, A), b) [/mw_shl_code]
11.我们现在声明用于训练模型的损失函数,预测操作和优化函数。 使用以下代码:
[mw_shl_code=python,true]loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_ logits(model_output, y_target))
# Prediction operation
prediction = tf.sigmoid(model_output)
# Declare optimizer
my_opt = tf.train.GradientDescentOptimizer(0.001)
train_step = my_opt.minimize(loss)[/mw_shl_code]
12.接下来我们在开始培训代之前初始化图形变量:
[mw_shl_code=python,true]init = tf.initialize_all_variables()
sess.run(init) [/mw_shl_code]
13.现在我们开始对句子进行迭代。 TensorFlow的vocab_processor。 fit()函数是一次运行一个句子的生成器。 我们将利用这一优势对我们的物流模型进行随机培训。 为了更好地了解准确性趋势,我们保留了过去50个培训步骤的平均值。 如果我们只绘制当前的一个,我们会看到1或0,这取决于我们是否预测训练数据是否正确。 使用以下代码:
[mw_shl_code=python,true]loss_vec = []
train_acc_all = []
train_acc_avg = []
for ix, t in enumerate(vocab_processor.fit_transform(texts_ train)):
y_data = [[target_train[ix]]]
sess.run(train_step, feed_dict={x_data: t, y_target: y_data})
temp_loss = sess.run(loss, feed_dict={x_data: t, y_target: y_ data})
loss_vec.append(temp_loss)
if (ix+1)%10==0:
print('Training Observation #' + str(ix+1) + ': Loss = ' + str(temp_loss))
# Keep trailing average of past 50 observations accuracy
# Get prediction of single observation
[[temp_pred]] = sess.run(prediction, feed_dict={x_data:t, y_ target:y_data})
# Get True/False if prediction is accurate
train_acc_temp = target_train[ix]==np.round(temp_pred)
train_acc_all.append(train_acc_temp)
if len(train_acc_all) >= 50:
train_acc_avg.append(np.mean(train_acc_all[-50:])) [/mw_shl_code]
14.这导致以下输出:
[mw_shl_code=python,true]Starting Training Over 4459 Sentences.
Training Observation #10: Loss = 5.45322
Training Observation #20: Loss = 3.58226
Training Observation #30: Loss = 0.0
Training Observation #4430: Loss = 1.84636
Training Observation #4440: Loss = 1.46626e-05
Training Observation #4450: Loss = 0.045941[/mw_shl_code]
15.为了获得测试集的准确性,我们重复前面的过程,但仅重复预测操作,而不是测试文本的训练操作:
[mw_shl_code=python,true]print('Getting Test Set Accuracy')
test_acc_all = []
for ix, t in enumerate(vocab_processor.fit_transform(texts_test)):
y_data = [[target_test[ix]]]
if (ix+1)%50==0: [mw_shl_code=python,true]print('Getting Test Set Accuracy')
test_acc_all = []
for ix, t in enumerate(vocab_processor.fit_transform(texts_test)):
y_data = [[target_test[ix]]]
if (ix+1)%50==0:
print('Test Observation #' + str(ix+1))
# Keep trailing average of past 50 observations accuracy
# Get prediction of single observation
[[temp_pred]] = sess.run(prediction, feed_dict={x_data:t, y_ target:y_data})
# Get True/False if prediction is accurate
test_acc_temp = target_test[ix]==np.round(temp_pred)
test_acc_all.append(test_acc_temp)
print('\nOverall Test Accuracy: {}'.format(np.mean(test_acc_all)))
Getting Test Set Accuracy For 1115 Sentences.
Test Observation #10
Test Observation #20
Test Observation #30
Test Observation #1000
Test Observation #1050
Test Observation #1100
Overall Test Accuracy: 0.8035874439461883
print('Test Observation #' + str(ix+1))
# Keep trailing average of past 50 observations accuracy
# Get prediction of single observation
[[temp_pred]] = sess.run(prediction, feed_dict={x_data:t, y_ target:y_data})
# Get True/False if prediction is accurate
test_acc_temp = target_test[ix]==np.round(temp_pred)
test_acc_all.append(test_acc_temp)
print('\nOverall Test Accuracy: {}'.format(np.mean(test_acc_all)))
Getting Test Set Accuracy For 1115 Sentences.
Test Observation #10
Test Observation #20
Test Observation #30
Test Observation #1000
Test Observation #1050
Test Observation #1100
Overall Test Accuracy: 0.8035874439461883 [/mw_shl_code]
这个怎么运作…
在本例中,我们使用了来自UCI机器学习库的垃圾邮件文本数据。我们使用TensorFlow的词汇处理函数来创建标准化词汇表来处理和创建句子向量,这些句子向量是每个文本的单词向量的总和。我们在逻辑回归中使用了这个句子向量,并获得了大约80%的准确度模型来预测文本是垃圾邮件。
还有更多…
值得一提的是限制句子(或文本)大小的动机。在此示例中,我们将文本大小限制为25个单词。这是一个常见的做法,因为它限制了文本长度对预测的影响。你可以想象,如果我们找到一个单词,例如会议,这可以预测文本是火腿(而不是垃圾邮件),那么垃圾邮件可能会通过在最后输入该单词的多次出现来实现。
实际上,这是目标数据不平衡的常见问题。在这种情况下可能会出现不平衡的数据,因为垃圾邮件可能很难找到,而火腿可能很容易找到。由于这个事实,我们创建的词汇可能会严重偏向我们数据的火腿部分中表示的单词(更多的火腿意味着更多的单词在火腿中表示而不是垃圾邮件)。如果我们允许无限长度的文本,那么垃圾邮件发送者可能会利用这一点并创建非常长的文本,这些文本在我们的逻辑模型中触发非垃圾邮件词因素的概率更高。
在下一节中,我们尝试通过使用单词出现的频率来确定单词嵌入的值,以更好的方式解决此问题。
原文:
7 Natural Language Processing
Here we will cover an introduction to working with text in TensorFlow. We start by introducing how word embeddings work and using the bag of words method, then we move on to implementing more advanced embeddings such as Word2vec and Doc2vec:
- Working with bag of words
- Implementing TF-IDF
- Working with Skip-gram Embeddings
- Working with CBOW Embeddings
- Making Predictions with Word2vec
- Using Doc2vec for Sentiment Analysis
As a note, the reader may find all the code for this chapter online at https://github. com/nfmcclure/tensorflow_cookbook.
Introduction
Up to this point, we have only considered machine learning algorithms that mostly operate on numerical inputs. If we want to use text, we must find a way to convert the text into numbers. There are many ways to do this and we will explore a few common ways this is achieved.
If we consider the sentence TensorFlow makes machine learning easy, we could convert the words to numbers in the order that we observe them. This would make the sentence become 1 2 3 4 5. Then when we see a new sentence, machine learning is easy, we can translate this as 3 4 0 5, denoting words we haven't seen with an index of zero. With these two examples, we have limited our vocabulary to six numbers. With large texts, we can choose how many words we want to keep, and usually keep the most frequent words, labeling everything else with the index of zero.
If the word learning has a numerical value of 4, and the word makes has a numerical value of 2, then it would be natural to assume that learning is twice makes. Since we do not want this type of numerical relationship between words, we assume these numbers represent categories and not relational numbers.
Another problem is that these two sentences are of different sizes. Each observation we make (sentences in this case) needs to have the same size input to the model we wish to create. To get around this, we create each sentence into a sparse vector that has the value of one in a specific index if that word occurs in that index:
A disadvantage to this method is that we lose any indication of word order. The two sentences TensorFlow makes machine learning easy and machine learning makes TensorFlow easy would result in the same sentence vector.
It is also worthwhile to note that the length of these vectors is equal to the size of our vocabulary that we pick. It is common to pick a very large vocabulary, so these sentence vectors can be very sparse. This type of embedding that we have covered in this introduction is called bag of words. We will implement this in the next section.
Another drawback is that the words is and TensorFlow have the same numerical index value of one. We can imagine that the word is might be less important than the occurrence of the word TensorFlow.
We will explore different types of embeddings in this chapter that attempt to address these ideas, but first we start with an implementation of bag of words.
Working with bag of words
We start by showing how to work with a bag of words embedding in TensorFlow. This mapping is what we introduced in the introduction. Here we show how to use this type of embedding to do spam prediction.
Getting ready
To illustrate how to use bag of words with a text dataset, we will use a spam-ham phone text database from the UCI machine learning data repository (https://archive.ics.uci. edu/ml/datasets/SMS+Spam+Collection). This is a collection of phone text messages that are spam or not-spam (ham). We will download this data, store it for future use, and then proceed with the bag of words method to predict whether a text is spam or not. The model that will operate on the bag of words will be a logistic model with no hidden layers. We will use stochastic training, with batch size of one, and compute the accuracy on a held-out test set at the end.
How to do it…
For this example, we will start by getting the data, normalizing and splitting the text, running it through an embedding function, and training the logistic function to predict spam:
1.The first task will be to import the necessary libraries for this task. Among the usual libraries, we will need a .zip file library to unzip the data from the UCI machine learning website we retrieve it from:
import tensorflow as tf
import matplotlib.pyplot as plt
import os
import numpy as np
import csv
import string
import requests
import io
from zipfile import ZipFile
from tensorflow.contrib import learn
sess = tf.Session()
2.Instead of downloading the text data every time the script is run, we will save it and check whether the file has been saved before. This prevents us from repeatedly downloading the data over and over if we want to change the script parameters. After downloading, we will extract the input and target data and change the target to be 1 for spam and 0 for ham:
save_file_name = os.path.join('temp','temp_spam_data.csv')
if os.path.isfile(save_file_name):
text_data = []
with open(save_file_name, 'r') as temp_output_file:
reader = csv.reader(temp_output_file)
for row in reader:
text_data.append(row)
else:
zip_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
r = requests.get(zip_url)
z = ZipFile(io.BytesIO(r.content))
file = z.read('SMSSpamCollection')
# Format Data
text_data = file.decode()
text_data = text_data.encode('ascii',errors='ignore')
text_data = text_data.decode().split('\n')
text_data = [x.split('\t') for x in text_data if len(x)>=1]
# And write to csv
with open(save_file_name, 'w') as temp_output_file:
writer = csv.writer(temp_output_file)
writer.writerows(text_data)
texts = [x[1] for x in text_data]
target = [x[0] for x in text_data]
# Relabel 'spam' as 1, 'ham' as 0
target = [1 if x=='spam' else 0 for x in target]
3.To reduce the potential vocabulary size, we normalize the text. To do this, we remove the influence of capitalization and numbers in the text. Use the following code:
# Convert to lower case
texts = [x.lower() for x in texts]
# Remove punctuation
texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]
# Remove numbers
texts = [''.join(c for c in x if c not in '0123456789') for x in texts]
# Trim extra whitespace
texts = [' '.join(x.split()) for x in texts]
4.We must also determine the maximum sentence size. To do this, we look at a histogram of text lengths in the data set. We see that a good cut-off might be around 25 words. Use the following code:
# Plot histogram of text lengths
text_lengths = [len(x.split()) for x in texts]
text_lengths = [x for x in text_lengths if x < 50]
plt.hist(text_lengths, bins=25)
plt.title('Histogram of # of Words in Texts')
sentence_size = 25
min_word_freq = 3
Figure 1: A histogram of the number of words in each text in our data. We use this to establish a maximum length of words to consider in each text. We set this as 25 words, but it can easily be set as 30 or 40 as well.
5.TensorFlow has a built-in processing tool for determining vocabulary embedding, called VocabularyProcessor(), under the learn.preprocessing library:
vocab_processor = learn.preprocessing. VocabularyProcessor(sentence_size, min_frequency=min_word_freq)
vocab_processor.fit_transform(texts)
embedding_size = len(vocab_processor.vocabulary_)
6.Now we will split the data into a train and test set:
train_indices = np.random.choice(len(texts), round(len(texts)*0.8), replace=False)
test_indices = np.array(list(set(range(len(texts))) - set(train_ indices)))
texts_train = [x for ix, x in enumerate(texts) if ix in train_ indices]
texts_test = [x for ix, x in enumerate(texts) if ix in test_ indices]
target_train = [x for ix, x in enumerate(target) if ix in train_ indices]
target_test = [x for ix, x in enumerate(target) if ix in test_ indices]
7.Next we declare the embedding matrix for the words. Sentence words will be translated into indices. These indices will be translated into one-hot-encoded vectors that we can create with an identity matrix, which will be the size of our word embeddings. We will use this matrix to look up the sparse vector for each word and add them together for the sparse sentence vector. Use the following code:
identity_mat = tf.diag(tf.ones(shape=[embedding_size]))
8.Since we will end up doing logistic regression to predict the probability of spam, we need to declare our logistic regression variables. Then we declare our data placeholders as well. It is important to note that the x_data input placeholder should be of integer type because it will be used to look up the row index of our identity matrix and TensorFlow requires that lookup to be an integer:
A = tf.Variable(tf.random_normal(shape=[embedding_size,1]))
b = tf.Variable(tf.random_normal(shape=[1,1]))
# Initialize placeholders
x_data = tf.placeholder(shape=[sentence_size], dtype=tf.int32)
y_target = tf.placeholder(shape=[1, 1], dtype=tf.float32)
9.Now we use TensorFlow's embedding lookup function that will map the indices of the words in the sentence to the one-hot-encoded vectors of our identity matrix. When we have that matrix, we create the sentence vector by summing up the aforementioned word vectors. Use the following code:
x_embed = tf.nn.embedding_lookup(identity_mat, x_data)
x_col_sums = tf.reduce_sum(x_embed, 0)
10.Now that we have our fixed-length sentence vectors for each sentence, we want to perform logistic regression. To do this, we will need to declare the actual model operations. Since we are doing this one data point at a time (stochastic training), we will expand the dimensions of our input and perform linear regression operations on it. Remember that TensorFlow has a loss function that includes the sigmoid function, so we do not need to include it in our output here:
x_col_sums_2D = tf.expand_dims(x_col_sums, 0)
model_output = tf.add(tf.matmul(x_col_sums_2D, A), b)
11.We now declare the loss function, prediction operation, and optimization function for training the model. Use the following code:
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_ logits(model_output, y_target))
# Prediction operation
prediction = tf.sigmoid(model_output)
# Declare optimizer
my_opt = tf.train.GradientDescentOptimizer(0.001)
train_step = my_opt.minimize(loss)
12.Next we initialize our graph variables before we start the training generations:
init = tf.initialize_all_variables()
sess.run(init)
13.Now we start the iteration over the sentences. TensorFlow's vocab_processor. fit() function is a generator that operates one sentence at a time. We will use this to our advantage to do stochastic training on our logistic model. To get a better idea of the accuracy trend, we keep a trailing average of the past 50 training steps. If we just plotted the current one, we would either see 1 or 0 depending on whether we predicted that training data point correctly or not. Use the following code:
loss_vec = []
train_acc_all = []
train_acc_avg = []
for ix, t in enumerate(vocab_processor.fit_transform(texts_ train)):
y_data = [[target_train[ix]]]
sess.run(train_step, feed_dict={x_data: t, y_target: y_data})
temp_loss = sess.run(loss, feed_dict={x_data: t, y_target: y_ data})
loss_vec.append(temp_loss)
if (ix+1)%10==0:
print('Training Observation #' + str(ix+1) + ': Loss = ' + str(temp_loss))
# Keep trailing average of past 50 observations accuracy
# Get prediction of single observation
[[temp_pred]] = sess.run(prediction, feed_dict={x_data:t, y_ target:y_data})
# Get True/False if prediction is accurate
train_acc_temp = target_train[ix]==np.round(temp_pred)
train_acc_all.append(train_acc_temp)
if len(train_acc_all) >= 50:
train_acc_avg.append(np.mean(train_acc_all[-50:]))
14.This results in the following output:
Starting Training Over 4459 Sentences.
Training Observation #10: Loss = 5.45322
Training Observation #20: Loss = 3.58226
Training Observation #30: Loss = 0.0
Training Observation #4430: Loss = 1.84636
Training Observation #4440: Loss = 1.46626e-05
Training Observation #4450: Loss = 0.045941
15.To get the test set accuracy, we repeat the preceding process, but only on the prediction operation, not the training operation with the test texts:
print('Getting Test Set Accuracy')
test_acc_all = []
for ix, t in enumerate(vocab_processor.fit_transform(texts_test)):
y_data = [[target_test[ix]]]
if (ix+1)%50==0:
print('Test Observation #' + str(ix+1))
# Keep trailing average of past 50 observations accuracy
# Get prediction of single observation
[[temp_pred]] = sess.run(prediction, feed_dict={x_data:t, y_ target:y_data})
# Get True/False if prediction is accurate
test_acc_temp = target_test[ix]==np.round(temp_pred)
test_acc_all.append(test_acc_temp)
print('\nOverall Test Accuracy: {}'.format(np.mean(test_acc_all)))
Getting Test Set Accuracy For 1115 Sentences.
Test Observation #10
Test Observation #20
Test Observation #30
Test Observation #1000
Test Observation #1050
Test Observation #1100
Overall Test Accuracy: 0.8035874439461883
How it works…
For this example, we worked with the spam-ham text data from the UCI machine learning repository. We used TensorFlow's vocabulary processing functions to create a standardized vocabulary to work with and created sentence vectors which were the sum of each text's word vectors. We used this sentence vector in logistic regression and obtained about an 80% accuracy model to predict a text being spam.
There's more…
It is worthwhile to mention the motivation of limiting the sentence (or text) size. In this example, we limited the text size to 25 words. This is a common practice with bag of words because it limits the effect of text length on the prediction. You can imagine that if we find a word, meeting for example, that is predictive of a text being ham (not spam), then a spam message might get through by putting in many occurrences of that word at the end.
In fact, this is a common problem with imbalanced target data. Imbalanced data might occur in this situation, since spam may be hard to find and ham may be easy to find. Because of this fact, our vocabulary that we create might be heavily skewed toward words represented in the ham part of our data (more ham means more words are represented in ham than spam). If we allow unlimited lengths of texts, then spammers might take advantage of this and create very long texts, which have a higher probability of triggering non-spam word factors in our logistic model.
In the next section, we attempt to tackle this problem in a better way by using the frequency of word occurrence to determine the values of the word embeddings.
最新经典文章,欢迎关注公众号
|
|