分享

用贝叶斯文本分类测试打过1329-3.patch的Mahout0.9 on Hadoop2.2.0

本帖最后由 坎蒂丝_Swan 于 2015-1-17 20:44 编辑

问题导读

1.如何对数据创建序列文件?
2.怎么将序列文件转化成向量?










引言

接前一篇文章《Mahout0.9 打patch使其支持 Hadoop2.2.0》为Mahout0.9打过Patch编译成功后,使用贝叶斯文本分类来测试Mahout0.9对Hadoop2.2.0的兼容性。

步骤一:将20news的文件都上传到hdfs

  1. yarn@singletest:~/Mahout/mahout-distribution-0.7$ hadoop fs -ls /workspace/mahout/week4/data/20news
  2. Found 2 items
  3. drwxr-xr-x   - yarn supergroup          0 2014-09-04 21:52 /workspace/mahout/week4/data/20news/20news-bydate-test
  4. drwxr-xr-x   - yarn supergroup          0 2014-09-04 21:57 /workspace/mahout/week4/data/20news/20news-bydate-train
复制代码

步骤二:对数据创建序列文件

  1. yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ ./mahout seqdirectory -i /workspace/mahout/week4/data/20news -o /workspace/mahout/week4/data/20news_seq
  2. yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ hadoop fs -ls /workspace/mahout/week4/data/20news_seq
  3. Found 1 items
  4. -rw-r--r--   1 yarn supergroup   37064977 2014-09-04 22:12 /workspace/mahout/week4/data/20news_seq/chunk-0
复制代码

第三步:将序列文件转化成向量

  1. yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ ./mahout seq2sparse -i /workspace/mahout/week4/data/20news_seq/ -o /workspace/mahout/week4/data/20news_vectors -lnorm -nv -wt tfidf
  2. yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ hadoop fs -ls /workspace/mahout/week4/data/20news_vectors
  3. Found 7 items
  4. drwxr-xr-x   - yarn supergroup          0 2014-09-04 22:20 /workspace/mahout/week4/data/20news_vectors/df-count
  5. -rw-r--r--   1 yarn supergroup    1937084 2014-09-04 22:18 /workspace/mahout/week4/data/20news_vectors/dictionary.file-0
  6. -rw-r--r--   1 yarn supergroup    1890053 2014-09-04 22:20 /workspace/mahout/week4/data/20news_vectors/frequency.file-0
  7. drwxr-xr-x   - yarn supergroup          0 2014-09-04 22:19 /workspace/mahout/week4/data/20news_vectors/tf-vectors
  8. drwxr-xr-x   - yarn supergroup          0 2014-09-04 22:21 /workspace/mahout/week4/data/20news_vectors/tfidf-vectors
  9. drwxr-xr-x   - yarn supergroup          0 2014-09-04 22:18 /workspace/mahout/week4/data/20news_vectors/tokenized-documents
  10. drwxr-xr-x   - yarn supergroup          0 2014-09-04 22:18 /workspace/mahout/week4/data/20news_vectors/wordcount
复制代码

第四步:将向量集分为训练集和测试数据

参数:
  • -tr训练集
  • -te测试集
  • -rp参数设定的是测试数据集占总数据集的百分比,以下代码设定为20%!

  1. yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ ./mahout split -i /workspace/mahout/week4/data/20news_vectors/tfidf-vectors -tr /workspace/mahout/week4/data/train-vectors -te /workspace/mahout/week4/data/test-vectors -rp 20 -ow -seq -xm sequential
复制代码

第五步:训练模型

  1. yarn@singletest:~/Mahout/mahout-distribution-0.9/bin$ ./mahout trainnb -i /workspace/mahout/week4/data/train-vectors -el -o /workspace/mahout/week4/nbmodel -li /workspace/mahout/week4/labindex -ow -c
复制代码

查看生成的索引:

  1. yarn@singletest:~$ hadoop fs -text /workspace/mahout/week4/labindex
  2. 20news-bydate-test      0
  3. 20news-bydate-train     1
复制代码

查看训练出来的模型:

  1. yarn@singletest:~$ hadoop fs -ls /workspace/mahout/week4/nbmodel
  2. Found 1 items
  3. -rw-r--r--   1 yarn supergroup    2437874 2014-09-05 23:09 /workspace/mahout/week4/nbmodel/naiveBayesModel.bin
复制代码

第六步:测试

  1. yarn@singletest:~/Mahout/mahout-distribution-0.9/bin$ ./mahout testnb -i /workspace/mahout/week4/data/test-vectors -m /workspace/mahout/week4/nbmodel -l /workspace/mahout/week4/labindex -ow -o /workspace/mahout/week4/20news-test-result -c
复制代码

注意:测试时的-i跟着的输入路径是第四步拆分出来的测试集。

测试结果:

14/09/05 23:18:09 INFO test.TestNaiveBayesDriver: Complementary Results:
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :       2887       74.9675%
Incorrectly Classified Instances        :        964       25.0325%
Total Classified Instances              :       3851

=======================================================
Confusion Matrix
-------------------------------------------------------
a       b       <--Classified as
1131    413      |  1544        a     = 20news-bydate-test
551     1756     |  2307        b     = 20news-bydate-train

=======================================================
Statistics
-------------------------------------------------------
Kappa                                        0.486
Accuracy                                   74.9675%
Reliability                                49.7892%
Reliability (standard deviation)            0.4314

14/09/05 23:18:09 INFO driver.MahoutDriver: Program took 17504 ms (Minutes: 0.29173333333333334)







本文转自http://blog.csdn.net/u010967382/article/details/39088285
欢迎加入about云群90371779322273151432264021 ,云计算爱好者群,亦可关注about云腾讯认证空间||关注本站微信

已有(3)人评论

跳转到指定楼层
stark_summer 发表于 2015-1-18 10:05:10
回复

使用道具 举报

tmacdwh1986 发表于 2015-1-19 09:57:05
要学习学习~
回复

使用道具 举报

stark_summer 发表于 2015-1-19 11:44:25
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条