分享

Mahout推荐算法API详解

本帖最后由 52Pig 于 2014-11-1 15:56 编辑

阅读导读:
1.mahout单机内存算法实现和分布式算法实现分别存在哪些问题?
2.算法评判标准有哪些?
3.什么会影响算法的评分?






1. Mahout推荐算法介绍
Mahout推荐算法,从数据处理能力上,可以划分为2类:
  • 单机内存算法实现
  • 基于Hadoop的分步式算法实现

1). 单机内存算法实现

  单机内存算法实现:就是在单机下运行的算法,是由cf.taste项目实现的,像我们熟悉的UserCF,ItemCF都支持单机内存运行,并且参数可以灵活配置。单机算法的基本实例,请参考文章:用Maven构建Mahout项目

  单机内存算法的问题在于,受限于单机的资源。对于中等规模的数据,像1G,10G的数据量,有能力进行计算,但是超过100G的数据量,对于单机来说是不可能完成的任务。

2). 基于Hadoop的分步式算法实现

  基于Hadoop的分步式算法实现:就是把单机内存算法并行化,把任务分散到多台计算机一起运行。Mahout提供了ItemCF基于Hadoop并行化算法实现。基于Hadoop的分步式算法实现,请参考文章:Mahout分步式程序开发 基于物品的协同过滤ItemCF

  分步式并行算法的问题在于,如何让单机算法并行化。在单机算法中,我们只需要考虑算法,数据结构,内存,CPU就够了,但是分步式算法还要额外考虑很多的情况,比如多节点的数据合并,数据排序,网路通信的效率,节点宕机重算,数据分步式存储等等的很多问题。

2. 算法评判标准:召回率(recall)与查准率(precision)

  Mahout提供了2个评估推荐器的指标,查准率和召回率(查全率),这两个指标是搜索引擎中经典的度量方法。
precision_recall.png
  1. 相关 不相关
  2. 检索到     A    C
  3. 未检索到   B    D
复制代码
  • A:检索到的,相关的 (搜到的也想要的)
  • B:未检索到的,但是相关的 (没搜到,然而实际上想要的)
  • C:检索到的,但是不相关的 (搜到的但没用的)
  • D:未检索到的,也不相关的 (没搜到也没用的)
被检索到的越多越好,这是追求“查全率”,即A/(A+B),越大越好。
被检索到的,越相关的越多越好,不相关的越少越好,这是追求“查准率”,即A/(A+C),越大越好。

在大规模数据集合中,这两个指标是相互制约的。当希望索引出更多的数据的时候,查准率就会下降,当希望索引更准确的时候,会索引更少的数据。

3. Recommender的API接口

1). 系统环境:
  • Win7 64bit
  • Java 1.6.0_45
  • Maven 3
  • Eclipse Juno Service Release 2
  • Mahout 0.8
  • Hadoop 1.1.2
2). Recommender接口文件:
  1. org.apache.mahout.cf.taste.recommender.Recommender.java
复制代码
mahout-Recommender-class.png
接口中方法的解释:
  • recommend(long userID, int howMany): 获得推荐结果,给userID推荐howMany个Item
  • recommend(long userID, int howMany, IDRescorer rescorer): 获得推荐结果,给userID推荐howMany个Item,可以根据rescorer对结构重新排序。
  • estimatePreference(long userID, long itemID): 当打分为空,估计用户对物品的打分
  • setPreference(long userID, long itemID, float value): 赋值用户,物品,打分
  • removePreference(long userID, long itemID): 删除用户对物品的打分
  • getDataModel(): 提取推荐数据
通过Recommender接口,我可以猜出核心算法,应该会在子类的estimatePreference()方法中进行实现。

3). 通过继承关系到Recommender接口的子类:
mahout-Recommender-hierarchy.png
推荐算法实现类:
  • GenericUserBasedRecommender: 基于用户的推荐算法
  • GenericItemBasedRecommender: 基于物品的推荐算法
  • KnnItemBasedRecommender: 基于物品的KNN推荐算法
  • SlopeOneRecommender: Slope推荐算法
  • SVDRecommender: SVD推荐算法
  • TreeClusteringRecommender:TreeCluster推荐算法
下面将分别介绍每种算法的实现。
4. 测试程序:RecommenderTest.java

测试数据集:item.csv
  1. 1,101,5.0
  2. 1,102,3.0
  3. 1,103,2.5
  4. 2,101,2.0
  5. 2,102,2.5
  6. 2,103,5.0
  7. 2,104,2.0
  8. 3,101,2.5
  9. 3,104,4.0
  10. 3,105,4.5
  11. 3,107,5.0
  12. 4,101,5.0
  13. 4,103,3.0
  14. 4,104,4.5
  15. 4,106,4.0
  16. 5,101,4.0
  17. 5,102,3.0
  18. 5,103,2.0
  19. 5,104,4.0
  20. 5,105,3.5
  21. 5,106,4.0
复制代码
测试程序:org.conan.mymahout.recommendation.job.RecommenderTest.java
  1. package org.conan.mymahout.recommendation.job;
  2. import java.io.IOException;
  3. import java.util.List;
  4. import org.apache.mahout.cf.taste.common.TasteException;
  5. import org.apache.mahout.cf.taste.eval.RecommenderBuilder;
  6. import org.apache.mahout.cf.taste.impl.common.LongPrimitiveIterator;
  7. import org.apache.mahout.cf.taste.model.DataModel;
  8. import org.apache.mahout.cf.taste.recommender.RecommendedItem;
  9. import org.apache.mahout.common.RandomUtils;
  10. public class RecommenderTest {
  11.     final static int NEIGHBORHOOD_NUM = 2;
  12.     final static int RECOMMENDER_NUM = 3;
  13.     public static void main(String[] args) throws TasteException, IOException {
  14.         RandomUtils.useTestSeed();
  15.         String file = "datafile/item.csv";
  16.         DataModel dataModel = RecommendFactory.buildDataModel(file);
  17.         slopeOne(dataModel);
  18.     }
  19.     public static void userCF(DataModel dataModel) throws TasteException{}
  20.     public static void itemCF(DataModel dataModel) throws TasteException{}
  21.     public static void slopeOne(DataModel dataModel) throws TasteException{}
  22.     ...
复制代码
每种算法都一个单独的方法进行算法测试,如userCF(),itemCF(),slopeOne()….
5. 基于用户的协同过滤算法UserCF

  基于用户的协同过滤,通过不同用户对物品的评分来评测用户之间的相似性,基于用户之间的相似性做出推荐。简单来讲就是:给用户推荐和他兴趣相似的其他用户喜欢的物品。

举例说明:

image015.gif
  基于用户的 CF 的基本思想相当简单,基于用户对物品的偏好找到相邻邻居用户,然后将邻居用户喜欢的推荐给当前用户。计算上,就是将一个用户对所有物品的偏好作为一个向量来计算用户之间的相似度,找到 K 邻居后,根据邻居的相似度权重以及他们对物品的偏好,预测当前用户没有偏好的未涉及物品,计算得到一个排序的物品列表作为推荐。图 2 给出了一个例子,对于用户 A,根据用户的历史偏好,这里只计算得到一个邻居 – 用户 C,然后将用户 C 喜欢的物品 D 推荐给用户 A。

算法API: org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender
  1. @Override
  2.   public float estimatePreference(long userID, long itemID) throws TasteException {
  3.     DataModel model = getDataModel();
  4.     Float actualPref = model.getPreferenceValue(userID, itemID);
  5.     if (actualPref != null) {
  6.       return actualPref;
  7.     }
  8.     long[] theNeighborhood = neighborhood.getUserNeighborhood(userID);
  9.     return doEstimatePreference(userID, theNeighborhood, itemID);
  10.   }
  11. protected float doEstimatePreference(long theUserID, long[] theNeighborhood, long itemID) throws TasteException {
  12.     if (theNeighborhood.length == 0) {
  13.       return Float.NaN;
  14.     }
  15.     DataModel dataModel = getDataModel();
  16.     double preference = 0.0;
  17.     double totalSimilarity = 0.0;
  18.     int count = 0;
  19.     for (long userID : theNeighborhood) {
  20.       if (userID != theUserID) {
  21.         // See GenericItemBasedRecommender.doEstimatePreference() too
  22.         Float pref = dataModel.getPreferenceValue(userID, itemID);
  23.         if (pref != null) {
  24.           double theSimilarity = similarity.userSimilarity(theUserID, userID);
  25.           if (!Double.isNaN(theSimilarity)) {
  26.             preference += theSimilarity * pref;
  27.             totalSimilarity += theSimilarity;
  28.             count++;
  29.           }
  30.         }
  31.       }
  32.     }
  33.     // Throw out the estimate if it was based on no data points, of course, but also if based on
  34.     // just one. This is a bit of a band-aid on the 'stock' item-based algorithm for the moment.
  35.     // The reason is that in this case the estimate is, simply, the user's rating for one item
  36.     // that happened to have a defined similarity. The similarity score doesn't matter, and that
  37.     // seems like a bad situation.
  38.     if (count <= 1) {
  39.       return Float.NaN;
  40.     }
  41.     float estimate = (float) (preference / totalSimilarity);
  42.     if (capper != null) {
  43.       estimate = capper.capEstimate(estimate);
  44.     }
  45.     return estimate;
  46.   }
复制代码
测试程序:
  1.   public static void userCF(DataModel dataModel) throws TasteException {
  2.         UserSimilarity userSimilarity = RecommendFactory.userSimilarity(RecommendFactory.SIMILARITY.EUCLIDEAN, dataModel);
  3.         UserNeighborhood userNeighborhood = RecommendFactory.userNeighborhood(RecommendFactory.NEIGHBORHOOD.NEAREST, userSimilarity, dataModel, NEIGHBORHOOD_NUM);
  4.         RecommenderBuilder recommenderBuilder = RecommendFactory.userRecommender(userSimilarity, userNeighborhood, true);
  5.         RecommendFactory.evaluate(RecommendFactory.EVALUATOR.AVERAGE_ABSOLUTE_DIFFERENCE, recommenderBuilder, null, dataModel, 0.7);
  6.         RecommendFactory.statsEvaluator(recommenderBuilder, null, dataModel, 2);
  7.         LongPrimitiveIterator iter = dataModel.getUserIDs();
  8.         while (iter.hasNext()) {
  9.             long uid = iter.nextLong();
  10.             List list = recommenderBuilder.buildRecommender(dataModel).recommend(uid, RECOMMENDER_NUM);
  11.             RecommendFactory.showItems(uid, list, true);
  12.         }
  13.     }
复制代码
程序输出:
  1. AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:1.0
  2. Recommender IR Evaluator: [Precision:0.5,Recall:0.5]
  3. uid:1,(104,4.333333)(106,4.000000)
  4. uid:2,(105,4.049678)
  5. uid:3,(103,3.512787)(102,2.747869)
  6. uid:4,(102,3.000000)
复制代码
用R语言重写UserCF的实现,请参考文章:用R解析Mahout用户推荐协同过滤算法(UserCF)

6. 基于物品的协同过滤算法ItemCF

基于item的协同过滤,通过用户对不同item的评分来评测item之间的相似性,基于item之间的相似性做出推荐。简单来讲就是:给用户推荐和他之前喜欢的物品相似的物品。

举例说明:
image017.gif
  基于物品的 CF 的原理和基于用户的 CF 类似,只是在计算邻居时采用物品本身,而不是从用户的角度,即基于用户对物品的偏好找到相似的物品,然后根据用户的历史偏好,推荐相似的物品给他。从计算的角度看,就是将所有用户对某个物品的偏好作为一个向量来计算物品之间的相似度,得到物品的相似物品后,根据用户历史的偏好预测当前用户还没有表示偏好的物品,计算得到一个排序的物品列表作为推荐。图 3 给出了一个例子,对于物品 A,根据所有用户的历史偏好,喜欢物品 A 的用户都喜欢物品 C,得出物品 A 和物品 C 比较相似,而用户 C 喜欢物品 A,那么可以推断出用户 C 可能也喜欢物品 C。

算法API: org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender
  1.   @Override
  2.   public float estimatePreference(long userID, long itemID) throws TasteException {
  3.     PreferenceArray preferencesFromUser = getDataModel().getPreferencesFromUser(userID);
  4.     Float actualPref = getPreferenceForItem(preferencesFromUser, itemID);
  5.     if (actualPref != null) {
  6.       return actualPref;
  7.     }
  8.     return doEstimatePreference(userID, preferencesFromUser, itemID);
  9.   }
  10. protected float doEstimatePreference(long userID, PreferenceArray preferencesFromUser, long itemID)
  11.     throws TasteException {
  12.     double preference = 0.0;
  13.     double totalSimilarity = 0.0;
  14.     int count = 0;
  15.     double[] similarities = similarity.itemSimilarities(itemID, preferencesFromUser.getIDs());
  16.     for (int i = 0; i < similarities.length; i++) {
  17.       double theSimilarity = similarities;
  18.       if (!Double.isNaN(theSimilarity)) {
  19.         // Weights can be negative!
  20.         preference += theSimilarity * preferencesFromUser.getValue(i);
  21.         totalSimilarity += theSimilarity;
  22.         count++;
  23.       }
  24.     }
  25.     // Throw out the estimate if it was based on no data points, of course, but also if based on
  26.     // just one. This is a bit of a band-aid on the 'stock' item-based algorithm for the moment.
  27.     // The reason is that in this case the estimate is, simply, the user's rating for one item
  28.     // that happened to have a defined similarity. The similarity score doesn't matter, and that
  29.     // seems like a bad situation.
  30.     if (count <= 1) {
  31.       return Float.NaN;
  32.     }
  33.     float estimate = (float) (preference / totalSimilarity);
  34.     if (capper != null) {
  35.       estimate = capper.capEstimate(estimate);
  36.     }
  37.     return estimate;
  38.   }
复制代码
测试程序:
  1.     public static void itemCF(DataModel dataModel) throws TasteException {
  2.         ItemSimilarity itemSimilarity = RecommendFactory.itemSimilarity(RecommendFactory.SIMILARITY.EUCLIDEAN, dataModel);
  3.         RecommenderBuilder recommenderBuilder = RecommendFactory.itemRecommender(itemSimilarity, true);
  4.         RecommendFactory.evaluate(RecommendFactory.EVALUATOR.AVERAGE_ABSOLUTE_DIFFERENCE, recommenderBuilder, null, dataModel, 0.7);
  5.         RecommendFactory.statsEvaluator(recommenderBuilder, null, dataModel, 2);
  6.         LongPrimitiveIterator iter = dataModel.getUserIDs();
  7.         while (iter.hasNext()) {
  8.             long uid = iter.nextLong();
  9.             List list = recommenderBuilder.buildRecommender(dataModel).recommend(uid, RECOMMENDER_NUM);
  10.             RecommendFactory.showItems(uid, list, true);
  11.         }
  12.     }
复制代码
程序输出:
  1. AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:0.8676552772521973
  2. Recommender IR Evaluator: [Precision:0.5,Recall:1.0]
  3. uid:1,(105,3.823529)(104,3.722222)(106,3.478261)
  4. uid:2,(106,2.984848)(105,2.537037)(107,2.000000)
  5. uid:3,(106,3.648649)(102,3.380000)(103,3.312500)
  6. uid:4,(107,4.722222)(105,4.313953)(102,4.025000)
  7. uid:5,(107,3.736842)
复制代码
7. SlopeOne算法

  这个算法在mahout-0.8版本中,已经被@Deprecated。SlopeOne是一种简单高效的协同过滤算法。通过均差计算进行评分。

1). 举例说明:
用户X,Y,Z,对于物品A,B进行打分,如下表,求Z对B的打分是多少?
slopeone.png
Slope one算法认为:平均值可以代替某两个未知个体之间的打分差异,事物A对事物B的平均差是:((5 - 4) + (4 - 2)) / 2 = 1.5,就得到Z对B的打分是,3-1.5 = 1.5。

Slope one算法将用户的评分之间的关系看作简单的线性关系:
  1. Y = mX + b
复制代码
2). 平均加权计算:
用户X,Y,Z,对于物品A,B,C进行打分,如下表,求Z对A的打分是多少?
slopeone2.png
  • 1. 计算A和B的平均差, ((5-3)+(3-4))/2=0.5
  • 2. 计算A和C的平均差, (5-2)/1=3
  • 3. Z对A的评分,通过AB得到, 2+0.5=2.5
  • 4. Z对A的评分,通过AC得到,5+3=8
  • 5. 通过加权平均计算Z对A的评分:A和B都有评价的用户数为2,A和C都有评价的用户数为1,权重为别是2和1, (2*2.5+1*8)/(2+1)=13/3=4.33
  • 通过这种简单的方式,我们可以快速计算出一个评分项,完成推荐过程!

算法API: org.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOneRecommender
  1. @Override
  2.   public float estimatePreference(long userID, long itemID) throws TasteException {
  3.     DataModel model = getDataModel();
  4.     Float actualPref = model.getPreferenceValue(userID, itemID);
  5.     if (actualPref != null) {
  6.       return actualPref;
  7.     }
  8.     return doEstimatePreference(userID, itemID);
  9.   }
  10.   
  11.   private float doEstimatePreference(long userID, long itemID) throws TasteException {
  12.     double count = 0.0;
  13.     double totalPreference = 0.0;
  14.     PreferenceArray prefs = getDataModel().getPreferencesFromUser(userID);
  15.     RunningAverage[] averages = diffStorage.getDiffs(userID, itemID, prefs);
  16.     int size = prefs.length();
  17.     for (int i = 0; i < size; i++) {
  18.       RunningAverage averageDiff = averages;
  19.       if (averageDiff != null) {
  20.         double averageDiffValue = averageDiff.getAverage();
  21.         if (weighted) {
  22.           double weight = averageDiff.getCount();
  23.           if (stdDevWeighted) {
  24.             double stdev = ((RunningAverageAndStdDev) averageDiff).getStandardDeviation();
  25.             if (!Double.isNaN(stdev)) {
  26.               weight /= 1.0 + stdev;
  27.             }
  28.             // If stdev is NaN, then it is because count is 1. Because we're weighting by count,
  29.             // the weight is already relatively low. We effectively assume stdev is 0.0 here and
  30.             // that is reasonable enough. Otherwise, dividing by NaN would yield a weight of NaN
  31.             // and disqualify this pref entirely
  32.             // (Thanks Daemmon)
  33.           }
  34.           totalPreference += weight * (prefs.getValue(i) + averageDiffValue);
  35.           count += weight;
  36.         } else {
  37.           totalPreference += prefs.getValue(i) + averageDiffValue;
  38.           count += 1.0;
  39.         }
  40.       }
  41.     }
  42.     if (count <= 0.0) {
  43.       RunningAverage itemAverage = diffStorage.getAverageItemPref(itemID);
  44.       return itemAverage == null ? Float.NaN : (float) itemAverage.getAverage();
  45.     } else {
  46.       return (float) (totalPreference / count);
  47.     }
  48.   }
复制代码
测试程序:
  1.     public static void slopeOne(DataModel dataModel) throws TasteException {
  2.         RecommenderBuilder recommenderBuilder = RecommendFactory.slopeOneRecommender();
  3.         RecommendFactory.evaluate(RecommendFactory.EVALUATOR.AVERAGE_ABSOLUTE_DIFFERENCE, recommenderBuilder, null, dataModel, 0.7);
  4.         RecommendFactory.statsEvaluator(recommenderBuilder, null, dataModel, 2);
  5.         LongPrimitiveIterator iter = dataModel.getUserIDs();
  6.         while (iter.hasNext()) {
  7.             long uid = iter.nextLong();
  8.             List list = recommenderBuilder.buildRecommender(dataModel).recommend(uid, RECOMMENDER_NUM);
  9.             RecommendFactory.showItems(uid, list, true);
  10.         }
  11.     }
复制代码
程序输出:
  1. AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:1.3333333333333333
  2. Recommender IR Evaluator: [Precision:0.25,Recall:0.5]
  3. uid:1,(105,5.750000)(104,5.250000)(106,4.500000)
  4. uid:2,(105,2.286115)(106,1.500000)
  5. uid:3,(106,2.000000)(102,1.666667)(103,1.625000)
  6. uid:4,(105,4.976859)(102,3.509071)
复制代码
8. KNN Linear interpolation item–based推荐算法
  这个算法在mahout-0.8版本中,已经被@Deprecated。

算法API: org.apache.mahout.cf.taste.impl.recommender.knn.KnnItemBasedRecommender
  1. @Override
  2.   protected float doEstimatePreference(long theUserID, PreferenceArray preferencesFromUser, long itemID)
  3.     throws TasteException {
  4.    
  5.     DataModel dataModel = getDataModel();
  6.     int size = preferencesFromUser.length();
  7.     FastIDSet possibleItemIDs = new FastIDSet(size);
  8.     for (int i = 0; i < size; i++) {
  9.       possibleItemIDs.add(preferencesFromUser.getItemID(i));
  10.     }
  11.     possibleItemIDs.remove(itemID);
  12.    
  13.     List mostSimilar = mostSimilarItems(itemID, possibleItemIDs.iterator(),
  14.       neighborhoodSize, null);
  15.     long[] theNeighborhood = new long[mostSimilar.size() + 1];
  16.     theNeighborhood[0] = -1;
  17.   
  18.     List usersRatedNeighborhood = Lists.newArrayList();
  19.     int nOffset = 0;
  20.     for (RecommendedItem rec : mostSimilar) {
  21.       theNeighborhood[nOffset++] = rec.getItemID();
  22.     }
  23.    
  24.     if (!mostSimilar.isEmpty()) {
  25.       theNeighborhood[mostSimilar.size()] = itemID;
  26.       for (int i = 0; i < theNeighborhood.length; i++) {
  27.         PreferenceArray usersNeighborhood = dataModel.getPreferencesForItem(theNeighborhood);
  28.         int size1 = usersRatedNeighborhood.isEmpty() ? usersNeighborhood.length() : usersRatedNeighborhood.size();
  29.         for (int j = 0; j < size1; j++) {
  30.           if (i == 0) {
  31.             usersRatedNeighborhood.add(usersNeighborhood.getUserID(j));
  32.           } else {
  33.             if (j >= usersRatedNeighborhood.size()) {
  34.               break;
  35.             }
  36.             long index = usersRatedNeighborhood.get(j);
  37.             if (!usersNeighborhood.hasPrefWithUserID(index) || index == theUserID) {
  38.               usersRatedNeighborhood.remove(index);
  39.               j--;
  40.             }
  41.           }
  42.         }
  43.       }
  44.     }
  45.     double[] weights = null;
  46.     if (!mostSimilar.isEmpty()) {
  47.       weights = getInterpolations(itemID, theNeighborhood, usersRatedNeighborhood);
  48.     }
  49.    
  50.     int i = 0;
  51.     double preference = 0.0;
  52.     double totalSimilarity = 0.0;
  53.     for (long jitem : theNeighborhood) {
  54.       
  55.       Float pref = dataModel.getPreferenceValue(theUserID, jitem);
  56.       
  57.       if (pref != null) {
  58.         double weight = weights;
  59.         preference += pref * weight;
  60.         totalSimilarity += weight;
  61.       }
  62.       i++;      
  63.     }
  64.     return totalSimilarity == 0.0 ? Float.NaN : (float) (preference / totalSimilarity);
  65.   }
  66. }
复制代码
测试程序:
  1.    public static void itemKNN(DataModel dataModel) throws TasteException {
  2.         ItemSimilarity itemSimilarity = RecommendFactory.itemSimilarity(RecommendFactory.SIMILARITY.EUCLIDEAN, dataModel);
  3.         RecommenderBuilder recommenderBuilder = RecommendFactory.itemKNNRecommender(itemSimilarity, new NonNegativeQuadraticOptimizer(), 10);
  4.         RecommendFactory.evaluate(RecommendFactory.EVALUATOR.AVERAGE_ABSOLUTE_DIFFERENCE, recommenderBuilder, null, dataModel, 0.7);
  5.         RecommendFactory.statsEvaluator(recommenderBuilder, null, dataModel, 2);
  6.         LongPrimitiveIterator iter = dataModel.getUserIDs();
  7.         while (iter.hasNext()) {
  8.             long uid = iter.nextLong();
  9.             List list = recommenderBuilder.buildRecommender(dataModel).recommend(uid, RECOMMENDER_NUM);
  10.             RecommendFactory.showItems(uid, list, true);
  11.         }
  12.     }
复制代码
程序输出:
  1. AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:1.5
  2. Recommender IR Evaluator: [Precision:0.5,Recall:1.0]
  3. uid:1,(107,5.000000)(104,3.501168)(106,3.498198)
  4. uid:2,(105,2.878995)(106,2.878086)(107,2.000000)
  5. uid:3,(103,3.667444)(102,3.667161)(106,3.667019)
  6. uid:4,(107,4.750247)(102,4.122755)(105,4.122709)
  7. uid:5,(107,3.833621)
复制代码
9. SVD推荐算法

算法API: org.apache.mahout.cf.taste.impl.recommender.svd.SVDRecommender
  1. @Override
  2.   public float estimatePreference(long userID, long itemID) throws TasteException {
  3.     double[] userFeatures = factorization.getUserFeatures(userID);
  4.     double[] itemFeatures = factorization.getItemFeatures(itemID);
  5.     double estimate = 0;
  6.     for (int feature = 0; feature < userFeatures.length; feature++) {
  7.       estimate += userFeatures[feature] * itemFeatures[feature];
  8.     }
  9.     return (float) estimate;
  10.   }
复制代码
测试程序:
  1.    public static void svd(DataModel dataModel) throws TasteException {
  2.         RecommenderBuilder recommenderBuilder = RecommendFactory.svdRecommender(new ALSWRFactorizer(dataModel, 10, 0.05, 10));
  3.         RecommendFactory.evaluate(RecommendFactory.EVALUATOR.AVERAGE_ABSOLUTE_DIFFERENCE, recommenderBuilder, null, dataModel, 0.7);
  4.         RecommendFactory.statsEvaluator(recommenderBuilder, null, dataModel, 2);
  5.         LongPrimitiveIterator iter = dataModel.getUserIDs();
  6.         while (iter.hasNext()) {
  7.             long uid = iter.nextLong();
  8.             List list = recommenderBuilder.buildRecommender(dataModel).recommend(uid, RECOMMENDER_NUM);
  9.             RecommendFactory.showItems(uid, list, true);
  10.         }
  11.     }
复制代码
程序输出:
  1. AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:0.09990564982096355
  2. Recommender IR Evaluator: [Precision:0.5,Recall:1.0]
  3. uid:1,(104,4.032909)(105,3.390885)(107,1.858541)
  4. uid:2,(105,3.761718)(106,2.951908)(107,1.561116)
  5. uid:3,(103,5.593422)(102,2.458930)(106,-0.091259)
  6. uid:4,(105,4.068329)(102,3.534025)(107,0.206257)
  7. uid:5,(107,0.105169)
复制代码
10. Tree Cluster-based 推荐算法
  这个算法在mahout-0.8版本中,已经被@Deprecated。
算法API: org.apache.mahout.cf.taste.impl.recommender.TreeClusteringRecommender
  1.   @Override
  2.   public float estimatePreference(long userID, long itemID) throws TasteException {
  3.     DataModel model = getDataModel();
  4.     Float actualPref = model.getPreferenceValue(userID, itemID);
  5.     if (actualPref != null) {
  6.       return actualPref;
  7.     }
  8.     buildClusters();
  9.     List topRecsForUser = topRecsByUserID.get(userID);
  10.     if (topRecsForUser != null) {
  11.       for (RecommendedItem item : topRecsForUser) {
  12.         if (itemID == item.getItemID()) {
  13.           return item.getValue();
  14.         }
  15.       }
  16.     }
  17.     // Hmm, we have no idea. The item is not in the user's cluster
  18.     return Float.NaN;
  19.   }
复制代码
测试程序:
  1.     public static void treeCluster(DataModel dataModel) throws TasteException {
  2.         UserSimilarity userSimilarity = RecommendFactory.userSimilarity(RecommendFactory.SIMILARITY.LOGLIKELIHOOD, dataModel);
  3.         ClusterSimilarity clusterSimilarity = RecommendFactory.clusterSimilarity(RecommendFactory.SIMILARITY.FARTHEST_NEIGHBOR_CLUSTER, userSimilarity);
  4.         RecommenderBuilder recommenderBuilder = RecommendFactory.treeClusterRecommender(clusterSimilarity, 10);
复制代码
程序输出:
  1. AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:NaN
  2. Recommender IR Evaluator: [Precision:NaN,Recall:0.0]
复制代码
11. Mahout推荐算法总结
算法及适用场景:
recommender-intro.png
算法评分的结果:
recommender-score.png
通过对上面几种算法的一评分比较:itemCF,itemKNN,SVD的Rrecision,Recall的评分值是最好的,并且itemCF和SVD的AVERAGE_ABSOLUTE_DIFFERENCE是最低的,所以,从算法的角度知道了,哪个算法是更准确的或者会索引到更多的数据集。

另外的一些因素:

  • 1. 这3个指标,并不能直接决定计算结果一定itemCF,SVD好
  • 2. 各种算法的参数我们并没有调优
  • 3. 数据量和数据分布,是影响算法的评分


本帖被以下淘专辑推荐:

已有(1)人评论

跳转到指定楼层
buildhappy 发表于 2014-11-3 10:10:46
学习  顶一个
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条