虽然Mahout已经宣布不再继续基于Mapreduce开发,迁移到Spark,但是实际面临的情况是公司集群没有足够的内存支持Spark这只把内存当饭吃的猛兽,再加上项目进度的压力以及开发人员的技能现状,所以不得不继续使用Mahout一段时间。
今天记录的是命令行运行ItemCF on Hadoop的过程。
历史
之前读过一些前辈们关于的Mahout ItemCF on Hadoop编程的相关文章,描述的都是如何基于Mahout编程实现ItemCF on Hadoop,由于没空亲自研究,所以一直遵循前辈们编程实现的做法,比如以下这段在各大博客都频繁出现的代码:
以上代码是可执行的,只要在命令行中传入正确的参数就可以顺利完成ItemCF on Hadoop的任务。
但是,如果按这么个代码逻辑,实际上是在Java中做了命令行的工作,为何不直接通过命令行执行呢?
官网资料
前辈们为我指明了道路,ItemCF on Hadoop的任务是通过org.apache.mahout.cf.taste.hadoop.item.RecommenderJob类实现的。
官网(https://builds.apache.org/job/Mahout-Quality/javadoc/)中对于org.apache.mahout.cf.taste.hadoop.item.RecommenderJob类的说明如下:
Runs a completely distributed recommender job as a series of mapreduces.
Preferences in the input file should look like userID, itemID[, preferencevalue]
Preference value is optional to accommodate applications that have no notion of a preference value (that is, the user simply expresses a preference for an item, but no degree of preference).
The preference value is assumed to be parseable as a double. The user IDs and item IDs are parsed as longs.
Command line arguments specific to this class are:
--input(path): Directory containing one or more text files with the preference data
--output(path): output path where recommender output should go
--tempDir (path): Specifies a directory where the job may place temp files (default "temp")
--similarityClassname (classname): Name of vector similarity class to instantiate or a predefined similarity from VectorSimilarityMeasure
--usersFile (path): only compute recommendations for user IDs contained in this file (optional)
--itemsFile (path): only include item IDs from this file in the recommendations (optional)
--filterFile (path): file containing comma-separated userID,itemID pairs. Used to exclude the item from the recommendations for that user (optional)
--numRecommendations (integer): Number of recommendations to compute per user (10)
--booleanData (boolean): Treat input data as having no pref values (false)
--maxPrefsPerUser (integer): Maximum number of preferences considered per user in final recommendation phase (10)
--maxSimilaritiesPerItem (integer): Maximum number of similarities considered per item (100)
--minPrefsPerUser (integer): ignore users with less preferences than this in the similarity computation (1)
--maxPrefsPerUserInItemSimilarity (integer): max number of preferences to consider per user in the item similarity computation phase, users with more preferences will be sampled down (1000)
--threshold (double): discard item pairs with a similarity value below this