答复: 可以运用MapReduce实现。
这样的还是比较典型的,例如在推荐系统中,计算一个用户和其他用户的相似性,比较两个病毒样本的特征 。。。etc.
先看看不运用MapReduce咋实现。
1. 用户商品的关系可以拆分为 商品用户的关系,有了这两个关系就可以快速查找一个用户和其他用户的关系(这个关系是通过商品体现的)如上图
2. 拆分完后只要迭代用户的每条信息,找到用户购买的商品,在根据商品用户信息查找到此用户和其他用户的关系。
具体运行的代码
[ol] import time
import operator
def analyze_data():
users_file = open("users.txt","r+")
goods_file = open("goods.txt","r+")
users_dict = makeDict(users_file)
goods_dict = makeDict(goods_file)
analyze_user(users_dict)
analyze_goods(goods_dict)
analyze_total(users_dict,goods_dict)
def find_same_key(user1,user2,dict):
keys = dict.iterkeys()
for k in keys:
try:
dict[user1+"_"+user2]
return user1+"_"+user2
except Exception:
try:
dict[user2+"_"+user1]
return user2+"_"+user1
except Exception:
return None
def analyze_total(dict_user,dict_goods):
keys = dict_user.iterkeys()
total_dict = {}
for user in keys:
goods_list = dict_user[user]
for goods in goods_list:
user_list = dict_goods[goods]
''' exclusive the single user '''
if len(user_list) == 1:
continue
for other in user_list:
if other != user:
k = find_same_key(user,other,total_dict)
if k:
num = total_dict[k]
num += 1
total_dict[k] = num
else:
''' first time '''
total_dict[user+"_"+other] = 1
print "total record of the result ", len(total_dict)
li = sorted([(y,x) for x,y in total_dict.items()], reverse=True)
get_top_n(li ,10)
''' get the top n user '''
def get_top_n(li,top_n):
for line in li:
if top_n > 0:
print line
top_n -= 1
def analyze_user(dict):
''' analyze the user's data '''
print "....................... user ........................."
keys = dict.iterkeys()
li = []
for key in keys:
li.append(len(dict[key]))
li.sort()
total = 0
for line in li:
total += line
print "total users is : ",len(dict)
print "min goods count of user : ",li[0]
print "max goods count of user : ",li[len(li)-1]
print "total goods count of user : ",total
print "avrage goods count of user : ",total/len(li)
print "mean goods number is : ",li[int(len(li)/2)]
def analyze_goods(dict):
''' analyze the goods's data '''
print ".................. goods ........................"
keys = dict.iterkeys()
li = []
for key in keys:
li.append(len(dict[key]))
li.sort()
total = 0
for line in li:
total += line
print "total goods is : ",len(dict)
print "min users count of goods : ",li[0]
print "max users count of goods : ",li[len(li)-1]
print "total users count of goods : ",total
print "avrage users count of goods : ",total/len(li)
print "mean users number is : ",li[int(len(li)/2)]
''' construct a dict object '''
def makeDict(file):
dict = {}
for line in file:
arr = line.replace("\n","").split("\t")
i = 0
temp_list = []
key = ""
for record in arr:
i += 1
if i == 1:
key = str(record)
else:
temp_list.append(record)
dict[key] = temp_list
return dict
if __name__ == "__main__":
start_time = time.time()
analyze_data()
# sort_dict()
print "total execute time is : ",(time.time() - start_time)[/ol]复制代码
测试数据:
....................... user .........................
total users is : 1846
min goods count of user : 18
max goods count of user : 3244
total goods count of user : 510959
avrage goods count of user : 276
mean goods number is : 173
.................. goods ........................
total goods is : 26555
min users count of goods : 1
max users count of goods : 1235
total users count of goods : 510959
avrage users count of goods : 19
mean users number is : 3
total record of the result 1980983
(2294, '2482807_tjz230')
(2246, '46921865_3459184') [次数,用户_用户]
(2210, '46921865_2482807')
(2173, 'GOAL_goal')
(2120, '46921865_tjz230')
(2108, '36855984_tjz230')
(2036, '46921865_nofish')
(2026, '2482807_36855984')
(1956, '2482807_3459184')
(1934, 'nofish_tjz230')
total execute time is : 328.155999899
备注: 一共 1846 个用户运行花费了 328 秒 ,26555个商品, 产生了 1,980,983 条数据,文件大小为 6M。
由上边的测试数据来看运行这样的操作还是比较耗费时间的,运用更高性能的机器(CPU更高,内存更大),或运用分布式系统是这种的两个解决方案。 |