计算每年提供的评分记录数
我的映射器是:
import sys
for line in sys.stdin:
# split line into the four fields
fields = line.strip().split("\t")
value = fields[2] #rating
key = fields[3] #timestamp in unix seconds
print(key, value, sep="\t")
我的还原器是:
import sys
(last_key , count) = (None, 0)
for line in sys.stdin:
(key, value) = line.strip().split("\t")
if (last_key and last_key !=key):
print(last_key, count, sep="\t")
count=0
last_key = key
count += int(value)
print(last_key, count, sep="\t")
如何获得评级数?映射器工作正常。我什么时候应该转换时间戳(在这种情况下为last_key)
输出应为(年度\ t评级记录的数量)
My mapper is:
import sys
for line in sys.stdin:
# split line into the four fields
fields = line.strip().split("\t")
value = fields[2] #rating
key = fields[3] #timestamp in unix seconds
print(key, value, sep="\t")
My reducer is:
import sys
(last_key , count) = (None, 0)
for line in sys.stdin:
(key, value) = line.strip().split("\t")
if (last_key and last_key !=key):
print(last_key, count, sep="\t")
count=0
last_key = key
count += int(value)
print(last_key, count, sep="\t")
How to I get the number of ratings? The mapper works fine. And when should I convert the timestamp (last_key in this case)
Output should be (year-month \t number of rating records)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您想减少(分组)
年度
字符串,那需要是映射器的钥匙,而不是具有第二个精确效果。之后,您只需要计算还原器中该键的值数量即可。 (映射器的值只能是
1
,而不是实际值,如果您只需要 count ,则还原器可以sum
值)我建议使用 而不是从
sys.stdin
中手动阅读,或者您可以在Pyspark中重写代码,并以更少的行进行相同的操作。If you want to reduce (group by) the
year-month
string, that needs to be the key of the mapper, rather than having second-precision.After that, you just need to count the number of values for that key in the reducer. (The mapper's value can just be
1
, not the actual value, if you only need to count the ratings, then the reducer cansum
the values)I'd recommend using
mrjob
library rather than manually reading fromsys.stdin
, or you can re-write your code in PySpark and do the same operation in less lines.