使用 hadoop/pig 从日志中提取相似用户

发布于 2024-10-29 14:37:35 字数 850 浏览 7 评论 0原文

作为我们的初创产品的一部分，我们需要计算“相似的用户特征”。我们决定选择猪。我已经学习pig几天了并且了解它是如何工作的。首先我们来看看日志文件的样子。

user        url             time
user1       http://someurl.com      1235416
user1       http://anotherlik.com       1255330
user2       http://someurl.com      1705012
user3       http://something.com        1705042
user3       http://someurl.com      1705042

由于用户和 url 的数量可能很大，我们不能在这里使用暴力方法，因此首先我们需要找到至少有权访问公共 url 的用户。

该算法可以拆分如下：

查找访问过某些常见 URL 的所有用户。
为访问的每个资源生成所有用户的成对组合。
对于每对和和 url，计算这些用户的相似度：相似度取决于访问之间的时间间隔（因此我们需要跟踪时间）。
总结每对 url 的相似度。

这是我到目前为止所写的内容：

A = LOAD 'logs.txt' USING PigStorage('\t') AS (uid:bytearray, url:bytearray, time:long);
grouped_pos = GROUP A BY ($1);

我知道还不是很多，但现在我不知道如何生成该对或进一步移动。因此，任何帮助将不胜感激。

谢谢。

原文

We need as part of our start-up product to compute "similar user feature". And we've decided to go with pig for it.
I've been learning pig for a few days now and understand how it work.
So to start here is how the log file look like.

user        url             time
user1       http://someurl.com      1235416
user1       http://anotherlik.com       1255330
user2       http://someurl.com      1705012
user3       http://something.com        1705042
user3       http://someurl.com      1705042

As the number of users and url can be huge, we can't use a bruteforce approach here, so first we need to find the user's that have access at least to on common url.

The algorithm could be splited as bellow:

Find all users that has accessed to some common urls.
generate pair-wise combination of all users for each resource accessed.
for each pair and and url, compute the similarity of those users: the similarity depend of the timeinterval between the access (so we need to keep track of the time).
sum up for each pair-url the similarity.

here is what i've written so far:

A = LOAD 'logs.txt' USING PigStorage('\t') AS (uid:bytearray, url:bytearray, time:long);
grouped_pos = GROUP A BY ($1);

I know it is not much yet, but now i don't know how to generate the pair or move further.
So any help would be appreciated.

Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

拥抱没勇气 2024-11-05 14:37:35

IBM 有一篇很好、详细的论文，介绍了如何进行合作- 使用 MapReduce 进行聚类，可能对您有用。

Google 新闻个性化论文描述了一个相当简单的实施局部敏感哈希来解决同样的问题。

回复收藏 0 原文

猥琐帝 2024-11-05 14:37:35

对于算法，请查看有关查询/URL 二分图的论文。这里有几个链接：

使用点击时间的查询建议
作者：Qiaozhu Mei、Dengyong Zhou、Kenneth Church
http://www-personal.umich.edu/~qmei/pub /cikm08-sugg.ppt

点击图上的随机游走
尼克·克拉斯韦尔和马丁·苏默
2007年7月
http://research.microsoft.com/apps/pubs/default.aspx ?id=65235

回复收藏 0 原文

~没有更多了~

关于作者

伤痕我心

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

使用 hadoop/pig 从日志中提取相似用户

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

西西弗的石头怪

5397313

烟沫凡尘

一个破名字

萌︼了一个春

当爱已成负担

友情链接

使用 hadoop/pig 从日志中提取相似用户

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

西西弗的石头怪

5397313

烟沫凡尘

一个破名字

萌︼了一个春

当爱已成负担

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。