有效计算大型数据集中的共现情况
最近遇到了这个面试编程测试:
- 您将获得 1000 个用户最喜欢的 50 位艺术家的列表(来自 last.fm)。
- 生成一起至少出现 50 次的所有艺术家对的列表。
- 该解决方案无法存储在内存中,或评估所有可能的对。
- 该解决方案应该可扩展到更大的数据集。
- 解决方案不必非常精确,即您可以报告满足截止值的高概率对。
我觉得我有一个非常可行的解决方案,但我想知道他们是否正在寻找我错过的特定内容。
(如果它有所不同 - 这不是来自我自己的面试,所以我不想欺骗任何未来的雇主)
以下是我的假设:
- 艺术家的最大数量是有限的(根据 MusicBrainz 的说法是 622K),而有用户数量没有限制(嗯,我猜不超过 70 亿)。
- 艺术家遵循“长尾”分布:少数很受欢迎,但大多数都受到极少数用户的喜爱。
- 选择截止值是为了选择一定比例的艺术家(大约 1%,50 人和给定数据),因此它会随着用户数量的增加而增加。
第三个要求有点模糊 - 从技术上讲,如果您有任何精确的解决方案,那么您就已经“评估了所有可能的对”。
实用解决方案
第一遍:将艺术家姓名转换为数字 ID;将转换后的收藏夹数据存储在临时文件中;记录用户对每位艺术家的喜爱程度。
需要一个 string->int 映射来跟踪分配的 id;如果空间比速度更重要,可以使用 Patricia 树(在我的测试中需要 1/5 的空间和两倍的时间)。
第二遍:迭代临时文件;剔除个别不符合标准的艺术家;计算二维矩阵中的对数。
将需要
n(n-1)/2
字节(或短整型,或整型,具体取决于数据大小)加上数组引用开销。应该不是问题,因为n
最多为 622K 的 0.01-0.05。
它似乎可以使用不到 100MB 的内存处理任何大小的现实数据集。
替代解决方案
如果您无法进行多次传递(无论出于何种人为原因),请使用布隆过滤器数组来保留对计数:对于您遇到的每一对,找到它(可能)所在的最高过滤器,然后添加到下一个过滤器最高的一个。因此,第一次将其添加到 bf[0],第二次添加到 bf[1],依此类推,直到 bf[49]。或者可以在某一点之后恢复保留实际计数。
我还没有运行这些数字,但最低的几个过滤器将相当大 - 这不是我最喜欢的解决方案,但它可以工作。
还有其他想法吗?
Came across this interview programming test recently:
- You're given a list of top 50 favorite artists for 1000 users (from last.fm)
- Generate a list of all artist pairs that appear together at least 50 times.
- The solution can't store in memory, or evaluate all possible pairs.
- The solution should be scalable to larger datasets.
- The solution doesn't have to be exact, ie you can report pairs with a high probability of meeting the cutoff.
I feel I have a pretty workable solution, but I'm wondering if they were looking for something specific that I missed.
(In case it makes a difference - this isn't from my own interviewing, so I'm not trying to cheat any prospective employers)
Here are my assumptions:
- There's a finite maximum number of artists (622K according to MusicBrainz), while there is no limit on the number of users (well, not more than ~7 billion, I guess).
- Artists follow a "long tail" distribution: a few are popular, but most are favorited by a very small number of users.
- The cutoff, is chosen to select a certain percentage of artists (around 1% with 50 and the given data) so it will increase as the number of users increases.
The third requirement is a little vague - technically, if you have any exact solution you've "evaluated all possible pairs".
Practical Solution
first pass: convert artist names to numeric ids; store converted favorite data in a temp file; keep count of user favorites for each artist.
Requires a string->int map to keep track of assigned ids; can use a Patricia tree if space is more important than speed (needed 1/5th the space and twice the time in my, admittedly not very rigorous, tests).
second pass: iterate over the temp file; throw out artists which didn't, individually, meet the cutoff; keep counts of pairs in a 2d matrix.
Will require
n(n-1)/2
bytes (or shorts, or ints, depending on the data size) plus the array reference overhead. Shouldn't be a problem sincen
is, at most, 0.01-0.05 of 622K.
This seems like it can process any sized real-world dataset using less than 100MB of memory.
Alternate Solution
If you can't do multiple passes (for whatever contrived reason), use an array of Bloom filters to keep the pair counts: for each pair you encounter, find the highest filter it's (probably) in, and add to the next highest one. So, first time it's added to bf[0], second time bf[1], and so on until bf[49]. Or can revert to keeping actual counts after a certain point.
I haven't run the numbers, but the lowest few filters will be quite sizable - it's not my favorite solution, but it could work.
Any other ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您应该考虑挖掘关联规则的现有方法之一。这是一个经过充分研究的问题,本土解决方案不太可能更快。一些参考资料:
维基百科有一个不错的实现列表http://en.wikipedia.org /wiki/Association_rule_learning .
引用我之前的回答:访问 SortedSet 中特定元素的最有效方法是什么? .
这里有一个现有实现的存储库: http://fimi.ua.ac.be/src/ 。这些是几年前参加性能比赛的工具;其中许多都附有指示性论文来解释它们如何/何时/为何比其他算法更快。
You should consider one of the existing approaches for mining association rules. This is a well-researched problem, and it is unlikely that a home-grown solution would be much faster. Some references:
Wikipedia has a non-terrible list of implementations http://en.wikipedia.org/wiki/Association_rule_learning .
Citing a previous answer of mine: What is the most efficient way to access particular elements in a SortedSet? .
There is a repository of existing implementations here: http://fimi.ua.ac.be/src/ . These are tools that participated in a performance competition a few years back; many of them come with indicative papers to explain how/when/why they are faster than other algorithms.
由于要求的两点是关于不精确的解决方案,我猜测他们正在寻找快速的快捷近似而不是详尽的搜索。所以我的想法是:
假设粉丝对最喜欢的艺术家的选择之间完全没有相关性。当然,这肯定是错误的。喜欢伦勃朗的人更有可能喜欢鲁本斯,而不是喜欢波洛克。 (你确实说过我们正在挑选最喜欢的艺术家,对吧?)我稍后会回到这个话题。
然后遍历数据,计算不同艺术家的数量、粉丝数量以及每个艺术家作为最爱出现的频率。当我们完成此过程后:(1)消除任何未单独出现所需“配对次数”次数的艺术家。如果一个艺人只出现了40次,那么他的组合不可能超过40对。 (2) 对于其余艺人,将每个“点赞数”转换为百分比,即该艺人被例如 10% 的粉丝所喜欢。然后,对于每对艺术家,将他们的喜欢百分比相乘,然后乘以粉丝总数。这是他们成对出现的估计次数。
例如,假设有 1000 名粉丝,其中 200 名粉丝说他们喜欢伦勃朗,100 名粉丝说他们喜欢米开朗基罗。这意味着伦勃朗占 20%,米开朗基罗占 10%。因此,如果没有相关性,我们估计 20% * 10% * 1000 = 20 两者都一样。这低于阈值,因此我们不会包含这一对。
问题在于“喜欢”之间几乎肯定存在相关性。我的第一个想法是研究一些真实数据,看看有多少相关性,也就是说,真实的配对计数与估计值有何不同。如果我们发现,例如,实际计数很少超过估计计数的两倍,那么我们可以说,任何给出超过阈值 1/2 估计值的对,我们都声明为“候选”。然后我们对候选人进行详尽的统计,看看有多少人真正符合条件。这将使我们能够消除所有远低于阈值的对,因为“不太可能”,因此不值得调查的成本。
当艺术家几乎总是一起出现时,这可能会错过配对。比如说,如果有 100 名粉丝喜欢毕加索,60 名粉丝喜欢梵高,而 60 名喜欢梵高的粉丝中有 50 名也喜欢毕加索,那么他们的估计将比实际情况低得多。如果这种情况很少发生,则可能属于可接受的“不需要精确答案”类别。如果这种情况一直发生,这种方法就行不通了。
With two points of the requirement being about inexact solution, I'm guessing they're looking for a fast shortcut approximation instead of an exhaustive search. So here's my idea:
Suppose that there is absolutely no correlation between a fan's choices for favorite artists. This is, of course, surely false. Someone who likes Rembrandt is far more likely to also like Rubens then he is to also like Pollock. (You did say we were picking favorite artists, right?) I'll get back to this in a moment.
Then make a pass through the data, counting the number of distinct artists, the number of fans, and how often each artist shows up as a favorite. When we're done making this pass: (1) Eliminate any artists who don't individually show up the required number of "pair times". If an arist only shows up 40 times, he can't possibly be included in more than 40 pairs. (2) For the remaining artists, convert each "like count" to a percentage, i.e. this artist was liked by, say, 10% of the fans. Then for each pair of artists, multiple their like percentages together and then multiply by the total number of fans. This is the estimated number of times they'd show up as a pair.
For example, suppose of 1000 fans, 200 say they like Rembrandt and 100 say they like Michaelangelo. That means 20% for Rembrandt and 10% for Michaelangelo. So if there's no correlation, we'd estimate that 20% * 10% * 1000 = 20 like both. This is below the threshold so we wouldn't include this pair.
The catch to this is that there almost surely is a correlation between "likes". My first thought would be to study some real data and see how much of a correlation there is, that is, how the real pair counts differs from the estimate. If we find that, say, the real count is rarely more than twice the estimated count, then we could just say that any pair that gives an estimate over 1/2 of the threshold we declare a "candidate". Then we do an exhaustive count on the candidates to see how many really meet the condition. This would allow us to eliminate all the pairs that fall well below the threshold as "unlikely" and thus not worth the cost of investigating.
This could miss pairs when the artists almost always occur together. If, say, 100 fans like Picasso, 60 like Van Gogh, and of the 60 who like Van Gogh 50 also like Picasso, their estimate will be MUCH lower than their actual. If this happens rarely enough it may fall into the acceptable "exact answer not required" category. If it happens all the time this approach won't work.