在非常大的文件中查找最常见的三项序列

发布于 2024-12-23 15:47:24 字数 881 浏览 0 评论 0原文

我有许多网页访问的日志文件，其中每次访问都与用户 ID 和时间戳相关联。我需要确定最流行（即最常访问）的三页序列。日志文件太大，无法一次保存在主内存中。

示例日志文件：

User ID  Page ID
A          1
A          2
A          3
B          2
B          3
C          1
B          4
A          4

相应结果：

A： 1-2-3， 2-3-4
B： 2-3-4
2-3-4 是最流行的三页序列

我的想法是使用两个哈希表。第一个对用户 ID 进行哈希处理并存储其序列；第二个哈希三页序列并存储每个序列出现的次数。这需要 O(n) 空间和 O(n) 时间。

但是，由于我必须使用两个哈希表，内存无法一次容纳所有内容，因此我必须使用磁盘。经常访问磁盘效率不高。

我怎样才能做得更好？

原文

I have many log files of webpage visits, where each visit is associated with a user ID and a timestamp. I need to identify the most popular (i.e. most often visited) three-page sequence of all. The log files are too large to be held in main memory at once.

Sample log file:

User ID  Page ID
A          1
A          2
A          3
B          2
B          3
C          1
B          4
A          4

Corresponding results:

A： 1-2-3， 2-3-4
B： 2-3-4
2-3-4 is the most popular three-page sequence

My idea is to use use two hash tables. The first hashes on user ID and stores its sequence; the second hashes three-page sequences and stores the number of times each one appears. This takes O(n) space and O(n) time.

However, since I have to use two hash tables, memory cannot hold everything at once, and I have to use disk. It is not efficient to access disk very often.

How can I do this better?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一梦浮鱼 2024-12-30 15:47:24

如果您想快速获得近似结果，请按照您的预期使用哈希表，但向每个哈希表添加一个有限大小的队列以删除最近最少使用的条目。

如果您想要准确的结果，请使用外部排序过程按用户 ID 对日志进行排序，然后组合每 3 个连续条目并再次排序，这次是按页面 ID。

更新（按时间戳排序）

可能需要进行一些预处理才能正确使用日志文件的时间戳：

如果日志文件已按时间戳排序，则无需进行预处理。
如果有多个日志文件（可能来自独立进程），并且每个文件都已按时间戳排序，请打开所有这些文件并使用合并排序来读取它们。
如果文件几乎按时间戳排序（就好像多个独立进程将日志写入单个文件），请使用二进制堆以正确的顺序获取数据。
如果文件未按时间戳排序（这在实践中不太可能），请使用按时间戳进行外部排序。

更新2（改进近似方法）

使用LRU队列的近似方法对于随机分布的数据应该产生相当好的结果。但网页访问在一天中的不同时间可能有不同的模式，或者在周末可能会有所不同。对于此类数据，原始方法可能会给出较差的结果。为了改善这一点，可以使用分层 LRU 队列。

将 LRU 队列划分为 log(N) 个较小的队列。大小为 N/2、N/4，...最大的一个应包含任何元素，下一个 - 仅包含元素，至少出现 2 次，下一个 - 至少出现 4 次，...如果从某个子元素中删除元素-queue，它被添加到另一个队列中，因此在被完全删除之前，它存在于层次结构较低的所有子队列中。这样的优先级队列仍然具有 O(1) 复杂度，但可以更好地近似最流行的页面。

回复收藏 0 原文

再浓的妆也掩不了殇 2024-12-30 15:47:24

这里可能存在大量语法错误，但是对于几乎无限长度的日志文件来说，这应该占用有限的 RAM。

typedef int pageid;
typedef int userid;
typedef pageid[3] sequence;
typedef int sequence_count;

const int num_pages = 1000; //where 1-1000 inclusive are valid pageids
const int num_passes = 4;
std::unordered_map<userid, sequence> userhistory;
std::unordered_map<sequence, sequence_count> visits;
sequence_count max_count=0;
sequence max_sequence={};
userid curuser;
pageid curpage;
for(int pass=0; pass<num_passes; ++pass) { //have to go in four passes
    std::ifstream logfile("log.log");
    pageid minpage = num_pages/num_passes*pass; //where first page is in a range
    pageid maxpage = num_pages/num_passes*(pass+1)+1;
    if (pass==num_passes-1) //if it's last pass, fix rounding errors
        maxpage = MAX_INT;
    while(logfile >> curuser >> curpage) { //read in line
        sequence& curhistory = userhistory[curuser]; //find that user's history
        curhistory[2] = curhistory[1];
        curhistory[1] = curhistory[0];
        curhistory[0] = curhistory[curpage]; //push back new page for that user
        //if they visited three pages in a row
        if (curhistory[2] > minpage && curhistory[2]<maxpage) { 
            sequence_count& count = visits[curhistory]; //get times sequence was hit
            ++count; //and increase it
            if (count > max_count) { //if that's new max
                max_count = count;  //update the max
                max_sequence = curhistory; //arrays, so this is memcpy or something
            }
        }
    }
}
std::cout << "The sequence visited the most is :\n";
std::cout << max_sequence[2] << '\n';
std::cout << max_sequence[1] << '\n';
std::cout << max_sequence[0] << '\n';
std::cout << "with " << max_count << " visits.\n";

请注意，如果您的 pageid 或 userid 是 string 而不是 int，则速度会显着降低/大小/缓存惩罚。

[EDIT2] 它现在可以在 4 个（可定制）通道中工作，这意味着它使用更少的内存，使得这项工作在 RAM 中实际工作。它只是按比例变慢。

There's probably syntax errors galore here, but this should take a limited amount of RAM for a virtually unlimited length log file.

typedef int pageid;
typedef int userid;
typedef pageid[3] sequence;
typedef int sequence_count;

const int num_pages = 1000; //where 1-1000 inclusive are valid pageids
const int num_passes = 4;
std::unordered_map<userid, sequence> userhistory;
std::unordered_map<sequence, sequence_count> visits;
sequence_count max_count=0;
sequence max_sequence={};
userid curuser;
pageid curpage;
for(int pass=0; pass<num_passes; ++pass) { //have to go in four passes
    std::ifstream logfile("log.log");
    pageid minpage = num_pages/num_passes*pass; //where first page is in a range
    pageid maxpage = num_pages/num_passes*(pass+1)+1;
    if (pass==num_passes-1) //if it's last pass, fix rounding errors
        maxpage = MAX_INT;
    while(logfile >> curuser >> curpage) { //read in line
        sequence& curhistory = userhistory[curuser]; //find that user's history
        curhistory[2] = curhistory[1];
        curhistory[1] = curhistory[0];
        curhistory[0] = curhistory[curpage]; //push back new page for that user
        //if they visited three pages in a row
        if (curhistory[2] > minpage && curhistory[2]<maxpage) { 
            sequence_count& count = visits[curhistory]; //get times sequence was hit
            ++count; //and increase it
            if (count > max_count) { //if that's new max
                max_count = count;  //update the max
                max_sequence = curhistory; //arrays, so this is memcpy or something
            }
        }
    }
}
std::cout << "The sequence visited the most is :\n";
std::cout << max_sequence[2] << '\n';
std::cout << max_sequence[1] << '\n';
std::cout << max_sequence[0] << '\n';
std::cout << "with " << max_count << " visits.\n";

Note that If you pageid or userid are strings instead of ints, you'll take a significant speed/size/caching penalty.

[EDIT2] It now works in 4 (customizable) passes, which means it uses less memory, making this work realistically in RAM. It just goes proportionately slower.

回复收藏 0 原文

固执像三岁 2024-12-30 15:47:24

如果您有 1000 个网页，那么您就有 10 亿种可能的 3 页序列。如果您有一个简单的 32 位计数器数组，那么您将使用 4GB 内存。可能有一些方法可以通过丢弃数据来减少这种情况，但如果你想保证得到正确的答案，那么这总是最糟糕的情况 - 无法避免它，并发明方法来节省内存平均情况会使最坏的情况更加消耗内存。

最重要的是，您必须跟踪用户。对于每个用户，您需要存储他们访问的最后两个页面。假设在日志中按名称引用用户，您需要将用户的名称以及两个页码存储在哈希表中，因此假设每个用户平均 24 个字节（可能保守 - 我假设短用户名）。如果有 1000 个用户，则为 24KB；拥有 1000000 个用户 24MB。

显然，序列计数器在内存问题中占主导地位。

如果您只有 1000 个页面，那么在现代 64 位计算机中 4GB 内存并非不合理，尤其是在具有大量磁盘支持的虚拟内存的情况下。如果没有足够的交换空间，您可以创建一个映射文件（在 Linux 上 - 我认为 Windows 也有类似的东西），并依赖操作系统始终将大多数使用情况缓存在内存中。

因此，基本上，数学表明，如果您有大量页面需要跟踪，并且您希望能够应对最坏的情况，那么您将不得不接受必须使用磁盘文件的事实。

我认为有限容量的哈希表可能是正确的答案。您可以根据可用内存调整其大小，从而针对特定机器对其进行优化。完成后，您需要处理表达到容量的情况。如果您可能很少到达那里，那么它可能不需要非常高效。以下是一些想法：

将最不常用的序列移出文件，将最常用的序列保留在内存中。我需要两次遍历桌子来确定低于平均水平的水平，然后进行驱逐。无论何时，只要遇到哈希未命中，您就需要知道将每个条目放在哪里，这可能会很棘手。
只需将整个表转储到文件，然后从头开始构建一个新表。重复。最后，重新组合所有表中的匹配条目。最后一部分可能也很棘手。
使用映射文件来扩展表。确保该文件主要用于最不常用的序列，就像我的第一个建议一样。基本上，您只需将其用作虚拟内存 - 在忘记地址之后，该文件将毫无意义，但您不需要将其保留那么久。我假设这里没有足够的常规虚拟内存，和/或者您不想使用它。显然，这仅适用于 64 位系统。

If you have 1000 web pages then you have 1 billion possible 3-page sequences. If you have a simple array of 32-bit counters then you'd use 4GB of memory. There might be ways to prune this down by discarding data as you go, but if you want to guarantee to get the correct answer then this is always going to be your worst case - there's no avoiding it, and inventing ways to save memory in the average case will make the worst case even more memory hungry.

On top of that, you have to track the users. For each user you need to store the last two pages they visited. Assuming the users are referred to by name in the logs, you'd need to store the users' names in a hash table, plus the two page numbers, so let's say 24 bytes per user on average (probably conservative - I'm assuming short user names). With 1000 users that would be 24KB; with 1000000 users 24MB.

Clearly the sequence counters dominate the memory problem.

If you do only have 1000 pages then 4GB of memory is not unreasonable in a modern 64-bit machine, especially with a good amount of disk-backed virtual memory. If you don't have enough swap space, you could just create an mmapped file (on Linux - I presume Windows has something similar), and rely on the OS to always have to most used cases cached in memory.

So, basically, the maths dictates that if you have a large number of pages to track, and you want to be able to cope with the worst case, then you're going to have to accept that you'll have to use disk files.

I think that a limited-capacity hash table is probably the right answer. You could probably optimize it for a specific machine by sizing it according to the memory available. Having got that you need to handle the case where the table reaches capacity. It may not need to be terribly efficient if it's likely you rarely get there. Here's some ideas:

Evict the least commonly used sequences to file, keeping the most common in memory. I'd need two passes over the table to determine what level is below average, and then to do the eviction. Somehow you'd need to know where you'd put each entry, whenever you get a hash-miss, which might prove tricky.
Just dump the whole table to file, and build a new one from scratch. Repeat. Finally, recombine the matching entries from all the tables. The last part might also prove tricky.
Use an mmapped file to extend the table. Ensure that the file is used primarily for the least-commonly used sequences, as in my first suggestion. Basically, you'd simply use it as virtual memory - the file would be meaningless later, after the addresses have been forgotten, but you wouldn't need to keep it that long. I'm assuming there isn't enough regular virtual memory here, and/or you don't want to use it. Obviously, this is for 64-bit systems only.

回复收藏 0 原文

红玫瑰 2024-12-30 15:47:24

我认为你只需要为每个用户存储最近看到的三元组，对吗？
所以你有两个哈希表。第一个包含 userid 的键，最近看到的三元组的值的大小等于 userid 的数量。

编辑：假设文件已经按时间戳排序。

第二个哈希表的键是 userid:page-triple，值是查看次数。

我知道你说的是 c++，但这里有一些 awk 可以在一次传递中完成此操作（转换为 c++ 应该非常简单）：

#  $1 is userid, $2 is pageid

{
    old = ids[$1];          # map with id, most-recently-seen triple
    split(old,oldarr,"-"); 
    oldarr[1]=oldarr[2]; 
    oldarr[2]=oldarr[3]; 
    oldarr[3] = $2; 
    ids[$1]=oldarr[1]"-"oldarr[2]"-"oldarr[3]; # save new most-recently-seen
    tripleid = $1":"ids[$1];  # build a triple-id of userid:triple
    if (oldarr[1] != "") { # don't accumulate incomplete triples
        triples[tripleid]++; }   # count this triple-id
}
END {
    MAX = 0;
    for (tid in  triples) {
        print tid" "triples[tid];
        if (triples[tid] > MAX) MAX = tid;
    }
    print "MAX is->" MAX" seen "triples[tid]" times";
}

I think you only have to store the most recently seen triple for each userid right?
So you have two hash tables. The first containing key of userid, value of most recently seen triple has size equal to number of userids.

EDIT: assumes file sorted by timestamp already.

The second hash table has a key of userid:page-triple, and a value of count of times seen.

I know you said c++ but here's some awk which does this in a single pass (should be pretty straight-forward to convert to c++):

#  $1 is userid, $2 is pageid

{
    old = ids[$1];          # map with id, most-recently-seen triple
    split(old,oldarr,"-"); 
    oldarr[1]=oldarr[2]; 
    oldarr[2]=oldarr[3]; 
    oldarr[3] = $2; 
    ids[$1]=oldarr[1]"-"oldarr[2]"-"oldarr[3]; # save new most-recently-seen
    tripleid = $1":"ids[$1];  # build a triple-id of userid:triple
    if (oldarr[1] != "") { # don't accumulate incomplete triples
        triples[tripleid]++; }   # count this triple-id
}
END {
    MAX = 0;
    for (tid in  triples) {
        print tid" "triples[tid];
        if (triples[tid] > MAX) MAX = tid;
    }
    print "MAX is->" MAX" seen "triples[tid]" times";
}

回复收藏 0 原文

谜兔 2024-12-30 15:47:24

如果您使用的是 Unix，sort 命令可以处理任意大的文件。所以你可以这样做：

sort -k1,1 -s logfile >排序（请注意，这是第一列上的稳定排序（-s））对
sorted执行一些自定义处理，将每个三元组输出为新行使用 C++ 或 shell 脚本复制到另一个文件，例如三元组。因此，在给出的示例中，您将获得一个包含三行的文件：1-2-3、2-3-4、2-3-4。此处理速度很快，因为第 1 步意味着您一次只处理一个用户的访问，因此您可以一次一行地处理已排序文件。
对三元组进行排序 | uniq-c|排序-r -n | head -1 应该给出最常见的三元组及其计数（它对三元组进行排序，计算每个三元组的出现次数，按计数的降序对它们进行排序并取顶部的一个）。

这种方法可能没有最佳性能，但它不应该耗尽内存。