在巨大的日志文件中搜索数百种模式

发布于 2024-11-04 15:51:42 字数 353 浏览 3 评论 0原文

我必须从网络服务器的 htdocs 目录中获取大量文件名,然后使用此文件名列表来搜索大量存档日志文件以查找这些文件的上次访问情况。

我计划在 C++ 中使用 Boost 来实现这一点。我会首先获取最新的日志,然后向后读取它,检查每一行是否有我得到的所有文件名。

如果文件名匹配,我会从日志字符串中读取时间并保存其上次访问的时间。现在我不需要再寻找这个文件,因为我只想知道上次访问的情况。

要搜索的文件名向量应该迅速减少。

我想知道如何使用多线程最有效地处理此类问题。

我是否对日志文件进行分区并让每个线程从内存中搜索日志的一部分,如果线程有匹配项,它会从文件名向量中删除该文件名,还是有更有效的方法来执行此操作?

I have to get lots of filenames from inside a webserver's htdocs directory and then take this list of filenames to search a huge amount of archived logfiles for last access on these files.

I plan to do this in C++ with Boost. I would take newest log first and read it backwards checking every single line for all of the filenames I got.

If a filename matches, I read the Time from Logstring and save it's last access. Now I don't need to look for this file any more as I only want to know last access.

The vector of filenames to search for should rapidly decrease.

I wonder how I can handle this kind of problem with multiple threads most effective.

Do I partition the Logfiles and let every thread search a part of the logs from memory and if a thread has a match it removes this filename from the filenames vector or is there a more effective way to do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

回眸一遍 2024-11-11 15:51:42

尝试使用 mmap,它会减少你大量的脱发。我感觉很快,并且以一种奇怪的心情回忆起我的 mmap 知识,所以我写了一个简单的东西来帮助你开始。希望这有帮助!

mmap 的优点在于它可以轻松地与 OpenMP 并行化。这也是防止 I/O 瓶颈的好方法。让我首先定义 Logfile 类,然后我将介绍实现。

这是头文件 (logfile.h)

#ifndef _LOGFILE_H_
#define _LOGFILE_H_

#include <iostream>
#include <fcntl.h>
#include <stdio.h>
#include <string>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

using std::string;

class Logfile {

public:

    Logfile(string title);

    char* open();
    unsigned int get_size() const;
    string get_name() const;
    bool close();

private:

    string name;
    char* start;
    unsigned int size;
    int file_descriptor;

};

#endif

这是 .cpp 文件。

#include <iostream>
#include "logfile.h"

using namespace std;

Logfile::Logfile(string name){
    this->name = name;
    start = NULL;
    size = 0;
    file_descriptor = -1;

}

char* Logfile::open(){

    // get file size
    struct stat st;
    stat(title.c_str(), &st);

    size = st.st_size;

    // get file descriptor
    file_descriptor = open(title.c_str(), O_RDONLY);
    if(file_descriptor < 0){
        cerr << "Error obtaining file descriptor for: " << title.c_str() << endl;
        return NULL;
    }

    // memory map part
    start = (char*) mmap(NULL, size, PROT_READ, MAP_SHARED, file_descriptor, 0);
    if(start == NULL){
        cerr << "Error memory-mapping the file\n";
        close(file_descriptor);
        return NULL;
    }

    return start;
}

unsigned int Logfile::get_size() const {
    return size;
}

string Logfile::get_title() const {
    return title;
}

bool Logfile::close(){

    if( start == NULL){
        cerr << "Error closing file. Was closetext() called without a matching opentext() ?\n";
        return false;
    }

    // unmap memory and close file
    bool ret = munmap(start, size) != -1 && close(file_descriptor) != -1;
    start = NULL;
    return ret;

}

现在,使用此代码,您可以使用 OpenMP 来工作共享这些日志文件的解析,即

Logfile lf ("yourfile");
char * log = lf.open();
int size = (int) lf.get_size();

#pragma omp parallel shared(log, size) private(i)
{
  #pragma omp for
  for (i = 0 ; i < size ; i++) {
     // do your routine
  }
  #pragma omp critical
     // some methods that combine the thread results
}

Try using mmap, it will save you considerable hair loss. I was feeling expeditious and in some odd mood to recall my mmap knowledge, so I wrote a simple thing to get you started. Hope this helps!

The beauty of mmap is that it can be easily parallelized with OpenMP. It's also a really good way to prevent an I/O bottleneck. Let me first define the Logfile class and then I'll go over implementation.

Here's the header file (logfile.h)

#ifndef _LOGFILE_H_
#define _LOGFILE_H_

#include <iostream>
#include <fcntl.h>
#include <stdio.h>
#include <string>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

using std::string;

class Logfile {

public:

    Logfile(string title);

    char* open();
    unsigned int get_size() const;
    string get_name() const;
    bool close();

private:

    string name;
    char* start;
    unsigned int size;
    int file_descriptor;

};

#endif

And here's the .cpp file.

#include <iostream>
#include "logfile.h"

using namespace std;

Logfile::Logfile(string name){
    this->name = name;
    start = NULL;
    size = 0;
    file_descriptor = -1;

}

char* Logfile::open(){

    // get file size
    struct stat st;
    stat(title.c_str(), &st);

    size = st.st_size;

    // get file descriptor
    file_descriptor = open(title.c_str(), O_RDONLY);
    if(file_descriptor < 0){
        cerr << "Error obtaining file descriptor for: " << title.c_str() << endl;
        return NULL;
    }

    // memory map part
    start = (char*) mmap(NULL, size, PROT_READ, MAP_SHARED, file_descriptor, 0);
    if(start == NULL){
        cerr << "Error memory-mapping the file\n";
        close(file_descriptor);
        return NULL;
    }

    return start;
}

unsigned int Logfile::get_size() const {
    return size;
}

string Logfile::get_title() const {
    return title;
}

bool Logfile::close(){

    if( start == NULL){
        cerr << "Error closing file. Was closetext() called without a matching opentext() ?\n";
        return false;
    }

    // unmap memory and close file
    bool ret = munmap(start, size) != -1 && close(file_descriptor) != -1;
    start = NULL;
    return ret;

}

Now, using this code, you can use OpenMP to work-share the parsing of these logfiles, i.e.

Logfile lf ("yourfile");
char * log = lf.open();
int size = (int) lf.get_size();

#pragma omp parallel shared(log, size) private(i)
{
  #pragma omp for
  for (i = 0 ; i < size ; i++) {
     // do your routine
  }
  #pragma omp critical
     // some methods that combine the thread results
}
べ映画 2024-11-11 15:51:42

将日志文件解析为数据库表 (SQLite ftw)。其中一个字段将是路径。

在另一个表中,添加您要查找的文件。

现在它是派生表上的简单联接。像这样的东西。

SELECT l.file, l.last_access FROM toFind f
LEFT JOIN ( 
    SELECT file, max(last_access) as last_access from logs group by file
) as l ON f.file = l.file

toFind 中的所有文件都将在那里,并且对于在日志中找不到的文件,last_access 将为 NULL。

Parsing the logfile into a database table (SQLite ftw). One of the fields will be the path.

In another table, add the files you are looking for.

Now it is a simple join on a derived table. Something like this.

SELECT l.file, l.last_access FROM toFind f
LEFT JOIN ( 
    SELECT file, max(last_access) as last_access from logs group by file
) as l ON f.file = l.file

All the files in toFind will be there, and will have last_access NULL for those not found in the logs.

顾铮苏瑾 2024-11-11 15:51:42

好吧,这已经是几天前的事了,但我花了一些时间编写代码并在其他项目中使用 SQLite。

我仍然想将 DB-Approach 与 MMAP 解决方案进行比较,只是为了性能方面。

当然,如果您可以使用 SQL 查询来处理您解析的所有数据,那么它会为您节省大量工作。但我真的不关心工作量,因为我仍在学习很多东西,我从中学到的是:

这种 MMAP 方法 - 如果你正确实现它 - 在性能上绝对优越。如果您实现“字数统计”示例,您会注意到它的速度快得令人难以置信,该示例可以被视为 MapReduce Algo 的“hello world”。

现在,如果您想进一步从 SQL 语言中受益,正确的方法是实现您自己的 SQL-Wrapper,它也通过在线程之间共享查询的方式使用 Map-Reduce 类型。

您也许可以在线程之间按 ID 共享对象,其中每个线程处理它自己的数据库连接。然后它查询数据集中它自己的部分中的对象。

这比仅以通常的方式将内容写入 SQLite DB 要快得多。

毕竟您可以说:

MMAP 是处理字符串处理的最快方法
SQL 为解析器应用程序提供了强大的功能,但如果您不实现用于处理 SQL 查询的包装器,它会减慢速度

Ok this is some days ago already but I spent some time writing code and working with SQLite in other projects.

I still wanted to compare the DB-Approach with the MMAP Solution just for the performance aspect.

Of course it saves you a lot of work if you can use SQL-Queries to handle all the data you parsed. But I really didn't care about the work amount because I'm still learning a lot and what I learned from this is:

This MMAP-Approach - if you implement it correctly - is absolutely superior in performance. It's unbelievable fast which you will notice if you implement the "word-count" example which can be seen as the "hello world" for MapReduce Algo.

Now if you further want to benefit from SQL-Language the correct approach would be implementing your own SQL-Wrapper that uses kind of Map-Reduce too by the means of sharing queries amongst threads.

You could perhaps share Objects by ID amongst threads, where every thread handles it's own DB-Connection. It then queries Objects in it's own part of the dataset.

This would be much faster than just writing things to SQLite DB the usual Way.

After all you can say:

MMAP is the fastest way to handle string processing
SQL provides great functionality for parser-applications but it slows down things if you don't implement a wrapper for processing SQL-Queries

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文