当前位置：文江博客话题详情

在巨大的日志文件中搜索数百种模式

发布于 2024-11-04 15:51:42 字数 353 浏览 8 评论 0原文

我必须从网络服务器的 htdocs 目录中获取大量文件名，然后使用此文件名列表来搜索大量存档日志文件以查找这些文件的上次访问情况。

我计划在 C++ 中使用 Boost 来实现这一点。我会首先获取最新的日志，然后向后读取它，检查每一行是否有我得到的所有文件名。

如果文件名匹配，我会从日志字符串中读取时间并保存其上次访问的时间。现在我不需要再寻找这个文件，因为我只想知道上次访问的情况。

要搜索的文件名向量应该迅速减少。

我想知道如何使用多线程最有效地处理此类问题。

我是否对日志文件进行分区并让每个线程从内存中搜索日志的一部分，如果线程有匹配项，它会从文件名向量中删除该文件名，还是有更有效的方法来执行此操作？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

回眸一遍 2024-11-11 15:51:42

尝试使用 mmap，它会减少你大量的脱发。我感觉很快，并且以一种奇怪的心情回忆起我的 mmap 知识，所以我写了一个简单的东西来帮助你开始。希望这有帮助！

mmap 的优点在于它可以轻松地与 OpenMP 并行化。这也是防止 I/O 瓶颈的好方法。让我首先定义 Logfile 类，然后我将介绍实现。

这是头文件 (logfile.h)

#ifndef _LOGFILE_H_
#define _LOGFILE_H_

#include <iostream>
#include <fcntl.h>
#include <stdio.h>
#include <string>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

using std::string;

class Logfile {

public:

    Logfile(string title);

    char* open();
    unsigned int get_size() const;
    string get_name() const;
    bool close();

private:

    string name;
    char* start;
    unsigned int size;
    int file_descriptor;

};

#endif

这是 .cpp 文件。

#include <iostream>
#include "logfile.h"

using namespace std;

Logfile::Logfile(string name){
    this->name = name;
    start = NULL;
    size = 0;
    file_descriptor = -1;

}

char* Logfile::open(){

    // get file size
    struct stat st;
    stat(title.c_str(), &st);

    size = st.st_size;

    // get file descriptor
    file_descriptor = open(title.c_str(), O_RDONLY);
    if(file_descriptor < 0){
        cerr << "Error obtaining file descriptor for: " << title.c_str() << endl;
        return NULL;
    }

    // memory map part
    start = (char*) mmap(NULL, size, PROT_READ, MAP_SHARED, file_descriptor, 0);
    if(start == NULL){
        cerr << "Error memory-mapping the file\n";
        close(file_descriptor);
        return NULL;
    }

    return start;
}

unsigned int Logfile::get_size() const {
    return size;
}

string Logfile::get_title() const {
    return title;
}

bool Logfile::close(){

    if( start == NULL){
        cerr << "Error closing file. Was closetext() called without a matching opentext() ?\n";
        return false;
    }

    // unmap memory and close file
    bool ret = munmap(start, size) != -1 && close(file_descriptor) != -1;
    start = NULL;
    return ret;

}

现在，使用此代码，您可以使用 OpenMP 来工作共享这些日志文件的解析，即

Logfile lf ("yourfile");
char * log = lf.open();
int size = (int) lf.get_size();

#pragma omp parallel shared(log, size) private(i)
{
  #pragma omp for
  for (i = 0 ; i < size ; i++) {
     // do your routine
  }
  #pragma omp critical
     // some methods that combine the thread results
}

Try using mmap, it will save you considerable hair loss. I was feeling expeditious and in some odd mood to recall my mmap knowledge, so I wrote a simple thing to get you started. Hope this helps!

The beauty of mmap is that it can be easily parallelized with OpenMP. It's also a really good way to prevent an I/O bottleneck. Let me first define the Logfile class and then I'll go over implementation.

Here's the header file (logfile.h)

#ifndef _LOGFILE_H_
#define _LOGFILE_H_

#include <iostream>
#include <fcntl.h>
#include <stdio.h>
#include <string>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

using std::string;

class Logfile {

public:

    Logfile(string title);

    char* open();
    unsigned int get_size() const;
    string get_name() const;
    bool close();

private:

    string name;
    char* start;
    unsigned int size;
    int file_descriptor;

};

#endif

And here's the .cpp file.

#include <iostream>
#include "logfile.h"

using namespace std;

Logfile::Logfile(string name){
    this->name = name;
    start = NULL;
    size = 0;
    file_descriptor = -1;

}

char* Logfile::open(){

    // get file size
    struct stat st;
    stat(title.c_str(), &st);

    size = st.st_size;

    // get file descriptor
    file_descriptor = open(title.c_str(), O_RDONLY);
    if(file_descriptor < 0){
        cerr << "Error obtaining file descriptor for: " << title.c_str() << endl;
        return NULL;
    }

    // memory map part
    start = (char*) mmap(NULL, size, PROT_READ, MAP_SHARED, file_descriptor, 0);
    if(start == NULL){
        cerr << "Error memory-mapping the file\n";
        close(file_descriptor);
        return NULL;
    }

    return start;
}

unsigned int Logfile::get_size() const {
    return size;
}

string Logfile::get_title() const {
    return title;
}

bool Logfile::close(){

    if( start == NULL){
        cerr << "Error closing file. Was closetext() called without a matching opentext() ?\n";
        return false;
    }

    // unmap memory and close file
    bool ret = munmap(start, size) != -1 && close(file_descriptor) != -1;
    start = NULL;
    return ret;

}

Now, using this code, you can use OpenMP to work-share the parsing of these logfiles, i.e.

Logfile lf ("yourfile");
char * log = lf.open();
int size = (int) lf.get_size();

#pragma omp parallel shared(log, size) private(i)
{
  #pragma omp for
  for (i = 0 ; i < size ; i++) {
     // do your routine
  }
  #pragma omp critical
     // some methods that combine the thread results
}

回复收藏 0 原文

べ映画 2024-11-11 15:51:42

将日志文件解析为数据库表 (SQLite ftw)。其中一个字段将是路径。

在另一个表中，添加您要查找的文件。

现在它是派生表上的简单联接。像这样的东西。

SELECT l.file, l.last_access FROM toFind f
LEFT JOIN ( 
    SELECT file, max(last_access) as last_access from logs group by file
) as l ON f.file = l.file

toFind 中的所有文件都将在那里，并且对于在日志中找不到的文件，last_access 将为 NULL。

Parsing the logfile into a database table (SQLite ftw). One of the fields will be the path.

In another table, add the files you are looking for.

Now it is a simple join on a derived table. Something like this.

SELECT l.file, l.last_access FROM toFind f
LEFT JOIN ( 
    SELECT file, max(last_access) as last_access from logs group by file
) as l ON f.file = l.file

All the files in toFind will be there, and will have last_access NULL for those not found in the logs.

回复收藏 0 原文