使用 C++ 在非常大的文本文件 (10 GB) 中搜索多个单词最快的方法

发布于 2024-12-08 02:03:53 字数 226 浏览 1 评论 0原文

我有这个程序,我必须在非常大的文本文件中搜索特定值及其行号,并且同一值可能会多次出现。

我尝试过一个简单的 C++ 程序,它逐行读取文本文件并使用 strstr 搜索值,但它花费了很长的时间gggggggggggggg

我还尝试使用 grep 使用系统命令,但仍然花费了很多时间,没有以前那么长了,但时间还是太多了。

我正在寻找一个可以用来加快搜索速度的库。 有什么帮助和建议吗?谢谢 :)

I have this program where I have to search for specific values and its line number in very large text file and there might be multiple occurences for the same value.

I've tried a simple C++ programs which reads the text files line by line and searches for a the value using strstr but it's taking a very longgggggggggggggg time

I also tried to use a system command using grep but still it's taking a lot of time, not as long as before but it's still too much time.

I was searching for a library I can use to fasten the search.
Any help and suggestions? Thank you :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

夜司空 2024-12-15 02:03:53

关于速度有两个问题:实际需要的时间
读取数据以及搜索所需的时间。

一般来说,读取文件最快的方法是 mmap 它(或者
Windows 下的等效项)。如果整个
文件不适合地址空间,但您在地址空间中提到了 10GB
标头;如果您在程序中所做的只是搜索,那么这不应该创建
任何问题。

更一般地说,如果速度是一个问题,请避免在
字符串。读取大块并拾取行(如 char[]
其中,无需复制,速度明显更快。 (作为一个简单的
妥协,当一条线穿过块边界时,您可能需要复制。
如果您正在处理 MB 或更多的块,这应该不会太严重
经常;我在较旧的 16 位机器上使用过这种技术,带有块
32KB,并且仍然获得了显着的性能改进。)

关于搜索,如果您正在搜索单个固定的
字符串(不是正则表达式或其他模式匹配),您可能
想尝试一下BM搜索。如果您要搜索的字符串是
相当长,这可以与其他方法产生显着差异
搜索算法。 (我认为 grep 的某些实现将
如果搜索模式实际上是固定字符串,并且是,请使用此选项
足够长的时间才能产生影响。)

There are two issues concerning the spead: the time it takes to actually
read the data, and the time it takes to search.

Generally speaking, the fastest way to read a file is to mmap it (or
the equivalent under Windows). This can get complicated if the entire
file won't fit into the address space, but you mention 10GB in the
header; if searching is all you do in the program, this shouldn't create
any problems.

More generally, if speed is a problem, avoid using getline on a
string. Reading large blocks, and picking the lines up (as char[])
out of them, without copying, is significantly faster. (As a simple
compromize, you may want to copy when a line crosses a block boundary.
If you're dealing with blocks of a MB or more, this shouldn't be too
often; I've used this technique on older, 16 bit machines, with blocks
of 32KB, and still gotten a significant performance improvement.)

With regards to searching, if you're searching for a single, fixed
string (not a regular expression or other pattern matching), you might
want to try a BM search. If the string you're searching for is
reasonably long, this can make a significant difference over other
search algorithms. (I think that some implementations of grep will
use this if the search pattern is in fact a fixed string, and is
sufficiently long for it to make a difference.)

请远离我 2024-12-15 02:03:53

使用多线程。每个线程可以负责搜索文件的一部分。例如,在 4 核机器上生成 12 个线程。第一个线程查看文件的前 8%,第二个线程查看文件的第二个 8%,依此类推。您需要调整每个核心的线程数以保持 cpu 的最大利用率。由于这是一个 I/O 密集型操作,您可能永远无法达到 100% 的 CPU 利用率。

使用此设计,向线程提供数据将成为瓶颈。映射文件的内存可能会有所帮助,但最终磁盘一次只能读取一个扇区。这将成为你难以解决的瓶颈。您可能会考虑启动一个线程,该线程除了将所有数据读入内存外什么都不做,并在数据加载时启动搜索线程。

Use multiple threads. Each thread can be responsible for searching through a portion of the file. For example on a 4 core machine spawn 12 threads. The first thread looks through the first 8%evening of the file, the second thread the second 8% of the file, etc. You will want to tune the number of threads per core to keep the cpu max utilized. Since this is an I/O bound operation you may never reach 100% cpu utilization.

Feeding data to the threads will be a bottleneck using this design. Memory mapping the file might help somewhat but at the end of the day the disk can only read one sector at a time. This will be a bottleneck that you will be hard pressed to resolve. You might consider starting one thread that does nothing but read all the data in to memory and kick off search threads as the data loads up.

决绝 2024-12-15 02:03:53

由于文件是连续的野兽,从头到尾搜索是您可能无法回避的事情,但是您可以做一些事情。

如果数据是静态的,您可以生成一个较小的查找文件(替代方案,带有主文件的偏移量),如果多次重复相同的字符串使索引文件小得多,那么这很有效。如果文件是动态的,您可能需要偶尔(离线)重新生成索引文件,

而不是逐行读取,而是从文件中读取更大的块(例如几 MB)以加速 I/O。

Since files are sequential beasts searching from start to end is something that you may not get around however there are a couple of things you could do.

if the data is static you could generate a smaller lookup file (alt. with offsets into the main file), this works good if the same string is repeated multiple times making the index file much smaller. if the file is dynamic you maybe need to regenerate the index file occassionally (offline)

instead of reading line by line, read larger chunks from the file like several MB to speed up I/O.

两人的回忆 2024-12-15 02:03:53

如果您想使用库,可以使用 xapian

您可能还想在搜索之前尝试对文本进行标记,我还建议您也尝试正则表达式,但如果您没有该文本的索引,这将花费很多时间,所以我绝对建议您尝试xapian 或一些搜索引擎。

If you'd like to do use a library you could use xapian.

You may also want to try tokenizing your text before doing the search and I'd also suggest you to try regex too but it will take a lot if you don't have an index on that text so I'd definitely suggest you to try xapian or some search engine.

赏烟花じ飞满天 2024-12-15 02:03:53

如果您的大文本文件不经常更改,则创建一个带有表的数据库(例如 SQLite):

create table word_line_numbers
  (word varchar(100), line_number integer);

读取您的文件并在数据库中为每个单词插入一条记录,如下所示:

insert into word_line_numbers(word, line_number) values ('foo', 13452);
insert into word_line_numbers(word, line_number) values ('foo', 13421);
insert into word_line_numbers(word, line_number) values ('bar', 1421);

创建单词索引:

create index wird_line_numbers_idx on word_line_numbers(word);

然后您可以找到使用此索引快速查找单词的行号:

select line_number from word_line_numbers where word='foo';

为了提高速度(因为数据库大小较小)和复杂性,您可以使用 2 个表:words(word_id 整数主键,word not null)word_lines (word_id 整数不为空引用单词, line_number 整数不为空)

If your big text file does not change often then create a database (for example SQLite) with a table:

create table word_line_numbers
  (word varchar(100), line_number integer);

Read your file and insert a record in database for every word with something like this:

insert into word_line_numbers(word, line_number) values ('foo', 13452);
insert into word_line_numbers(word, line_number) values ('foo', 13421);
insert into word_line_numbers(word, line_number) values ('bar', 1421);

Create an index of words:

create index wird_line_numbers_idx on word_line_numbers(word);

And then you can find line numbers for words fast using this index:

select line_number from word_line_numbers where word='foo';

For added speed (because of smaller database size) and complexity you can use 2 tables: words(word_id integer primary key, word not null) and word_lines(word_id integer not null references words, line_number integer not null).

远昼 2024-12-15 02:03:53

我会尝试首先将尽可能多的文件加载到 RAM 中(文件的内存映射是一个不错的选择),然后在多个处理器上同时搜索其中的部分内容。您需要在缓冲区边界附近特别小心,以确保没有遗漏任何单词。另外,您可能想尝试比典型的 strstr() 更有效的方法,请参阅这些:
Boyer–Moore 字符串搜索算法
Knuth–Morris–Pratt 算法

I'd try first loading as much of the file into the RAM as possible (memory mapping of the file is a good option) and then search concurrently in parts of it on multiple processors. You'll need to take special care near the buffer boundaries to make sure you aren't missing any words. Also, you may want to try something more efficient than the typical strstr(), see these:
Boyer–Moore string search algorithm
Knuth–Morris–Pratt algorithm

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文