用Delphi快速搜索大文件中是否存在字符串
我的程序中有一个 FindFile 例程,它将列出文件,但如果填写了“包含文本”字段,那么它应该只列出包含该文本的文件。
如果输入“包含文本”字段,则我会搜索找到的每个文件中的文本。我目前的方法是:
var
FileContents: TStringlist;
begin
FileContents.LoadFromFile(Filepath);
if Pos(TextToFind, FileContents.Text) = 0 then
Found := false
else
Found := true;
上面的代码很简单,而且通常工作正常。但它有两个问题:
对于非常大的文件(例如 300 MB)它会失败
我觉得它可以更快。这还不错,但如果有一种简单的方法可以加快速度,为什么要花 10 分钟搜索 1000 个文件呢?
我需要它在 Delphi 2009 上工作并搜索可能是也可能不是 Unicode 的文本文件。它只需要适用于文本文件。
那么如何才能加快搜索速度并使其适用于非常大的文件呢?
奖励:我还想允许“忽略大小写”选项。这是一个更难提高效率的方法。有什么想法吗?
解决方案:
嗯,mghie 指出了我之前的问题 如何在Delphi中高效读取多个文件的前几行,正如我回答的那样,它是不同的,没有提供解决方案。
但他让我想到我以前就这么做过,而且我确实这么做了。我为大文件构建了一个块读取例程,将其分成 32 MB 的块。我用它来读取程序的输入文件,该文件可能很大。这个例程运行良好且快速。因此,第一步是对我正在查看的这些文件执行相同的操作。
所以现在的问题是如何在这些块内有效地搜索。好吧,我之前确实有过关于该主题的问题:是Delphi中有高效的全字搜索功能吗?并且RRUZ向我指出了SearchBuf例程。
这也解决了“奖励”问题,因为 SearchBuf 的选项包括全字搜索(该问题的答案)和 MatchCase/noMatchCase(奖励的答案)。
所以我就出发了。再次感谢SO社区。
I have a FindFile routine in my program which will list files, but if the "Containing Text" field is filled in, then it should only list files containing that text.
If the "Containing Text" field is entered, then I search each file found for the text. My current method of doing that is:
var
FileContents: TStringlist;
begin
FileContents.LoadFromFile(Filepath);
if Pos(TextToFind, FileContents.Text) = 0 then
Found := false
else
Found := true;
The above code is simple, and it generally works okay. But it has two problems:
It fails for very large files (e.g. 300 MB)
I feel it could be faster. It isn't bad, but why wait 10 minutes searching through 1000 files, if there might be a simple way to speed it up a bit?
I need this to work for Delphi 2009 and to search text files that may or may not be Unicode. It only needs to work for text files.
So how can I speed this search up and also make it work for very large files?
Bonus: I would also want to allow an "ignore case" option. That's a tougher one to make efficient. Any ideas?
Solution:
Well, mghie pointed out my earlier question How Can I Efficiently Read The First Few Lines of Many Files in Delphi, and as I answered, it was different and didn't provide the solution.
But he got me thinking that I had done this before and I had. I built a block reading routine for large files that breaks it into 32 MB blocks. I use that to read the input file of my program which can be huge. The routine works fine and fast. So step one is to do the same for these files I am looking through.
So now the question was how to efficiently search within those blocks. Well I did have a previous question on that topic: Is There An Efficient Whole Word Search Function in Delphi? and RRUZ pointed out the SearchBuf routine to me.
That solves the "bonus" as well, because SearchBuf has options which include Whole Word Search (the answer to that question) and MatchCase/noMatchCase (the answer to the bonus).
So I'm off and running. Thanks once again SO community.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
这里最好的方法可能是使用内存映射文件。
首先,您需要一个文件句柄,请使用
CreateFile
windows API 函数。然后将其传递给 CreateFileMapping 来获取文件映射句柄。最后使用MapViewOfFile将文件映射到内存中。
为了处理大文件,
MapViewOfFile
只能将特定范围映射到内存中,因此您可以例如映射前 32MB,然后使用UnmapViewOfFile
取消映射,然后使用 < code>MapViewOfFile 用于接下来的 32MB,依此类推。 (编辑:正如下面所指出的,确保您以这种方式映射的块重叠 4kb 的倍数,并且至少与您正在搜索的文本的长度一样多,这样您就不会忽略任何文本可能会在块边界处分割)要在文件(部分)映射到内存后进行实际搜索,您可以从 SysUtils.pas 复制
StrPosLen
的源代码(不幸的是仅在实现部分中定义,不在接口中公开)。保留一份原样并制作另一份副本,每次都将Wide
替换为Ansi
。另外,如果您希望能够搜索可能包含嵌入#0
的二进制文件,您可以删除(Str1[I] <> #0) 和 < /代码> 部分。
要么找到一种方法来识别文件是 ANSI 还是 Unicode,或者简单地在文件的每个映射部分上调用 Ansi 和 Unicode 版本。
处理完每个文件后,请确保首先在文件映射句柄上调用
CloseHandle
,然后在文件处理上调用。 (并且不要忘记首先调用UnmapViewOfFile
)。编辑:
使用内存映射文件而不是使用 TFileStream 将文件分块读入内存的一大优点是字节只会在内存中出现一次。
通常,在文件访问时,Windows 首先将字节读取到操作系统文件缓存中。然后将它们从那里复制到应用程序内存中。
如果使用内存映射文件,操作系统可以直接将操作系统文件缓存中的物理页映射到应用程序的地址空间,而无需再次复制(减少复制所需的时间并减少一半的内存使用量)。
额外答案:通过调用 StrLIComp 而不是 StrLComp,您可以进行不区分大小写的搜索。
The best approach here is probably to use memory mapped files.
First you need a file handle, use the
CreateFile
windows API function for that.Then pass that to
CreateFileMapping
to get a file mapping handle. Finally useMapViewOfFile
to map the file into memory.To handle large files,
MapViewOfFile
is able to map only a certain range into memory, so you can e.g. map the first 32MB, then useUnmapViewOfFile
to unmap it followed by aMapViewOfFile
for the next 32MB and so on. (EDIT: as was pointed out below, make sure that the blocks you map this way overlap by a multiple of 4kb, and at least as much as the length of the text you are searching for, so that you are not overlooking any text which might be split at the block boundary)To do the actual searching once the (part of) the file is mapped into memory, you can make a copy of the source for
StrPosLen
from SysUtils.pas (it's unfortunately defined in the implementation section only and not exposed in the interface). Leave one copy as is and make another copy, replacingWide
withAnsi
every time. Also, if you want to be able to search in binary files which might contain embedded#0
's, you can remove the(Str1[I] <> #0) and
part.Either find a way to identify if a file is ANSI or Unicode, or simply call both the Ansi and Unicode version on each mapped part of the file.
Once you are done with each file, make sure to call
CloseHandle
first on the file mapping handle and then on the file handling. (And don't forget to callUnmapViewOfFile
first).EDIT:
A big advantage of using memory mapped files instead of using e.g. a TFileStream to read the file into memory in blocks is that the bytes will only end up in memory once.
Normally, on file access, first Windows reads the bytes into the OS file cache. Then copies them from there into the application memory.
If you use memory mapped files, the OS can directly map the physical pages from the OS file cache into the address space of the application without making another copy (reducing the time needed for making the copy and halfing memory usage).
Bonus Answer: By calling StrLIComp instead of StrLComp you can do a case insensitive search.
我可以推荐一个组件吗?如果是,我会推荐ATStreamSearch。
它可以处理 ANSI 和 UNICODE(甚至 EBCDIC 和韩语等)。
或者 JclUnicode (Jedi-jcl) 中的 TUTBMSearch 类。它主要由 Mike Lischke (VirtualTreeview) 编写。它使用经过调整的 Boyer-Moore 算法来确保速度。您的情况的缺点是,它在unicode(宽字符串)中完全有效,因此从字符串到宽字符串的转换风险会受到惩罚。
May I suggest a component ? If yes I would recommend ATStreamSearch.
It handles ANSI and UNICODE (and even EBCDIC and Korean and more).
Or the class TUTBMSearch from the JclUnicode (Jedi-jcl). It was mainly written by Mike Lischke (VirtualTreeview). It uses a tuned Boyer-Moore algo that ensure speed. The bad point in your case, is that is fully works in unicode (widestrings) so the trans-typing from String to Widestring risk to be penalizing.
这是与您之前的问题相关的问题 如何在Delphi中高效地读取许多文件的前几行,同样的答案也适用。如果您不完全读取文件而是按块读取文件,那么大文件不会造成问题。对于包含文本的文件来说,速度也有很大的提高,因为您应该在第一次匹配时取消搜索。目前,即使要找到的文本位于前几行,您也会读取整个文件。
This is a problem connected with your previous question How Can I Efficiently Read The First Few Lines of Many Files in Delphi, and the same answers apply. If you don't read the files completely but in blocks then large files won't pose a problem. There's also a big speed-up to be had for files containing the text, in that you should cancel the search upon the first match. Currently you read the whole files even when the text to be found is in the first few lines.
如果您正在寻找文本字符串搜索,请寻找 Boyer-Moore 搜索算法。它使用内存映射文件和一个非常快的搜索引擎。这是一些包含该算法实现的 delphi 单元。
为了让您了解速度 - 我目前搜索 10-20MB 文件,所需时间约为毫秒。
哦,刚刚读到它可能是 unicode - 不确定它是否支持 - 但肯定会沿着这条路径往下看。
If you are looking for text string searches, look for the Boyer-Moore search algorithm. It uses memory mapped files and a really fast search engine. The is some delphi units around that contain implementations of this algorithm.
To give you an idea of the speed - i currently search through 10-20MB files and it takes in the order of milliseconds.
Oh just read that it might be unicode - not sure if it supports that - but definately look down this path.
如果要多次搜索文件,那么使用单词索引可能是个好主意。
这称为“全文搜索”。
第一次会比较慢(必须解析文本并创建索引),但以后的任何搜索都将立即进行:简而言之,它将仅使用索引,而不会再次读取所有文本。
您可以在 The Delphi Magazine Issue 78, February 2002 中找到所需的解析器:
“Alfresco 算法:问一千次
Julian Bucknall 讨论了单词索引和文档搜索:如果您想了解 Google 如何发挥其魔力,请参阅此页。”
Delphi 有多种 FTS 实现:
我想补充一点,大多数数据库都有嵌入式 FTS 引擎SQLite3 甚至有一个非常小但高效的实现,具有页面排名等功能。
我们通过 ORM 类提供 从 Delphi 直接访问此全文搜索引擎 ,命名为 FTS3/FTS4。
If the files are to be searched multiple times, it could be a good idea to use a word index.
This is called "Full Text Search".
It will be slower the first time (text must be parsed and indexes must be created), but any future search will be immediate: in short, it will use only the indexes, and not read all text again.
You have the exact parser you need in The Delphi Magazine Issue 78, February 2002:
"Algorithms Alfresco: Ask A Thousand Times
Julian Bucknall discusses word indexing and document searches: if you want to know how Google works its magic this is the page to turn to."
There are several FTS implementation for Delphi:
I'd like to add that most DB have an embedded FTS engine. SQLite3 even has a very small but efficient implementation, with page ranking and such.
We provide direct access from Delphi, with ORM classes, to this Full Text Search engine, named FTS3/FTS4.
这取决于您要使用它搜索什么样的数据,为了获得真正有效的结果,您需要让您的程序解析有趣的目录(包括其中的所有文件),并将数据保存在数据库中您每次都可以访问特定文件列表中的特定单词,该文件列表可以根据搜索路径生成。数据库语句可以在几毫秒内为您提供结果。
问题是您必须在安装后让它运行并解析所有文件,这可能需要 1 个多小时才能解析您想要解析的数据量。
每次程序启动时都应更新此数据库,这可以通过比较每个文件的 MD5 值(如果已更改)来完成,因此您不必每次都解析所有文件。
如果您将所有数据放在一个固定的位置,并且您在同一文件中分析数据的次数比每次全新文件的数据分析次数都要多,那么这种工作方式会很有趣,那么某些代码分析器会像这样工作,而且它们非常高效。因此,您投入一些时间来解析和保存有趣的数据,并且您可以跳转到搜索词出现的确切位置,并在很短的时间内提供它出现的所有位置的列表。
It depends on what kind of data yre you going to search with it, in order for you to achieve a real efficient results you will need to let your programm parse the interesting directories including all files in there, and keep the data in a database which you can access each time for a specific word in a specific list of files which can be generated up to the searching path. A Database statement can provide you results in milliseconds.
The Issue is that you will have to let it run and parse all files after the installation, which may take even more than 1 hour up to the amount of data you wish to parse.
This Database should be updated eachtime your programm starts, this can be done by comparing the MD5-Value of each file if it was changed, so you dont have to parse all your files each time.
If this way of working can be interesting if you have all your data in a constant place and you analyse data in the same files more than each time totally new files, some code analyser work like this and they are real efficient. So you invest some time on parsing and saving intresting data and you can jump to the exact place where a searching word appears and provide a list of all places it appears on in a very short time.