Delphi:高效快速的 Unicode 文本搜索
Unicode 文本/字符串中是否存在快速高效的文本搜索?我也需要搜索单词的一部分,而不仅仅是整个单词。
搜索缓冲区?
谢谢!
Is there a fast and efficient text search in a Unicode text/string? I need to search a part of a word too, not only a whole word.
SearchBuf?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
正如已经指出的那样,最快的方法取决于很多因素,最重要的是您是否需要能够重复搜索。第二个问题是,真正拥有“最快”的方法而不是相当快速的方法对您来说有多重要,以及您愿意在优化上投入的时间。
重复搜索
如果您需要重复搜索,据我所知,最有效的字符串搜索方法是使用 后缀数组(通常与 Burrows-Wheeler 结合使用变换)。这种方法广泛用于生物信息学,人们经常需要在非常大的数据集上处理大量的字符串搜索(例如这里)。一个非常好的后缀数组(和 BWT)库是 libdivsufsort C 库,但不幸的是我不知道 Delphi 端口这个图书馆。 (我相信这个库能够处理 unicode 字符串。)
单次搜索
如果您不需要重复搜索,强力字符串搜索算法可能会很有效,例如汇编优化的 <
Pos
和朋友的 href="http://fastcode.sourceforge.net/">FastCode 版本。然而,这些是在 Delphi 统一编码之前编写的,据我所知没有类似的优化统一编码实现。如果我今天要写一个并希望针对现代处理器(支持 SSE4.2 指令集)对其进行优化,我会认真查看PCMPESTRI
汇编指令(参考 pdf 这里< /a>;另见例如此处,但我不知道该代码是否有效),它可以处理您需要的 2 字节字符用于 unicode 字符串搜索。As has already been pointed out, the fastest way of doing this depends on a number of things, most importantly whether you need to be able to search repeatedly or not. The second question is how important is it to you to really have the "fastest" approach rather than a reasonably fast approach and the amount of time you are willing to invest in optimisations.
Repeated searches
If you need to search repeatedly, the most efficient way for string searching I know of is by the use of suffix arrays (often combined with Burrows-Wheeler transforms). This approach is used extensively in bioinformatics where one often has to deal with a huge number of string searches over really large data sets (e.g. here). A very good suffix array (and BWT) library is the libdivsufsort C library, but unfortunately I know of no Delphi port of this library. (I believe this library is capable of handling unicode strings.)
Single searches
If you don't need to search repeatedly, a brute-force string search algorithm can be efficient, for instance the assembly-optimised FastCode versions of
Pos
and friends. These were, however, written before Delphi was unicode-ified and I know of no similar optimised unicode implementations. If I were to write one today and wanted to optimise it for a modern processor (capable of the SSE4.2 instruction set), I would have a serious look at thePCMPESTRI
assembly instruction (reference pdf here; see also e.g. here, but I have no idea whether that code is working), which can handle the 2-byte characters you'd need for unicode string searching.从其实现来看,
SearchBuf
比Pos/PosEx
慢。但它确实有其他选项,例如全字搜索和不区分大小写的搜索。出于您的目的,
Pos
的 UnicodeString 版本恕我直言,比PosEx
慢(Pos
使用最慢的rep scansw
asm 而不是两个 Widechar-unrolled 比较PosEx
- 至少在 Delphi 2010 中)。由于我猜您希望为搜索提供一个起始偏移量(创建用于调用Pos
的子字符串非常慢),因此您希望使用PosEx
。鲍尔-更多的算法可能会更快。您必须在您的应用程序中尝试使用真实数据,以猜测是否值得。
如果您想搜索英文文本,使用 UTF-8 进行存储值得一试。搜索速度会更快,因为您将使用更少的内存。
但我猜想您的应用程序的主要瓶颈将在存储/检索部分(即访问磁盘或取消/压缩),而不是在搜索部分。使用软件分析器来猜测哪些内容值得尝试增强:
From its implementation,
SearchBuf
is slower thanPos/PosEx
. But it does have other options, like whole word search and case insensitive search.For your purpose, the UnicodeString version of
Pos
is IMHO slower thanPosEx
(Pos
use slowestrep scansw
asm instead of two widechar-unrolled compare forPosEx
- at least in Delphi 2010). Since I guess you'd like to have a start offset for your search (creating a sub-string for callingPos
is dead slow), you'd like to usePosEx
.A Bower-More algorithm could be probably faster. You have to try in your application, on real data, to guess if it's worth it.
And if you want to search for English text, using UTF-8 for your storage is worth making a try. The search will be faster, since you'll use less memory.
But I guess that the main bottleneck of your application will be in the storage / retrieval part (i.e. access to disk or un/compression), not in the search part. Use a software profiler to guess what is worth trying to enhance:
对于某些情况,一种快速有效的搜索算法是(如果正在搜索的数据没有改变,并且您只是在相同的缓冲区内容上一次又一次地搜索)搜索一次并构建某种查找表。一种可能的解决方案是 BoyerMoore(请参阅 Rudy 评论中的链接),但由于该解决方案对于真正高范围的 unicode 字符(例如范围 $0000...$FFFF)效果不佳,因此它仍然比重复的 Pos 或 SearchBuf 调用更快,但当我使用真正的高范围 Unicode 数据集(例如中文文本)对其进行测试时,它消耗了大量内存。我的观点是,不存在“灌篮高手”的解决方案。
也许您可以编写一个比 Pos 更快但内存不像 BoyerMoore 那么多的解决方案,如下所示:
为所有字符点构建一个表并存储每个字符出现的第一个位置。我将使用稀疏排序数组,对于您所做的每次搜索,存储该初始字符的每个起始位置。这将使您免于通过大型 unicode 字符串进行强力搜索以查找初始字符匹配。
如果表中不包含特定字符的条目,则必须进行强力搜索(仅一次)来构建该数据集。如果暴力搜索一次,代码点 U+1234 没有出现,那么 U+1234 的条目应该是一个空数组
[]
,表明您不必再次搜索,并且可以如果初始字符不匹配,搜索很快就会失败。一旦您找到了像这样的非空搜索初始字符列表
[123,456,789]
(初始代码点匹配),您可以继续在 string[123 处逐个字符检查...x] 然后 string[456..x] 等等,直到用完数组中的元素。任何不匹配都会导致跳到查找表中的下一个元素。再次出现病态情况,所有这些额外的工作都会导致代码比 Pos 慢,除非您可以确定需要在完全相同的大字符串中搜索许多不同的子字符串,否则所有这些“优化” “做事是浪费时间;忘记它吧,当您至少无法多次重复使用查找数据表时,即使 boyer moore 字符串搜索也会变慢。
如果您想要的只是一个在 Assembly 中手动优化的 Pos() 函数,以便在不生成或消耗任何中间表存储的库函数范围内尽可能快地工作,那么恭喜,Pos() 是可能已经尽可能快了(我相信几年前 FastCode 的人已经对其进行了优化)。
One kind of fast and efficient search algorithm for some cases, is one that (if the data being searched is not changing, and you just search again and again on the same buffer content) searches once and builds some kind of lookup table. One possible solution is BoyerMoore (see link in comment by Rudy) but as that one didn't work great for really high range unicode characters (say range $0000...$FFFF), it was still faster than repeated Pos or SearchBuf calls, but it consumed a LOT of memory when I tested it with truly high-range Unicode data-sets (Chinese text for instance). My opinion is there is no "slam dunk" solution.
Perhaps you could write a faster-than-Pos-but-not-so-much-memory-as-BoyerMoore solution like this:
Build a table for all character points and store the first location that each character appears. I would use a sparse sorted array, for each search you have done, store each starting location for that initial character. That will spare you the brute force search through the big unicode string looking for the initial character matches.
If the table contains NO entry for a particular character, you have to do a brute force search (just once) to build that data set up. If you search once by brute force and codepoint U+1234 does not appear, then the entry for U+1234 should be an empty array
[]
indicating that you don't have to search again, and can quickly fail out on searches where the initial character doesn't match.Once you have found a non-empty search initial-character list of positions like this
[123,456,789]
, an initial codepoint match, you can continue by doing a character by character check at string[123...x] then string[456..x] and so on until you run out of elements in the array. Any mismatch causes a skip to the next element in the table of lookups.Again there will be pathological cases where all this extra work results in code that is less fast than Pos is, and unless you can be sure you need to search for a lot of different substrings in exactly the same big string, then all this "optimization" stuff is a waste of time; forget it, even boyer moore string searches are slower, when you can't reuse the lookup data table multiple times at least.
If all you want is a Pos() function that is hand-optimized in Assembly to work as fast as is possible within the confines of a library function that doesn't produce or consume any intermediate table storage, then congratulations, Pos() is already probably about as fast as you can get (I believe it was optimized by the FastCode people a few years ago).