Delphi:高效快速的 Unicode 文本搜索

发布于 2024-11-25 10:03:44 字数 86 浏览 1 评论 0原文

Unicode 文本/字符串中是否存在快速高效的文本搜索?我也需要搜索单词的一部分,而不仅仅是整个单词。

搜索缓冲区?

谢谢!

Is there a fast and efficient text search in a Unicode text/string? I need to search a part of a word too, not only a whole word.

SearchBuf?

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

情丝乱 2024-12-02 10:03:44

正如已经指出的那样,最快的方法取决于很多因素,最重要的是您是否需要能够重复搜索。第二个问题是,真正拥有“最快”的方法而不是相当快速的方法对您来说有多重要,以及您愿意在优化上投入的时间。

重复搜索

如果您需要重复搜索,据我所知,最有效的字符串搜索方法是使用 后缀数组(通常与 Burrows-Wheeler 结合使用变换)。这种方法广泛用于生物信息学,人们经常需要在非常大的数据集上处理大量的字符串搜索(例如这里)。一个非常好的后缀数组(和 BWT)库是 libdivsufsort C 库,但不幸的是我不知道 Delphi 端口这个图书馆。 (我相信这个库能够处理 unicode 字符串。)

单次搜索

如果您不需要重复搜索,强力字符串搜索算法可能会很有效,例如汇编优化的 < Pos 和朋友的 href="http://fastcode.sourceforge.net/">FastCode 版本。然而,这些是在 Delphi 统一编码之前编写的,据我所知没有类似的优化统一编码实现。如果我今天要写一个并希望针对现代处理器(支持 SSE4.2 指令集)对其进行优化,我会认真查看 PCMPESTRI 汇编指令(参考 pdf 这里< /a>;另见例如此处,但我不知道该代码是否有效),它可以处理您需要的 2 字节字符用于 unicode 字符串搜索。

As has already been pointed out, the fastest way of doing this depends on a number of things, most importantly whether you need to be able to search repeatedly or not. The second question is how important is it to you to really have the "fastest" approach rather than a reasonably fast approach and the amount of time you are willing to invest in optimisations.

Repeated searches

If you need to search repeatedly, the most efficient way for string searching I know of is by the use of suffix arrays (often combined with Burrows-Wheeler transforms). This approach is used extensively in bioinformatics where one often has to deal with a huge number of string searches over really large data sets (e.g. here). A very good suffix array (and BWT) library is the libdivsufsort C library, but unfortunately I know of no Delphi port of this library. (I believe this library is capable of handling unicode strings.)

Single searches

If you don't need to search repeatedly, a brute-force string search algorithm can be efficient, for instance the assembly-optimised FastCode versions of Pos and friends. These were, however, written before Delphi was unicode-ified and I know of no similar optimised unicode implementations. If I were to write one today and wanted to optimise it for a modern processor (capable of the SSE4.2 instruction set), I would have a serious look at the PCMPESTRI assembly instruction (reference pdf here; see also e.g. here, but I have no idea whether that code is working), which can handle the 2-byte characters you'd need for unicode string searching.

§普罗旺斯的薰衣草 2024-12-02 10:03:44

从其实现来看,SearchBufPos/PosEx 慢。但它确实有其他选项,例如全字搜索和不区分大小写的搜索。

出于您的目的,PosUnicodeString 版本恕我直言,比 PosEx 慢(Pos 使用最慢的 rep scansw asm 而不是两个 Widechar-unrolled 比较 PosEx - 至少在 Delphi 2010 中)。由于我猜您希望为搜索提供一个起始偏移量(创建用于调用 Pos 的子字符串非常慢),因此您希望使用 PosEx

鲍尔-更多的算法可能会更快。您必须在您的应用程序中尝试使用真实数据,以猜测是否值得。

如果您想搜索英文文本,使用 UTF-8 进行存储值得一试。搜索速度会更快,因为您将使用更少的内存。

但我猜想您的应用程序的主要瓶颈将在存储/检索部分(即访问磁盘或取消/压缩),而不是在搜索部分。使用软件分析器来猜测哪些内容值得尝试增强:

  • 先做对,再做快。做之前先说清楚
    它更快。当你让它变得更快时,保持正确。 —Kernighan 和
    Plauger,《编程风格的要素》
  • 过早的优化是万恶之源。 — 唐纳德·高德纳 (Donald Knuth),
    引用 CAR Hoare
  • 性能的关键是优雅,而不是大量的特殊情况。
    应该抵制调整的可怕诱惑。 —乔恩·本特利
    和道格·麦克罗伊
  • 规则归结为:“1. 不要过早优化。2. 不要优化
    直到你知道它是需要的。 3. 即使如此,也不要进行优化,直到
    你知道需要什么,以及在哪里。”——赫伯·萨特

From its implementation, SearchBuf is slower than Pos/PosEx. But it does have other options, like whole word search and case insensitive search.

For your purpose, the UnicodeString version of Pos is IMHO slower than PosEx (Pos use slowest rep scansw asm instead of two widechar-unrolled compare for PosEx - at least in Delphi 2010). Since I guess you'd like to have a start offset for your search (creating a sub-string for calling Pos is dead slow), you'd like to use PosEx.

A Bower-More algorithm could be probably faster. You have to try in your application, on real data, to guess if it's worth it.

And if you want to search for English text, using UTF-8 for your storage is worth making a try. The search will be faster, since you'll use less memory.

But I guess that the main bottleneck of your application will be in the storage / retrieval part (i.e. access to disk or un/compression), not in the search part. Use a software profiler to guess what is worth trying to enhance:

  • Make it right before you make it fast. Make it clear before you make
    it faster. Keep it right when you make it faster. — Kernighan and
    Plauger, Elements of Programming Style
    .
  • Premature optimization is the root of all evil. — Donald Knuth,
    quoting C. A. R. Hoare
  • The key to performance is elegance, not battalions of special cases.
    The terrible temptation to tweak should be resisted. — Jon Bentley
    and Doug McIlroy
  • The rules boil down to: "1. Don't optimize early. 2. Don't optimize
    until you know that it's needed. 3. Even then, don't optimize until
    you know what's needed, and where." — Herb Sutter
芯好空 2024-12-02 10:03:44

对于某些情况,一种快速有效的搜索算法是(如果正在搜索的数据没有改变,并且您只是在相同的缓冲区内容上一次又一次地搜索)搜索一次并构建某种查找表。一种可能的解决方案是 BoyerMoore(请参阅 Rudy 评论中的链接),但由于该解决方案对于真正高范围的 unicode 字符(例如范围 $0000...$FFFF)效果不佳,因此它仍然比重复的 Pos 或 SearchBuf 调用更快,但当我使用真正的高范围 Unicode 数据集(例如中文文本)对其进行测试时,它消耗了大量内存。我的观点是,不存在“灌篮高手”的解决方案。

也许您可以编写一个比 Pos 更快但内存不像 BoyerMoore 那么多的解决方案,如下所示:

  1. 为所有字符点构建一个表并存储每个字符出现的第一个位置。我将使用稀疏排序数组,对于您所做的每次搜索,存储该初始字符的每个起始位置。这将使您免于通过大型 unicode 字符串进行强力搜索以查找初始字符匹配。

  2. 如果表中不包含特定字符的条目,则必须进行强力搜索(仅一次)来构建该数据集。如果暴力搜索一次,代码点 U+1234 没有出现,那么 U+1234 的条目应该是一个空数组 [] ,表明您不必再次搜索,并且可以如果初始字符不匹配,搜索很快就会失败。

  3. 一旦您找到了像这样的非空搜索初始字符列表 [123,456,789](初始代码点匹配),您可以继续在 string[123 处逐个字符检查...x] 然后 string[456..x] 等等,直到用完数组中的元素。任何不匹配都会导致跳到查找表中的下一个元素。

再次出现病态情况,所有这些额外的工作都会导致代码比 Pos 慢,除非您可以确定需要在完全相同的大字符串中搜索许多不同的子字符串,否则所有这些“优化” “做事是浪费时间;忘记它吧,当您至少无法多次重复使用查找数据表时,即使 boyer moore 字符串搜索也会变慢。

如果您想要的只是一个在 Assembly 中手动优化的 Pos() 函数,以便在不生成或消耗任何中间表存储的库函数范围内尽可能快地工作,那么恭喜,Pos() 是可能已经尽可能快了(我相信几年前 FastCode 的人已经对其进行了优化)。

One kind of fast and efficient search algorithm for some cases, is one that (if the data being searched is not changing, and you just search again and again on the same buffer content) searches once and builds some kind of lookup table. One possible solution is BoyerMoore (see link in comment by Rudy) but as that one didn't work great for really high range unicode characters (say range $0000...$FFFF), it was still faster than repeated Pos or SearchBuf calls, but it consumed a LOT of memory when I tested it with truly high-range Unicode data-sets (Chinese text for instance). My opinion is there is no "slam dunk" solution.

Perhaps you could write a faster-than-Pos-but-not-so-much-memory-as-BoyerMoore solution like this:

  1. Build a table for all character points and store the first location that each character appears. I would use a sparse sorted array, for each search you have done, store each starting location for that initial character. That will spare you the brute force search through the big unicode string looking for the initial character matches.

  2. If the table contains NO entry for a particular character, you have to do a brute force search (just once) to build that data set up. If you search once by brute force and codepoint U+1234 does not appear, then the entry for U+1234 should be an empty array [] indicating that you don't have to search again, and can quickly fail out on searches where the initial character doesn't match.

  3. Once you have found a non-empty search initial-character list of positions like this [123,456,789], an initial codepoint match, you can continue by doing a character by character check at string[123...x] then string[456..x] and so on until you run out of elements in the array. Any mismatch causes a skip to the next element in the table of lookups.

Again there will be pathological cases where all this extra work results in code that is less fast than Pos is, and unless you can be sure you need to search for a lot of different substrings in exactly the same big string, then all this "optimization" stuff is a waste of time; forget it, even boyer moore string searches are slower, when you can't reuse the lookup data table multiple times at least.

If all you want is a Pos() function that is hand-optimized in Assembly to work as fast as is possible within the confines of a library function that doesn't produce or consume any intermediate table storage, then congratulations, Pos() is already probably about as fast as you can get (I believe it was optimized by the FastCode people a few years ago).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文