博耶摩尔算法的实现?

发布于 2024-10-26 11:22:56 字数 81 浏览 9 评论 0原文

有 C 语言的 Boyer-Moore 字符串搜索算法的工作示例吗? 我浏览了一些网站,但它们似乎有很多问题,包括维基百科。

谢谢。

Is there a working example of the Boyer-Moore string search algorithm in C?
I've looked at a few sites, but they seem pretty buggy, including wikipedia.

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

许仙没带伞 2024-11-02 11:22:56

子字符串搜索算法的最佳网站:

http://igm.univ-mlv.fr/~ lecroq/字符串/

The best site for substring search algorithms:

http://igm.univ-mlv.fr/~lecroq/string/

朦胧时间 2024-11-02 11:22:56

这是我用很多奇怪的测试用例强调的 C90 实现:

#ifndef MAX
#define MAX(a,b)  ((a > b) ? (a) : (b))
#endif

void  fillBadCharIndexTable (
    /*----------------------------------------------------------------
    function:
       the table fits for 8 bit character only (including utf-8)
    parameters: */
    size_t  aBadCharIndexTable [],
    char const *  const pPattern,
    size_t  const patternLength)
    /*----------------------------------------------------------------*/
{
    size_t  i;
    size_t  remainingPatternLength = patternLength - 1;

    for (i = 0;  i < 256; ++i) {
        aBadCharIndexTable [i] = patternLength;
    }
    for (i = 0;  i < patternLength;  ++i) {
        aBadCharIndexTable [pPattern [i]] = remainingPatternLength--;
    }
}

void  fillGoodSuffixRuleTable (
    /*----------------------------------------------------------------
    function:
       the table fits for patterns of length < 256; for longer patterns ... (1 of)
       - increase the static size
       - use variable length arrays and >= C99 compilers
       - allocate (and finally release) heap according to demand
    parameters: */
    size_t  aGoodSuffixIndexTable [],
    char const *  const pPattern,
    size_t  const patternLength)
    /*----------------------------------------------------------------*/
{
    size_t  const highestPatternIndex = patternLength - 1;
    size_t  prefixLength = 1;

    /* complementary prefix length, i.e. difference from highest possible pattern index and prefix length */
    size_t  cplPrefixLength = highestPatternIndex;

    /* complementary length of recently inspected pattern substring which is simultaneously pattern prefix and suffix */
    size_t  cplPrefixSuffixLength = patternLength;

    /* too hard to explain in a C source ;-) */
    size_t  iRepeatedSuffixMax;

    aGoodSuffixIndexTable [cplPrefixLength] = patternLength;

    while (cplPrefixLength > 0) {
        if (!strncmp (pPattern,  pPattern + cplPrefixLength,  prefixLength)) {
            cplPrefixSuffixLength = cplPrefixLength;
        }

        aGoodSuffixIndexTable [--cplPrefixLength] = cplPrefixSuffixLength + prefixLength++;
    }

    if (pPattern [0] != pPattern [highestPatternIndex]) {
        aGoodSuffixIndexTable [highestPatternIndex] = highestPatternIndex;
    }

    for (iRepeatedSuffixMax = 1;  iRepeatedSuffixMax < highestPatternIndex;  ++iRepeatedSuffixMax) {
        size_t  iSuffix = highestPatternIndex;
        size_t  iRepeatedSuffix = iRepeatedSuffixMax;

        do {
            if (pPattern [iRepeatedSuffix] != pPattern [iSuffix]) {
                aGoodSuffixIndexTable [iSuffix] = highestPatternIndex - iRepeatedSuffix;
                break;
            }

            --iSuffix;
        } while (--iRepeatedSuffix > 0);
    }
}

char const *  boyerMoore (
    /*----------------------------------------------------------------
    function:
       find a pattern (needle) inside a text (haystack)
    parameters: */
    char const *  const pHaystack,
    size_t  const haystackLength,
    char const *  const pPattern)
    /*----------------------------------------------------------------*/
{
    size_t  const patternLength = strlen (pPattern);
    size_t  const highestPatternIndex = patternLength - 1;
    size_t  aBadCharIndexTable [256];
    size_t  aGoodSuffixIndexTable [256];

    if (*pPattern == '\0') {
        return pHaystack;
    }

    if (patternLength <= 1) {
        return strchr (pHaystack,  *pPattern);
    }

    if (patternLength >= sizeof aGoodSuffixIndexTable) {
        /* exit for too long patterns */
        return 0;
    }

    {
        char const *  pInHaystack = pHaystack + highestPatternIndex;

        /* search preparation */
        fillBadCharIndexTable (
            aBadCharIndexTable,
            pPattern,
            patternLength);
        fillGoodSuffixRuleTable (
            aGoodSuffixIndexTable,
            pPattern,
            patternLength);

        /* search execution */
        while (pInHaystack++ < pHaystack + haystackLength) {
            int  iPattern = (int) highestPatternIndex;

            while (*--pInHaystack == pPattern [iPattern]) {
                if (--iPattern < 0) {
                    return pInHaystack;
                }
            }

            pInHaystack += MAX (aBadCharIndexTable [*pInHaystack],  aGoodSuffixIndexTable [iPattern]);
        }
    }

    return 0;
}

Here is a C90 implementation that I have stressed with a lot of strange test cases:

#ifndef MAX
#define MAX(a,b)  ((a > b) ? (a) : (b))
#endif

void  fillBadCharIndexTable (
    /*----------------------------------------------------------------
    function:
       the table fits for 8 bit character only (including utf-8)
    parameters: */
    size_t  aBadCharIndexTable [],
    char const *  const pPattern,
    size_t  const patternLength)
    /*----------------------------------------------------------------*/
{
    size_t  i;
    size_t  remainingPatternLength = patternLength - 1;

    for (i = 0;  i < 256; ++i) {
        aBadCharIndexTable [i] = patternLength;
    }
    for (i = 0;  i < patternLength;  ++i) {
        aBadCharIndexTable [pPattern [i]] = remainingPatternLength--;
    }
}

void  fillGoodSuffixRuleTable (
    /*----------------------------------------------------------------
    function:
       the table fits for patterns of length < 256; for longer patterns ... (1 of)
       - increase the static size
       - use variable length arrays and >= C99 compilers
       - allocate (and finally release) heap according to demand
    parameters: */
    size_t  aGoodSuffixIndexTable [],
    char const *  const pPattern,
    size_t  const patternLength)
    /*----------------------------------------------------------------*/
{
    size_t  const highestPatternIndex = patternLength - 1;
    size_t  prefixLength = 1;

    /* complementary prefix length, i.e. difference from highest possible pattern index and prefix length */
    size_t  cplPrefixLength = highestPatternIndex;

    /* complementary length of recently inspected pattern substring which is simultaneously pattern prefix and suffix */
    size_t  cplPrefixSuffixLength = patternLength;

    /* too hard to explain in a C source ;-) */
    size_t  iRepeatedSuffixMax;

    aGoodSuffixIndexTable [cplPrefixLength] = patternLength;

    while (cplPrefixLength > 0) {
        if (!strncmp (pPattern,  pPattern + cplPrefixLength,  prefixLength)) {
            cplPrefixSuffixLength = cplPrefixLength;
        }

        aGoodSuffixIndexTable [--cplPrefixLength] = cplPrefixSuffixLength + prefixLength++;
    }

    if (pPattern [0] != pPattern [highestPatternIndex]) {
        aGoodSuffixIndexTable [highestPatternIndex] = highestPatternIndex;
    }

    for (iRepeatedSuffixMax = 1;  iRepeatedSuffixMax < highestPatternIndex;  ++iRepeatedSuffixMax) {
        size_t  iSuffix = highestPatternIndex;
        size_t  iRepeatedSuffix = iRepeatedSuffixMax;

        do {
            if (pPattern [iRepeatedSuffix] != pPattern [iSuffix]) {
                aGoodSuffixIndexTable [iSuffix] = highestPatternIndex - iRepeatedSuffix;
                break;
            }

            --iSuffix;
        } while (--iRepeatedSuffix > 0);
    }
}

char const *  boyerMoore (
    /*----------------------------------------------------------------
    function:
       find a pattern (needle) inside a text (haystack)
    parameters: */
    char const *  const pHaystack,
    size_t  const haystackLength,
    char const *  const pPattern)
    /*----------------------------------------------------------------*/
{
    size_t  const patternLength = strlen (pPattern);
    size_t  const highestPatternIndex = patternLength - 1;
    size_t  aBadCharIndexTable [256];
    size_t  aGoodSuffixIndexTable [256];

    if (*pPattern == '\0') {
        return pHaystack;
    }

    if (patternLength <= 1) {
        return strchr (pHaystack,  *pPattern);
    }

    if (patternLength >= sizeof aGoodSuffixIndexTable) {
        /* exit for too long patterns */
        return 0;
    }

    {
        char const *  pInHaystack = pHaystack + highestPatternIndex;

        /* search preparation */
        fillBadCharIndexTable (
            aBadCharIndexTable,
            pPattern,
            patternLength);
        fillGoodSuffixRuleTable (
            aGoodSuffixIndexTable,
            pPattern,
            patternLength);

        /* search execution */
        while (pInHaystack++ < pHaystack + haystackLength) {
            int  iPattern = (int) highestPatternIndex;

            while (*--pInHaystack == pPattern [iPattern]) {
                if (--iPattern < 0) {
                    return pInHaystack;
                }
            }

            pInHaystack += MAX (aBadCharIndexTable [*pInHaystack],  aGoodSuffixIndexTable [iPattern]);
        }
    }

    return 0;
}
若有似无的小暗淡 2024-11-02 11:22:56

Bob Stout 的 Snippets 网站上有一些 Boyer-Moore-Horspool 的实现(包括周日的变体)。据我所知,Ray Gardner 在 BMHSRCH.C 中的实现没有错误知道1,绝对是我见过或听说过的最快的。然而,这并不是最容易理解的——他使用了一些相当棘手的代码来使内部循环尽可能简单。我可能有偏见,但我认为我的版本2位于 PBMSRCH.C 更容易理解一点(尽管肯定慢一点)。

1 在其限制范围内 - 它最初是为 MS-DOS 编写的,并且可以针对提供更多内存的环境进行重写。
2 这不知何故被标记为“Pratt-Boyer-Moore”,但实际上是 Boyer-Moore-Horspool 的周日变体(尽管我当时不知道并且没有发布它) ,我相信我实际上是在周日之前大约一年发明了它)。

There are a couple of implementations of Boyer-Moore-Horspool (including Sunday's variant) on Bob Stout's Snippets site. Ray Gardner's implementation in BMHSRCH.C is bug-free as far as I know1, and definitely the fastest I've ever seen or heard of. It's not, however, the easiest to understand -- he uses some fairly tricky code to keep the inner loop as a simple as possible. I may be biased, but I think my version2 in PBMSRCH.C is a bit easier to understand (though definitely a bit slower).

1 Within its limits -- it was originally written for MS-DOS, and could use a rewrite for environments that provide more memory.
2 This somehow got labeled as "Pratt-Boyer-Moore", but is actually Sunday's variant of Boyer-Moore-Horspool (though I wasn't aware of it at the time and didn't publish it, I believe I actually invented it about a year before Sunday did).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文