克服 Bitap 算法的搜索模式长度

发布于 2024-07-18 06:38:13 字数 511 浏览 9 评论 0原文

我是近似字符串匹配领域的新手。

我正在探索 Bitap 算法的用途，但到目前为止，其有限的模式长度让我感到困扰。我正在使用 Flash，并且处理 32 位无符号整数和 IEEE-754 双精度浮点数类型，该类型最多可以为整数提供 53 位。尽管如此，我还是宁愿有一个模糊匹配算法，它可以处理超过 50 个字符的模式。

Bitap 算法的维基百科页面提到了 libbitap，据称它演示了该算法的无限模式长度实现算法，但我很难从其来源中获得这个想法。

您是否对如何将 Bitap 泛化为无限长度的模式有任何建议，或者对另一种可以在大海捞针中建议位置附近执行针模糊字符串匹配的算法有什么建议吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

猫腻 2024-07-25 06:38:13

Google 代码提供了该算法的相当清晰的实现。
尝试一下。虽然我不明白如何获得模糊匹配的确切位置（文本中的起点和终点）。如果您有任何想法如何获得起点和终点，请分享。

回复收藏 0 原文

月牙弯弯 2024-07-25 06:38:13

模糊匹配最简单的形式可能是“匹配与不匹配”。不匹配有时称为替换。关键是我们没有考虑删除或插入。

Bitapp 多个版本的作者 Ricardo Baeza-Yates 还与 Chris Perleberg 一起编写了“匹配与不匹配”的算法。该算法使用链表而不是位数组，但算法的精神是相同的。该论文在评论中被引用。

下面是使用 GLib 的 Baeza-Yates-Perleberg“匹配与不匹配”算法的 C 实现。它的优化程度低于原始实现，但对图案或文本的大小没有限制。

https://gist.github.com/angstyloop/e4ca495542cd469790ca926ade2fc072

/* g_match_with_mismatches_v3.c
 *
 * Search for fuzzy matches of a pattern in text with @k or fewer mismatches
 * (substitutions). Uses doubly-linked list GList from GLib.
 *
 * COMPILE
 *
 * gcc `pkg-config --cflags glib-2.0` -o g_match_with_mismatches_v3 g_match_with_mismatches_v3.c `pkg-config --libs glib-2.0`
 *
 * RUN
 *
 * ./match_with_mismatches
 *
 * REFS
 *
 * The search and preprocess functions were taken from the example code from
 * "Fast and Practical Approximate String Matching" by Ricardo A. Baeza-Yates
 * and Chris H. Perleberg. I just modified the code.
 */

#include <glib-2.0/glib.h>
#include <stdio.h>

/* Size of alpha index and count array. This number should be twice the size
 * of the alphabet, which is 128 in this case for ASCII (not extended).
 * The purpose of this extra space is explained later.
 */
#define ALPHABET_SIZE 256

/* The Match object will hold the index @i of the first character of the match
 * and the number of mismatched characters @k within that match. The Match
 * objects are also linked list nodes.
 */
typedef struct Match Match;
struct Match {
    int i;
    int k;
};

Match *
Match_new( int i, int k )
{
    Match *t;
    t = g_malloc( sizeof( Match ) );
    t->i = i;
    t->k = k;
    return t;
}

void
Match_free( gpointer match_, gpointer user_data )
{
    Match *match = (Match *) match_;
    if ( match )
        g_free( match );
}

void
Match_print( gpointer match_, gpointer user_data )
{
    Match *match = (Match *) match_;
    printf( "position: %d, errors: %d\n", match->i, match->k );
}

/* An array of lists. There are 128 linked lists in total - one for each
 * character in our 128-character alphabet. The last 128 lists characters are
 * added to the linked lists as needed.
 *
 * Each list will contain the offsets for each occurence of that character, or a
 * single placeholder offset of -1 if no occurences are found.
 *
 * The offset is the distance of the character to the left of the end of
 * pattern, (i.e., measured by counting to the left from the end of pattern),
 * for a given character and a given instance of pattern.
 */
GList *alpha[ALPHABET_SIZE];

/* This function initializes the @alpha and @count arrays to look like this:
 *
 *     alpha = [ [ -1 ] ] * ALPHABET_SIZE   where the inner brackets are a GList.
 *
 *     count = [m] * m
 *
 * @alpha will be an array of linked lists. Characters in the pattern
 * that do not occur or that occur exactly once in the text will have
 * corresponding linked lists with length one. Characters in the pattern that
 * occur in the text more than once will have corresponding linked lists with
 * length greater than one.
 *
 * The first m - 1  elements of @count will be skipped on the first iteration of
 * the cyclic array (since no match can be shorter than the @pattern). Note that
 * the values in @count are reset to m once they are no longer needed, until the
 * next loop around @count.
 *
 * @p - pattern string
 * @m - pattern length
 * @alpha - array of GList. See above.
 * @count - circular buffer for counts of matches
 */
void preprocess( char *p, int m, GList *alpha[], int count[], int max_pattern_size )
{
    int i, j;

    for ( i = 0; i < ALPHABET_SIZE; i++ ) {
        alpha[i] = NULL;
        alpha[i] = g_list_append( alpha[i], GINT_TO_POINTER( -1 ) );
    }

    for ( i = 0, j = 128; i < m; i++, p++ ) {
        if ( GPOINTER_TO_INT( alpha[*p]->data ) == -1 )
            alpha[*p]->data = GINT_TO_POINTER( m - i - 1 );
        else
            alpha[*p] = g_list_append( alpha[*p],
                GINT_TO_POINTER( m - i - 1 ) );
    }

    for ( i = 0; i < max_pattern_size; i++ )
        count[i] = m;

}

void
increment_offset( gpointer off_, gpointer args_ )
{
    gpointer *args = (gpointer *) args_;
    int i = GPOINTER_TO_INT( args[0] );
    int max_pattern_size = GPOINTER_TO_INT( args[1] );
    int *count = (int *) args[2];
    gint off = GPOINTER_TO_INT( off_ ) ;
    count[(i + off) % max_pattern_size]--;
}

/* Find the position of the first character and number of mismatches of every
 * fuzzy match in a string @t with @k or fewer mismatches. Uses the array of
 * GList @alpha and the array of counts @count prepared by the preprocess
 * function.
 * @t - text string
 * @n - length of text string
 * @m - length of the pattern used to create @alpha and @count
 * @k - maximum number of allowed mismatches
 * @alpha - array of GList. See above.
 * @count - circular buffer for counts of matches
 */
GList *
search( char *t, int n, int m, int k, GList *alpha[], int count[], int max_pattern_size )
{
    int i, off, j;
    Match *match;
    GList *l0 = NULL, *l1 = NULL;

    /* Walk the text @t, which has length @n.
     */
    for ( i = 0; i < n; i++ ) {
        /* If the current character in @t is in pattern, its
         * corresponding list in @alpha will have a non-negative offset,
         * thanks to the workdone by the preprocess function. If so, we
         * need to decrement the counts in the circular buffer @count
         * corresponding to the index of the character in the text and
         * the offsets the lists corresponding to those characters,
         * which the preprocess function prepared.
         * 
         * Note that we will only ever need m counts at a time, and
         * we reset them to @m when we are done with them, in case
         * they are needed when the text wraps max_pattern_size
         * characters.
         */
        l0 = alpha[*t++];
        off = GPOINTER_TO_INT( l0->data );
        if ( off >= 0 ) {
            g_assert( l0 );
            gpointer t[3] = {
                GINT_TO_POINTER( i ),
                GINT_TO_POINTER( max_pattern_size ),
                (gpointer) count,
            };
            g_list_foreach( l0, increment_offset, t );
        }

        /* If the count in @count corresponding to the current index in
         * the text is no greater than @k, the number of mismatches we
         * allow, then the pattern instance is reported as a fuzzy
         * match. The position of the first letter in the match is
         * calculated using the pattern length and the index of the last
         * character in the match The number of mismatches is calculated
         * from the number of matches. The first m - 1 elements are
         * skipped.
         */
        if ( i >= m - 1 && count[i % max_pattern_size] <= k ) {
            g_assert( i - m + 1 >= 0 );
            match = Match_new( i - m + 1, count[i % max_pattern_size] );
            l1 = g_list_append( l1, match );
        }

        /* The count in @count corresponding to the current index in
         * text is no longer needed, so we reset it to @m until we
         * need it on the next wraparound.
         */
        count[i % max_pattern_size] = m;
    }

    return l1;
}

/* This is a test harness for the code in this file.
 */
int main()
{

    /* Define the max pattern size. This can be INT_MAX (65535) if you want.
     */
    const int max_pattern_size = 256;

    /* This array is used as a cyclic buffer for counts of the number of matching
     * characters in a given instance of the pattern. The counts are maintained at
     * the end of each pattern. When a pattern with k or fewer mismatches is found,
     * it is reported. As the algorithm steps through the count array, it resets the
     * counts it doesn't need anymore back to m, so they can be reused when the
     * index in the text exceeds reaches the end and needs to wrap around. The first m-1
     * characters will be initialized to max_pattern_size, so they never have a valid
     * number of mismatches.
     */
    int count[max_pattern_size];

    char *text = "xxxadcxxx", *pattern = "abc";
    int n  = strlen( text ), m = strlen( pattern ), k;
    Match *match = NULL;
    GList *l0, *list;

    /* Test Match class
     */
    printf( "\nTesting Match class..\n\n" );
    match = Match_new( 0, 0 );
    g_assert( match );
    Match_print( match, NULL );
    Match_free( match, NULL );
    printf( "\nDone testing Match class.\n\n" );

    /* Test preprocess and search functions.
     */
    printf( "\nTesting \"preprocess\" and \"search\" functions...\n" );

    k = 0;
    printf( "\n...with number of allowed errors k = %d\n", k );
    preprocess( pattern, m, alpha, count, max_pattern_size );
    list = search( text, n, m, k, alpha, count, max_pattern_size );
    g_list_foreach( list, Match_print, NULL );
    g_list_foreach( list, Match_free, NULL  );
    if ( !g_list_length( list ) )
        printf( "No matches.\n" );

    k = 1;
    printf( "\n...with number of allowed errors k = %d\n", k );
    preprocess( pattern, m, alpha, count, max_pattern_size );
    list = search( text, n, m, k, alpha, count, max_pattern_size );
    g_list_foreach( list, Match_print, NULL );
    match = (Match *) g_list_nth_data( list , 0 );
    g_assert(  GPOINTER_TO_INT( match->i ) == 3 );
    g_assert(  GPOINTER_TO_INT( match->k ) == 1 );
    g_list_foreach( list, Match_free, NULL  );

    k = 2;
    printf( "\n...with number of allowed errors k = %d\n", k );
    preprocess( pattern, m, alpha, count, max_pattern_size );
    list = search( text, n, m, k, alpha, count, max_pattern_size );
    g_list_foreach( list, Match_print, NULL );
    match = (Match *) g_list_nth_data( list , 0 );
    g_assert(  GPOINTER_TO_INT( match->i ) == 3 );
    g_assert(  GPOINTER_TO_INT( match->k ) == 1 );
    g_list_foreach( list, Match_free, NULL  );

    k = 3;
    printf( "\n...with number of allowed errors k = %d\n", k );
    preprocess( pattern, m, alpha, count, max_pattern_size );
    list = search( text, n, m, k, alpha, count, max_pattern_size );
    g_list_foreach( list, Match_print, NULL );
    match = (Match *) g_list_nth_data( list , 3 );
    g_assert(  GPOINTER_TO_INT( match->i ) == 3 );
    g_assert(  GPOINTER_TO_INT( match->k ) == 1 );
    g_list_foreach( list, Match_free, NULL  );

    printf( "\nDone testing \"preprocess\" and \"search\" functions.\n\n" );

    return 0;
}

输出

这是简单编译示例程序的输出：

Testing Match class..

position: 0, errors: 0

Done testing Match class.


Testing "preprocess" and "search" functions...

...with number of allowed errors k = 0
No matches.

...with number of allowed errors k = 1
position: 3, errors: 1

...with number of allowed errors k = 2
position: 3, errors: 1

...with number of allowed errors k = 3
position: 0, errors: 3
position: 1, errors: 3
position: 2, errors: 3
position: 3, errors: 1
position: 4, errors: 3
position: 5, errors: 3
position: 6, errors: 3

Done testing "preprocess" and "search" functions.

这是一个使用此代码的小型 GTK4 应用程序：

https ://gist.github.com/angstyloop/2281191a3e7fd7e4c615698661fbac24

通过动态选择模式的最大长度，如果您正在搜索的字符串是，您可以免费获得完整的模糊匹配就汉明距离而言，大多数情况下相距很远。即使进行插入和删除，汉明距离最接近的字符串与其他字符串相比也会有少量的不匹配。用户将不得不犯很多错误，或者两个字符串必须非常接近，才能破坏这种良好的行为。这是一个例子：

The simplest form of fuzzy match is probably "match with mismatches". Mismatches are sometimes called substitutions. The point is we are not considering deletions or insertions.

Ricardo Baeza-Yates, the author of many versions of Bitapp, also authored an algorithm for "match with mismatches" with Chris Perleberg. The algorithm uses linked lists instead of bit arrays, but the spirit of the algorithm is the same. The paper is cited in the comments.

Here is a C implementation of the Baeza-Yates-Perleberg "match with mismatches" algorithm that uses GLib. It is less optimized than the original implementation, but there are no limits on the size of the pattern or the text.

https://gist.github.com/angstyloop/e4ca495542cd469790ca926ade2fc072

/* g_match_with_mismatches_v3.c
 *
 * Search for fuzzy matches of a pattern in text with @k or fewer mismatches
 * (substitutions). Uses doubly-linked list GList from GLib.
 *
 * COMPILE
 *
 * gcc `pkg-config --cflags glib-2.0` -o g_match_with_mismatches_v3 g_match_with_mismatches_v3.c `pkg-config --libs glib-2.0`
 *
 * RUN
 *
 * ./match_with_mismatches
 *
 * REFS
 *
 * The search and preprocess functions were taken from the example code from
 * "Fast and Practical Approximate String Matching" by Ricardo A. Baeza-Yates
 * and Chris H. Perleberg. I just modified the code.
 */

#include <glib-2.0/glib.h>
#include <stdio.h>

/* Size of alpha index and count array. This number should be twice the size
 * of the alphabet, which is 128 in this case for ASCII (not extended).
 * The purpose of this extra space is explained later.
 */
#define ALPHABET_SIZE 256

/* The Match object will hold the index @i of the first character of the match
 * and the number of mismatched characters @k within that match. The Match
 * objects are also linked list nodes.
 */
typedef struct Match Match;
struct Match {
    int i;
    int k;
};

Match *
Match_new( int i, int k )
{
    Match *t;
    t = g_malloc( sizeof( Match ) );
    t->i = i;
    t->k = k;
    return t;
}

void
Match_free( gpointer match_, gpointer user_data )
{
    Match *match = (Match *) match_;
    if ( match )
        g_free( match );
}

void
Match_print( gpointer match_, gpointer user_data )
{
    Match *match = (Match *) match_;
    printf( "position: %d, errors: %d\n", match->i, match->k );
}

/* An array of lists. There are 128 linked lists in total - one for each
 * character in our 128-character alphabet. The last 128 lists characters are
 * added to the linked lists as needed.
 *
 * Each list will contain the offsets for each occurence of that character, or a
 * single placeholder offset of -1 if no occurences are found.
 *
 * The offset is the distance of the character to the left of the end of
 * pattern, (i.e., measured by counting to the left from the end of pattern),
 * for a given character and a given instance of pattern.
 */
GList *alpha[ALPHABET_SIZE];

/* This function initializes the @alpha and @count arrays to look like this:
 *
 *     alpha = [ [ -1 ] ] * ALPHABET_SIZE   where the inner brackets are a GList.
 *
 *     count = [m] * m
 *
 * @alpha will be an array of linked lists. Characters in the pattern
 * that do not occur or that occur exactly once in the text will have
 * corresponding linked lists with length one. Characters in the pattern that
 * occur in the text more than once will have corresponding linked lists with
 * length greater than one.
 *
 * The first m - 1  elements of @count will be skipped on the first iteration of
 * the cyclic array (since no match can be shorter than the @pattern). Note that
 * the values in @count are reset to m once they are no longer needed, until the
 * next loop around @count.
 *
 * @p - pattern string
 * @m - pattern length
 * @alpha - array of GList. See above.
 * @count - circular buffer for counts of matches
 */
void preprocess( char *p, int m, GList *alpha[], int count[], int max_pattern_size )
{
    int i, j;

    for ( i = 0; i < ALPHABET_SIZE; i++ ) {
        alpha[i] = NULL;
        alpha[i] = g_list_append( alpha[i], GINT_TO_POINTER( -1 ) );
    }

    for ( i = 0, j = 128; i < m; i++, p++ ) {
        if ( GPOINTER_TO_INT( alpha[*p]->data ) == -1 )
            alpha[*p]->data = GINT_TO_POINTER( m - i - 1 );
        else
            alpha[*p] = g_list_append( alpha[*p],
                GINT_TO_POINTER( m - i - 1 ) );
    }

    for ( i = 0; i < max_pattern_size; i++ )
        count[i] = m;

}

void
increment_offset( gpointer off_, gpointer args_ )
{
    gpointer *args = (gpointer *) args_;
    int i = GPOINTER_TO_INT( args[0] );
    int max_pattern_size = GPOINTER_TO_INT( args[1] );
    int *count = (int *) args[2];
    gint off = GPOINTER_TO_INT( off_ ) ;
    count[(i + off) % max_pattern_size]--;
}

/* Find the position of the first character and number of mismatches of every
 * fuzzy match in a string @t with @k or fewer mismatches. Uses the array of
 * GList @alpha and the array of counts @count prepared by the preprocess
 * function.
 * @t - text string
 * @n - length of text string
 * @m - length of the pattern used to create @alpha and @count
 * @k - maximum number of allowed mismatches
 * @alpha - array of GList. See above.
 * @count - circular buffer for counts of matches
 */
GList *
search( char *t, int n, int m, int k, GList *alpha[], int count[], int max_pattern_size )
{
    int i, off, j;
    Match *match;
    GList *l0 = NULL, *l1 = NULL;

    /* Walk the text @t, which has length @n.
     */
    for ( i = 0; i < n; i++ ) {
        /* If the current character in @t is in pattern, its
         * corresponding list in @alpha will have a non-negative offset,
         * thanks to the workdone by the preprocess function. If so, we
         * need to decrement the counts in the circular buffer @count
         * corresponding to the index of the character in the text and
         * the offsets the lists corresponding to those characters,
         * which the preprocess function prepared.
         * 
         * Note that we will only ever need m counts at a time, and
         * we reset them to @m when we are done with them, in case
         * they are needed when the text wraps max_pattern_size
         * characters.
         */
        l0 = alpha[*t++];
        off = GPOINTER_TO_INT( l0->data );
        if ( off >= 0 ) {
            g_assert( l0 );
            gpointer t[3] = {
                GINT_TO_POINTER( i ),
                GINT_TO_POINTER( max_pattern_size ),
                (gpointer) count,
            };
            g_list_foreach( l0, increment_offset, t );
        }

        /* If the count in @count corresponding to the current index in
         * the text is no greater than @k, the number of mismatches we
         * allow, then the pattern instance is reported as a fuzzy
         * match. The position of the first letter in the match is
         * calculated using the pattern length and the index of the last
         * character in the match The number of mismatches is calculated
         * from the number of matches. The first m - 1 elements are
         * skipped.
         */
        if ( i >= m - 1 && count[i % max_pattern_size] <= k ) {
            g_assert( i - m + 1 >= 0 );
            match = Match_new( i - m + 1, count[i % max_pattern_size] );
            l1 = g_list_append( l1, match );
        }

        /* The count in @count corresponding to the current index in
         * text is no longer needed, so we reset it to @m until we
         * need it on the next wraparound.
         */
        count[i % max_pattern_size] = m;
    }

    return l1;
}

/* This is a test harness for the code in this file.
 */
int main()
{

    /* Define the max pattern size. This can be INT_MAX (65535) if you want.
     */
    const int max_pattern_size = 256;

    /* This array is used as a cyclic buffer for counts of the number of matching
     * characters in a given instance of the pattern. The counts are maintained at
     * the end of each pattern. When a pattern with k or fewer mismatches is found,
     * it is reported. As the algorithm steps through the count array, it resets the
     * counts it doesn't need anymore back to m, so they can be reused when the
     * index in the text exceeds reaches the end and needs to wrap around. The first m-1
     * characters will be initialized to max_pattern_size, so they never have a valid
     * number of mismatches.
     */
    int count[max_pattern_size];

    char *text = "xxxadcxxx", *pattern = "abc";
    int n  = strlen( text ), m = strlen( pattern ), k;
    Match *match = NULL;
    GList *l0, *list;

    /* Test Match class
     */
    printf( "\nTesting Match class..\n\n" );
    match = Match_new( 0, 0 );
    g_assert( match );
    Match_print( match, NULL );
    Match_free( match, NULL );
    printf( "\nDone testing Match class.\n\n" );

    /* Test preprocess and search functions.
     */
    printf( "\nTesting \"preprocess\" and \"search\" functions...\n" );

    k = 0;
    printf( "\n...with number of allowed errors k = %d\n", k );
    preprocess( pattern, m, alpha, count, max_pattern_size );
    list = search( text, n, m, k, alpha, count, max_pattern_size );
    g_list_foreach( list, Match_print, NULL );
    g_list_foreach( list, Match_free, NULL  );
    if ( !g_list_length( list ) )
        printf( "No matches.\n" );

    k = 1;
    printf( "\n...with number of allowed errors k = %d\n", k );
    preprocess( pattern, m, alpha, count, max_pattern_size );
    list = search( text, n, m, k, alpha, count, max_pattern_size );
    g_list_foreach( list, Match_print, NULL );
    match = (Match *) g_list_nth_data( list , 0 );
    g_assert(  GPOINTER_TO_INT( match->i ) == 3 );
    g_assert(  GPOINTER_TO_INT( match->k ) == 1 );
    g_list_foreach( list, Match_free, NULL  );

    k = 2;
    printf( "\n...with number of allowed errors k = %d\n", k );
    preprocess( pattern, m, alpha, count, max_pattern_size );
    list = search( text, n, m, k, alpha, count, max_pattern_size );
    g_list_foreach( list, Match_print, NULL );
    match = (Match *) g_list_nth_data( list , 0 );
    g_assert(  GPOINTER_TO_INT( match->i ) == 3 );
    g_assert(  GPOINTER_TO_INT( match->k ) == 1 );
    g_list_foreach( list, Match_free, NULL  );

    k = 3;
    printf( "\n...with number of allowed errors k = %d\n", k );
    preprocess( pattern, m, alpha, count, max_pattern_size );
    list = search( text, n, m, k, alpha, count, max_pattern_size );
    g_list_foreach( list, Match_print, NULL );
    match = (Match *) g_list_nth_data( list , 3 );
    g_assert(  GPOINTER_TO_INT( match->i ) == 3 );
    g_assert(  GPOINTER_TO_INT( match->k ) == 1 );
    g_list_foreach( list, Match_free, NULL  );

    printf( "\nDone testing \"preprocess\" and \"search\" functions.\n\n" );

    return 0;
}

Output

Here is the output of the simple compiled example program:

Testing Match class..

position: 0, errors: 0

Done testing Match class.


Testing "preprocess" and "search" functions...

...with number of allowed errors k = 0
No matches.

...with number of allowed errors k = 1
position: 3, errors: 1

...with number of allowed errors k = 2
position: 3, errors: 1

...with number of allowed errors k = 3
position: 0, errors: 3
position: 1, errors: 3
position: 2, errors: 3
position: 3, errors: 1
position: 4, errors: 3
position: 5, errors: 3
position: 6, errors: 3

Done testing "preprocess" and "search" functions.

Here is a small GTK4 application that uses this code:

https://gist.github.com/angstyloop/2281191a3e7fd7e4c615698661fbac24

By dynamically picking the max length of the pattern, you can get a full fuzzy match for free if the strings you are searching are mostly far apart in terms of Hamming distance. Even with insertions and deletions, the string that is closest in terms of Hamming distance will have a small number of mismatches compared to the other strings. The user will have to make many errors, or two of the strings will have to be very close, in order to break that nice behavior. Here is an example:

回复收藏 0 原文

~没有更多了~