如何根据真实数据自动创建模式？

发布于 2024-12-26 13:26:54 字数 448 浏览 1 评论 0原文

我的数据库中有很多供应商，他们在数据的某些方面都有所不同。我想制定基于以前数据的数据验证规则。

示例：

A: XZ-4, XZ-23, XZ-217
B: 1276, 1899, 22711
C: 12-4, 12-75, 12

目标：如果用户输入供应商 B 的字符串“XZ-217”，算法应比较以前的数据并说：该字符串与供应商 B 以前的数据不相似。

有没有一些好的方法/工具来实现这样的比较？答案可能是一些通用算法或 Perl 模块。

编辑：我同意，“相似性”很难定义。但我想了解算法，它可以分析之前的大约 100 个样本，然后将分析结果与新数据进行比较。相似性可能基于长度、字符/数字的使用、字符串创建模式、相似的开头/结尾/中间、有一些分隔符。

我觉得这不是一件容易的事，但另一方面，我认为它有非常广泛的用途。所以我希望，已经有一些提示了。

原文

I have many vendors in database, they all differ in some aspect of their data. I'd like to make data validation rule which is based on previous data.

Example:

A: XZ-4, XZ-23, XZ-217
B: 1276, 1899, 22711
C: 12-4, 12-75, 12

Goal: if user inputs string 'XZ-217' for vendor B, algorithm should compare previous data and say: this string is not similar to vendor B previous data.

Is there some good way/tools to achieve such comparison? Answer could be some generic algoritm or Perl module.

Edit:
The "similarity" is hard to define, i agree. But i'd like to catch to algorithm, which could analyze previous ca 100 samples and then compare the outcome of analyze with new data. Similarity may based on length, on use of characters/numbers, string creation patterns, similar beginning/end/middle, having some separators in.

I feel it is not easy task, but on other hand, i think it has very wide use. So i hoped, there is already some hints.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

那些过往 2025-01-02 13:26:54

您可能想仔细阅读：
http://en.wikipedia.org/wiki/String_metric 和 http://search.cpan.org/dist/Text-Levenshtein/Levenshtein.pm （例如）

回复收藏 0 原文

对你的占有欲 2025-01-02 13:26:54

乔尔和我也提出了类似的想法。下面的代码区分了 3 种类型的区域。

一个或多个非单词字符
字母数字簇
一组数字

它创建字符串的配置文件和正则表达式来匹配输入。此外，它还包含扩展现有配置文件的逻辑。最后，在任务子中，它包含一些伪逻辑，指示如何将其集成到更大的应用程序中。

use strict;
use warnings;
use List::Util qw<max min>;

sub compile_search_expr { 
    shift;
    @_ = @{ shift() } if @_ == 1;
    my $str 
        = join( '|'
              , map { join( ''
                           , grep { defined; } 
                             map  {
                                 $_ eq 'P' ? quotemeta;
                               : $_ eq 'W' ? "\\w{$_->[1],$_->[2]}"
                               : $_ eq 'D' ? "\\d{$_->[1],$_->[2]}"
                               :             undef
                               ;
                            } @$_ 
                          )
                } @_ == 1 ? @{ shift } : @_
        );
    return qr/^(?:$str)$/;
}

sub merge_profiles {
    shift;
    my ( $profile_list, $new_profile ) = @_;
    my $found = 0;
    PROFILE:
    for my $profile ( @$profile_list ) { 
        my $profile_length = @$profile;

        # it's not the same profile.
        next PROFILE unless $profile_length == @$new_profile;
        my @merged;
        for ( my $i = 0; $i < $profile_length; $i++ ) { 
            my $old = $profile->[$i];
            my $new = $new_profile->[$i];
            next PROFILE unless $old->[0] eq $new->[0];
            push( @merged
                , [ $old->[0]
                  , min( $old->[1], $new->[1] )
                  , max( $old->[2], $new->[2] ) 
                  ]);
        }
        @$profile = @merged;
        $found = 1;
        last PROFILE;
    }
    push @$profile_list, $new_profile unless $found;
    return;
}

sub compute_info_profile { 
    shift;
    my @profile_chunks
        = map { 
              /\W/ ? [ P => $_ ]
            : /\D/ ? [ W => length, length ]
            :        [ D => length, length ]
        }
        grep { length; } split /(\W+)/, shift
        ;
}

# Psuedo-Perl
sub process_input_task { 
    my ( $application, $input ) = @_;

    my $patterns = $application->get_patterns_for_current_customer;
    my $regex    = $application->compile_search_expr( $patterns );

    if    ( $input =~ /$regex/ ) {}
    elsif ( $application->approve_divergeance( $input )) {
        $application->merge_profiles( $patterns, compute_info_profile( $input ));
    }
    else { 
        $application->escalate( 
           Incident->new( issue    => INVALID_FORMAT
                        , input    => $input
                        , customer => $customer 
                        ));
    }

    return $application->process_approved_input( $input );
}

Joel and I came up with similar ideas. The code below differentiates 3 types of zones.

one or more non-word characters
alphanumeric cluster
a cluster of digits

It creates a profile of the string and a regex to match input. In addition, it also contains logic to expand existing profiles. At the end, in the task sub, it contains some pseudo logic which indicates how this might be integrated into a larger application.

use strict;
use warnings;
use List::Util qw<max min>;

sub compile_search_expr { 
    shift;
    @_ = @{ shift() } if @_ == 1;
    my $str 
        = join( '|'
              , map { join( ''
                           , grep { defined; } 
                             map  {
                                 $_ eq 'P' ? quotemeta;
                               : $_ eq 'W' ? "\\w{$_->[1],$_->[2]}"
                               : $_ eq 'D' ? "\\d{$_->[1],$_->[2]}"
                               :             undef
                               ;
                            } @$_ 
                          )
                } @_ == 1 ? @{ shift } : @_
        );
    return qr/^(?:$str)$/;
}

sub merge_profiles {
    shift;
    my ( $profile_list, $new_profile ) = @_;
    my $found = 0;
    PROFILE:
    for my $profile ( @$profile_list ) { 
        my $profile_length = @$profile;

        # it's not the same profile.
        next PROFILE unless $profile_length == @$new_profile;
        my @merged;
        for ( my $i = 0; $i < $profile_length; $i++ ) { 
            my $old = $profile->[$i];
            my $new = $new_profile->[$i];
            next PROFILE unless $old->[0] eq $new->[0];
            push( @merged
                , [ $old->[0]
                  , min( $old->[1], $new->[1] )
                  , max( $old->[2], $new->[2] ) 
                  ]);
        }
        @$profile = @merged;
        $found = 1;
        last PROFILE;
    }
    push @$profile_list, $new_profile unless $found;
    return;
}

sub compute_info_profile { 
    shift;
    my @profile_chunks
        = map { 
              /\W/ ? [ P => $_ ]
            : /\D/ ? [ W => length, length ]
            :        [ D => length, length ]
        }
        grep { length; } split /(\W+)/, shift
        ;
}

# Psuedo-Perl
sub process_input_task { 
    my ( $application, $input ) = @_;

    my $patterns = $application->get_patterns_for_current_customer;
    my $regex    = $application->compile_search_expr( $patterns );

    if    ( $input =~ /$regex/ ) {}
    elsif ( $application->approve_divergeance( $input )) {
        $application->merge_profiles( $patterns, compute_info_profile( $input ));
    }
    else { 
        $application->escalate( 
           Incident->new( issue    => INVALID_FORMAT
                        , input    => $input
                        , customer => $customer 
                        ));
    }

    return $application->process_approved_input( $input );
}

回复收藏 0 原文

删除→记忆 2025-01-02 13:26:54

这是我的实现和对您的测试用例的循环。基本上，您为该函数提供了一个好的值列表，它会尝试为其构建一个正则表达式。

输出：

A: (?^:\w{2,2}(?:\-){1}\d{1,3})
B: (?^:\d{4,5})
C: (?^:\d{2,2}(?:\-)?\d{0,2})

代码：

#!/usr/bin/env perl

use strict;
use warnings;

use List::MoreUtils qw'uniq each_arrayref';

my %examples = (
  A => [qw/ XZ-4 XZ-23 XZ-217 /],
  B => [qw/ 1276 1899 22711 /],
  C => [qw/ 12-4 12-75 12 /],
);

foreach my $example (sort keys %examples) {
  print "$example: ", gen_regex(@{ $examples{$example} }) || "Generate failed!", "\n";
}

sub gen_regex {
  my @cases = @_;

  my %exploded;

  # ex. $case may be XZ-217
  foreach my $case (@cases) {
    my @parts = 
      grep { defined and length } 
      split( /(\d+|\w+)/, $case );

    # @parts are ( XZ, -, 217 )

    foreach (@parts) {
      if (/\d/) {
        # 217 becomes ['\d' => 3]
        push @{ $exploded{$case} }, ['\d' => length];

      } elsif (/\w/) {
        #XZ becomes ['\w' => 2]
        push @{ $exploded{$case} }, ['\w' => length];

      } else {
        # - becomes ['lit' => '-']
        push @{ $exploded{$case} }, ['lit' => $_ ];

      }
    }
  }

  my $pattern = '';

  # iterate over nth element (part) of each case
  my $ea = each_arrayref(values %exploded);
  while (my @parts = $ea->()) {

    # remove undefined (i.e. optional) parts
    my @def_parts = grep { defined } @parts;

    # check that all (defined) parts are the same type
    my @part_types = uniq map {$_->[0]} @def_parts;
    if (@part_types > 1) {
      warn "Parts not aligned\n";
      return;
    }
    my $type = $part_types[0]; #same so make scalar

    # were there optional parts?
    my $required = (@parts == @def_parts);

    # keep the values of each part
    # these are either a repitition or lit strings
    my @values = sort uniq map { $_->[1] } @def_parts;

    # these are for non-literal quantifiers
    my $min = $required ? $values[0] : 0;
    my $max = $values[-1];

    # write the specific pattern for each type
    if ($type eq '\d') {
      $pattern .= '\d' . "{$min,$max}";

    } elsif ($type eq '\w') {
      $pattern .= '\w' . "{$min,$max}";

    } elsif ($type eq 'lit') {
      # quote special characters, - becomes \-
      my @uniq = map { quotemeta } uniq @values;
      # join with alternations, surround by non-capture grouup, add quantifier
      $pattern .= '(?:' . join('|', @uniq) . ')' . ($required ? '{1}' : '?');
    }
  }


  # build the qr regex from pattern
  my $regex = qr/$pattern/;
  # test that all original patterns match (@fail should be empty)
  my @fail = grep { $_ !~ $regex } @cases;

  if (@fail) {
    warn "Some cases fail for generated pattern $regex: (@fail)\n";
    return '';
  } else {
    return $regex;
  }
}

为了简化查找模式的工作，可选部分可以出现在末尾，但可选部分之后不能出现必需的部分。这或许可以克服，但可能很难。

Here is my implementation and a loop over your test cases. Basically you give a list of good values to the function and it tries to build a regex for it.

output:

A: (?^:\w{2,2}(?:\-){1}\d{1,3})
B: (?^:\d{4,5})
C: (?^:\d{2,2}(?:\-)?\d{0,2})

code:

#!/usr/bin/env perl

use strict;
use warnings;

use List::MoreUtils qw'uniq each_arrayref';

my %examples = (
  A => [qw/ XZ-4 XZ-23 XZ-217 /],
  B => [qw/ 1276 1899 22711 /],
  C => [qw/ 12-4 12-75 12 /],
);

foreach my $example (sort keys %examples) {
  print "$example: ", gen_regex(@{ $examples{$example} }) || "Generate failed!", "\n";
}

sub gen_regex {
  my @cases = @_;

  my %exploded;

  # ex. $case may be XZ-217
  foreach my $case (@cases) {
    my @parts = 
      grep { defined and length } 
      split( /(\d+|\w+)/, $case );

    # @parts are ( XZ, -, 217 )

    foreach (@parts) {
      if (/\d/) {
        # 217 becomes ['\d' => 3]
        push @{ $exploded{$case} }, ['\d' => length];

      } elsif (/\w/) {
        #XZ becomes ['\w' => 2]
        push @{ $exploded{$case} }, ['\w' => length];

      } else {
        # - becomes ['lit' => '-']
        push @{ $exploded{$case} }, ['lit' => $_ ];

      }
    }
  }

  my $pattern = '';

  # iterate over nth element (part) of each case
  my $ea = each_arrayref(values %exploded);
  while (my @parts = $ea->()) {

    # remove undefined (i.e. optional) parts
    my @def_parts = grep { defined } @parts;

    # check that all (defined) parts are the same type
    my @part_types = uniq map {$_->[0]} @def_parts;
    if (@part_types > 1) {
      warn "Parts not aligned\n";
      return;
    }
    my $type = $part_types[0]; #same so make scalar

    # were there optional parts?
    my $required = (@parts == @def_parts);

    # keep the values of each part
    # these are either a repitition or lit strings
    my @values = sort uniq map { $_->[1] } @def_parts;

    # these are for non-literal quantifiers
    my $min = $required ? $values[0] : 0;
    my $max = $values[-1];

    # write the specific pattern for each type
    if ($type eq '\d') {
      $pattern .= '\d' . "{$min,$max}";

    } elsif ($type eq '\w') {
      $pattern .= '\w' . "{$min,$max}";

    } elsif ($type eq 'lit') {
      # quote special characters, - becomes \-
      my @uniq = map { quotemeta } uniq @values;
      # join with alternations, surround by non-capture grouup, add quantifier
      $pattern .= '(?:' . join('|', @uniq) . ')' . ($required ? '{1}' : '?');
    }
  }


  # build the qr regex from pattern
  my $regex = qr/$pattern/;
  # test that all original patterns match (@fail should be empty)
  my @fail = grep { $_ !~ $regex } @cases;

  if (@fail) {
    warn "Some cases fail for generated pattern $regex: (@fail)\n";
    return '';
  } else {
    return $regex;
  }
}

To simplify the work of finding the pattern, optional parts may come at the end, but no required parts may come after optional ones. This could probably be overcome but it might be hard.

回复收藏 0 原文