使用正则表达式仅从列表中提取不包含重复字母的单词

发布于 2024-11-01 07:45:44 字数 197 浏览 1 评论 0原文

我有一个很大的单词列表文件，每行一个单词。我想过滤掉重复字母的单词。

INPUT:
  abducts
  abe
  abeam
  abel
  abele

OUTPUT:
  abducts
  abe
  abel

我想使用正则表达式（grep 或 perl 或 python）来做到这一点。这可能吗？

原文

I have a large word list file with one word per line. I would like to filter out the words with repeating alphabets.

INPUT:
  abducts
  abe
  abeam
  abel
  abele

OUTPUT:
  abducts
  abe
  abel

I'd like to do this using Regex (grep or perl or python). Is that possible?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

聽兲甴掵 2024-11-08 07:45:44

编写一个正则表达式来匹配具有重复字母的单词，然后否定匹配要容易得多：（

my @input = qw(abducts abe abeam abel abele);
my @output = grep { not /(\w).*\1/ } @input;

此代码假设 @input 每个条目包含一个单词。）但这个问题不一定能通过正则表达式得到最好的解决。

我已经在 Perl 中给出了代码，但它可以很容易地转换为任何支持反向引用的正则表达式风格，包括 grep （它也有 -v 开关来否定匹配）。

It's much easier to write a regex that matches words that do have repeating letters, and then negate the match:

my @input = qw(abducts abe abeam abel abele);
my @output = grep { not /(\w).*\1/ } @input;

(This code assumes that @input contains one word per entry.) But this problem isn't necessarily best solved with a regex.

I've given the code in Perl, but it could easily be translated into any regex flavor that supports backreferences, including grep (which also has the -v switch to negate the match).

回复收藏 0 原文

紫罗兰の梦幻 2024-11-08 07:45:44

$ egrep -vi '(.).*\1' wordlist

$ egrep -vi '(.).*\1' wordlist

回复收藏 0 原文

滥情哥ㄟ 2024-11-08 07:45:44

可以使用正则表达式：

import re

inp = [
    'abducts'
,   'abe'
,   'abeam'
,   'abel'
,   'abele'
]

# detect word which contains a character at least twice
rgx = re.compile(r'.*(.).*\1.*') 

def filter_words(inp):
    for word in inp:
        if rgx.match(word) is None:
            yield word

print list(filter_words(inp))

It is possible to use regex:

import re

inp = [
    'abducts'
,   'abe'
,   'abeam'
,   'abel'
,   'abele'
]

# detect word which contains a character at least twice
rgx = re.compile(r'.*(.).*\1.*') 

def filter_words(inp):
    for word in inp:
        if rgx.match(word) is None:
            yield word

print list(filter_words(inp))

回复收藏 0 原文

蓝天白云 2024-11-08 07:45:44

简单的东西

尽管有不准确的说法称这对于正则表达式是不可能的，但它确实是不可能的。

虽然 @cjm 公正地指出，否定正匹配比将负匹配表示为单个模式要容易得多，但这样做的模型已经足够众所周知，以至于只需将一些东西插入该模式即可模型。假设：

/X/

表达条件的方法

    ! /X/

匹配某些东西，那么以单个正匹配模式

    /\A (?: (?! X ) . ) * \z /sx

是将其写为因此，鉴于正模式是

    / (\pL) .* \1 /sxi

相应的负需求，必须

    /\A (?: (?! (\pL) .* \1  ) . ) * \z /sxi

通过简单替换 X 来实现.

现实世界的担忧

尽管如此，有些令人情有可原的担忧有时可能需要更多的工作。例如，虽然 \pL 描述具有 GeneralCategory=Letter 属性的任何代码点，但它不考虑如何处理诸如 red‐violet–colored< /em>、'Tisn't 或 fiancée — 后者在其他等效的 NFD 与 NFC 形式中有所不同。

因此，您必须首先通过完全分解来运行它，以便像 "r\x{E9}sume\x{301}" 这样的字符串能够正确检测重复的“字母 é's”——即所有规范等效的字素簇单元。

为了解决这些问题，您至少必须首先通过 NFD 分解运行字符串，然后还通过 \X 使用字素簇，而不是通过 .< /代码>。

因此，对于英语，您需要遵循这些正匹配的内容，并根据上面给出的替换进行相应的负匹配：

    NFD($string) =~ m{
        (?<ELEMENT>
           (?= [\p{Alphabetic}\p{Dash}\p{Quotation_Mark}] ) \X 
        )
        \X *
        \k<ELEMENT>
    }xi

但即使如此，仍然存在某些未解决的未解决问题，例如是否 \ N{EN DASH} 和 \N{HYPHEN} 应被视为等效元素或不同元素。

这是因为正确的书写方式，将两个元素如 red-violet 和 colored 连接起来形成单个复合词 red-violet-colored，其中至少其中一个已经包含连字符，要求使用 EN DASH 作为分隔符，而不仅仅是连字符。

通常，EN DASH 保留用于类似性质的化合物，例如时空权衡。然而，使用打字机英语的人甚至不会这样做，他们使用超大规模重载的遗留代码点连字符减号来表示两者：红紫色。

这仅取决于您的文本是否来自 19 世纪的手动打字机，或者它是否代表在现代排版规则下正确呈现的英语文本。 :)

认真区分大小写

您会注意到，我在这里将仅大小写不同的字母视为同一字母。这是因为我使用 /i 正则表达式开关，ᴀᴋᴀ (?i) 模式修饰符。

这相当就像说它们与排序规则强度1相同 - 但又不完全一样，因为Perl仅使用大小写折叠（尽管完整大小写折叠并不简单) 对于不区分大小写的匹配，而不是比三级更高的排序规则强度，这可能是首选。

主要校对强度的完全等价是一种明显更强的说法，但在一般情况下可能需要这种说法才能完全解决问题。然而，在许多特定情况下，这需要比解决问题所需的工作多得多的工作。简而言之，对于实际出现的许多具体情况来说，它是多余的，无论假设的一般情况需要多少。

这变得更加困难，因为尽管你可以这样做：

    my $collator = new Unicode::Collate::Locale::
                       level => 1, 
                       locale => "de__phonebook",
                       normalization => undef,
                    ;

    if ($collator->cmp("müß", "MUESS") == 0) { ... }

并期望得到正确的答案 - 你确实得到了，万岁！ - 这种强大的字符串比较不容易扩展到正则表达式匹配。

然而。 :)

总结

选择是设计不足还是过度设计解决方案将根据个人情况而有所不同，没有人可以为您做出决定。

我喜欢 CJM 的解决方案，它否定了正匹配，尽管它对于重复字母的判断有些漫不经心。注意：

    while ("de__phonebook" =~ /(?=((\w).*?\2))/g) {
        print "The letter <$2> is duplicated in the substring <$1>.\n";
    }

产生：

    The letter <e> is duplicated in the substring <e__phone>.
    The letter <_> is duplicated in the substring <__>.
    The letter <o> is duplicated in the substring <onebo>.
    The letter <o> is duplicated in the substring <oo>.

这说明了为什么当您需要匹配字母时，您应该始终使用 \pL ᴀᴋᴀ \p{Letter} 而不是\w，实际上匹配 [\p{alpha}\p{GC=Mark}\p{NT=De}\p{GC=Pc}]。

当然，当你需要匹配字母时，你需要使用 \p{alpha} ᴀᴋᴀ\p{Alphabetic}，它与只是一封信——与普遍的误解相反。 :)

Simple Stuff

Despite the inaccurate protestation that this is impossible with a regex, it certainly is.

While @cjm justly states that it is a lot easier to negate a positive match than it is to express a negative one as a single pattern, the model for doing so is sufficiently well-known that it becomes a mere matter of plugging things into that model. Given that:

/X/

matches something, then the way to express the condition

    ! /X/

in a single, positively-matching pattern is to write it as

    /\A (?: (?! X ) . ) * \z /sx

Therefore, given that the positive pattern is

    / (\pL) .* \1 /sxi

the corresponding negative needs must be

    /\A (?: (?! (\pL) .* \1  ) . ) * \z /sxi

by way of simple substitution for X.

Real-World Concerns

That said, there are extenuating concerns that may sometimes require more work. For example, while \pL describes any code point having the GeneralCategory=Letter property, it does not consider what to do with words like red‐violet–colored, ’Tisn’t, or fiancée — the latter of which is different in otherwise-equivalent NFD vs NFC forms.

You therefore must first run it through full decomposition, so that a string like "r\x{E9}sume\x{301}" would correctly detect the duplicate “letter é’s” — that is, all canonically equivalent grapheme cluster units.

To account for such as these, you must at a bare minimum first run your string through an NFD decomposition, and then afterwards also use grapheme clusters via \X instead of arbitrary code points via ..

So for English, you would want something that followed along these lines for the positive match, with the corresponding negative match per the substitution give above:

    NFD($string) =~ m{
        (?<ELEMENT>
           (?= [\p{Alphabetic}\p{Dash}\p{Quotation_Mark}] ) \X 
        )
        \X *
        \k<ELEMENT>
    }xi

But even with that there still remain certain outstanding issues unresolved, such as for example whether \N{EN DASH} and \N{HYPHEN} should be considered equivalent elements or different ones.

That’s because properly written, hyphenating two elements like red‐violet and colored to form the single compound word red‐violet–colored, where at least one of the pair already contains a hyphen, requires that one employ an EN DASH as the separator instead of a mere HYPHEN.

Normally the EN DASH is reserved for compounds of like nature, such as a time–space trade‐off. People using typewriter‐English don’t even do that, though, using that super‐massively overloaded legacy code point, HYPHEN-MINUS, for both: red-violet-colored.

It just depends whether your text came from some 19th‐century manual typewriter — or whether it represents English text properly rendered under modern typesetting rules. :)

Conscientious Case Insensitivity

You will note I am here considering letter that differ in case alone to be the same one. That’s because I use the /i regex switch, ᴀᴋᴀ the (?i) pattern modifier.

That’s rather like saying that they are the same as collation strength 1 — but not quite, because Perl uses only case folding (albeit full case folding not simple) for its case insensitive matches, not some higher collation strength than the tertiary level as might be preferred.

Full equivalence at the primary collation strength is a significantly stronger statement, but one that may well be needed to fully solve the problem in the general case. However, that requires a lot more work than the problem necessarily requires in many specific instances. In short, it is overkill for many specific cases that actually arise, no matter how much it might be needed for the hypothetical general case.

This is made even more difficult because, although you can for example do this:

    my $collator = new Unicode::Collate::Locale::
                       level => 1, 
                       locale => "de__phonebook",
                       normalization => undef,
                    ;

    if ($collator->cmp("müß", "MUESS") == 0) { ... }

and expect to get the right answer — and you do, hurray! — this sort of robust string comparison is not easily extended to regex matches.

Yet. :)

Summary

The choice of whether to under‐engineer — or to over‐engineer — a solution will vary according to individual circumstances, which no one can decide for you.

I like CJM’s solution that negates a positive match, myself, although it’s somewhat cavalier about what it considers a duplicate letter. Notice:

    while ("de__phonebook" =~ /(?=((\w).*?\2))/g) {
        print "The letter <$2> is duplicated in the substring <$1>.\n";
    }

produces:

    The letter <e> is duplicated in the substring <e__phone>.
    The letter <_> is duplicated in the substring <__>.
    The letter <o> is duplicated in the substring <onebo>.
    The letter <o> is duplicated in the substring <oo>.

That shows why when you need to match a letter, you should alwasy use \pL ᴀᴋᴀ \p{Letter} instead of \w, which actually matches [\p{alpha}\p{GC=Mark}\p{NT=De}\p{GC=Pc}].

Of course, when you need to match an alphabetic, you need to use \p{alpha} ᴀᴋᴀ\p{Alphabetic}, which isn’t at all the same as a mere letter — contrary to popular misunderstanding. :)

回复收藏 0 原文

寒冷纷飞旳雪 2024-11-08 07:45:44

如果您正在处理可能包含重复字母的长字符串，尽快停止可能会有所帮助。

INPUT: for (@input) {
   my %seen;
   while (/(.)/sg) {
      next INPUT if $seen{$1}++;
   }
   say;
}

我会采用最简单的解决方案，除非发现性能确实不可接受。

my @output = grep !/(.).*?\1/s, @input;

If you're dealing with long strings that are likely to have duplicate letters, stopping ASAP may help.

INPUT: for (@input) {
   my %seen;
   while (/(.)/sg) {
      next INPUT if $seen{$1}++;
   }
   say;
}

I'd go with the simplest solution unless the performance is found to be really unacceptable.

my @output = grep !/(.).*?\1/s, @input;

回复收藏 0 原文

猫烠⑼条掵仅有一顆心 2024-11-08 07:45:44

我很好奇其他作者针对这个问题提交的各种基于 Perl 的方法的相对速度。所以，我决定对它们进行基准测试。

必要时，我稍微修改了每个方法，以便它填充一个 @output 数组，以保持输入和输出一致。我验证了所有方法都会产生相同的 @output，尽管我没有在此处记录该断言。

以下是对各种方法进行基准测试的脚本：

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark qw(cmpthese :hireswallclock);

# get a convenient list of words (on Mac OS X 10.6.6, this contains 234,936 entries)
open (my $fh, '<', '/usr/share/dict/words') or die "can't open words file: $!\n";
my @input = <$fh>;
close $fh;

# remove line breaks
chomp @input;

# set-up the tests (
my %tests = (

  # Author: cjm
  RegExp => sub { my @output = grep { not /(\w).*\1/ } @input },

  # Author: daotoad
  SplitCount => sub { my @output = grep { my @l = split ''; my %l; @l{@l} = (); keys %l == @l } @input; },

  # Author: ikegami
  NextIfSeen => sub {
    my @output;
    INPUT: for (@input) {
      my %seen;
      while (/(.)/sg) {
        next INPUT if $seen{$1}++;
      }
      push @output, $_;
    }

  },

  # Author: ysth
  BitMask => sub {
    my @output;
    for my $word (@input) {
      my $mask1 = $word x ( length($word) - 1 );
      my $mask2 = join( '', map { substr($word, $_), substr($word, 0, $_) } 1..length($word)-1 );
      if ( ( $mask1 ^ $mask2 ) !~ tr/\0// ) {
        push @output, $word;
      }
    }
  },

);

# run each test 100 times
cmpthese(100, \%tests);

以下是 100 次迭代的结果。

           s/iter SplitCount    BitMask NextIfSeen     RegExp
SplitCount   2.85         --       -11%       -58%       -85%
BitMask      2.54        12%         --       -53%       -83%
NextIfSeen   1.20       138%       113%         --       -64%
RegExp      0.427       567%       496%       180%         --

正如您所看到的，cjm 的“RegExp”方法是迄今为止最快的。它比第二快的方法（ikegami 的“NextIfSeen”方法）快 180%。我怀疑 RegExp 和 NextIfSeen 方法的相对速度会随着输入字符串的平均长度的增加而收敛。但对于“正常”长度的英语单词，RegExp 方法是最快的。

I was very curious about the relative speed of the various Perl-based methods submitted by other authors for this question. So, I decided to benchmark them.

Where necessary, I slightly modified each method so that it would populate an @output array, to keep the input and output consistent. I verified that all the methods produce the same @output, although I have not documented that assertion here.

Here is the script to benchmark the various methods:

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark qw(cmpthese :hireswallclock);

# get a convenient list of words (on Mac OS X 10.6.6, this contains 234,936 entries)
open (my $fh, '<', '/usr/share/dict/words') or die "can't open words file: $!\n";
my @input = <$fh>;
close $fh;

# remove line breaks
chomp @input;

# set-up the tests (
my %tests = (

  # Author: cjm
  RegExp => sub { my @output = grep { not /(\w).*\1/ } @input },

  # Author: daotoad
  SplitCount => sub { my @output = grep { my @l = split ''; my %l; @l{@l} = (); keys %l == @l } @input; },

  # Author: ikegami
  NextIfSeen => sub {
    my @output;
    INPUT: for (@input) {
      my %seen;
      while (/(.)/sg) {
        next INPUT if $seen{$1}++;
      }
      push @output, $_;
    }

  },

  # Author: ysth
  BitMask => sub {
    my @output;
    for my $word (@input) {
      my $mask1 = $word x ( length($word) - 1 );
      my $mask2 = join( '', map { substr($word, $_), substr($word, 0, $_) } 1..length($word)-1 );
      if ( ( $mask1 ^ $mask2 ) !~ tr/\0// ) {
        push @output, $word;
      }
    }
  },

);

# run each test 100 times
cmpthese(100, \%tests);

Here are the results for 100 iterations.

           s/iter SplitCount    BitMask NextIfSeen     RegExp
SplitCount   2.85         --       -11%       -58%       -85%
BitMask      2.54        12%         --       -53%       -83%
NextIfSeen   1.20       138%       113%         --       -64%
RegExp      0.427       567%       496%       180%         --

As you can see, cjm's "RegExp" method is the fastest by far. It is 180% faster than the next fastest method, ikegami's "NextIfSeen" method. I suspect that the relative speed of the RegExp and NextIfSeen methods will converge as the average length of the input strings increases. But for "normal" length English words, the RegExp method is the fastest.

回复收藏 0 原文

人生戏 2024-11-08 07:45:44

cjm 给出了正则表达式，但这是一个有趣的非正则表达式方式：

@words = qw/abducts abe abeam abel abele/;
for my $word (@words) {
    my $mask1 = $word x ( length($word) - 1 );
    my $mask2 = join( '', map { substr($word, $_), substr($word, 0, $_) } 1..length($word)-1 );
    if ( ( $mask1 ^ $mask2 ) !~ tr/\0// ) {
        print "$word\n";
    }
}

cjm gave the regex, but here's an interesting non-regex way:

@words = qw/abducts abe abeam abel abele/;
for my $word (@words) {
    my $mask1 = $word x ( length($word) - 1 );
    my $mask2 = join( '', map { substr($word, $_), substr($word, 0, $_) } 1..length($word)-1 );
    if ( ( $mask1 ^ $mask2 ) !~ tr/\0// ) {
        print "$word\n";
    }
}

回复收藏 0 原文

成熟的代价 2024-11-08 07:45:44

为了回应 cjm 的解决方案，我想知道它与一些相当简洁的 Perl 相比如何：

my @output = grep { my @l = split ''; my %l; @l{@l} = (); keys %l == @l } @input;

因为我在这里不受字符数和格式的限制，所以我会更清楚一点，甚至到过度记录的程度：

my @output = grep {

    # Split $_ on the empty string to get letters in $_. 
    my @letters = split '';

    # Use a hash to remove duplicate letters.
    my %unique_letters;
    @unique_letters{@letters} = ();  # This is a hash slice assignment.
                                     # See perldoc perlvar for more info

    # is the number of unique letters equal to the number of letters?
    keys %unique_letters == @letters

} @input;

并且，当然在生产代码中，请执行以下操作：

my @output = grep ! has_repeated_chars($_), @input;

sub has_repeated_letters {
    my $word = shift;
    #blah blah blah
    # see example above for the code to use here, with a nip and a tuck.
}

In response to cjm's solution, I wondered about how it compared to some rather terse Perl:

my @output = grep { my @l = split ''; my %l; @l{@l} = (); keys %l == @l } @input;

Since I am not constrained in character count and formatting here, I'll be a bit clearer, even to the point of over-documenting:

my @output = grep {

    # Split $_ on the empty string to get letters in $_. 
    my @letters = split '';

    # Use a hash to remove duplicate letters.
    my %unique_letters;
    @unique_letters{@letters} = ();  # This is a hash slice assignment.
                                     # See perldoc perlvar for more info

    # is the number of unique letters equal to the number of letters?
    keys %unique_letters == @letters

} @input;

And, of course in production code, please do something like this:

my @output = grep ! has_repeated_chars($_), @input;

sub has_repeated_letters {
    my $word = shift;
    #blah blah blah
    # see example above for the code to use here, with a nip and a tuck.
}

回复收藏 0 原文

少跟Wǒ拽 2024-11-08 07:45:44

在带有正则表达式的 python 中：

python -c 'import re, sys; print "".join(s for s in open(sys.argv[1]) if not re.match(r".*(\w).*\1", s))' wordlist.txt

在没有正则表达式的 python 中：

python -c 'import sys; print "".join(s for s in open(sys.argv[1]) if len(s) == len(frozenset(s)))' wordlist.txt

我使用硬编码的文件名执行了一些计时测试，并将输出重定向到 /dev/null 以避免在计时中包含输出：

没有正则表达式的计时：

python -m timeit 'import sys' 'print >> sys.stderr, "".join(s for s in open("wordlist.txt") if len(s) == len(frozenset(s)))' 2>/dev/null
10000 loops, best of 3: 91.3 usec per loop

带有正则表达式的计时：

python -m timeit 'import re, sys' 'print >> sys.stderr, "".join(s for s in open("wordlist.txt") if re.match(r".*(\w).*\1", s))' 2>/dev/null
10000 loops, best of 3: 105 usec per loop

显然正则表达式比 python 中简单的 freezeset 创建和 len 比较慢一点。

In python with a regex:

python -c 'import re, sys; print "".join(s for s in open(sys.argv[1]) if not re.match(r".*(\w).*\1", s))' wordlist.txt

In python without a regex:

python -c 'import sys; print "".join(s for s in open(sys.argv[1]) if len(s) == len(frozenset(s)))' wordlist.txt

I performed some timing tests with a hardcoded file name and output redirected to /dev/null to avoid including output in the timing:

Timings without the regex:

python -m timeit 'import sys' 'print >> sys.stderr, "".join(s for s in open("wordlist.txt") if len(s) == len(frozenset(s)))' 2>/dev/null
10000 loops, best of 3: 91.3 usec per loop

Timings with the regex:

python -m timeit 'import re, sys' 'print >> sys.stderr, "".join(s for s in open("wordlist.txt") if re.match(r".*(\w).*\1", s))' 2>/dev/null
10000 loops, best of 3: 105 usec per loop

Clearly the regex is a tiny bit slower than a simple frozenset creation and len comparison in python.

回复收藏 0 原文

以为你会在 2024-11-08 07:45:44

你不能用正则表达式来做到这一点。正则表达式是一个有限状态机，这需要一个堆栈来存储已看到的字母。

我建议使用 foreach 执行此操作，并使用代码手动检查每个单词。
像这样的东西

List chars
foreach word in list
    foreach letter in word
        if chars.contains letter then remove word from list
        else
            chars.Add letter
    chars.clear

You can't do this with Regex. Regex is a Finite State Machine, and this would require a stack to store what letters have been seen.

I would suggest doing this with a foreach and manually check each word with code.
Something like

List chars
foreach word in list
    foreach letter in word
        if chars.contains letter then remove word from list
        else
            chars.Add letter
    chars.clear

回复收藏 0 原文

~没有更多了~

关于作者

美煞众生

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

使用正则表达式仅从列表中提取不包含重复字母的单词

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（10）

简单的东西

现实世界的担忧

认真区分大小写

总结

Simple Stuff

Real-World Concerns

Conscientious Case Insensitivity

Summary

关于作者

相关话题

热门标签

推荐作者

lorenzathorton8

Zero

萧瑟寒风

mylayout

tkewei

17818769742

友情链接

使用正则表达式仅从列表中提取不包含重复字母的单词

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（10）

简单的东西

现实世界的担忧

认真区分大小写

总结

Simple Stuff

Real-World Concerns

Conscientious Case Insensitivity

Summary

关于作者

相关话题

热门标签

推荐作者

lorenzathorton8

Zero

萧瑟寒风

mylayout

tkewei

17818769742

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。