使用正则表达式仅从列表中提取不包含重复字母的单词
我有一个很大的单词列表文件,每行一个单词。我想过滤掉重复字母的单词。
INPUT:
abducts
abe
abeam
abel
abele
OUTPUT:
abducts
abe
abel
我想使用正则表达式(grep 或 perl 或 python)来做到这一点。这可能吗?
I have a large word list file with one word per line. I would like to filter out the words with repeating alphabets.
INPUT:
abducts
abe
abeam
abel
abele
OUTPUT:
abducts
abe
abel
I'd like to do this using Regex (grep or perl or python). Is that possible?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
编写一个正则表达式来匹配具有重复字母的单词,然后否定匹配要容易得多:(
此代码假设
@input
每个条目包含一个单词。)但这个问题不一定能通过正则表达式得到最好的解决。我已经在 Perl 中给出了代码,但它可以很容易地转换为任何支持反向引用的正则表达式风格,包括
grep
(它也有-v
开关来否定匹配)。It's much easier to write a regex that matches words that do have repeating letters, and then negate the match:
(This code assumes that
@input
contains one word per entry.) But this problem isn't necessarily best solved with a regex.I've given the code in Perl, but it could easily be translated into any regex flavor that supports backreferences, including
grep
(which also has the-v
switch to negate the match).可以使用正则表达式:
It is possible to use regex:
简单的东西
尽管有不准确的说法称这对于正则表达式是不可能的,但它确实是不可能的。
虽然 @cjm 公正地指出,否定正匹配比将负匹配表示为单个模式要容易得多,但这样做的模型已经足够众所周知,以至于只需将一些东西插入该模式即可模型。假设:
表达条件的方法
匹配某些东西,那么以单个正匹配模式
是将其写为因此,鉴于正模式是
相应的负需求,必须
通过简单替换 X 来 实现.
现实世界的担忧
尽管如此,有些令人情有可原的担忧有时可能需要更多的工作。例如,虽然
\pL
描述具有 GeneralCategory=Letter 属性的任何代码点,但它不考虑如何处理诸如 red‐violet–colored< /em>、'Tisn't 或 fiancée — 后者在其他等效的 NFD 与 NFC 形式中有所不同。因此,您必须首先通过完全分解来运行它,以便像
"r\x{E9}sume\x{301}"
这样的字符串能够正确检测重复的“字母 é's”——即所有规范等效的字素簇单元。为了解决这些问题,您至少必须首先通过 NFD 分解运行字符串,然后还通过
\X
使用字素簇,而不是通过.< /代码>。
因此,对于英语,您需要遵循这些正匹配的内容,并根据上面给出的替换进行相应的负匹配:
但即使如此,仍然存在某些未解决的未解决问题,例如是否
\ N{EN DASH}
和\N{HYPHEN}
应被视为等效元素或不同元素。这是因为正确的书写方式,将两个元素如 red-violet 和 colored 连接起来形成单个复合词 red-violet-colored,其中至少其中一个已经包含连字符,要求使用 EN DASH 作为分隔符,而不仅仅是连字符。
通常,EN DASH 保留用于类似性质的化合物,例如时空权衡。然而,使用打字机英语的人甚至不会这样做,他们使用超大规模重载的遗留代码点连字符减号来表示两者:红紫色。
这仅取决于您的文本是否来自 19 世纪的手动打字机,或者它是否代表在现代排版规则下正确呈现的英语文本。 :)
认真区分大小写
您会注意到,我在这里将仅大小写不同的字母视为同一字母。这是因为我使用
/i
正则表达式开关,ᴀᴋᴀ(?i)
模式修饰符。这相当就像说它们与排序规则强度1相同 - 但又不完全一样,因为Perl仅使用大小写折叠(尽管完整大小写折叠并不简单) 对于不区分大小写的匹配,而不是比三级更高的排序规则强度,这可能是首选。
主要校对强度的完全等价是一种明显更强的说法,但在一般情况下可能需要这种说法才能完全解决问题。然而,在许多特定情况下,这需要比解决问题所需的工作多得多的工作。简而言之,对于实际出现的许多具体情况来说,它是多余的,无论假设的一般情况需要多少。
这变得更加困难,因为尽管你可以这样做:
并期望得到正确的答案 - 你确实得到了,万岁! - 这种强大的字符串比较不容易扩展到正则表达式匹配。
然而。 :)
总结
选择是设计不足还是过度设计解决方案将根据个人情况而有所不同,没有人可以为您做出决定。
我喜欢 CJM 的解决方案,它否定了正匹配,尽管它对于重复字母的判断有些漫不经心。注意:
产生:
这说明了为什么当您需要匹配字母时,您应该始终使用
\pL
ᴀᴋᴀ\p{Letter}
而不是\w
,实际上匹配[\p{alpha}\p{GC=Mark}\p{NT=De}\p{GC=Pc}]
。当然,当你需要匹配字母时,你需要使用
\p{alpha}
ᴀᴋᴀ\p{Alphabetic}
,它与只是一封信——与普遍的误解相反。 :)Simple Stuff
Despite the inaccurate protestation that this is impossible with a regex, it certainly is.
While @cjm justly states that it is a lot easier to negate a positive match than it is to express a negative one as a single pattern, the model for doing so is sufficiently well-known that it becomes a mere matter of plugging things into that model. Given that:
matches something, then the way to express the condition
in a single, positively-matching pattern is to write it as
Therefore, given that the positive pattern is
the corresponding negative needs must be
by way of simple substitution for X.
Real-World Concerns
That said, there are extenuating concerns that may sometimes require more work. For example, while
\pL
describes any code point having the GeneralCategory=Letter property, it does not consider what to do with words like red‐violet–colored, ’Tisn’t, or fiancée — the latter of which is different in otherwise-equivalent NFD vs NFC forms.You therefore must first run it through full decomposition, so that a string like
"r\x{E9}sume\x{301}"
would correctly detect the duplicate “letter é’s” — that is, all canonically equivalent grapheme cluster units.To account for such as these, you must at a bare minimum first run your string through an NFD decomposition, and then afterwards also use grapheme clusters via
\X
instead of arbitrary code points via.
.So for English, you would want something that followed along these lines for the positive match, with the corresponding negative match per the substitution give above:
But even with that there still remain certain outstanding issues unresolved, such as for example whether
\N{EN DASH}
and\N{HYPHEN}
should be considered equivalent elements or different ones.That’s because properly written, hyphenating two elements like red‐violet and colored to form the single compound word red‐violet–colored, where at least one of the pair already contains a hyphen, requires that one employ an EN DASH as the separator instead of a mere HYPHEN.
Normally the EN DASH is reserved for compounds of like nature, such as a time–space trade‐off. People using typewriter‐English don’t even do that, though, using that super‐massively overloaded legacy code point, HYPHEN-MINUS, for both: red-violet-colored.
It just depends whether your text came from some 19th‐century manual typewriter — or whether it represents English text properly rendered under modern typesetting rules. :)
Conscientious Case Insensitivity
You will note I am here considering letter that differ in case alone to be the same one. That’s because I use the
/i
regex switch, ᴀᴋᴀ the(?i)
pattern modifier.That’s rather like saying that they are the same as collation strength 1 — but not quite, because Perl uses only case folding (albeit full case folding not simple) for its case insensitive matches, not some higher collation strength than the tertiary level as might be preferred.
Full equivalence at the primary collation strength is a significantly stronger statement, but one that may well be needed to fully solve the problem in the general case. However, that requires a lot more work than the problem necessarily requires in many specific instances. In short, it is overkill for many specific cases that actually arise, no matter how much it might be needed for the hypothetical general case.
This is made even more difficult because, although you can for example do this:
and expect to get the right answer — and you do, hurray! — this sort of robust string comparison is not easily extended to regex matches.
Yet. :)
Summary
The choice of whether to under‐engineer — or to over‐engineer — a solution will vary according to individual circumstances, which no one can decide for you.
I like CJM’s solution that negates a positive match, myself, although it’s somewhat cavalier about what it considers a duplicate letter. Notice:
produces:
That shows why when you need to match a letter, you should alwasy use
\pL
ᴀᴋᴀ\p{Letter}
instead of\w
, which actually matches[\p{alpha}\p{GC=Mark}\p{NT=De}\p{GC=Pc}]
.Of course, when you need to match an alphabetic, you need to use
\p{alpha}
ᴀᴋᴀ\p{Alphabetic}
, which isn’t at all the same as a mere letter — contrary to popular misunderstanding. :)如果您正在处理可能包含重复字母的长字符串,尽快停止可能会有所帮助。
我会采用最简单的解决方案,除非发现性能确实不可接受。
If you're dealing with long strings that are likely to have duplicate letters, stopping ASAP may help.
I'd go with the simplest solution unless the performance is found to be really unacceptable.
我很好奇其他作者针对这个问题提交的各种基于 Perl 的方法的相对速度。所以,我决定对它们进行基准测试。
必要时,我稍微修改了每个方法,以便它填充一个 @output 数组,以保持输入和输出一致。我验证了所有方法都会产生相同的
@output
,尽管我没有在此处记录该断言。以下是对各种方法进行基准测试的脚本:
以下是 100 次迭代的结果。
正如您所看到的,cjm 的“RegExp”方法是迄今为止最快的。它比第二快的方法(ikegami 的“NextIfSeen”方法)快 180%。我怀疑 RegExp 和 NextIfSeen 方法的相对速度会随着输入字符串的平均长度的增加而收敛。但对于“正常”长度的英语单词,RegExp 方法是最快的。
I was very curious about the relative speed of the various Perl-based methods submitted by other authors for this question. So, I decided to benchmark them.
Where necessary, I slightly modified each method so that it would populate an
@output
array, to keep the input and output consistent. I verified that all the methods produce the same@output
, although I have not documented that assertion here.Here is the script to benchmark the various methods:
Here are the results for 100 iterations.
As you can see, cjm's "RegExp" method is the fastest by far. It is 180% faster than the next fastest method, ikegami's "NextIfSeen" method. I suspect that the relative speed of the RegExp and NextIfSeen methods will converge as the average length of the input strings increases. But for "normal" length English words, the RegExp method is the fastest.
cjm 给出了正则表达式,但这是一个有趣的非正则表达式方式:
cjm gave the regex, but here's an interesting non-regex way:
为了回应 cjm 的解决方案,我想知道它与一些相当简洁的 Perl 相比如何:
因为我在这里不受字符数和格式的限制,所以我会更清楚一点,甚至到过度记录的程度:
并且,当然在生产代码中,请执行以下操作:
In response to cjm's solution, I wondered about how it compared to some rather terse Perl:
Since I am not constrained in character count and formatting here, I'll be a bit clearer, even to the point of over-documenting:
And, of course in production code, please do something like this:
在带有正则表达式的 python 中:
在没有正则表达式的 python 中:
我使用硬编码的文件名执行了一些计时测试,并将输出重定向到 /dev/null 以避免在计时中包含输出:
没有正则表达式的计时:
带有正则表达式的计时:
显然正则表达式比 python 中简单的 freezeset 创建和 len 比较慢一点。
In python with a regex:
In python without a regex:
I performed some timing tests with a hardcoded file name and output redirected to /dev/null to avoid including output in the timing:
Timings without the regex:
Timings with the regex:
Clearly the regex is a tiny bit slower than a simple frozenset creation and len comparison in python.
你不能用正则表达式来做到这一点。正则表达式是一个有限状态机,这需要一个堆栈来存储已看到的字母。
我建议使用 foreach 执行此操作,并使用代码手动检查每个单词。
像这样的东西
You can't do this with Regex. Regex is a Finite State Machine, and this would require a stack to store what letters have been seen.
I would suggest doing this with a foreach and manually check each word with code.
Something like