正则表达式匹配“lol”; “哈哈”和“天哪”到“omggg”等

发布于 2024-09-26 14:21:45 字数 349 浏览 2 评论 0原文

嘿,我喜欢正则表达式,但我一点也不擅长。

我有大约 400 个缩写词的列表,例如 lol、omg、lmao...等。每当有人输入这些缩短的单词之一时,它就会被替换为对应的英语单词([笑声],或类似的东西)。不管怎样,人们很烦人,他们输入这些速记词,最后一个字母重复 x 次。

例子: 天哪->哎呀,哈哈->哈哈,哈哈->哈哈哈哈,哈哈-> lololol

我想知道是否有人可以给我正则表达式(最好是Python)来处理这个问题?

谢谢大家。

(如果有人好奇的话,这是一个与 Twitter 相关的主题识别项目。如果有人发推文“让我们去投篮吧”,你怎么知道这条推文是关于篮球等的)

Hey there, I love regular expressions, but I'm just not good at them at all.

I have a list of some 400 shortened words such as lol, omg, lmao...etc. Whenever someone types one of these shortened words, it is replaced with its English counterpart ([laughter], or something to that effect). Anyway, people are annoying and type these short-hand words with the last letter(s) repeated x number of times.

examples:
omg -> omgggg, lol -> lollll, haha -> hahahaha, lol -> lololol

I was wondering if anyone could hand me the regex (in Python, preferably) to deal with this?

Thanks all.

(It's a Twitter-related project for topic identification if anyone's curious. If someone tweets "Let's go shoot some hoops", how do you know the tweet is about basketball, etc)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

以往的大感动 2024-10-03 14:21:46

第一种方法 -

嗯,使用正则表达式,你可以这样做 -

import re
re.sub('g+', 'g', 'omgggg')
re.sub('l+', 'l', 'lollll')

等等。

让我指出,使用正则表达式是一种非常脆弱的方法。处理这个问题的基本方法。您可以轻松地从用户那里获取字符串,这将破坏上述正则表达式。我想说的是,这种方法需要大量维护来观察用户犯下的错误模式和用户的错误模式。然后为它们创建特定于案例的正则表达式。

第二种方法 -

您是否考虑过使用 difflib 模块?它是一个带有帮助程序的模块,用于计算对象之间的增量。这里对您来说特别重要的是SequenceMatcher。摘自官方文档-

SequenceMatcher 是一个灵活的类
用于比较序列对
任何类型,只要顺序
元素是可散列的。序列匹配器
试图计算一个“人类友好的
两个序列之间的差异”。
基本概念是最长的
连续 &无垃圾匹配子序列。

import difflib as dl
x   = dl.SequenceMatcher(lambda x : x == ' ', "omg", "omgggg")
y   = dl.SequenceMatcher(lambda x : x == ' ', "omgggg","omg")
avg = (x.ratio()+y.ratio())/2.0
if avg>= 0.6: 
    print 'Match!'
else:
    print 'Sorry!'

根据文档,任何超过 0.6 的ratio() 都是接近匹配的。您可能需要根据您的数据需求探索调整比率。如果您需要更严格的匹配,我发现任何超过 0.8 的值都可以。

FIRST APPROACH -

Well, using regular expression(s) you could do like so -

import re
re.sub('g+', 'g', 'omgggg')
re.sub('l+', 'l', 'lollll')

etc.

Let me point out that using regular expressions is a very fragile & basic approach to dealing with this problem. You could so easily get strings from users which will break the above regular expressions. What I am trying to say is that this approach requires lot of maintenance in terms of observing the patterns of mistakes the users make & then creating case specific regular expressions for them.

SECOND APPROACH -

Instead have you considered using difflib module? It's a module with helpers for computing deltas between objects. Of particular importance here for you is SequenceMatcher. To paraphrase from official documentation-

SequenceMatcher is a flexible class
for comparing pairs of sequences of
any type, so long as the sequence
elements are hashable. SequenceMatcher
tries to compute a "human-friendly
diff" between two sequences. The
fundamental notion is the longest
contiguous & junk-free matching subsequence.

import difflib as dl
x   = dl.SequenceMatcher(lambda x : x == ' ', "omg", "omgggg")
y   = dl.SequenceMatcher(lambda x : x == ' ', "omgggg","omg")
avg = (x.ratio()+y.ratio())/2.0
if avg>= 0.6: 
    print 'Match!'
else:
    print 'Sorry!'

According to documentation, any ratio() over 0.6 is a close match. You might need to explore tweak the ratio for your data needs. If you need more stricter matching I found any value over 0.8 serves well.

乱世争霸 2024-10-03 14:21:46

怎么样

\b(?=lol)\S*(\S+)(?<=\blol)\1*\b

(将 lol 替换为 omghaha 等)

这将匹配 lollololol< /code>、lolllllollolol 等,但失败 lollolollllololly等等。

规则:

  1. 完全匹配单词lol
  2. 然后允许在单词末尾重复一个或多个字符(即 lollol),

因此 \b (?=zomg)\S*(\S+)(?<=\bzomg)\1*\b 将匹配 zomgzomggg、< code>zomgmgmg、zomgomgomg 等。

在 Python 中,带注释:

result = re.sub(
    r"""(?ix)\b    # assert position at a word boundary
    (?=lol)        # assert that "lol" can be matched here
    \S*            # match any number of characters except whitespace
    (\S+)          # match at least one character (to be repeated later)
    (?<=\blol)     # until we have reached exactly the position after the 1st "lol"
    \1*            # then repeat the preceding character(s) any number of times
    \b             # and ensure that we end up at another word boundary""", 
    "lol", subject)

这也将匹配“unadorned”版本(即 lol 没有任何重复)。如果您不想这样做,请使用 \1+ 而不是 \1*

How about

\b(?=lol)\S*(\S+)(?<=\blol)\1*\b

(replace lol with omg, haha etc.)

This will match lol, lololol, lollll, lollollol etc. but fail lolo, lollllo, lolly and so on.

The rules:

  1. Match the word lol completely.
  2. Then allow any repetition of one or more characters at the end of the word (i. e. l, ol or lol)

So \b(?=zomg)\S*(\S+)(?<=\bzomg)\1*\b will match zomg, zomggg, zomgmgmg, zomgomgomg etc.

In Python, with comments:

result = re.sub(
    r"""(?ix)\b    # assert position at a word boundary
    (?=lol)        # assert that "lol" can be matched here
    \S*            # match any number of characters except whitespace
    (\S+)          # match at least one character (to be repeated later)
    (?<=\blol)     # until we have reached exactly the position after the 1st "lol"
    \1*            # then repeat the preceding character(s) any number of times
    \b             # and ensure that we end up at another word boundary""", 
    "lol", subject)

This will also match the "unadorned" version (i. e. lol without any repetition). If you don't want this, use \1+ instead of \1*.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文