正则表达式匹配“lol”; “哈哈”和“天哪”到“omggg”等
嘿,我喜欢正则表达式,但我一点也不擅长。
我有大约 400 个缩写词的列表,例如 lol、omg、lmao...等。每当有人输入这些缩短的单词之一时,它就会被替换为对应的英语单词([笑声],或类似的东西)。不管怎样,人们很烦人,他们输入这些速记词,最后一个字母重复 x 次。
例子: 天哪->哎呀,哈哈->哈哈,哈哈->哈哈哈哈,哈哈-> lololol
我想知道是否有人可以给我正则表达式(最好是Python)来处理这个问题?
谢谢大家。
(如果有人好奇的话,这是一个与 Twitter 相关的主题识别项目。如果有人发推文“让我们去投篮吧”,你怎么知道这条推文是关于篮球等的)
Hey there, I love regular expressions, but I'm just not good at them at all.
I have a list of some 400 shortened words such as lol, omg, lmao...etc. Whenever someone types one of these shortened words, it is replaced with its English counterpart ([laughter], or something to that effect). Anyway, people are annoying and type these short-hand words with the last letter(s) repeated x number of times.
examples:
omg -> omgggg, lol -> lollll, haha -> hahahaha, lol -> lololol
I was wondering if anyone could hand me the regex (in Python, preferably) to deal with this?
Thanks all.
(It's a Twitter-related project for topic identification if anyone's curious. If someone tweets "Let's go shoot some hoops", how do you know the tweet is about basketball, etc)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
第一种方法 -
嗯,使用正则表达式,你可以这样做 -
等等。
让我指出,使用正则表达式是一种非常脆弱的方法。处理这个问题的基本方法。您可以轻松地从用户那里获取字符串,这将破坏上述正则表达式。我想说的是,这种方法需要大量维护来观察用户犯下的错误模式和用户的错误模式。然后为它们创建特定于案例的正则表达式。
第二种方法 -
您是否考虑过使用
difflib
模块?它是一个带有帮助程序的模块,用于计算对象之间的增量。这里对您来说特别重要的是SequenceMatcher
。摘自官方文档-根据文档,任何超过 0.6 的ratio() 都是接近匹配的。您可能需要根据您的数据需求探索调整比率。如果您需要更严格的匹配,我发现任何超过 0.8 的值都可以。
FIRST APPROACH -
Well, using regular expression(s) you could do like so -
etc.
Let me point out that using regular expressions is a very fragile & basic approach to dealing with this problem. You could so easily get strings from users which will break the above regular expressions. What I am trying to say is that this approach requires lot of maintenance in terms of observing the patterns of mistakes the users make & then creating case specific regular expressions for them.
SECOND APPROACH -
Instead have you considered using
difflib
module? It's a module with helpers for computing deltas between objects. Of particular importance here for you isSequenceMatcher
. To paraphrase from official documentation-According to documentation, any ratio() over 0.6 is a close match. You might need to explore tweak the ratio for your data needs. If you need more stricter matching I found any value over 0.8 serves well.
怎么样
(将
lol
替换为omg
、haha
等)这将匹配
lol
、lololol< /code>、
lollll
、lollolol
等,但失败lollo
、lollllo
、lolly
等等。规则:
lol
。l
、ol
或lol
),因此
\b (?=zomg)\S*(\S+)(?<=\bzomg)\1*\b
将匹配zomg
、zomggg
、< code>zomgmgmg、zomgomgomg
等。在 Python 中,带注释:
这也将匹配“unadorned”版本(即
lol
没有任何重复)。如果您不想这样做,请使用\1+
而不是\1*
。How about
(replace
lol
withomg
,haha
etc.)This will match
lol
,lololol
,lollll
,lollollol
etc. but faillolo
,lollllo
,lolly
and so on.The rules:
lol
completely.l
,ol
orlol
)So
\b(?=zomg)\S*(\S+)(?<=\bzomg)\1*\b
will matchzomg
,zomggg
,zomgmgmg
,zomgomgomg
etc.In Python, with comments:
This will also match the "unadorned" version (i. e.
lol
without any repetition). If you don't want this, use\1+
instead of\1*
.