关于正则表达式和标记化的问题
我需要制作一个能够识别英语单词的分词器。
目前,我对可以作为 url 表达式一部分的字符感到困惑。
例如,如果字符“:”、“?”、“=”是网址的一部分,我不应该真正对它们进行分段。
我的qns是,这可以用正则表达式表达吗?我有
\b(?:(?:https?|ftp|file)://|www\.|ftp\.)
(?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*
(?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[A-Z0-9+&@#/%=~_|$])
来自这里的正则表达式
但我不知道如何将所有内容拼凑在一起,以便如果在上述表达式中发现字符,则不要在它们之间插入空格。
帮助!
I need to make a tokenizer that is able to English words.
Currently, I'm stuck with characters where they can be part of of a url expression.
For instance, if the characters ':','?','=' are part of a url, i shouldn't really segment them.
My qns is, can this be expressed in regex? I have the regex
\b(?:(?:https?|ftp|file)://|www\.|ftp\.)
(?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*
(?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[A-Z0-9+&@#/%=~_|$])
from here
but I don't know how to piece everything such that if the characters are spotted inside the above expression, don't insert spaces between them.
Help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我会通过使用不同的正则表达式进行扫描来解决这个问题,将命中放入数组中,从字符串中删除这些命中,然后像平常一样执行标记生成器。
I would approach this problem by doing a sweep with a different regexp, putting hits into an array, removing those hits from the string, and then doing your tokenizer as normal.