关于正则表达式和标记化的问题

发布于 2024-09-18 13:39:10 字数 536 浏览 2 评论 0原文

我需要制作一个能够识别英语单词的分词器。

目前，我对可以作为 url 表达式一部分的字符感到困惑。

例如，如果字符“：”、“？”、“=”是网址的一部分，我不应该真正对它们进行分段。

我的qns是，这可以用正则表达式表达吗？我有

\b(?:(?:https?|ftp|file)://|www\.|ftp\.)
  (?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*
  (?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[A-Z0-9+&@#/%=~_|$])

来自这里的正则表达式

但我不知道如何将所有内容拼凑在一起，以便如果在上述表达式中发现字符，则不要在它们之间插入空格。

帮助！

原文

I need to make a tokenizer that is able to English words.

Currently, I'm stuck with characters where they can be part of of a url expression.

For instance, if the characters ':','?','=' are part of a url, i shouldn't really segment them.

My qns is, can this be expressed in regex? I have the regex

\b(?:(?:https?|ftp|file)://|www\.|ftp\.)
  (?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*
  (?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[A-Z0-9+&@#/%=~_|$])

from here

but I don't know how to piece everything such that if the characters are spotted inside the above expression, don't insert spaces between them.

Help!

分享到QQ

分享到微博