Perl 正则表达式中的加权析取？

发布于 2024-10-15 08:25:10 字数 2494 浏览 13 评论 0原文

我对正则表达式相当有经验，但我在当前涉及析取的应用程序中遇到了一些困难。

我的情况是这样的：我需要根据地址的“标识符元素”的正则表达式匹配将地址分成其组成部分——一个类似的英语示例是“州”、“路”或“大道”——例如，如果我们在地址中写下这些内容。想象一下，我们有一个如下所示的地址，其中（这在英语中永远不会发生），我们在每个名称后指定了标识符类型

United States COUNTRY California STATE San Francisco CITY Mission STREET 345 NUMBER

（其中大写字母中的单词就是我所说的“标识符”）。

我们想将其解析为：
<代码> 美国国家
加利福尼亚州
旧金山市
使命街
245 号码

好吧，这当然是为英语设计的，但有一个问题：我正在处理中文数据，实际上这种类型的标识符规范一直在发生。下面是一个例子：

<代码> 云南-省 ; 丽江-市 ; 古城-区 ; 西安-街 ; 杨春巷; 云南省；丽江市;古城区；西安街;阳春巷

这很简单——对潜在候选标识符名称进行惰性匹配，将其分成一个析取列表。

对于中国，以下是“省级”实体：

省 (省) , 自治区（自治区） , 市（直辖市）

所以我的正则表达式到目前为止看起来像这样：

(.+?(?:(?:省)|(?:隶属)|(?:市)))

我有一系列这些，以便说明地址的不同部分。例如，对应于城市的下一个级别是：

(.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))

因此，要匹配省实体后跟城市实体：

(.+?(?:(?:省)|(?:隶属)|(?:市)))(.+?(?:(?:地区)|(?:自治州)|(?:市) |(?:盟)))

使用命名捕获组：
<代码> (?<省>.+?(?:(?:省)|(?:自治区)|(?:市)))(?<城市>.+?(?:(?:地区)|( ?:自治州)|(?:市)|(?:盟)))

对于上面的内容，这会产生： <代码> $+{省} = 云南省
$+{City} = 丽江市

这一切都很好，让我走得很远。然而，问题是当我尝试考虑可能是其他标识符的子字符串的标识符时。例如，一个常见的街道级实体是“村委会”，意思是村组织委员会。在我希望分开的地址集中，并非每个地址都完整写出此内容。事实上，我也找到了“村委”和简单的“村”。

问题？如果我对这些元素进行纯析取，我们将得到以下结果：

(?<街道>.+?(?:(?:村委会)|(?:村委)|(?:村)))

然而，如果你有一个实体保定-村委会（保定村组织委员会），这个懒惰的正则表达式会在村停止并结束，孤立我们可怜的村委会，因为村是其中之一潜在的析取元素。

想象一下类似以下的英语对应内容：
<代码> (?<动物>.+?(?:(?:猫)|(?:大象)|(?:猫大象)|(?:城市)))

我们有两个输入字符串：
1.“crap catelephant crap city”，我们想要“Crap catelephant”和“crap city” 2.“crap catelephant city”，我们想要的是“crap cat”“elephant city”

啊，你说的解决方案就是让预标识符捕获贪婪。但！有些实体具有相同的标识符，但不在同一级别。

以市为例。它的意思很简单，就是“城市”。但在中国，有县级、省级、直辖市级的城市。如果此字符在字符串中出现两次，尤其是在两个相邻实体中，则贪婪搜索会错误地将贪婪匹配标记为第一个实体。如下：

<代码> 广东-省 ; 江门-市 ; 开平-市 ; 三埠-区石海管-区
广东省；江门-市;开平市;三步区；石海关区

（请注意，如上所述，这已被手动分段。原始数据将仅具有一串连接字符）

贪婪搜索的匹配将是
<代码> 江门市开平市

这是错误的，因为两个相邻的实体应该被分成它们的组成部分。一个是省级市，一个是县级市。

回到最初的观点，感谢您阅读本文，有没有办法对析取实体赋予权重？我希望正则表达式首先找到最高的“加权”标识符。例如，用“村委会”代替简单的“村”，用“catelephant”代替“cat”。在初步实验中，正则表达式解析器显然是从左到右查找析取匹配。这是一个有效的假设吗？我应该将最常出现的标识符放在析取列表的前面吗？

如果我丢失了任何与中国相关的详细信息，我深表歉意，并在需要时进一步澄清。这个例子实际上不一定是中文的——我认为更一般地说，这是一个关于正则表达式析取匹配机制的问题——它以什么顺序优先选择析取实体，以及它如何决定何时“调用它”一天”在懒惰搜索的背景下？

在某种程度上，懒惰搜索和贪婪搜索之间是否存在某种中间立场？找到在最长/最高权重的析取实体之前可以找到的最小位？偷懒，但如果可以的话，为了彻底而付出一点点额外的努力？（顺便问一下我大学时期的工作理念？）

原文

I am fairly experienced with regular expressions, but I am having some difficulty with a current application involving disjunction.

My situation is this: I need to separate an address into its component parts based on a regular expression match on the "Identifier elements" of the address -- A comparable English example would be words like "state", "road", or "boulevard"--IF, for example, we wrote these out in our addresses. Imagine we have an address like the following, where (and this would never happen in English), we specified the identifier type after each name

United States COUNTRY California STATE San Francisco CITY Mission STREET 345 NUMBER

(Where the words in CAPS are what I have called "identifiers").

We want to parse it into:
United States COUNTRY California STATE San Francisco CITY Mission STREET 245 NUMBER

OK, this is certainly contrived for English, but here's the catch: I am working with Chinese data, where in fact this style of identifier specification happens all the time. An example below:

云南-省 ; 丽江-市 ; 古城-区 ; 西安-街 ; 杨春-巷 ; Yunnan-Province ; LiJiang-City ; GuCheng-District ; Xi'An-Street ; Yangchun-Alley

This is easy enough--a lazy match on a potential candidate identifier names, separated into a disjunctive list.

For China, the following are the "province-level" entities:

省 (Province) , 自治区 (Autonomous Region) , 市 (Municipality)

So my regex so far looks like this:

(.+?(?:(?:省)|(?:自治区)|(?:市)))

I have a series of these, in order to account for different portions of the address. The next level, corresponding to cities, for instance, is:

(.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))

So to match a province entity followed by a city entity:

(.+?(?:(?:省)|(?:自治区)|(?:市)))(.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))

With named capture groups:
(?<Province>.+?(?:(?:省)|(?:自治区)|(?:市)))(?<City>.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))

For the above, this yields:
$+{Province} = 云南省 $+{City} = 丽江市

This is all good and well, and gets me pretty far. The problem, however, is when I try to account for identifiers that can be a substring of other identifiers. A common street-level entity, for instance, is "村委会", which means village organizing committee. In the set of addresses I wish to separate, not every address has this written out in full. In fact, I find "村委" and just plain "村" as well.

The problem? If I have a pure disjunction of these elements, we have the following:

(?<Street>.+?(?:(?:村委会)|(?:村委)|(?:村)))

What happens, though, is that if you have an entity 保定-村委会 (Baoding Village organizing committee), this lazy regex stops at 村 and calls it a day, orphaning our poor 委会 because 村 is one of the potential disjunctive elements.

Imagine an English equivalent like the following:
(?<Animal>.+?(?:(?:Cat)|(?:Elephant)|(?:CatElephant)|(?:City)))

We have two input strings:
1. "crap catelephant crap city", where we wanted "Crap catelephant" and "crap city"
2. "crap catelephant city" , where we wanted "crap cat" "elephant city"

Ah, the solution, you say, is to make the pre-identifier capture greedy. But! There are entities have the same identifier that are not at the same level.

Take 市 for example. It means simply "city". But in China, there are county-level, province-level, and municipality-level cities. If this character occurred twice in the string, especially in two adjacent entities, the greedy search would incorrectly tag the greedy match as the first entity. As in the following:

广东-省 ; 江门-市 ; 开平-市 ; 三埠-区石海管-区 Guangdong-province ; Jiangmen-City ; Kaiping-City ; Sanbu-District ; Shihaiguan-District

(Note, as above, this has been hand-segmented. The raw data would simply have a string of concatenated characters)

The match for a greedy search would be
江门市开平市

This is wrong, as the two adjacent entities should be separated into their constituent parts. Once is at the level of provincial city, one is a county-level city.

Back to the original point, and I thank you for reading this far, is there a way to put a weighting on disjunctive entities? I would want the regex to find the highest "weighted" identifier first. 村委会 instead of simple 村 for example, "catelephant" instead of just "cat". In preliminary experiments, the regex parser apparently proceeds left to right in finding disjunctive matches. Is this a valid assumption to make? Should I put the most frequently-occurring identifiers first in the disjunctive list?

If I have lost anyone with Chinese-related details, I apologize, and can further clarify if needed. The example really doesn't have to be Chinese--I think more generally it is a question about the mechanics of the regex disjunctive match -- in what order does it preference the disjunctive entities, and how does it decide when to "call it a day" in the context of a lazy search?

In a way, is there some sort of middle ground between lazy and greedy searches? Find the smallest bit you can find before the longest / highest weighted disjunctive entity? Be lazy, but put in that little bit of extra effort if you can for the sake of thoroughness?
(Incidentally, my work philosophy in college?)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

花想c 2024-10-22 08:25:10

如何处理替换取决于特定的正则表达式引擎。对于几乎所有引擎（包括 Perl 的正则表达式引擎），交替匹配是急切的 - 也就是说，它首先匹配最左边的选择，并且只有在失败时才尝试另一个选择。例如，如果您有 /(cat|catelephant)/ 它永远不会匹配 catelephant。解决方案是重新排序选择，使最具体的优先。