扩展 [可选]、分组和 |或文本中的运算符
我正在尝试扩展包含 [ ]
的句子以指示可选项,( )
指示分组,以及 |
指示 or
运算符并枚举所有可能性。例如:
“嘿[那里]你[hood]。”
应该返回四个句子:
Hey there you hood.
Hey there you.
Hey you hood.
Hey you.
最终目标如下:
输入:“(他|她)狗[非常|”非常]困惑。”
Output: His dog was very confused.
His dog was extremely confused.
His dog was confused.
Her dog was very confused.
Her dog was extremely confused.
Her dog was confused.
我正在使用正则表达式匹配和递归来做到这一点。我在以下短语下搜索了 CPAN 和 SO:
扩展文本
扩展句子
扩展条件
扩展选项
扩大分组
但没有运气。
谢谢。
我编辑了这个问题,主要是为了更好地反映其演变,并删除了随着问题的演变而过时的大部分内容。上面的问题是下面大多数答案试图解决的问题。
我目前的状态如下:
在与上述问题搏斗一天后,我有两个非常接近我想要的解决方案。一张是我自己的,第二张是下面的 PLT。然而,我决定尝试一种完全不同的方法。
使用正则表达式并手动解析这些句子似乎是一种非常丑陋的做事方式。因此,我决定为我的“语言”编写语法,并使用解析器生成器为我解析它。
这给了我一个额外的抽象层,并避免了 Damian Conway 在 Perl 最佳实践中描述的以下场景:[关于正则表达式]
稍微剪切和粘贴并修改,哦,现在它根本不起作用,所以让我们修改一下,看看是否可以-那-帮助-不-它-没有-但是-我们现在承诺-所以-也许-如果-我们-改变-那-位-而不-嗯-那是-克洛Ser-但仍然不太对-也许-如果-我做了第三次重复-非贪婪-而不是-哎呀-现在-它-回到-根本不匹配--也许我应该将其发布到 PerlMonks.org 并看看他们是否知道出了什么问题
如果这些表达式的语法发生变化并且我稍后需要支持其他结构,它也会变得更加容易。
最后更新:
我使用开源工具包解决了我的问题。这将转录我的输入的 JSGF 版本并生成有限状态传感器。从那里您可以逐步完成 FST 以生成所有可能的结果。
I am trying to expand sentences that incorporate [ ]
to indicate optionals, ( )
to indicate grouping, and |
to indicate the or
operator and enumerate all possibilities. So for example:
"Hey [there] you [hood]."
should return four sentences:
Hey there you hood.
Hey there you.
Hey you hood.
Hey you.
The end goal would look like:
Input: "(His|Her) dog was [very|extremely] confused."
Output: His dog was very confused.
His dog was extremely confused.
His dog was confused.
Her dog was very confused.
Her dog was extremely confused.
Her dog was confused.
I am doing it using regex matching and recursion. I have searched both CPAN and SO under the phrases:
Expanding text
expanding sentences
expanding conditionals
expanding optionals
expanding groupings
with no luck.
Thanks.
I have edited this question largely to better reflect its evolution and removed large portions which were made obsolete as the question evolved. The question above is the question that most of the answers below are attempting to address.
My current state is the following:
After wrestling with the problem above for a day I have two solutions very close to what I want. One is my own and the second is PLT's below. However, I have decided to try a fundamentally different approach.
Using regular expressions and manually parsing these sentences seems like a very ugly way of doing things. So I have decided to instead write a grammar for my "language" and use a parser-generator to parse it for me.
This gives me an additional layer of abstraction and avoids the following scenario described by Damian Conway in Perl Best Practices: [about regexps]
cut-and-paste-and-modify-slightly-and-oh-now-it-doesn't-work-at-all-so-let's-modify-it-some-more-and-see-if-that-helps-no-it-didn't-but-we're-commited-now-so-maybe-if-we-change-that-bit-instead-hmmmm-that's-closer-but-still-not-quite-right-maybe-if-I-made-that-third-repetition-non-greedy-instead-oops-now-it's-back-to-not-matching-at-all-perhaps-I-should-just-post-it-to-PerlMonks.org-and-see-if-they-know-what's-wrong
It also makes it much easier if the grammar of these expressions were to change and I needed to support other constructs later on.
Last update:
I solved my problem using an open source toolkit. This will transcribe a JSGF version of my input and generate a finite-state transducer. From there you can walk through the FST to generate all possible outcomes.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
好的,另一个对答案的完整修订。这将按预期工作。 :) 现在它还扩展了嵌套括号。换行符仍然是分隔符,但我添加了一种方法,可以在需要时将其快速更改为更复杂的内容。
基本上,我首先用括号 + 管道替换括号,因为
[word ]
和(|word )
是等效的。然后,我提取了所有封装括号,例如
(you |myfriend)
和(you |my (|friend)friend)
。然后,我将嵌套括号扩展为常规括号,例如(you |my (|friend )friend )
被替换为(you |myfriend |myfriend)
。完成后,可以使用原始子例程处理这些单词。
仍有待在更复杂的扩展上进行测试,但在我的测试过程中它运行良好。
这是修改后的代码:
将产生输出:
Ok, another complete revision of the answer. This will work as intended. :) It now also expands nested parens. Newline is still the delimeter, but I added a way to quickly change it to something more complicated if the need arises.
Basically, I started with replacing brackets with parens + pipe, since
[word ]
and(|word )
are equivalent.I then extracted all the encapsulating parens, e.g. both
(you |my friend)
and(you |my (|friendly ) friend )
. I then expanded the nested parens into regular parens, e.g.(you |my (|friendly ) friend )
was replaced with(you |my friendly friend |my friend )
.With that done, the words could be processed with the original subroutine.
Remains to be tested on more complicated expansions, but it works fine during my testing.
Here's the revised code:
Will produce the output:
Data::Generate。我在搜索组合时发现了这一点,这是您对单词集所做的数学术语。
Data::Generate. I found this while searching for combination which is the mathematical term of what you're doing with your sets of words there.
如果您由于语法和正则表达式语法之间的冲突而克服了一些丑陋的正则表达式,那么这里有一个相当简单的解决方案。它允许使用 [] 和 () 语法,实际上它们非常相似, [foo] 与 (foo| ) 相同。
其基础是用标记 #0、#1、#2... 替换每个交替,同时将它们存储在数组中。然后替换最后一个标记,生成几个短语,然后替换每个短语中的倒数第二个标记......直到所有标记都被替换。 高阶 Perl 的细心读者无疑会找到一种更优雅的方法来做到这一点。
Here is a rather simple solution, if you get past some of the ugly regexps, due to collisions between your syntax and the regexp syntax. It allows for both the [] and the () syntax, which in fact are very similar, [foo] is the same as (foo| ).
The basis is to replace each alternation by a marker #0, #1, #2... while storing them in an array. then replace the last marker, generating several phrases, then replace the next-to last marker in each of those phrases... until all markers have been replaced. Attentive readers of Higher-order Perl will no doubt find a more elegant way to do this.