如何在给定不完全包含在单词中的标记的情况下对单词进行标记?
我了解如何按以下方式在 Perl 中使用正则表达式:
$str =~ s/expression/replacement/g;
我了解如果表达式的任何部分括在括号中,则可以在替换部分中使用和捕获它,如下所示:
$str =~ s/(a)/($1)dosomething/;
但是有没有办法捕获 < code>($1) 位于正则表达式的外部之上?
我有一个完整的单词,它是一串辅音,例如 bEdmA
,它的元音版本 baEodamaA
(其中 a
和 o 是元音),以及由空格分隔的两个标记的分割形式,
bEd maA
。我只想从完整的单词中提取标记的元音形式,如下所示:beEoda
、maA
。我正在尝试捕获完整单词表达式中的标记,因此我有:
$unvowelizedword = "bEdmA";
$tokens[0] = "bEd", $tokens[1] = "mA";
$vowelizedword = "baEodamA";
foreach $t(@tokens) {
#find the token within the full word, and capture its vowels
}
我正在尝试执行以下操作:
$vowelizedword = m/($t)/;
这是完全错误的,原因有两个:标记 $t
不存在完全以其自己的形式,例如 bEd
,但类似 m/bEd/
的内容会更相关。另外,如何在正则表达式外部的变量中捕获它?
真正的问题是:在给定标记 bEd
、mA
的情况下,如何捕获元音序列 baEoda
和 maA
来自完整单词beEodamaA
?
编辑
我从所有答案中意识到我错过了两个重要的细节。
- 元音是可选的。因此,如果标记是:“Al”和“ywm”,并且完全元音化的单词是“Alyawmi”,则输出标记将是“Al”和“yawmi”。
我只提到了两个元音,但还有更多,包括由两个字符组成的符号,例如“~a”。完整列表(尽管我认为我不需要在这里提及)是:
@元音 = ('a', 'i', 'u', 'o', '~', '~a', '~i', '~u', 'N', 'F', 'K', '~N', '~K');
I understand how to use regex in Perl in the following way:
$str =~ s/expression/replacement/g;
I understand that if any part of the expression is enclosed in parentheses, it can be used and captured in the replacement part, like this:
$str =~ s/(a)/($1)dosomething/;
But is there a way to capture the ($1)
above outside of the regex expression?
I have a full word which is a string of consonants, e.g. bEdmA
, its vowelized version baEodamaA
(where a
and o
are vowels), as well its split up form of two tokens, separated by space, bEd maA
. I want to just pick up the vowelized form of the tokens from the full word, like so: beEoda
, maA
. I'm trying to capture the token within the full word expression, so I have:
$unvowelizedword = "bEdmA";
$tokens[0] = "bEd", $tokens[1] = "mA";
$vowelizedword = "baEodamA";
foreach $t(@tokens) {
#find the token within the full word, and capture its vowels
}
I'm trying to do something like this:
$vowelizedword = m/($t)/;
This is completely wrong for two reasons: the token $t
is not present in exactly its own form, such as bEd
, but something like m/b.E.d/
would be more relevant. Also, how do I capture it in a variable outside the regular expression?
The real question is: how can I capture the vowelized sequences baEoda
and maA
, given the tokens bEd
, mA
from the full word beEodamaA
?
Edit
I realized from all the answers that I missed out two important details.
- Vowels are optional. So if the tokens are : "Al" and "ywm", and the fully vowelized word is "Alyawmi", then the output tokens would be "Al" and "yawmi".
I only mentioned two vowels, but there are more, including symbols made up of two characters, like '~a'. The full list (although I don't think I need to mention it here) is:
@vowels = ('a', 'i', 'u', 'o', '~', '~a', '~i', '~u', 'N', 'F', 'K', '~N', '~K');
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
以下似乎可以满足您的要求:
根据您更新的问题进行更新(元音是可选的)。它从字符串末尾开始工作,因此您必须将标记收集到一个数组中并反向打印它们:
The following seems to do what you want:
Update as per your updated question (vowels are optional). It works from the end of the string so you'll have to gather the tokens into an array and print them in reverse:
在所谓的“列表上下文”中使用
m//
运算符,如下所示:my @tokens = ($input =~ m/capturing_regex_here/modifiershere);
Use the
m//
operator in so-called "list context", as this:my @tokens = ($input =~ m/capturing_regex_here/modifiershere);
ETA:据我现在的了解,您想说的是您想要在标记的每个字符后面匹配一个可选的元音。
这样,您可以调整
$vowels
变量以仅包含您要查找的字母。或者,您也可以只使用.
来捕获任何字符。输出:
请注意,
不需要捕获组< /a> 在正则表达式中。
ETA: From what I understand now, what you were trying to say is that you want to match an optional vowel after each character of the tokens.
With this, you can tweak the
$vowels
variable to only contain the letters you seek. Optionally, you may also just use.
to capture any character.Output:
Note that
does not require capturing groups in the regex.
我怀疑有一种更简单的方法可以完成您想要完成的任何事情。诀窍是不要让正则表达式生成代码太复杂,以至于你忘记它实际上在做什么。
我只能开始猜测您的任务,但从您的单个示例来看,您似乎想检查两个子标记是否在较大的标记中,而忽略某些字符。我猜测这些子标记必须按顺序排列,并且除了那些元音字符之外,它们之间不能有任何其他内容。
为了匹配标记,我可以在标量上下文中使用
\G
锚点和/g
全局标志。这会将匹配锚定到同一标量的最后一个匹配结束后的字符。这种方式允许我为每个子标记拥有单独的模式。这更容易管理,因为我只需要更改@subtokens
中的值列表。一旦你检查了每一对并找到哪些与所有模式匹配,我就可以从这对中提取原始字符串。
现在,这是这个结构的好处。我可能对你的任务猜错了。如果有的话,很容易修复,无需更改设置。假设子标记不必按顺序排列。这是对我创建的模式的简单更改。我只是摆脱了
\G
锚点和/g
标志;或者,假设令牌必须按顺序排列,但它们之间可能存在其他因素。我可以插入一个
.*?
来匹配这些内容,从而有效地跳过它:如果我可以从我创建的
map
中管理所有这些,那就更好了模式,但/g
标志不是模式标志。它必须与操作员一起。我发现,当我不将所有内容都包含在单个正则表达式中时,管理不断变化的需求会容易得多。
I suspect that there is an easier way to do whatever you're trying to accomplish. The trick is not to make the regex generation code so tricky that you forget what it's actually doing.
I can only begin to guess at your task, but from your single example, it looks like you want to check that the two subtokens are in the larger token, ignoring certain characters. I'm going to guess that those sub tokens have to be in order and can't have anything else between them besides those vowel characters.
To match the tokens, I can use the
\G
anchor with the/g
global flag in scalar context. This anchors the match to the character one after the end of the last match for the same scalar. This way allows me to have separate patterns for each sub token. This is much easier to manage since I only need to change the list of values in@subtokens
.Once you go through each of the pairs and find which ones match all the patterns, I can extract the original string from the pair.
Now, here's the nice thing about this structure. I've probably guessed wrong about your task. If I have, it's easy to fix without changing the setup. Let's say that the subtokens don't have to be in order. That's an easy change to the pattern I created. I just get rid of the
\G
anchor and the/g
flag;Or, suppose that the tokens have to be in order, but other things may be between them. I can insert a
.*?
to match that stuff, effectively skipping over it:It would be much nicer if I could manage all of this from the
map
where I create the patterns, but the/g
flag isn't a pattern flag. It has to go with the operator.I find it much easier to manage changing requirements when I don't wrap everything in a single regular expression.
假设标记需要按顺序出现并且它们之间没有任何内容(元音除外):
Assuming the tokens need to appear in order and without anything (other than a vowel) between them: