使用正则表达式解析自然语言音乐引文
我正在努力确定一个相当复杂的正则表达式,以从松散类型的英语中解析具有可选艺术家归属的歌曲标题。用户输入来自单个文本字段,正则表达式匹配将用于查询歌曲数据库以获取唯一的曲目 ID。我需要能够获得这些匹配:
\1
= 歌曲标题\2
= 艺术家,
同时在允许的格式上相当自由。
示例
单词“by”应将字符串拆分为歌曲标题和艺术家(但仅限于单词边界);逗号也应该带有/不带有尾随空格:
布兰妮·斯皮尔斯的《再一次宝贝》
再一次宝贝,布兰妮·斯皮尔斯
再一次宝贝吧,布兰妮·斯皮尔斯
\1
= 宝贝再来一次\2
= 布兰妮·斯皮尔斯
误报,例如这些是可以接受的:
在海湾边
\1
= 下方\2
= 海湾
无论人们怎么说我,我都不是
\1
= 无论人们说我是什么\2
= 这就是我不
......假设引号可以用来将一段文本明确标记为歌曲标题:
“在海湾边”
\1
= down by the bay\2
不匹配
北极猴子的“无论人们怎么说我,我都不是”
\1
= 无论人们说我是谁,我都不是\2
= 北极猴子
单引号也应该起作用,但如果它们出现在标题中显然不行:
“无论人们怎么说我,我都不是”
\1
= 无论人们说我是什么,\2
= 我不是什么'
另外,如果使用引号,则单词“by”或逗号是可选的:
“海湾边”拉菲
\1
= down by the bay\2
= raffi
但是,如果没有引号,且有多个“by”,则仅最后一个“by”应该用作分隔符:
拉菲在海湾边
\1
= down by the bay\2
= raffi
这甚至可以通过单个正则表达式实现吗?或者更明智的方法是将其分成多个表达式?不管怎样,这会是什么样子?
I am struggling with nailing down a fairly complex regular expression to parse song titles with optional artist attribution from loosely-typed English. The user input comes from a single text field and the regex matches will be used to query a song database to get unique track IDs. I need to be able to get these matches:
\1
= song title\2
= artist
while being fairly liberal in allowed formats.
Examples
The wold "by" should split the string into song title and artist (but only on word boundaries); as should a comma with/without trailing whitespace:
baby one more time by britney spears
baby one more time, britney spears
baby one more time,britney spears
\1
= baby one more time\2
= britney spears
False positives like these are acceptable:
down by the bay
\1
= down\2
= the bay
whatever people say i am, that's what i'm not
\1
= whatever people say i am\2
= that's what i'm not
…assuming quotes can be used to mark a run of text as a song title explicitly:
"down by the bay"
\1
= down by the bay\2
not matched
"whatever people say i am, that's what i'm not" by arctic monkeys
\1
= whatever people say i am, that's what i'm not\2
= arctic monkeys
Single quotes should work too, but obviously not if they appear within the title:
'whatever people say i am, that's what i'm not'
\1
= whatever people say i am, that\2
= s what i'm not'
Additionally, if quotes are in use, the word "by" or a comma are optional:
"down by the bay" raffi
\1
= down by the bay\2
= raffi
However, if there are no quotes, and more than one "by", then only the last "by" should be used as a delimiter:
down by the bay by raffi
\1
= down by the bay\2
= raffi
Is this even possible with a single regex? Or would the more sane way be to split it up into multiple expressions? Either way, what might this look like?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这是一个使用 C# 的示例:
据我所知,输出符合您的规范:
您实际上可以通过允许单词中包含撇号来更好地处理单引号情况:
这修复了这种情况:
这是正则表达式的注释版本,它解释了每个部分的作用(应与
RegexOptions.ExplicitCapture|RegexOptions.IgnorePatternWhitespace
匹配):编辑:
我玩过与 PHP 代码有关,但我无法让它正确使用命名捕获组。以下是使用未命名捕获组的版本:
标题将位于组 1、2 或 3 中,艺术家将位于组 4 中。
Here is an example, using C#:
Output matches your specification, as far as I can tell:
You can actually make it better for the single-quote case by allowing apostrophes inside words:
Which fixes this case:
Here's a commented version of the regex, which explains what each part does (should be matched with
RegexOptions.ExplicitCapture|RegexOptions.IgnorePatternWhitespace
):Edit:
I've played around a bit with the PHP code, but I can't get it to use named capturing groups properly. Here is a version using unnamed capturing groups:
The title will be in group 1, 2, or 3, and the artist in group 4.
根据您发布的示例,我当然不会尝试为所有情况编写一个正则表达式,除非有一些令人信服的理由这样做。编写这样的表达式(我确实认为这是可能的)会非常脆弱,并且维护起来可能会很麻烦。
听起来你只是有一些简单的基于规则的处理,我会这样对待。您可以将每个单独的规则作为正则表达式,以您喜欢的任何顺序存储它们,然后当您获得更多的处理经验时,您可以尝试找出是否有更好的顺序,也许取决于解析的百分比你想要的方式。
只是迭代地尝试完善你的规则;您可能会开始注意到更复杂的模式,并且您可以扩展规则类以对一个规则采取多个步骤,例如,您可能注意到对于特定规则,它失败了,但是如果您要向该规则添加额外的检查规则你可以排除大部分失败。
对于每个正则表达式,我认为最简单的可能是最好的,并且任何单个规则都可能不需要那么复杂,尤其是在一开始。正则表达式是非常强大的工具,但我不会过多关注尝试将自然语言解析之类的东西强行转换为更适合解析定义明确的形式语言的东西。 (因此,“常规”部分。)
我脑海中浮现的另一个想法是考虑您可能会发现在某些情况下对输入文本运行某种一致性可以使处理更容易,例如,减少必须处理的案件数量。使用所提供示例中的(可能好或坏)示例,而不是使用一个规则来处理
X by Y
和一个规则来处理X, Y
和一个规则要处理"X" Y
,您可以运行一个过滤器,将by[space]
替换为,
一个替换,[space ]
与,
以及替换的一个“X”[空格]
与X,
。最后,您仅留下X,Y
,这意味着您只需处理一种情况。示例可能过于简单而无用,但它是一个可以搜索的好模式;有时一致性可以大大简化这种处理。Based on the examples you've posted, I certainly wouldn't try to write a single regex for all cases, unless there was some compelling reason to do so. Writing such an expression, which I do imagine is possible, would be very brittle, and would likely be a hassle to maintain.
Sounds like you just have some simple rule-based processing, which I would treat as such. You could have each of the individual rule be a regex, store them in whatever order you like, and then as you got more experience with processing you could try to figure out whether there was a better order, perhaps depending on the percentage that were parsed the way you would like.
Just iteratively try to refine your rules; you might start to notice more complex patterns and you could expand your rules classes to take multiple steps into account for one rule, e.g. perhaps you notice that for a particular rule, it's failing, but that if you were to add an additional check to that rule you could weed out most of the failures.
As for each regex, I think probably simplest is best, and none of the individual rules would likely need to be that complicated, especially at first. Regular expressions are very powerful tools, but I wouldn't focus too much on trying to shoehorn something like parsing natural language into something that is more well-suited for parsing well-defined formal languages. (Thus, the "regular" part.)
One more idea that comes to me off the top of my head would be to consider that you might find in certain cases that running some sort of conformance on the input text could make the processing easier, for instance by reducing the number of cases you have to process. To use a (possibly good or bad) example from the provided examples, instead of having a rule to process
X by Y
and a rule to processX, Y
and a rule to process"X" Y
, you could run a filter that replacesby[space]
with,
one that replaces,[space]
with,
and one that replaces"X"[space]
withX,
. Then at the end you're only left withX,Y
which means you only have to process the one case. Likely too simplistic of an example to be useful, but it's a good pattern to be able to search for; sometimes conformance can greatly simplify this kind of processing.我会采用更统计/垃圾邮件过滤器的方式,将自然语言简化为单词数组,然后测量组成标题和艺术家姓名的单词之间的距离。
在正则表达式术语中,这可能意味着将每个普通单词 (
\w+
) 转换为单个-
并将标题和作者中的每个单词转换为 < code>!但这只是一种可视化单词运行的奇特方式。
I would go a more statistical/spam-filter way and reduce the natural language to an array of words, then measure the distance among the words that compose the title and the artist's name.
In regexp terms this may mean transforming every normal word (
\w+
) in a single-
and every word in the title and author in a!
But that's just a fancy way to visualize word runs.