正则表达式 - 从书名中提取卷号和章节号
嘿,
我正在尝试将一些遗留数据导入到一个全新的系统中,几乎完成了,但是有一个很大的问题!假设这些数据:
Blabla Vol.1 chapter 2
ABCD in the era of XYZ volume 2 First Chapter
A really useless book Eighth vol
Blala Sixth Vol Chapter 5
Lablah V6C7 2002
FooBar Vol6 C3 by Dr. Foo Bar
Regex: A tool in Hell V1 Eleventh Chapter
困惑!我尝试编写正则表达式来提取卷号和章节号,但你知道它是正则表达式!有人可以指导我完成这个吗?
Hey,
I'm trying to import some legacy data into a brand new system, it's almost done, but there's a huge problem! Assuming these kinda data:
Blabla Vol.1 chapter 2
ABCD in the era of XYZ volume 2 First Chapter
A really useless book Eighth vol
Blala Sixth Vol Chapter 5
Lablah V6C7 2002
FooBar Vol6 C3 by Dr. Foo Bar
Regex: A tool in Hell V1 Eleventh Chapter
Confused!! I tried to write that regex to extract volume and chapter numbers but you know it's REGEX! Can anyone please guide me through this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这是与您的示例匹配的正则表达式:
您可以在此处实时编辑正则表达式和/或添加测试。
在此链接中:
I assumed that volumes always comes before chapters as stated in your examples.
Here is a regular expression that will match your example :
You can live edit the regex and/or add tests here.
In this link :
I assumed that volumes always comes before chapters as stated in your examples.
在我看来,最好将其分成单独的步骤。在第一步中,您可以使用“/Vol.[0-9]+\s+chapter\s[0-9]+$/i”模式转换标题。在第二遍中,您可以转换与模式“/[az]+(th|nd|st)\svol/i”匹配的标题。等等。
尝试编写一个正则表达式来捕获所有这些情况通常不会有好结果,并且几乎总是始终存在错误。这是我前几天发现的一篇有趣的文章,详细介绍了过于复杂的正则表达式的危险。
In my opinion, it is always best to break this into separate steps. In the first step, you might convert the titles with the pattern "/Vol.[0-9]+\s+chapter\s[0-9]+$/i". In the second pass, you might convert the titles matching the pattern "/[a-z]+(th|nd|st)\svol/i". Etc.
Trying to write one regular expression to capture all of these cases usually does not end well and is almost always consistently buggy. Here's an interesting article I found the other day detailing the perils of overly complex regexing.
由于这些表达式根本不是“正则”,因此单个正则表达式将很困难。如果您有一组有限的章节和卷显示“方式”,那么您可以使用多个正则表达式来尝试提取该信息。
或者,如果您可以定义一些规则,例如“章节编号始终采用 [chapter #] 格式”,那么这也会有所帮助!
As these expressions are not "regular" at all, a single regular expression will be difficult. If you have a finite set of "ways" the chapter and volume are displayed, then you could use multiple regular expressions to attempt to extract that information.
Or if you can define some rules such as "the chapter number is always in the format [chapter #]" then that would also help!
如果同一行上的输出始终是相同的内容,我要做的第一件事就是爆炸(“\ n”,$ data)并使用正确的行。如果一致的话你可以匹配
什么。
顺便说一句,这个页面总是帮助我进行正则表达式测试。
http://www.quanetic.com/Regex
If the output is always the same things on the same lines the first thing I would do is explode("\n", $data) and work with the correct line. If consistent you could then match for
or something.
BTW, this page has always helped me with regex testing.
http://www.quanetic.com/Regex