使用正则表达式匹配 .srt 文件字幕行和时间戳
正如标题所述,我想匹配 .srt 文件字幕的时间戳和文本行。
其中一些文件的格式不正确,所以我需要一些东西来处理几乎所有这些文件。
文件的正确格式如下:
1
00:00:02,160 --> 00:00:04,994
You really don't remember
what happened last year?
2
00:00:06,440 --> 00:00:07,920
- School. Now.
- I dropped out.
3
00:00:08,120 --> 00:00:10,510
- Get your diploma, I'll get mine.
- What you doing?
4
00:00:10,680 --> 00:00:13,514
- Studying.
- You taking your GED? All right, Fi.
我想出的正则表达式模式对于此类文件非常有效。
正如我所说,有些文件的格式不正确,有些文件没有行号,有些文件在每个字幕行之后没有新行,并且我想出的正则表达式无法正常工作对于那些。
还有其他类似的问题已经得到解答,但我想在单独的匹配组中匹配每个时间戳和文本行。因此,我在上述示例的第一行中的组将如下所示:
group 1: 00:00:02,160
group 2: 00:00:04,994
group 3: 你真的不记得\n去年发生了什么?
这就是我到目前为止所得到的:
LINE_RE = (
# group 1:
r"^\s*(\d+:\d+:\d+,\d+)" # line starts with any number of whitespace
# and followed by a time format like 00:00:00,000
r"(?:\s*-{2,3}>\s*)" # non-matching group for ' --> '
# matches one or more of - follwed by a >
# group 2:
r"(\d+:\d+:\d+,\d+)\s*\n" # time format again,
# ended with any number of whitespace and a \n
# group 3:
r"([\s\S]*?(?:^\s*$|\d+:\d+:\d+,\d+|^\s*\d+\s*\n))"
# matches any character, until it hits an empty line, a line with only a number in it or a timestamp
)
我认为我的确切问题是在最后一个不匹配的组中,当下一个该行不是空行。
this 是一个示例文件,我在文件中做了一些修改,以便更好地显示问题。
As the title states, I want to match the timestamp and text lines of a .srt file subtitles.
some of these files are not formatted properly, so I need something to work with almost all of them.
the correct formatting of a file is like this:
1
00:00:02,160 --> 00:00:04,994
You really don't remember
what happened last year?
2
00:00:06,440 --> 00:00:07,920
- School. Now.
- I dropped out.
3
00:00:08,120 --> 00:00:10,510
- Get your diploma, I'll get mine.
- What you doing?
4
00:00:10,680 --> 00:00:13,514
- Studying.
- You taking your GED? All right, Fi.
and the regex pattern that I came up with is working very well for this kind of files.
as I said, some of the files are not formatted properly, some of them don't have the line number, some of them don't have a new line after each subtitle line and the regex that I came up with does not work properly for those.
There are other questions like this that have already been answered, but I want to match each timestamp and text line in a separate matching-group. so my groups for the first line of the mentioned example would be something like this:
group 1: 00:00:02,160
group 2: 00:00:04,994
group 3: You really don't remember\nwhat happened last year?
this is what I've got so far:
LINE_RE = (
# group 1:
r"^\s*(\d+:\d+:\d+,\d+)" # line starts with any number of whitespace
# and followed by a time format like 00:00:00,000
r"(?:\s*-{2,3}>\s*)" # non-matching group for ' --> '
# matches one or more of - follwed by a >
# group 2:
r"(\d+:\d+:\d+,\d+)\s*\n" # time format again,
# ended with any number of whitespace and a \n
# group 3:
r"([\s\S]*?(?:^\s*$|\d+:\d+:\d+,\d+|^\s*\d+\s*\n))"
# matches any character, until it hits an empty line, a line with only a number in it or a timestamp
)
I think my exact problem is in the last non-matching group, it does not work properly when the next line is not an empty line.
this is an example file, I did some mangling in the file so I could show the problem better.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在这种情况下,您可以匹配以类似时间戳的模式开头的行,并捕获不以换行符和单个数字或其他类似时间戳的模式开头的所有行。
部分中的模式匹配:
^
字符串开头\s*
匹配可选空白字符(\d+:\d+:\d+,\d+)
捕获组 1,匹配类似时间戳的模式[^\S\n]+-->[^\S\n]+
匹配-- >
之间有 1 个或多个空格(\d+:\d+:\d+,\d+)
捕获组 2,与组 1 的模式相同(
捕获组3(?: 非捕获组
-\n
匹配换行符(?!
负向前看,断言右边的不是\d+:\d+:\d+,\d+\b|\n+\d+$
匹配时间戳或 1 个以上换行符后仅跟数字)
关闭前瞻.*
匹配整行)*
关闭非捕获组并可选择重复它)
关闭组 3查看 正则表达式演示。
In that case, you can match the lines that start with a timestamp like pattern, and capture all lines that do not start with either a newline and a single digit or another timestamp like pattern.
The pattern in parts matches:
^
Start of string\s*
Match optional whitspace chars(\d+:\d+:\d+,\d+)
Capture group 1, match a timestamp like pattern[^\S\n]+-->[^\S\n]+
Match-->
between 1 or more spaces(\d+:\d+:\d+,\d+)
Capture group 2, same pattern as for group 1(
Capture group 3(?: Non capture group
-\n
Match a newline(?!
Negative lookahead, assert what is to the right is not\d+:\d+:\d+,\d+\b|\n+\d+$
Match either a timestamp or 1+ newlines followed by only digits)
Close lookahead.*
Match the whole line)*
Close the non capture group and optionally repeat it)
Close group 3See a regex demo.