用r解析文本
我有一些最初是srt
的文件,格式字幕已发布。 他们通常遵循的模式如下:
Subtitle_number
Beginning_min --> Ending_min
Text
例如,这可能是srt
文件的结构:
1
00:00:00,100 --> 00:00:01,500
This is the first subtitle
2
00:00:01,700 --> 00:00:02,300
of the movie
现在,我有一些“修改” srt
's,由于它们在字幕编号之后立即具有字符的名称,因此与正常情况有所不同。这是一个示例:
1 Matt
00:00:00,100 --> 00:00:01,500
This is said by Matt
2 Lucas
00:00:01,700 --> 00:00:02,300
While this is said by Lucas
我想做的是解析这些文件,以创建data.frame
如下:
+---------------------------------------------+
| CHARACTER | TEXT |
|--------------+------------------------------|
| Matt | This is said by Matt |
|--------------+------------------------------|
| Lucas | While this is said by Lucas |
+---------------------------------------------+
因此,我不想要字幕的数字或分钟。 我已经能够使用readText
库读取文本,从而导致这样的内容:
1 Matt\n00:00:00,100 --> 00:00:01,500\nThis is said by Matt.\n\n2 Lucas\n00:00:01,700 --> 00:00:02,300\nWhile this is said by Lucas
请注意,文本内部也可能有\ n
其他(可读)的字符
是我被卡住的地方,我想我必须使用某种REGEX
来提取所有名称,然后是所有文本,但是我对如何执行此操作一无所知。
任何帮助都非常感谢!
I have some txt
files which were originally srt
's, the format subtitles are published.
The pattern they usually follow is like the following:
Subtitle_number
Beginning_min --> Ending_min
Text
As an example, this might be the structure of an srt
file:
1
00:00:00,100 --> 00:00:01,500
This is the first subtitle
2
00:00:01,700 --> 00:00:02,300
of the movie
Now, I have some "modified" srt
's, which differ from normal ones because of them having the name of the character right after the subtitle number. Here is an example:
1 Matt
00:00:00,100 --> 00:00:01,500
This is said by Matt
2 Lucas
00:00:01,700 --> 00:00:02,300
While this is said by Lucas
What I would like to do is to parse these files in order to create a data.frame
like the following:
+---------------------------------------------+
| CHARACTER | TEXT |
|--------------+------------------------------|
| Matt | This is said by Matt |
|--------------+------------------------------|
| Lucas | While this is said by Lucas |
+---------------------------------------------+
So, I do not want the number or the minute of the subtitle.
I have been able to read the text with the readtext
library, resulting in something like this:
1 Matt\n00:00:00,100 --> 00:00:01,500\nThis is said by Matt.\n\n2 Lucas\n00:00:01,700 --> 00:00:02,300\nWhile this is said by Lucas
Note that there might be \n
also inside of the texts, as well as any other (readable) character
Here is where I am stuck, I guess I would have to use some kind of Regex
to extract all names and then all texts, but I have no clue on how to do this.
Any help is highly appreciated!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是一个无需正直的逐步完成的方法。它有点草率,但它可以显示有关如何处理这样的文件的逻辑。最终结果是一个数据框架,您可以在其中获取所需的信息。
然后您将获得此数据框架。当然,您可以将任何东西命名。
在这里编辑
是一种使用一些整理的更简化的方式。
Here is a step-by-step way to do this without regex. It's a bit sloppy, but its to show the logic on how to approach a file like this. End result is a data frame where you can grab the info you want.
And you get this data frame. You can name things whatever you want, of course.
EDIT
Here is a more streamlined way using a bit of tidyverse.
您是对的,可以使用正则表达式来实现这一目标。使用
Stringr
软件包通常是一个好主意。这在很大程度上取决于您的文本的一致性,但这对您的示例有用。如果规则有例外,它可能行不通,但是您可以调整模式。使用 Regex101 是一个很好的帮助。在您的反馈之后,我认为首先使用
strsplit
将文本拆分为块。然后使用dplyr
和Stringr
:You are right that you can use regular expressions to try and accomplish this. Using the
stringr
package is usually a good idea for this. It highly depends on how consistent your texts are, but this works for your example. It might not work if there are exceptions to the rule, but you can tweak the patterns. Using regex101 is a great help.After your feedback I think splitting the text into chunks first using
strsplit
makes it easier to process. Then usingdplyr
andstringr
: