用r解析文本

发布于 2025-02-05 14:14:17 字数 1337 浏览 2 评论 0原文

我有一些最初是srt的文件，格式字幕已发布。他们通常遵循的模式如下：

Subtitle_number
Beginning_min --> Ending_min
Text

例如，这可能是srt文件的结构：

1
00:00:00,100 --> 00:00:01,500
This is the first subtitle

2
00:00:01,700 --> 00:00:02,300
of the movie

现在，我有一些“修改” srt's，由于它们在字幕编号之后立即具有字符的名称，因此与正常情况有所不同。这是一个示例：

1 Matt
00:00:00,100 --> 00:00:01,500
This is said by Matt

2 Lucas
00:00:01,700 --> 00:00:02,300
While this is said by Lucas

我想做的是解析这些文件，以创建data.frame如下：

+---------------------------------------------+
| CHARACTER    |  TEXT                        |
|--------------+------------------------------|
| Matt         |  This is said by Matt        | 
|--------------+------------------------------|
| Lucas        |  While this is said by Lucas |
+---------------------------------------------+

因此，我不想要字幕的数字或分钟。我已经能够使用readText库读取文本，从而导致这样的内容：

1 Matt\n00:00:00,100 --> 00:00:01,500\nThis is said by Matt.\n\n2 Lucas\n00:00:01,700 --> 00:00:02,300\nWhile this is said by Lucas

请注意，文本内部也可能有\ n其他（可读）的字符

是我被卡住的地方，我想我必须使用某种REGEX来提取所有名称，然后是所有文本，但是我对如何执行此操作一无所知。

任何帮助都非常感谢！

原文

I have some txt files which were originally srt's, the format subtitles are published.
The pattern they usually follow is like the following:

Subtitle_number
Beginning_min --> Ending_min
Text

As an example, this might be the structure of an srt file:

1
00:00:00,100 --> 00:00:01,500
This is the first subtitle

2
00:00:01,700 --> 00:00:02,300
of the movie

Now, I have some "modified" srt's, which differ from normal ones because of them having the name of the character right after the subtitle number. Here is an example:

1 Matt
00:00:00,100 --> 00:00:01,500
This is said by Matt

2 Lucas
00:00:01,700 --> 00:00:02,300
While this is said by Lucas

What I would like to do is to parse these files in order to create a data.frame like the following:

+---------------------------------------------+
| CHARACTER    |  TEXT                        |
|--------------+------------------------------|
| Matt         |  This is said by Matt        | 
|--------------+------------------------------|
| Lucas        |  While this is said by Lucas |
+---------------------------------------------+

So, I do not want the number or the minute of the subtitle.
I have been able to read the text with the readtext library, resulting in something like this:

1 Matt\n00:00:00,100 --> 00:00:01,500\nThis is said by Matt.\n\n2 Lucas\n00:00:01,700 --> 00:00:02,300\nWhile this is said by Lucas

Note that there might be \n also inside of the texts, as well as any other (readable) character

Here is where I am stuck, I guess I would have to use some kind of Regex to extract all names and then all texts, but I have no clue on how to do this.

Any help is highly appreciated!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夏见 2025-02-12 14:14:18

这是一个无需正直的逐步完成的方法。它有点草率，但它可以显示有关如何处理这样的文件的逻辑。最终结果是一个数据框架，您可以在其中获取所需的信息。

txt <- "1 Matt
00:00:00,100 --> 00:00:01,500
This is said by Matt

2 Lucas
00:00:01,700 --> 00:00:02,300
While this is said by Lucas\nand another line

3
00:00:01,700 --> 00:00:02,300
While this is said by nobody"

library(readr)
library(tidyr)
library(tibble)
library(dplyr)
library(purrr)

df <- tibble(txt = read_lines(txt))

df %>% 
  rowid_to_column("row") %>% 
  group_by(group = cumsum(txt == "")) %>% 
  filter(!(txt == "")) %>% 
  mutate(field = pmin(row_number(), 3)) %>% 
  group_by(group, field) %>% 
  summarize(txt = paste(txt, collapse = "\n"), .groups = "drop") %>% 
  pivot_wider(names_from = "field",
              values_from = "txt") %>% 
  select(-group) %>% 
  set_names(c("Col1", "Col2", "Col3")) %>% 
  separate(Col1, c("Col1A", "Col1B"), extra = "merge", fill = "right")

然后您将获得此数据框架。当然，您可以将任何东西命名。

# A tibble: 3 x 4
  Col1A Col1B Col2                          Col3                                           
  <chr> <chr> <chr>                         <chr>                                          
1 1     Matt  00:00:00,100 --> 00:00:01,500 "This is said by Matt"                         
2 2     Lucas 00:00:01,700 --> 00:00:02,300 "While this is said by Lucas\nand another line"
3 3     NA    00:00:01,700 --> 00:00:02,300 "While this is said by nobody"

在这里编辑

是一种使用一些整理的更简化的方式。

library(tidyr)
library(dplyr)

tibble(txt = txt) %>% 
  separate_rows(txt, sep = "\\n\\n") %>% 
  separate(txt, c("A", "B", "C"), sep = "\n", extra = "merge") %>% 
  separate(A, c("A1", "B2"), extra = "merge", fill = "right")

Here is a step-by-step way to do this without regex. It's a bit sloppy, but its to show the logic on how to approach a file like this. End result is a data frame where you can grab the info you want.

txt <- "1 Matt
00:00:00,100 --> 00:00:01,500
This is said by Matt

2 Lucas
00:00:01,700 --> 00:00:02,300
While this is said by Lucas\nand another line

3
00:00:01,700 --> 00:00:02,300
While this is said by nobody"

library(readr)
library(tidyr)
library(tibble)
library(dplyr)
library(purrr)

df <- tibble(txt = read_lines(txt))

df %>% 
  rowid_to_column("row") %>% 
  group_by(group = cumsum(txt == "")) %>% 
  filter(!(txt == "")) %>% 
  mutate(field = pmin(row_number(), 3)) %>% 
  group_by(group, field) %>% 
  summarize(txt = paste(txt, collapse = "\n"), .groups = "drop") %>% 
  pivot_wider(names_from = "field",
              values_from = "txt") %>% 
  select(-group) %>% 
  set_names(c("Col1", "Col2", "Col3")) %>% 
  separate(Col1, c("Col1A", "Col1B"), extra = "merge", fill = "right")

And you get this data frame. You can name things whatever you want, of course.

# A tibble: 3 x 4
  Col1A Col1B Col2                          Col3                                           
  <chr> <chr> <chr>                         <chr>                                          
1 1     Matt  00:00:00,100 --> 00:00:01,500 "This is said by Matt"                         
2 2     Lucas 00:00:01,700 --> 00:00:02,300 "While this is said by Lucas\nand another line"
3 3     NA    00:00:01,700 --> 00:00:02,300 "While this is said by nobody"

EDIT

Here is a more streamlined way using a bit of tidyverse.

library(tidyr)
library(dplyr)

tibble(txt = txt) %>% 
  separate_rows(txt, sep = "\\n\\n") %>% 
  separate(txt, c("A", "B", "C"), sep = "\n", extra = "merge") %>% 
  separate(A, c("A1", "B2"), extra = "merge", fill = "right")

回复收藏 0 原文

金兰素衣 2025-02-12 14:14:18

您是对的，可以使用正则表达式来实现这一目标。使用Stringr软件包通常是一个好主意。这在很大程度上取决于您的文本的一致性，但这对您的示例有用。如果规则有例外，它可能行不通，但是您可以调整模式。使用 Regex101 是一个很好的帮助。

在您的反馈之后，我认为首先使用strsplit将文本拆分为块。然后使用dplyr和Stringr：

library(dplyr)

input_string <- "1 Matt\n00:00:00,100 --> 00:00:01,500\nThis is said by Matt.\n\n
                 2 Lucas\n00:00:01,700 --> 00:00:02,300\nWhile this is said by Lucas:\n Hi I'm Lucas\n Lucas is my name\n\n
                 1237 VvdL\n00:00:02,701 --> 00:00:02,900\nI'm\nHappy\nThis\nSeems\nTo\nWork"

tmp <- strsplit(input_string, split = '\\n\\n', perl = T) %>%
  data.frame 

colnames(tmp) <- "full_line"

tmp %>%
  mutate(CHARACTER = stringr::str_extract_all(full_line, "(?<=\\d )[a-zA-Z]+?(?=\\n)"), 
         TEXT = stringr::str_extract_all(full_line, "(?<=\\d\\n)(.|\\s)*")) %>%
  select(CHARACTER, TEXT)


 CHARACTER                                                           TEXT
1      Matt                                          This is said by Matt.
2     Lucas While this is said by Lucas:\n Hi I'm Lucas\n Lucas is my name
3      VvdL                              I'm\nHappy\nThis\nSeems\nTo\nWork

You are right that you can use regular expressions to try and accomplish this. Using the stringr package is usually a good idea for this. It highly depends on how consistent your texts are, but this works for your example. It might not work if there are exceptions to the rule, but you can tweak the patterns. Using regex101 is a great help.

After your feedback I think splitting the text into chunks first using strsplit makes it easier to process. Then using dplyr and stringr:

library(dplyr)

input_string <- "1 Matt\n00:00:00,100 --> 00:00:01,500\nThis is said by Matt.\n\n
                 2 Lucas\n00:00:01,700 --> 00:00:02,300\nWhile this is said by Lucas:\n Hi I'm Lucas\n Lucas is my name\n\n
                 1237 VvdL\n00:00:02,701 --> 00:00:02,900\nI'm\nHappy\nThis\nSeems\nTo\nWork"

tmp <- strsplit(input_string, split = '\\n\\n', perl = T) %>%
  data.frame 

colnames(tmp) <- "full_line"

tmp %>%
  mutate(CHARACTER = stringr::str_extract_all(full_line, "(?<=\\d )[a-zA-Z]+?(?=\\n)"), 
         TEXT = stringr::str_extract_all(full_line, "(?<=\\d\\n)(.|\\s)*")) %>%
  select(CHARACTER, TEXT)


 CHARACTER                                                           TEXT
1      Matt                                          This is said by Matt.
2     Lucas While this is said by Lucas:\n Hi I'm Lucas\n Lucas is my name
3      VvdL                              I'm\nHappy\nThis\nSeems\nTo\nWork

回复收藏 0 原文

~没有更多了~