如何从文本中删除小写句子片段?

发布于 2024-08-25 00:46:00 字数 907 浏览 4 评论 0原文

我正在尝试使用正则表达式或简单的 Perl oneliner 从标准文本文件中删除小写句子片段。

这些通常被称为语音或归因标签,例如 - 他说,她说等。

此示例显示使用手动删除之前和之后:

  1. 原始:

“啊,那是完全正确的!”阿辽沙喊道。

“哦,别再装傻了!某个白痴进来了,你把我们 真丢脸!”窗边的女孩突然转向她的父亲,大声喊道。 带着不屑和轻蔑的神气。

“等一下,瓦尔瓦拉!”她父亲大声喊道,语气专横,但是 相当赞同地看着他们。 “这就是她的性格,”他说, 再次向阿廖沙讲话。

“你去哪儿了?”他问他。

“我想,”他说,“我忘记了一些东西……我的手帕,我 想想……好吧,即使我没有忘记任何事情,让我留下来吧 小。”

他坐下来。父亲站在他身边。

“你也坐下来,”他说。


  1. 手动删除所有小写句子片段:

“啊,完全正确!”

“哦,别再装傻了!一些白痴进来了,你把我们 羞耻!”

“等一下,瓦尔瓦拉!” “这就是她的性格,”

“你去哪儿了?”

“我想,” “我忘记了一些东西......我的手帕,我 想想......好吧,即使我没有忘记任何事情,让我留下来 一点点。”

他坐下来。父亲站在他旁边。

“你也坐下来,”


我已经将直引号改为“平衡并尝试了:”(...)+[.]

当然,这会删除一些片段,但是删除平衡引号中的一些文本以及以大写字母开头的文本。 [^AZ] 在上面的表达式中不起作用。

我意识到可能不可能达到 100% 的准确性,但任何有用的表达式、perl 或 python 脚本都将受到深深的赞赏。

干杯,

亚伦

I'm tyring to remove lowercase sentence fragments from standard text files using regular expresions or a simple Perl oneliner.

These are commonly referred to as speech or attribution tags, for example - he said, she said, etc.

This example shows before and after using manual deletion:

  1. Original:

"Ah, that's perfectly true!" exclaimed Alyosha.

"Oh, do leave off playing the fool! Some idiot comes in, and you put us
to shame!" cried the girl by the window, suddenly turning to her father
with a disdainful and contemptuous air.

"Wait a little, Varvara!" cried her father, speaking peremptorily but
looking at them quite approvingly. "That's her character," he said,
addressing Alyosha again.

"Where have you been?" he asked him.

"I think," he said, "I've forgotten something... my handkerchief, I
think.... Well, even if I've not forgotten anything, let me stay a
little."

He sat down. Father stood over him.

"You sit down, too," said he.


  1. All lower case sentence fragments manually removed:

"Ah, that's perfectly true!"

"Oh, do leave off playing the fool! Some idiot comes in, and you put us
to shame!"

"Wait a little, Varvara!" "That's her character,"

"Where have you been?"

"I think," "I've forgotten something... my handkerchief, I
think.... Well, even if I've not forgotten anything, let me stay a
little."

He sat down. Father stood over him.

"You sit down, too,"


I've changed straight quotes " to balanced and tried: ” (...)+[.]

Of course, this removes some fragments but deletes some text in balanced quotes and text starting with uppercase letters. [^A-Z] didn't work in the above expression.

I realize that it may be impossible to achieve 100% accuracy but any useful expression, perl, or python script would be deeply appreciated.

Cheers,

Aaron

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

栀梦 2024-09-01 00:46:00

下面是一个可以实现的 Python 代码片段:

 thetext="""triple quoted paste of your sample text"""
 y=thetext.split('\n')
 for line in y:
    m=re.findall('(".*?")',line)
    if m:
        print ' '.join(m)
    else:
        print line

Here's a Python snippet that should do:

 thetext="""triple quoted paste of your sample text"""
 y=thetext.split('\n')
 for line in y:
    m=re.findall('(".*?")',line)
    if m:
        print ' '.join(m)
    else:
        print line
策马西风 2024-09-01 00:46:00

这适用于问题中显示的所有情况:

sed -n '/"/!{p;b}; s/\(.*\)"[^"]*/\1" /;s/\(.*"\)\([^"]*\)\(".*"\)/\1 \3/;p' textfile

对于以下情况,它会失败:

He said, "It doesn't always work."

"Secondly," I said, "it fails for three quoted phrases..." He completed my thought, "with two unquoted ones."

I replied, "That's right." dejectedly.

This works for all cases shown in the question:

sed -n '/"/!{p;b}; s/\(.*\)"[^"]*/\1" /;s/\(.*"\)\([^"]*\)\(".*"\)/\1 \3/;p' textfile

It fails for cases such as these:

He said, "It doesn't always work."

"Secondly," I said, "it fails for three quoted phrases..." He completed my thought, "with two unquoted ones."

I replied, "That's right." dejectedly.
旧人 2024-09-01 00:46:00

Text::Balanced 模块似乎就是您所追求的。以下应该能够提取示例中所有引用的演讲(不太漂亮,但完成了工作)。

它也适用于丹尼斯的测试用例。

下面代码的优点是引号按段落分组,这对于后面的分析可能有用,也可能没有用

Script

use strict;
use warnings;
use Text::Balanced qw/extract_quotelike extract_multiple/;

my %quotedSpeech;

{
    local $/ = '';
    while (my $text = <DATA>) { # one paragraph at a time

        while (my $speech = extract_multiple(
                            $text,
                            [sub{extract_quotelike($_[0])},],
                            undef,
                            1))
        {   push @{$quotedSpeech{$.}}, $speech; }
    }
}

# Print total number of paragraphs in DATA filehandle

print "Total paragraphs: ", (sort {$a <=> $b} keys %quotedSpeech)[-1];

# Print quotes grouped by paragraph:

foreach my $paraNumber (sort {$a <=> $b} keys %quotedSpeech) {
    print "\n\nPara ",$paraNumber;
    foreach my $speech (@{$quotedSpeech{$paraNumber}}) {
        print "\t",$speech,"\n";
    }
}
# How many quotes in paragraph 8?
print "Number of quotes in Paragraph 8: ", scalar @{$quotedSpeech{8}};

__DATA__

“啊,完全正确!”阿廖沙惊呼道。

“哦,别再装傻了!
某个白痴进来了,你让我们去
耻辱!”窗边的女孩喊道,
突然转向她的父亲
带着不屑和轻蔑的神情。

“等一下,瓦尔瓦拉!”她哭了
父亲,霸道地说话,但是
相当赞同地看着他们。
“这就是她的性格,”他说,
再次向阿廖沙讲话。

“你去哪儿了?”他问他。

“我想,”他说,“我忘记了
东西……我的手帕,我
想想……好吧,即使我没有
忘记了什么,让我留下来
很少。”

他坐下了。父亲站在他身边。

“你也坐下来,”他说。

他说,“这并不总是有效。”

“其次,”我说,“它失败了
三个引用的短语......”他完成了
我的想法是,“有两个未引用的内容。”

我回答说:“是的。”垂头丧气。

输出

Total paragraphs: 10

Para 1  "Ah, that's perfectly true!"


Para 2  "Oh, do leave off playing the fool! Some idiot comes in, and you put us
to shame!"


Para 3  "Wait a little, Varvara!"
        "That's her character,"


Para 4  "Where have you been?"


Para 5  "I think,"
        "I've forgotten something... my handkerchief, I think.... Well, even if
I've not forgotten anything, let me stay a little."


Para 7  "You sit down, too,"


Para 8  "It doesn't always work."


Para 9  "Secondly,"
        "it fails for three quoted phrases..."
        "with two unquoted ones."


Para 10 "That's right."

The Text::Balanced module is what you seem to be after if you're looking to use Perl. The following should be able to extract all the quoted speech in your example (not pretty, but gets the job done).

It also works for Dennis' test cases.

The advantage of the code below is that the quotes are grouped by paragraph, which may or may not be useful for later analysis

Script

use strict;
use warnings;
use Text::Balanced qw/extract_quotelike extract_multiple/;

my %quotedSpeech;

{
    local $/ = '';
    while (my $text = <DATA>) { # one paragraph at a time

        while (my $speech = extract_multiple(
                            $text,
                            [sub{extract_quotelike($_[0])},],
                            undef,
                            1))
        {   push @{$quotedSpeech{$.}}, $speech; }
    }
}

# Print total number of paragraphs in DATA filehandle

print "Total paragraphs: ", (sort {$a <=> $b} keys %quotedSpeech)[-1];

# Print quotes grouped by paragraph:

foreach my $paraNumber (sort {$a <=> $b} keys %quotedSpeech) {
    print "\n\nPara ",$paraNumber;
    foreach my $speech (@{$quotedSpeech{$paraNumber}}) {
        print "\t",$speech,"\n";
    }
}
# How many quotes in paragraph 8?
print "Number of quotes in Paragraph 8: ", scalar @{$quotedSpeech{8}};

__DATA__

"Ah, that's perfectly true!" exclaimed Alyosha.

"Oh, do leave off playing the fool!
Some idiot comes in, and you put us to
shame!" cried the girl by the window,
suddenly turning to her father with a
disdainful and contemptuous air.

"Wait a little, Varvara!" cried her
father, speaking peremptorily but
looking at them quite approvingly.
"That's her character," he said,
addressing Alyosha again.

"Where have you been?" he asked him.

"I think," he said, "I've forgotten
something... my handkerchief, I
think.... Well, even if I've not
forgotten anything, let me stay a
little."

He sat down. Father stood over him.

"You sit down, too," said he.

He said, "It doesn't always work."

"Secondly," I said, "it fails for
three quoted phrases..." He completed
my thought, "with two unquoted ones."

I replied, "That's right." dejectedly.

Output

Total paragraphs: 10

Para 1  "Ah, that's perfectly true!"


Para 2  "Oh, do leave off playing the fool! Some idiot comes in, and you put us
to shame!"


Para 3  "Wait a little, Varvara!"
        "That's her character,"


Para 4  "Where have you been?"


Para 5  "I think,"
        "I've forgotten something... my handkerchief, I think.... Well, even if
I've not forgotten anything, let me stay a little."


Para 7  "You sit down, too,"


Para 8  "It doesn't always work."


Para 9  "Secondly,"
        "it fails for three quoted phrases..."
        "with two unquoted ones."


Para 10 "That's right."
£烟消云散 2024-09-01 00:46:00

我不完全确定您使用的是哪个编辑器,如果您使用的是支持原子分组的编辑器(例如EditorPad Pro)您可以使用下面的正则表达式来进行搜索和替换:

搜索

(".+?"|^[A-Z].+\r\n)(.(?!"))* 
Note: you should replace \r\n with \n or \r according to your line breaks

替换为

\1

这里有一些解释正则表达式:

第一个捕获组用于引号和以大写字母开头的行之间的字符。第二个捕获组适用于引号之后但另一个引号之前的任何字符。

I am not entirely sure which editor are you using, if you are using something editor that supports atomic grouping (e.g. EditorPad Pro) You can use the regular expression below to do the search and replace:

Search for

(".+?"|^[A-Z].+\r\n)(.(?!"))* 
Note: you should replace \r\n with \n or \r according to your line breaks

Replace with

\1

Here is a bit explanation for the regular expression:

The first capturing group is for characters between quotes and lines starting with Capital Letters. The second capturing group is for any characters that is after a quote but before another quote.

箹锭⒈辈孓 2024-09-01 00:46:00

如果我明白你在做什么...通过这样的正则表达式传递每一行应该可以...

你可以使用 perl 调试器来解决这个问题。在 linux/mac 中,只需在命令行上输入 perl -de 42 即可进入 perl 调试器。 (“42”只是一个有效的表达式 - 它可以是任何东西,但为什么不选择生命的意义?)

无论如何

open FILE, "<", "filename.txt" or die $!;
while (my $line = <FILE>) {
   @fixed_text = $line =~ m{(?:(" .+? ")) | (?:\A .* [^"] .* \z)}xmsg;
  for my $new_line (@fixed_text) {
    print qq($new_line );
  }
  print qq(\n);
}

注意:抱歉我不得不编辑它 - 没有看到你想要的没有任何引号的行.. 是的

,Regex 和 Perl 很棒。它应该 100% 准确并获取所有实例,除非引用跨段落

If I understand what you are after... passing each line through a regex like this should work...

You can use the perl debugger to play around with this. Hop into the perl debugger with just a perl -de 42 on the command line in linux/mac. (The "42" is just a valid expression - it could be anything, but why not choose the meaning of life?)

anyways

open FILE, "<", "filename.txt" or die $!;
while (my $line = <FILE>) {
   @fixed_text = $line =~ m{(?:(" .+? ")) | (?:\A .* [^"] .* \z)}xmsg;
  for my $new_line (@fixed_text) {
    print qq($new_line );
  }
  print qq(\n);
}

NOTE: Sorry I had to edit it - didn't see you wanted lines without any quotes at all...

Yes, Regex and Perl is amazing. It should be 100% accurate and get all of your instances, acept in the case where a quote extends across paragraphs

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文