如何从文本中删除小写句子片段?
我正在尝试使用正则表达式或简单的 Perl oneliner 从标准文本文件中删除小写句子片段。
这些通常被称为语音或归因标签,例如 - 他说,她说等。
此示例显示使用手动删除之前和之后:
- 原始:
“啊,那是完全正确的!”阿辽沙喊道。
“哦,别再装傻了!某个白痴进来了,你把我们 真丢脸!”窗边的女孩突然转向她的父亲,大声喊道。 带着不屑和轻蔑的神气。
“等一下,瓦尔瓦拉!”她父亲大声喊道,语气专横,但是 相当赞同地看着他们。 “这就是她的性格,”他说, 再次向阿廖沙讲话。
“你去哪儿了?”他问他。
“我想,”他说,“我忘记了一些东西……我的手帕,我 想想……好吧,即使我没有忘记任何事情,让我留下来吧 小。”
他坐下来。父亲站在他身边。
“你也坐下来,”他说。
- 手动删除所有小写句子片段:
“啊,完全正确!”
“哦,别再装傻了!一些白痴进来了,你把我们 羞耻!”
“等一下,瓦尔瓦拉!” “这就是她的性格,”
“你去哪儿了?”
“我想,” “我忘记了一些东西......我的手帕,我 想想......好吧,即使我没有忘记任何事情,让我留下来 一点点。”
他坐下来。父亲站在他旁边。
“你也坐下来,”
我已经将直引号改为“平衡并尝试了:”(...)+[.]
当然,这会删除一些片段,但是删除平衡引号中的一些文本以及以大写字母开头的文本。 [^AZ] 在上面的表达式中不起作用。
我意识到可能不可能达到 100% 的准确性,但任何有用的表达式、perl 或 python 脚本都将受到深深的赞赏。
干杯,
亚伦
I'm tyring to remove lowercase sentence fragments from standard text files using regular expresions or a simple Perl oneliner.
These are commonly referred to as speech or attribution tags, for example - he said, she said, etc.
This example shows before and after using manual deletion:
- Original:
"Ah, that's perfectly true!" exclaimed Alyosha.
"Oh, do leave off playing the fool! Some idiot comes in, and you put us
to shame!" cried the girl by the window, suddenly turning to her father
with a disdainful and contemptuous air.
"Wait a little, Varvara!" cried her father, speaking peremptorily but
looking at them quite approvingly. "That's her character," he said,
addressing Alyosha again.
"Where have you been?" he asked him.
"I think," he said, "I've forgotten something... my handkerchief, I
think.... Well, even if I've not forgotten anything, let me stay a
little."
He sat down. Father stood over him.
"You sit down, too," said he.
- All lower case sentence fragments manually removed:
"Ah, that's perfectly true!"
"Oh, do leave off playing the fool! Some idiot comes in, and you put us
to shame!"
"Wait a little, Varvara!" "That's her character,"
"Where have you been?"
"I think," "I've forgotten something... my handkerchief, I
think.... Well, even if I've not forgotten anything, let me stay a
little."
He sat down. Father stood over him.
"You sit down, too,"
I've changed straight quotes " to balanced and tried: ” (...)+[.]
Of course, this removes some fragments but deletes some text in balanced quotes and text starting with uppercase letters. [^A-Z] didn't work in the above expression.
I realize that it may be impossible to achieve 100% accuracy but any useful expression, perl, or python script would be deeply appreciated.
Cheers,
Aaron
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
下面是一个可以实现的 Python 代码片段:
Here's a Python snippet that should do:
这适用于问题中显示的所有情况:
对于以下情况,它会失败:
This works for all cases shown in the question:
It fails for cases such as these:
Text::Balanced
模块似乎就是您所追求的。以下应该能够提取示例中所有引用的演讲(不太漂亮,但完成了工作)。它也适用于丹尼斯的测试用例。
下面代码的优点是引号按段落分组,这对于后面的分析可能有用,也可能没有用
Script
输出
The
Text::Balanced
module is what you seem to be after if you're looking to use Perl. The following should be able to extract all the quoted speech in your example (not pretty, but gets the job done).It also works for Dennis' test cases.
The advantage of the code below is that the quotes are grouped by paragraph, which may or may not be useful for later analysis
Script
Output
我不完全确定您使用的是哪个编辑器,如果您使用的是支持原子分组的编辑器(例如EditorPad Pro)您可以使用下面的正则表达式来进行搜索和替换:
搜索
替换为
这里有一些解释正则表达式:
I am not entirely sure which editor are you using, if you are using something editor that supports atomic grouping (e.g. EditorPad Pro) You can use the regular expression below to do the search and replace:
Search for
Replace with
Here is a bit explanation for the regular expression:
如果我明白你在做什么...通过这样的正则表达式传递每一行应该可以...
你可以使用 perl 调试器来解决这个问题。在 linux/mac 中,只需在命令行上输入
perl -de 42
即可进入 perl 调试器。 (“42”只是一个有效的表达式 - 它可以是任何东西,但为什么不选择生命的意义?)无论如何
注意:抱歉我不得不编辑它 - 没有看到你想要的没有任何引号的行.. 是的
,Regex 和 Perl 很棒。它应该 100% 准确并获取所有实例,除非引用跨段落
If I understand what you are after... passing each line through a regex like this should work...
You can use the perl debugger to play around with this. Hop into the perl debugger with just a
perl -de 42
on the command line in linux/mac. (The "42" is just a valid expression - it could be anything, but why not choose the meaning of life?)anyways
NOTE: Sorry I had to edit it - didn't see you wanted lines without any quotes at all...
Yes, Regex and Perl is amazing. It should be 100% accurate and get all of your instances, acept in the case where a quote extends across paragraphs