如何使用正则表达式删除文件开头的文本?
我有一堆包含半标准标头的文件。 也就是说,它的外观非常相似,但文字有所变化。
我想从所有文件中删除此标头。
通过查看文件,我知道我要删除的内容封装在相似的单词之间。
例如,我有:
Foo bar...some text here...
more text
Foo bar...I want to keep everything after this point
我在 perl 中尝试了这个命令:
perl -pi -e "s/\A.*?Foo.bar*?Foo.bar//simxg" 00ws110.txt
但它不起作用。 我不是正则表达式专家,但希望有人知道如何根据文本匹配而不是字符数从文件开头删除一大块文本......
I have a bunch of files that contain a semi-standard header. That is, the look of it is very similar but the text changes somewhat.
I want to remove this header from all of the files.
From looking at the files, I know that what I want to remove is encapsulated between similar words.
So, for instance, I have:
Foo bar...some text here...
more text
Foo bar...I want to keep everything after this point
I tried this command in perl:
perl -pi -e "s/\A.*?Foo.bar*?Foo.bar//simxg" 00ws110.txt
But it doesn't work. I'm not a regex expert but hoping someone knows how to basically remove a chunk of text from the beginning of a file based on a text match and not the number of characters...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
默认情况下,
ARGV
(又名<>
,由-p
在幕后使用)仅读取一行时间。解决方法:
取消设置
$/
,它告诉 Perl 一次读取整个文件。BEGIN
是在第一次读取完成之前运行该代码所必需的。使用
-0
,设置$/ = "\0"
。利用触发器运算符。
这将跳过从第 1 行开始到
/^Foo.bar/
的打印。By default,
ARGV
(aka<>
which is used behind-the-scenes by-p
) only reads a single line at a time.Workarounds:
Unset
$/
, which tells Perl to read a whole file at a time.BEGIN
is necessary to have that code run before the first read is done.Use
-0
, which sets$/ = "\0"
.Take advantage of the flip-flop operator.
This will skip printing starting from line 1 to
/^Foo.bar/
.如果你的标题超过一行,你必须告诉 perl 要读取多少内容。 如果文件与内存相比较小,您可能只想将整个文件放入内存中:
-0777
选项将perl设置为slurp模式,因此$_
将保留每次通过循环的每个整个文件。 另外,请始终记住设置备份扩展名。 如果不这样做,您可能会发现数据被意外删除并且无法恢复。 有关详细信息,请参阅perldoc perlrun
。根据评论中的信息,您似乎正在尝试从 前面删除所有烦人的内容古腾堡计划电子书。 如果您了解所涉及的所有版权问题,您应该能够摆脱像这样的前面的问题:
古腾堡项目标头以
更安全的正则表达式结尾,会考虑到
*END*
也到了最后,但我很懒。If your header stretches across more than one line you must tell perl how much to read. If the files are small in comparison to memory you may want to just slurp the whole file into memory:
The
-0777
option sets perl to slurp mode, so$_
will hold the each whole file each time through the loop. Also, always remember to set the backup extension. If you don't you may find that you have wiped out your data accidentally and have no way to get it back. Seeperldoc perlrun
for more information.Given information from the comments, it looks like you are trying to strip all of the annoying stuff from the front of a Project Gutenberg ebook. If you understand all of the copyright issues involved, you should be able to get rid of the front matter like this:
The Project Gutenberg header ends with
A safer regex would take into account the
*END*
at the end of the line as well, but I am lazy.我可能会误解你的要求,但在我看来就这么简单:
I might be misinterpreting what you're asking for, but it looks to me that simple:
干得好! 这将替换文件的第一行:
您可以对数组进行操作,您将看到数组中的修改。 您可以从数组中删除元素,它将从文件中删除该行。 对元素应用替换将替换行中的文本。
如果您想删除前两行,并保留第三行中的某些内容,您可以执行以下操作:
这将完全满足您的需要!
Here you go! This replaces the first line of the file:
You can operate on the array and you will see the modifications in the array. You can delete elements from the array and it will erase the line from the file. Applying substitution on elements will substitute text from the lines.
If you want to delete the first two lines, and keep something from the third, you can do something like this :
and this will do exactly what you need!