如何使用正则表达式删除文件开头的文本?

发布于 2024-07-15 19:38:23 字数 423 浏览 9 评论 0原文

我有一堆包含半标准标头的文件。 也就是说,它的外观非常相似,但文字有所变化。

我想从所有文件中删除此标头。

通过查看文件,我知道我要删除的内容封装在相似的单词之间。

例如,我有:

Foo bar...some text here...
more text
Foo bar...I want to keep everything after this point

我在 perl 中尝试了这个命令:

perl -pi -e "s/\A.*?Foo.bar*?Foo.bar//simxg" 00ws110.txt

但它不起作用。 我不是正则表达式专家,但希望有人知道如何根据文本匹配而不是字符数从文件开头删除一大块文本......

I have a bunch of files that contain a semi-standard header. That is, the look of it is very similar but the text changes somewhat.

I want to remove this header from all of the files.

From looking at the files, I know that what I want to remove is encapsulated between similar words.

So, for instance, I have:

Foo bar...some text here...
more text
Foo bar...I want to keep everything after this point

I tried this command in perl:

perl -pi -e "s/\A.*?Foo.bar*?Foo.bar//simxg" 00ws110.txt

But it doesn't work. I'm not a regex expert but hoping someone knows how to basically remove a chunk of text from the beginning of a file based on a text match and not the number of characters...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

墨洒年华 2024-07-22 19:38:24

默认情况下,ARGV(又名 <>,由 -p 在幕后使用)仅读取一行时间。

解决方法:

  1. 取消设置$/,它告诉 Perl 一次读取整个文件。

    perl -pi -e "BEGIN{undef$/}s/\A.*?Foo.bar*?Foo.bar//simxg" 00ws110.txt 
      

    BEGIN 是在第一次读取完成之前运行该代码所必需的。

  2. 使用-0,设置$/ = "\0"

    perl -pi -0 -e "s/\A.*?Foo.bar*?Foo.bar//simxg" 00ws110.txt 
      
  3. 利用触发器运算符

    perl -ni -e "打印除非 1 ... /^Foo.bar/' 
      

    这将跳过从第 1 行开始到 /^Foo.bar/ 的打印。

By default, ARGV (aka <> which is used behind-the-scenes by -p) only reads a single line at a time.

Workarounds:

  1. Unset $/, which tells Perl to read a whole file at a time.

    perl -pi -e "BEGIN{undef$/}s/\A.*?Foo.bar*?Foo.bar//simxg" 00ws110.txt
    

    BEGIN is necessary to have that code run before the first read is done.

  2. Use -0, which sets $/ = "\0".

    perl -pi -0 -e "s/\A.*?Foo.bar*?Foo.bar//simxg" 00ws110.txt
    
  3. Take advantage of the flip-flop operator.

    perl -ni -e "print unless 1 ... /^Foo.bar/'
    

    This will skip printing starting from line 1 to /^Foo.bar/.

终陌 2024-07-22 19:38:24

如果你的标题超过一行,你必须告诉 perl 要读取多少内容。 如果文件与内存相比较小,您可能只想将整个文件放入内存中:

perl -0777pi.orig -e 's/your regex/your replace/s' file1 file2 file3

-0777选项将perl设置为slurp模式,因此$_将保留每次通过循环的每个整个文件。 另外,请始终记住设置备份扩展名。 如果不这样做,您可能会发现数据被意外删除并且无法恢复。 有关详细信息,请参阅 perldoc perlrun

根据评论中的信息,您似乎正在尝试从 前面删除所有烦人的内容古腾堡计划电子书。 如果您了解所涉及的所有版权问题,您应该能够摆脱像这样的前面的问题:

perl -ni.orig -e 'print unless 1 .. /^\*END/' 00ws110.txt

古腾堡项目标头以

*END*THE SMALL PRINT! FOR PUBLIC DOMAIN ETEXTS*Ver.04.29.93*END*

更安全的正则表达式结尾,会考虑到 *END*也到了最后,但我很懒。

If your header stretches across more than one line you must tell perl how much to read. If the files are small in comparison to memory you may want to just slurp the whole file into memory:

perl -0777pi.orig -e 's/your regex/your replace/s' file1 file2 file3

The -0777 option sets perl to slurp mode, so $_ will hold the each whole file each time through the loop. Also, always remember to set the backup extension. If you don't you may find that you have wiped out your data accidentally and have no way to get it back. See perldoc perlrun for more information.

Given information from the comments, it looks like you are trying to strip all of the annoying stuff from the front of a Project Gutenberg ebook. If you understand all of the copyright issues involved, you should be able to get rid of the front matter like this:

perl -ni.orig -e 'print unless 1 .. /^\*END/' 00ws110.txt

The Project Gutenberg header ends with

*END*THE SMALL PRINT! FOR PUBLIC DOMAIN ETEXTS*Ver.04.29.93*END*

A safer regex would take into account the *END* at the end of the line as well, but I am lazy.

又爬满兰若 2024-07-22 19:38:24

我可能会误解你的要求,但在我看来就这么简单:

perl -ni -e 'print unless 1..($. > 1 && /^Foo bar/)'

I might be misinterpreting what you're asking for, but it looks to me that simple:

perl -ni -e 'print unless 1..($. > 1 && /^Foo bar/)'
冷默言语 2024-07-22 19:38:24

干得好! 这将替换文件的第一行:


use Tie::File;

tie my @array,"Tie::File","path_to_file" or die("can't tie the file");
$array[0] =~s/text_i_want_to_replace/replacement_text/gi;
untie @array;

您可以对数组进行操作,您将看到数组中的修改。 您可以从数组中删除元素,它将从文件中删除该行。 对元素应用替换将替换行中的文本。

如果您想删除前两行,并保留第三行中的某些内容,您可以执行以下操作:


# tie the @array before this
shift @array;
shift @array;
$array[0]=~s/foo bar\.\.\.//gi;
# untie the @array

这将完全满足您的需要!

Here you go! This replaces the first line of the file:


use Tie::File;

tie my @array,"Tie::File","path_to_file" or die("can't tie the file");
$array[0] =~s/text_i_want_to_replace/replacement_text/gi;
untie @array;

You can operate on the array and you will see the modifications in the array. You can delete elements from the array and it will erase the line from the file. Applying substitution on elements will substitute text from the lines.

If you want to delete the first two lines, and keep something from the third, you can do something like this :


# tie the @array before this
shift @array;
shift @array;
$array[0]=~s/foo bar\.\.\.//gi;
# untie the @array

and this will do exactly what you need!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文