如何使用 sed/awk 从文件中删除文本块(模式)

发布于 2024-12-03 09:02:18 字数 241 浏览 0 评论 0原文

我导入了数千个文本文件,其中包含我想要删除的一段文本。

它不仅仅是一段文本,而是一种模式。

<!--
# Translator(s):
#
# username1 <email1>
# username2 <email2>
# usernameN <emailN>
#
-->

如果该块出现,则将列出 1 个或多个用户及其电子邮件地址。

I have thousands of text files that I have imported that contain a piece of text that I would like to remove.

It is not just a block of text but a pattern.

<!--
# Translator(s):
#
# username1 <email1>
# username2 <email2>
# usernameN <emailN>
#
-->

The block if it appears it will have 1 or more users being listed with their email addresses.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

寂寞清仓 2024-12-10 09:02:18

我有另一个小 awk 程序,可以用很少的代码行完成任务。它可用于从文件中删除文本模式。可以设置启动和停止正则表达式。

# This block is a range pattern and captures all lines between( and including )
# the start '<!--' to the end '-->' and stores the content in record $0. 
# Record $0 contains every line in the range pattern.
# awk -f remove_email.awk yourfile

# The if statement is not needed to accomplish the task, but may be useful.
# It says - if the range patterns in $0 contains a '@' then it will print
# the string "Found an email..." if uncommented.

# command 'next' will discard the content of the current record and search
# for the next record.
# At the same time the awk program begins from the beginning.


/<!--/, /-->/ {
    #if( $0 ~ /@/ ){
        # print "Found an email and removed that!"
    #}
next
}

# This line prints the body of the file to standard output - if not captured in
# the block above.
1 {
    print
}

将代码保存在“remove_email.awk”中并通过以下方式运行:
awk -f remove_email.awk 你的文件

I have another small awk program that accomplish the task in a very few rows of code. It can be used to remove patterns of text from a file. Start as well as stop regexp can be set.

# This block is a range pattern and captures all lines between( and including )
# the start '<!--' to the end '-->' and stores the content in record $0. 
# Record $0 contains every line in the range pattern.
# awk -f remove_email.awk yourfile

# The if statement is not needed to accomplish the task, but may be useful.
# It says - if the range patterns in $0 contains a '@' then it will print
# the string "Found an email..." if uncommented.

# command 'next' will discard the content of the current record and search
# for the next record.
# At the same time the awk program begins from the beginning.


/<!--/, /-->/ {
    #if( $0 ~ /@/ ){
        # print "Found an email and removed that!"
    #}
next
}

# This line prints the body of the file to standard output - if not captured in
# the block above.
1 {
    print
}

Save the code in 'remove_email.awk' and run it by:
awk -f remove_email.awk yourfile

浪漫人生路 2024-12-10 09:02:18

这个 sed 解决方案可能有效:

 sed '/^<!--/,/^-->/{/^<!--/{h;d};H;/^-->/{x;/^<!--\n# Translator(s):\n#\(\n# [^<]*<email[0-9]\+>\)\+\n#\n-->$/!p};d}' file

另一种选择(也许更好的解决方案?):

sed '/^<!--/{:a;N;/^-->/M!ba;/^<!--\n# Translator(s):\n#\(\n# \w\+ <[^>]\+>\)+\n#\n-->/d}' file

这会收集以 结尾的行,然后集合上的模式匹配,即第二行是 # Translator(s): 第三行是 #,第四行以及可能更多行遵循 # username # username <电子邮件地址>,倒数第二行是#,最后一行是-->。如果匹配,则删除整个集合,否则将正常打印。

This sed solution might work:

 sed '/^<!--/,/^-->/{/^<!--/{h;d};H;/^-->/{x;/^<!--\n# Translator(s):\n#\(\n# [^<]*<email[0-9]\+>\)\+\n#\n-->$/!p};d}' file

An alternative (perhaps better solution?):

sed '/^<!--/{:a;N;/^-->/M!ba;/^<!--\n# Translator(s):\n#\(\n# \w\+ <[^>]\+>\)+\n#\n-->/d}' file

This gathers up the lines that start with <!-- and end with --> then pattern matches on the collection i.e. the second line is # Translator(s): the third line is #, the fourth and perhaps more lines follow # username <email address>, the penultimate line is # and the last line is -->. If a match is made the entire collection is deleted otherwise it is printed as normal.

随波逐流 2024-12-10 09:02:18

对于此任务,您需要先行查看,这通常是通过解析器完成的。

另一种解决方案,但不是很有效:

sed "s/-->/&\n/;s/<!--/\n&/" file |  awk 'BEGIN {RS = "";FS = "\n"}/username/{print}'

HTH Chris

for this task you need look-ahead, which is normally done with a parser.

Another solution, but not very efficient would be:

sed "s/-->/&\n/;s/<!--/\n&/" file |  awk 'BEGIN {RS = "";FS = "\n"}/username/{print}'

HTH Chris

和影子一齐双人舞 2024-12-10 09:02:18
perl -i.orig -00 -pe 's/<!--\s+#\s*Translator.*?\s-->//gs' file1 file2 file3
perl -i.orig -00 -pe 's/<!--\s+#\s*Translator.*?\s-->//gs' file1 file2 file3
内心激荡 2024-12-10 09:02:18

如果我正确理解你的问题,这是我的解决方案。将以下内容保存到名为remove_blocks.awk的文件中:

# See the beginning of the block, mark it
/<!--/ {
    state = "block_started" 
}

# At the end of the block, if the block does not contain email, print
# out the whole block.
/^-->/ {
    if (!block_contains_user_email) {
        for (i = 0; i < count; i++) {
            print saved_line[i];
        }
        print
    }

    count = 0
    block_contains_user_email = 0
    state = ""
    next
}

# Encounter a block: save the lines and wait until the end of the block
# to decide if we should print it out
state == "block_started" {
    saved_line[count++] = $0
    if (NF>=3 && $3 ~ /@/) {
        block_contains_user_email = 1
    }
    next
}

# For everything else, print the line
1

假设您的文本文件位于data.txt(或许多文件中):

awk -f remove_blocks.awk data.txt

上述命令将打印出文本文件中的所有内容,减去包含用户电子邮件的块。

Here is my solution, if I understood your problem correctly. Save the following to a file called remove_blocks.awk:

# See the beginning of the block, mark it
/<!--/ {
    state = "block_started" 
}

# At the end of the block, if the block does not contain email, print
# out the whole block.
/^-->/ {
    if (!block_contains_user_email) {
        for (i = 0; i < count; i++) {
            print saved_line[i];
        }
        print
    }

    count = 0
    block_contains_user_email = 0
    state = ""
    next
}

# Encounter a block: save the lines and wait until the end of the block
# to decide if we should print it out
state == "block_started" {
    saved_line[count++] = $0
    if (NF>=3 && $3 ~ /@/) {
        block_contains_user_email = 1
    }
    next
}

# For everything else, print the line
1

Assume that your text file is in data.txt (or many files, for that matter):

awk -f remove_blocks.awk data.txt

The above command will print out everything in the text file, minus the blocks which contain user email.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文