加快使用SED或替代方案搜索大文件

发布于 2025-02-01 12:28:43 字数 750 浏览 7 评论 0 原文

我有几个大文件,我需要找到一个特定的字符串,并在行之间包含字符串和下一个日期之间的所有内容。这个文件看起来像这样:

20220520-11:53:01.242: foofoobar
20220520-11:53:01.244: foo_bar blah: this_i_need
what 
to 
do
20220520-11:53:01.257: blablabla
20220520-11:53:01.257: bla this_i_need bla
20220520-11:53:01.258: barbarfooo

我需要的输出是:

20220520-11:53:01.244: foo_bar blah: this_i_need
what 
to 
do
20220520-11:53:01.257: bla this_i_need bla

现在我使用 sed'/'“ $ string”'/,/'“ $ date”'/!d',可用预期的是,即使它不包含字符串,它也会带有下一行,但这不是一个大问题。

问题在于,搜索文件需要很长时间。 是否可以编辑 sed 命令,以使其运行速度更快,或者还有其他选项可以获得更好的运行时?也许使用Awk或Grep?

编辑:我忘了补充说,预期结果在一个文件中多次发生,因此在一场比赛后退出不合适。我正在用相同的$字符串和相同的$日期循环循环多个文件。我无法更改脚本有很多因素(从7z中一一提取文件,在一个循环中搜索后搜索和删除它们)。

I have several large files in which I need to find a specific string and take everything between the line which contains the string and the next date at the beginning of a line. This file looks like this:

20220520-11:53:01.242: foofoobar
20220520-11:53:01.244: foo_bar blah: this_i_need
what 
to 
do
20220520-11:53:01.257: blablabla
20220520-11:53:01.257: bla this_i_need bla
20220520-11:53:01.258: barbarfooo

The output I need is this:

20220520-11:53:01.244: foo_bar blah: this_i_need
what 
to 
do
20220520-11:53:01.257: bla this_i_need bla

Now I'm using sed '/'"$string"'/,/'"$date"'/!d' which works as intended except it also takes the next row with the date even if it doesn't contain the string, but it's not a big problem.

The problem is that it takes a really long time searching the files.
Is it possible to edit the sed command so it will run faster or is there any other option to get a better runtime? Maybe using awk or grep?

EDIT: I forgot to add that the expected results occur multiple times in one file, so exiting after one match is not suitable. I am looping trough multiple files in a for loop with the same $string and same $date. There are a lot of factors slowing the script down that i can't change (extracting files one by one from a 7z, searching and removing them after search in one loop).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

篱下浅笙歌 2025-02-08 12:28:43

使用 sed 您可以使用:

sed -n '/this_i_need/{:a;N;/\n20220520/!ba;p;q}' file

说明

  • 请防止默认打印行
  • -n 在匹配/this_i_need/ > this_i_need
  • :a 设置标签 a 能够跳回
  • n 将下一行拉入模式空间
  • < 如果不匹配新线
  • \ n20220520/!
  • 代码> / 当我们确实匹配新线和日期时,然后打印模式空间
  • q 退出SED

输出

20220520-11:53:01.244: foo_bar blah: this_i_need
what 
to 
do
20220520-11:53:01.257: blablabla

Using sed you might use:

sed -n '/this_i_need/{:a;N;/\n20220520/!ba;p;q}' file

Explanation

  • -n Prevent default printing of a line
  • /this_i_need/ When matching this_i_need
  • :a Set a label a to be able to jump back to
  • N pull the next line into the pattern space
  • /\n20220520/! If not matching a newline followed by the date
  • ba Jump back to the label (like a loop and process what is after the label again)
  • p When we do match a newline and the date, then print the pattern space
  • q Exit sed

Output

20220520-11:53:01.244: foo_bar blah: this_i_need
what 
to 
do
20220520-11:53:01.257: blablabla
习惯成性 2025-02-08 12:28:43

您可以使用 a>指示gnu awk 停止处理,如果您在文件结束前遥远的线路结束时,这应该会带来速度增益。令 file.txt 进行

20220520-11:53:01.242: foofoobar
20220520-11:53:01.244: foo_bar blah: this_i_need
what 
to 
do
20220520-11:53:01.257: blablabla
20220520-11:53:01.257: bla this_i_need bla
20220520-11:53:01.258: barbarfooo

输出

awk 's&&/^[[:digit:]]{8}.*this_i_need/{print;exit}/this_i_need/{p=1;s=1;next}p&&/^[[:digit:]]{8}/{p=0}p{print}' file.txt

说明

what 
to 
do
20220520-11:53:01.257: bla this_i_need bla

:我使用2个flag-variables p 作为priting和 s ,如所见。我将gnu awk 告知

  • print 当前行和 exit (如果看到的话),并以8位数字开始,然后以0或更多字符开始。 > this_i_need
  • set p flag to 1 (true)和 s flag to 1 (true)如果 this_i_need 在行中找到 line line,请转到
  • line p flag to 0 0 (false),如果 p 标志是 1 ,行以8位数字
  • print 当前行开始,如果 p flag设置为 1 /代码>

请注意,操作顺序至关重要。

免责声明:该解决方案假设如果行以8位数字开头,则它是按日期开始的,如果不是情况,请根据您的需求调整正则表达式。

(在GAWK 4.2.1中测试)

You might use exit statement to instruct GNU AWK to stop processing, which should give speed gain if lines you are looking ends far before end of file. Let file.txt content be

20220520-11:53:01.242: foofoobar
20220520-11:53:01.244: foo_bar blah: this_i_need
what 
to 
do
20220520-11:53:01.257: blablabla
20220520-11:53:01.257: bla this_i_need bla
20220520-11:53:01.258: barbarfooo

then

awk 's&&/^[[:digit:]]{8}.*this_i_need/{print;exit}/this_i_need/{p=1;s=1;next}p&&/^[[:digit:]]{8}/{p=0}p{print}' file.txt

gives output

what 
to 
do
20220520-11:53:01.257: bla this_i_need bla

Explanation: I use 2 flag-variables p as priting and s as seen. I inform GNU AWK to

  • print current line and exit if seen and line starts with 8 digits followed by 0 or more any characters followed by this_i_need
  • set p flag to 1 (true) and s flag to 1 (true) and go to next line if this_i_need was found in line
  • set p flag to 0 (false) if p flag is 1 and line starts with 8 digit
  • print current line if p flag is set to 1

Note that order of actions is crucial.

Disclaimer: this solution assumes that if line starts with 8 digits, then it is line beginning with date, if this is not case adjust regular expression according to your needs.

(tested in gawk 4.2.1)

农村范ル 2025-02-08 12:28:43

使用SED,它必须删除匹配范围之外的所有线路,从缓冲区范围内,该文件效率很大。

取而代之的是,您可以在匹配特定字符串时设置标志并在匹配日期模式时清除标志,然后在设置标志时输出线路,直接通过设置标志来直接输出所需线路,并在设置标志时输出线路:

awk '/[0-9]{8}/{f=0}/this_i_need/{f=1}f' file

https://ideone.com/j2isvd

With sed it has to delete all the lines outside the matching ranges from the buffer, which is inefficient when the file is large.

You can instead use awk to output the desired lines directly by setting a flag upon matching the specific string and clearing the flag when matching a date pattern, and outputting the line when the flag is set:

awk '/[0-9]{8}/{f=0}/this_i_need/{f=1}f' file

Demo: https://ideone.com/J2ISVD

浅浅 2025-02-08 12:28:43

假设:

  • 打印
  • 当我们读取以任何日期开始的行(即任何8位字符串)开始时,我们在找到所需的字符串停止打印

开始

string='this_i_need'

awk -v ptn="${string}" '         # pass bash variable "$string" in as awk variable "ptn"
/^[0-9]{8}/ { printme=0 }        # clear printme flag if line starts with 8-digit string
$0 ~ ptn    { printme=1 }        # set printme flag if we find "ptn" in the current line
printme                          # only print current line if printme==1
' foo.dat

时 :

awk -v ptn="${pattern}" '/^[0-9]{8}/ {printme=0} $0~ptn {printme=1} printme' foo.dat

注意: op可以重命名 awk 变量( ptn printme ),只要它们不是保留的关键字(请参阅

这会生成:

20220520-11:53:01.244: foo_bar blah: this_i_need
what
to
do
20220520-11:53:01.257: bla this_i_need bla

Assumptions:

  • start printing when we find the desired string
  • stop printing when we read a line that starts with any date (ie, any 8-digit string)

One awk idea:

string='this_i_need'

awk -v ptn="${string}" '         # pass bash variable "$string" in as awk variable "ptn"
/^[0-9]{8}/ { printme=0 }        # clear printme flag if line starts with 8-digit string
$0 ~ ptn    { printme=1 }        # set printme flag if we find "ptn" in the current line
printme                          # only print current line if printme==1
' foo.dat

Or as a one-liner sans comments:

awk -v ptn="${pattern}" '/^[0-9]{8}/ {printme=0} $0~ptn {printme=1} printme' foo.dat

NOTE: OP can rename the awk variables (ptn, printme) as desired as long as they are not a reserved keyword (see 'Keyword' in awk glossary)

This generates:

20220520-11:53:01.244: foo_bar blah: this_i_need
what
to
do
20220520-11:53:01.257: bla this_i_need bla
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文