我有几个大文件,我需要找到一个特定的字符串,并在行之间包含字符串和下一个日期之间的所有内容。这个文件看起来像这样:
20220520-11:53:01.242: foofoobar
20220520-11:53:01.244: foo_bar blah: this_i_need
what
to
do
20220520-11:53:01.257: blablabla
20220520-11:53:01.257: bla this_i_need bla
20220520-11:53:01.258: barbarfooo
我需要的输出是:
20220520-11:53:01.244: foo_bar blah: this_i_need
what
to
do
20220520-11:53:01.257: bla this_i_need bla
现在我使用 sed'/'“ $ string”'/,/'“ $ date”'/!d'
,可用预期的是,即使它不包含字符串,它也会带有下一行,但这不是一个大问题。
问题在于,搜索文件需要很长时间。
是否可以编辑 sed
命令,以使其运行速度更快,或者还有其他选项可以获得更好的运行时?也许使用Awk或Grep?
编辑:我忘了补充说,预期结果在一个文件中多次发生,因此在一场比赛后退出不合适。我正在用相同的$字符串和相同的$日期循环循环多个文件。我无法更改脚本有很多因素(从7z中一一提取文件,在一个循环中搜索后搜索和删除它们)。
I have several large files in which I need to find a specific string and take everything between the line which contains the string and the next date at the beginning of a line. This file looks like this:
20220520-11:53:01.242: foofoobar
20220520-11:53:01.244: foo_bar blah: this_i_need
what
to
do
20220520-11:53:01.257: blablabla
20220520-11:53:01.257: bla this_i_need bla
20220520-11:53:01.258: barbarfooo
The output I need is this:
20220520-11:53:01.244: foo_bar blah: this_i_need
what
to
do
20220520-11:53:01.257: bla this_i_need bla
Now I'm using sed '/'"$string"'/,/'"$date"'/!d'
which works as intended except it also takes the next row with the date even if it doesn't contain the string, but it's not a big problem.
The problem is that it takes a really long time searching the files.
Is it possible to edit the sed
command so it will run faster or is there any other option to get a better runtime? Maybe using awk or grep?
EDIT: I forgot to add that the expected results occur multiple times in one file, so exiting after one match is not suitable. I am looping trough multiple files in a for loop with the same $string and same $date. There are a lot of factors slowing the script down that i can't change (extracting files one by one from a 7z, searching and removing them after search in one loop).
发布评论
评论(4)
使用
sed
您可以使用:说明
-n
在匹配/this_i_need/ > this_i_need
:a
设置标签a
能够跳回n
将下一行拉入模式空间q
退出SED输出
Using
sed
you might use:Explanation
-n
Prevent default printing of a line/this_i_need/
When matchingthis_i_need
:a
Set a labela
to be able to jump back toN
pull the next line into the pattern space/\n20220520/!
If not matching a newline followed by the dateba
Jump back to the label (like a loop and process what is after the label again)p
When we do match a newline and the date, then print the pattern spaceq
Exit sedOutput
您可以使用 a>指示gnu
awk
停止处理,如果您在文件结束前遥远的线路结束时,这应该会带来速度增益。令file.txt
进行输出
说明
:我使用2个flag-variables
p
作为priting和s
,如所见。我将gnuawk
告知print
当前行和exit
(如果看到的话),并以8位数字开始,然后以0或更多字符开始。 > this_i_needp
flag to1
(true)和s
flag to1
(true)如果this_i_need
在行中找到 line line,请转到0
0 (false),如果p
标志是1
,行以8位数字print
当前行开始,如果p
flag设置为1
/代码>
请注意,操作顺序至关重要。
免责声明:该解决方案假设如果行以8位数字开头,则它是按日期开始的,如果不是情况,请根据您的需求调整正则表达式。
(在GAWK 4.2.1中测试)
You might use
exit
statement to instruct GNUAWK
to stop processing, which should give speed gain if lines you are looking ends far before end of file. Letfile.txt
content bethen
gives output
Explanation: I use 2 flag-variables
p
as priting ands
as seen. I inform GNUAWK
toprint
current line andexit
if seen and line starts with 8 digits followed by 0 or more any characters followed bythis_i_need
p
flag to1
(true) ands
flag to1
(true) and go tonext
line ifthis_i_need
was found in linep
flag to0
(false) ifp
flag is1
and line starts with 8 digitprint
current line ifp
flag is set to1
Note that order of actions is crucial.
Disclaimer: this solution assumes that if line starts with 8 digits, then it is line beginning with date, if this is not case adjust regular expression according to your needs.
(tested in gawk 4.2.1)
使用SED,它必须删除匹配范围之外的所有线路,从缓冲区范围内,该文件效率很大。
取而代之的是,您可以在匹配特定字符串时设置标志并在匹配日期模式时清除标志,然后在设置标志时输出线路,直接通过设置标志来直接输出所需线路,并在设置标志时输出线路:
https://ideone.com/j2isvd
With sed it has to delete all the lines outside the matching ranges from the buffer, which is inefficient when the file is large.
You can instead use awk to output the desired lines directly by setting a flag upon matching the specific string and clearing the flag when matching a date pattern, and outputting the line when the flag is set:
Demo: https://ideone.com/J2ISVD
假设:
开始
时 :
注意: op可以重命名
awk
变量(ptn
,printme
),只要它们不是保留的关键字(请参阅这会生成:
Assumptions:
One
awk
idea:Or as a one-liner sans comments:
NOTE: OP can rename the
awk
variables (ptn
,printme
) as desired as long as they are not a reserved keyword (see 'Keyword' in awk glossary)This generates: