如何使用 awk 打印第 n 个匹配项
我正在尝试将一个大的 xml 文件拆分为几个较小的文件。我找到了一个将每个节点拆分为自己的文件的解决方案:
awk '/<mono/{close("row"count".xml");count++}count{f="row"count".xml";print $0 > f}' file.xml
上面的代码匹配每个“单”节点并将其输出到文件名 row{rownumber}.xml。如何将每 20 个匹配项打印到一个文件中?
I am trying to split a large xml file into several smaller files. I found a solution to split each node into it's own file:
awk '/<mono/{close("row"count".xml");count++}count{f="row"count".xml";print $0 > f}' file.xml
The above code matches every "mono" node and outputs it to a file names row{rownumber}.xml. How can I print every 20 matches to a file?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我想说保留你的“count”变量,你只需要改变你构建文件名的方式:
f="row" int(count/20) ".xml"
You don't不必显式关闭该文件。当 awk 退出时,所有打开的文件都将被关闭。鉴于评论,我将删除该评论。请注意,在下面的代码中,文件将被关闭最多 20 次,但会根据需要重新打开。I would say keep your "count" variable, and you just need to change the way you build your filename:
f="row" int(count/20) ".xml"
You don't have to explicitly close the file. All open files will be closed when awk exits.Given the comments, I'll strike that remark. Note in the code below, a file will be closed up to 20 times, but reopened as required.维护两项计数 - 当前计数和重复计数。仅当重复计数模 20 处于适当的值(所示代码中的 0 和 1)时才执行当前活动(打印标签):
第二个条件中的“== 1”条件有点混乱;可能有更好的方法来处理该逻辑。
请注意,您的代码也将“”检测为 Mono。
将文件 1 中的记录 1-20 分组,文件 2 中的 21-40 分组,等等...
同样的一般思想适用...您有一个文件号和一个匹配的记录号,并且您可以适当地处理它们。测试代码:
第一个文件将是
row.xml
。后续文件是row1.xml
等。我在这样的文件上进行了测试:
它包含 100 个
行和少量的ignore行(一些重复)。它生成文件
row.xml
、row1.xml
、...row4.xml
,每个文件包含 20 行。这是在 MacOS X 10.6.6 上使用标准 (BSD)awk
进行测试的。Maintain two counts - the current one and a repeat count. Only do the current activity (print the tag) when the repeat count modulo 20 is at the appropriate value (0 and 1, in the code shown):
The '== 1' condition in the second condition is a little untidy; there's probably a better way to handle that logic.
Note that your code detects '
<monotonous>
' as being Mono too.Grouping records 1-20 in file1, 21-40 in file2, etc...
The same general idea applies...you have a file number and a matching record number, and you handle them appropriately. Tested code:
The first file will be
row.xml
. Subsequent files arerow1.xml
, etc.I tested this on a file like this:
It contained 100
<mono>
lines and a sprinkling ofignore
lines (some repeated). It produced filesrow.xml
,row1.xml
, ...row4.xml
with 20 lines in each. This was tested on MacOS X 10.6.6 with the standard (BSD)awk
.