如何使用 awk 打印第 n 个匹配项

发布于 2024-10-20 10:17:40 字数 260 浏览 2 评论 0原文

我正在尝试将一个大的 xml 文件拆分为几个较小的文件。我找到了一个将每个节点拆分为自己的文件的解决方案：

awk '/<mono/{close("row"count".xml");count++}count{f="row"count".xml";print $0 > f}' file.xml

上面的代码匹配每个“单”节点并将其输出到文件名 row{rownumber}.xml。如何将每 20 个匹配项打印到一个文件中？

原文

I am trying to split a large xml file into several smaller files. I found a solution to split each node into it's own file:

awk '/<mono/{close("row"count".xml");count++}count{f="row"count".xml";print $0 > f}' file.xml

The above code matches every "mono" node and outputs it to a file names row{rownumber}.xml. How can I print every 20 matches to a file?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

陌生 2024-10-27 10:17:40

我想说保留你的“count”变量，你只需要改变你构建文件名的方式： f="row" int(count/20) ".xml"

~~You don't不必显式关闭该文件。当 awk 退出时，所有打开的文件都将被关闭。~~ 鉴于评论，我将删除该评论。请注意，在下面的代码中，文件将被关闭最多 20 次，但会根据需要重新打开。

awk '
  /<mono/ {close f; count++; f = "row" int(count/20) ".xml"} 
  count {print >> f}
' file.xml

I would say keep your "count" variable, and you just need to change the way you build your filename: f="row" int(count/20) ".xml"

~~You don't have to explicitly close the file. All open files will be closed when awk exits.~~ Given the comments, I'll strike that remark. Note in the code below, a file will be closed up to 20 times, but reopened as required.

awk '
  /<mono/ {close f; count++; f = "row" int(count/20) ".xml"} 
  count {print >> f}
' file.xml

回复收藏 0 原文

半透明的墙 2024-10-27 10:17:40

维护两项计数 - 当前计数和重复计数。仅当重复计数模 20 处于适当的值（所示代码中的 0 和 1）时才执行当前活动（打印标签）：

awk '/<mono/ { if (repeat++ % 20 == 0) { close("row"count".xml"); count++ } }
     count && repeat % 20 == 1 { f = "row"count".xml"; print $0 > f}' file.xml

第二个条件中的“== 1”条件有点混乱；可能有更好的方法来处理该逻辑。

请注意，您的代码也将“”检测为 Mono。

将文件 1 中的记录 1-20 分组，文件 2 中的 21-40 分组，等等...

同样的一般思想适用...您有一个文件号和一个匹配的记录号，并且您可以适当地处理它们。测试代码：

awk '/<mono/ {   if (recno > 1 && recno % 20 == 0) { close(file); count++;}
                 if (recno % 20 == 0) { file = "row" count ".xml" }
                 print $0 > file
                 recno++
             }' file.xml

第一个文件将是 row.xml。后续文件是 row1.xml 等。

我在这样的文件上进行了测试：

<mono> <tonous val=001/> </mono>
ignore
<mono> <tonous val=002/> </mono>
<mono> <tonous val=003/> </mono>
<mono> <tonous val=004/> </mono>
<mono> <tonous val=005/> </mono>
ignore
<mono> <tonous val=006/> </mono>
<mono> <tonous val=007/> </mono>
<mono> <tonous val=008/> </mono>
<mono> <tonous val=009/> </mono>
ignore
<mono> <tonous val=010/> </mono>
<mono> <tonous val=011/> </mono>
<mono> <tonous val=012/> </mono>
<mono> <tonous val=013/> </mono>
<mono> <tonous val=014/> </mono>
ignore
<mono> <tonous val=015/> </mono>
<mono> <tonous val=016/> </mono>
<mono> <tonous val=017/> </mono>
<mono> <tonous val=018/> </mono>
<mono> <tonous val=019/> </mono>
ignore
<mono> <tonous val=020/> </mono>
<mono> <tonous val=021/> </mono>
<mono> <tonous val=022/> </mono>
<mono> <tonous val=023/> </mono>
ignore
<mono> <tonous val=024/> </mono>
...

它包含 100 个行和少量的 ignore行（一些重复）。它生成文件 row.xml、row1.xml、...row4.xml，每个文件包含 20 行。这是在 MacOS X 10.6.6 上使用标准 (BSD) awk 进行测试的。

Maintain two counts - the current one and a repeat count. Only do the current activity (print the tag) when the repeat count modulo 20 is at the appropriate value (0 and 1, in the code shown):

awk '/<mono/ { if (repeat++ % 20 == 0) { close("row"count".xml"); count++ } }
     count && repeat % 20 == 1 { f = "row"count".xml"; print $0 > f}' file.xml

The '== 1' condition in the second condition is a little untidy; there's probably a better way to handle that logic.

Note that your code detects '<monotonous>' as being Mono too.

Grouping records 1-20 in file1, 21-40 in file2, etc...

The same general idea applies...you have a file number and a matching record number, and you handle them appropriately. Tested code:

awk '/<mono/ {   if (recno > 1 && recno % 20 == 0) { close(file); count++;}
                 if (recno % 20 == 0) { file = "row" count ".xml" }
                 print $0 > file
                 recno++
             }' file.xml

The first file will be row.xml. Subsequent files are row1.xml, etc.

I tested this on a file like this:

<mono> <tonous val=001/> </mono>
ignore
<mono> <tonous val=002/> </mono>
<mono> <tonous val=003/> </mono>
<mono> <tonous val=004/> </mono>
<mono> <tonous val=005/> </mono>
ignore
<mono> <tonous val=006/> </mono>
<mono> <tonous val=007/> </mono>
<mono> <tonous val=008/> </mono>
<mono> <tonous val=009/> </mono>
ignore
<mono> <tonous val=010/> </mono>
<mono> <tonous val=011/> </mono>
<mono> <tonous val=012/> </mono>
<mono> <tonous val=013/> </mono>
<mono> <tonous val=014/> </mono>
ignore
<mono> <tonous val=015/> </mono>
<mono> <tonous val=016/> </mono>
<mono> <tonous val=017/> </mono>
<mono> <tonous val=018/> </mono>
<mono> <tonous val=019/> </mono>
ignore
<mono> <tonous val=020/> </mono>
<mono> <tonous val=021/> </mono>
<mono> <tonous val=022/> </mono>
<mono> <tonous val=023/> </mono>
ignore
<mono> <tonous val=024/> </mono>
...

It contained 100 <mono> lines and a sprinkling of ignore lines (some repeated). It produced files row.xml, row1.xml, ... row4.xml with 20 lines in each. This was tested on MacOS X 10.6.6 with the standard (BSD) awk.

回复收藏 0 原文

~没有更多了~