这个 awk 正则表达式替换有什么问题?

发布于 2024-09-10 06:32:21 字数 2542 浏览 1 评论 0原文

我在使用 awk 正则表达式匹配替换 xml 文件中的某些文本时遇到一个特殊问题。

xml 文件很简单。每个 xml 的节点中有一段文本,awk 程序用从文本文件 rtxt 中选取的另一段文本替换该文本。但由于某种原因,rtxt 中的文本(标记为“42”)替换 42.xml 中的文本不会产生正确的替换。

toxml.awk 写入标准输出。它首先打印读取到的 xml,然后打印最终的替换结果。

我实际上有这些 xml 文件的集合,我用从较长的 rtxt 中选取的文本进行替换。碰巧这个特定的替换(对于 42.xml)不起作用。元素中的文本不会被替换,而是另一个标签会嵌套在现有标签中。


toxml.awk

BEGIN{
    srcfile = "rtxt"
    FS = "|"

    while (getline <srcfile) {
    xmlfile = $1 ".xml"
    rep = "<narrative>" $2 "</narrative>"

    ## read in the xml file in one go.
    ## (the last tag will be missing.)
    RS = "</topic>"
    FS = "</topic>"

    getline <xmlfile
    #print $0
    close(xmlfile)

    ## replace
    subs = gsub(/<narrative>.*<\/narrative>/, rep, $0)

    ## append the closing tag
    subs = gsub(/[ \n\r\t]+$/, "\n</topic>", $0)
    print $0

    ## restore them before reading rtxt.
    RS = "\n"
    FS = "|"
    }

    close(srcfile)
}

rtxt

42|显示 Java 培训机构详细信息的结果以及提供 Java 解决方案的 IT 公司也被认为是不相关的。 Java 是 Sun Microsystems 开发的一种流行的编程语言。我有兴趣了解这种编程语言,也有兴趣用它学习编程。为了具有相关性,结果应该提供有关 Java 和 Java 的历史的信息。关于 Java 的不同版本以及 Java 中的不同概念。如果能找到学习 Java 的教程就好了。仅与 Sun Microsystems 相关但与 Java 无关的结果被认为是不相关的。我喜欢查找讨论这种编程语言以及各种概念和内容的文章。它的版本。


42.xml

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE topic SYSTEM "topic.dtd">
<topic id="2009042" ct_no="227">

  <title>sun java</title>

  <castitle>//article[about(.//language, java) or about(.,sun)]//sec[about(.//language, java)]</castitle>

  <phrasetitle>"sun java"</phrasetitle>

  <description>Find information about Sun Microsystem's Java language</description>

  <narrative>Java is a popular programming language developed at Sun Microsystems. I am interested to know about this programming language, and also to learn programming with it.    To be relevant, a result should give information on history of Java &amp; on different versions of Java, and on different concepts in Java. Its good if I find tutorials for learning Java. Results related only to Sun Microsystems but not Java are considered non-relevant. Results showing details of training institutes for Java, and IT companies which provide Java solutions are also considered non-relevant. I like to find articles that discuss this programming language and various concepts &amp; versions of it.  </narrative>

</topic>

I have a peculiar problem replacing some text in an xml file using awk regex matching.

The xml files are simple. There's a paragraph of text in the node of each xml and the awk program replaces this text with another paragraph of text picked from the text file rtxt. But for some reason the text in rtxt (labeled '42') that substitutes the text in 42.xml does not produce the proper substitution.

toxml.awk write to stdout. It first prints the xml as it has read it, and then the final replaced result.

I actually have a collection of these xml files where I do a replacement with text picked from a longer rtxt. It so happens that this particular replacement (for 42.xml) doesn't work. Instead of the text in the element being replaced, another tag gets nested within the existing one.


toxml.awk

BEGIN{
    srcfile = "rtxt"
    FS = "|"

    while (getline <srcfile) {
    xmlfile = $1 ".xml"
    rep = "<narrative>" $2 "</narrative>"

    ## read in the xml file in one go.
    ## (the last tag will be missing.)
    RS = "</topic>"
    FS = "</topic>"

    getline <xmlfile
    #print $0
    close(xmlfile)

    ## replace
    subs = gsub(/<narrative>.*<\/narrative>/, rep, $0)

    ## append the closing tag
    subs = gsub(/[ \n\r\t]+$/, "\n</topic>", $0)
    print $0

    ## restore them before reading rtxt.
    RS = "\n"
    FS = "|"
    }

    close(srcfile)
}

rtxt

42|Results showing details of training institutes for Java, and IT companies which provide Java solutions are also considered non-relevant. Java is a popular programming language developed at Sun Microsystems. I am interested to know about this programming language, and also to learn programming with it. To be relevant, a result should give information on history of Java & on different versions of Java, and on different concepts in Java. Its good if I find tutorials for learning Java. Results related only to Sun Microsystems but not Java are considered non-relevant. I like to find articles that discuss this programming language and various concepts & versions of it.


42.xml

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE topic SYSTEM "topic.dtd">
<topic id="2009042" ct_no="227">

  <title>sun java</title>

  <castitle>//article[about(.//language, java) or about(.,sun)]//sec[about(.//language, java)]</castitle>

  <phrasetitle>"sun java"</phrasetitle>

  <description>Find information about Sun Microsystem's Java language</description>

  <narrative>Java is a popular programming language developed at Sun Microsystems. I am interested to know about this programming language, and also to learn programming with it.    To be relevant, a result should give information on history of Java & on different versions of Java, and on different concepts in Java. Its good if I find tutorials for learning Java. Results related only to Sun Microsystems but not Java are considered non-relevant. Results showing details of training institutes for Java, and IT companies which provide Java solutions are also considered non-relevant. I like to find articles that discuss this programming language and various concepts & versions of it.  </narrative>

</topic>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

救赎№ 2024-09-17 06:32:21

只是一个开始

#!/bin/bash

awk 'BEGIN{FS="|"}
FNR==NR{  nar[$1]=$2; next }
END{
  for(i=2;i<ARGC;i++){
     xmlfile=ARGV[i]
     split(xmlfile,fname,".")
     print "Doing file: "xmlfile
     print "---------------------------------"
     while( (getline line < xmlfile ) > 0)  {
         if ( line ~ /<narrative>/ ){
            line="<narrative>"nar[fname[1]]"</narrative>"
         }
         print line
     }
  }
}' rtxt 42.xml 71.xml

Just a start

#!/bin/bash

awk 'BEGIN{FS="|"}
FNR==NR{  nar[$1]=$2; next }
END{
  for(i=2;i<ARGC;i++){
     xmlfile=ARGV[i]
     split(xmlfile,fname,".")
     print "Doing file: "xmlfile
     print "---------------------------------"
     while( (getline line < xmlfile ) > 0)  {
         if ( line ~ /<narrative>/ ){
            line="<narrative>"nar[fname[1]]"</narrative>"
         }
         print line
     }
  }
}' rtxt 42.xml 71.xml
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文