从简单的 XML 文件中提取数据

发布于 2024-08-20 20:06:24 字数 280 浏览 10 评论 0 原文

我有一个包含以下内容的 XML 文件:

<?xml version="1.0" encoding="utf-8"?>
<job xmlns="http://www.sample.com/">programming</job>

我需要一种方法来提取 标签中的内容,然后进行编程这个案例。这应该在 Linux 命令提示符下使用 grep/sed/awk 完成。

I've a XML file with the contents:

<?xml version="1.0" encoding="utf-8"?>
<job xmlns="http://www.sample.com/">programming</job>

I need a way to extract what is in the <job..> </job> tags, programmin in this case. This should be done on linux command prompt, using grep/sed/awk.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

蝶舞 2024-08-27 20:06:25

请不要对 XML 使用基于行和正则表达式的解析。这是一个坏主意。您可以拥有语义相同但格式不同的 XML,而正则表达式和基于行的解析根本无法应对它。

像一元标签和变量换行之类的东西 - 这些片段“说”同样的事情:

<root>
  <sometag val1="fish" val2="carrot" val3="narf"></sometag>
</root>


<root>
  <sometag
      val1="fish"
      val2="carrot"
      val3="narf"></sometag>
</root>

<root
><sometag
val1="fish"
val2="carrot"
val3="narf"
></sometag></root>

<root><sometag val1="fish" val2="carrot" val3="narf"/></root>

希望这可以清楚地说明为什么制作基于正则表达式/行的解析器很困难?幸运的是,您不需要这样做。许多脚本语言至少有一个,有时甚至更多的解析器选项。

正如之前的海报所提到的 - xml_grep 可用。这实际上是一个基于 XML::Twig perl 库的工具。然而,它的作用是使用“xpath 表达式”来查找某些内容,并区分文档结构、属性和“内容”。

例如:

xml_grep 'job' jobs.xml --text_only

但是,为了做出更好的答案,这里有几个基于源数据的“自己动手”的示例:

第一种方法:

使用twig handlers来捕获特定类型的元素并执行操作在他们身上。这样做的优点是它可以“随时”解析 XML,并允许您根据需要随时修改它。当您使用 purgeflush 处理大文件时,这对于丢弃“已处理”XML 特别有用:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

XML::Twig->new(
    twig_handlers => {
        'job' => sub { print $_ ->text }
    }
    )->parse( <> );

这将使用 <> 获取输入(通过管道输入或通过命令行 ./myscript somefile.xml 指定)并处理它 - 每个 job 元素,它将提取并打印任何文本联系。 (您可能需要 print $_ -> text,"\n" 插入换行符)。

因为它匹配“job”元素,所以它也会匹配嵌套的 job 元素:

<job>programming
    <job>anotherjob</job>
</job>

将匹配两次,但也会将某些输出打印两次。不过,如果您愿意,也可以匹配 /job。有用 - 例如,这可以让您打印和删除一个元素,或者复制并粘贴一个元素来修改 XML 结构。

或者 - 首先解析,然后根据结构“打印”:

my $twig = XML::Twig->new( )->parse( <> );
print $twig -> root -> text;

由于 job 是您的根元素,我们需要做的就是打印它的文本。

但我们可以更挑剔一些,查找 job/job 并专门打印它:

my $twig = XML::Twig->new( )->parse( <> );
print $twig -> findnodes('/job',0)->text;

您可以使用 XML::Twig s pretty_print 选项也可以重新格式化您的 XML:

XML::Twig->new( 'pretty_print' => 'indented_a' )->parse( <> ) -> print;

有多种输出格式选项,但对于更简单的 XML(如您的),大多数看起来都非常相似。

Please don't use line and regex based parsing on XML. It is a bad idea. You can have semantically identical XML with different formatting, and regex and line based parsing simply cannot cope with it.

Things like unary tags and variable line wrapping - these snippets 'say' the same thing:

<root>
  <sometag val1="fish" val2="carrot" val3="narf"></sometag>
</root>


<root>
  <sometag
      val1="fish"
      val2="carrot"
      val3="narf"></sometag>
</root>

<root
><sometag
val1="fish"
val2="carrot"
val3="narf"
></sometag></root>

<root><sometag val1="fish" val2="carrot" val3="narf"/></root>

Hopefully this makes it clear why making a regex/line based parser is difficult? Fortunately, you don't need to. Many scripting languages have at least one, sometimes more parser options.

As a previous poster has alluded to - xml_grep is available. That's actually a tool based off the XML::Twig perl library. However what it does is use 'xpath expressions' to find something, and differentiates between document structure, attributes and 'content'.

E.g.:

xml_grep 'job' jobs.xml --text_only

However in the interest of making better answers, here's a couple of examples of 'roll your own' based on your source data:

First way:

Use twig handlers that catches elements of a particular type and acts on them. The advantage of doing it this way is it parses the XML 'as you go', and lets you modify it in flight if you need to. This is particularly useful for discarding 'processed' XML when you're working with large files, using purge or flush:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

XML::Twig->new(
    twig_handlers => {
        'job' => sub { print $_ ->text }
    }
    )->parse( <> );

Which will use <> to take input (piped in, or specified via commandline ./myscript somefile.xml) and process it - each job element, it'll extract and print any text associated. (You might want print $_ -> text,"\n" to insert a linefeed).

Because it's matching on 'job' elements, it'll also match on nested job elements:

<job>programming
    <job>anotherjob</job>
</job>

Will match twice, but print some of the output twice too. You can however, match on /job instead if you prefer. Usefully - this lets you e.g. print and delete an element or copy and paste one modifying the XML structure.

Alternatively - parse first, and 'print' based on structure:

my $twig = XML::Twig->new( )->parse( <> );
print $twig -> root -> text;

As job is your root element, all we need do is print the text of it.

But we can be a bit more discerning, and look for job or /job and print that specifically instead:

my $twig = XML::Twig->new( )->parse( <> );
print $twig -> findnodes('/job',0)->text;

You can use XML::Twigs pretty_print option to reformat your XML too:

XML::Twig->new( 'pretty_print' => 'indented_a' )->parse( <> ) -> print;

There's a variety of output format options, but for simpler XML (like yours) most will look pretty similar.

孤独陪着我 2024-08-27 20:06:25

只需使用 awk,无需其他外部工具。如果您想要的标签出现在多行中,则以下工作有效。

$ cat file
test
<job xmlns="http://www.sample.com/">programming</job>
<job xmlns="http://www.sample.com/">
programming</job>

$ awk -vRS="</job>" '{gsub(/.*<job.*>/,"");print}' file
programming

programming

just use awk, no need other external tools. Below works if your desired tags appears in multitine.

$ cat file
test
<job xmlns="http://www.sample.com/">programming</job>
<job xmlns="http://www.sample.com/">
programming</job>

$ awk -vRS="</job>" '{gsub(/.*<job.*>/,"");print}' file
programming

programming
音栖息无 2024-08-27 20:06:25

使用sed命令:

示例:

$ cat file.xml
<note>
        <to>Tove</to>
                <from>Jani</from>
                <heading>Reminder</heading>
        <body>Don't forget me this weekend!</body>
</note>

$ cat file.xml | sed -ne '/<heading>/s#\s*<[^>]*>\s*##gp'
Reminder

说明:

cat file.xml | sed -ne '//s#\s*<[^>]*>\s*##gp'

n - 禁止打印所有行< br>
e - 脚本

// - 查找包含指定模式的行,例如

接下来是替换部分 s///p,删除除所需值之外的所有内容,其中 / 替换为 # 以提高可读性:

s#\s*<[^>]*>\s*##gp
\s* - 包括空格(如果存在)(末尾相同)
<[^>]*> 表示 作为非贪婪正则表达式替代原因 <.*?>不适用于 sed
g - 替换所有内容,例如关闭 xml 标记

Using sed command:

Example:

$ cat file.xml
<note>
        <to>Tove</to>
                <from>Jani</from>
                <heading>Reminder</heading>
        <body>Don't forget me this weekend!</body>
</note>

$ cat file.xml | sed -ne '/<heading>/s#\s*<[^>]*>\s*##gp'
Reminder

Explanation:

cat file.xml | sed -ne '/<pattern_to_find>/s#\s*<[^>]*>\s*##gp'

n - suppress printing all lines
e - script

/<pattern_to_find>/ - finds lines that contain specified pattern what could be e.g.<heading>

next is substitution part s///pthat removes everything except desired value where / is replaced with # for better readability:

s#\s*<[^>]*>\s*##gp
\s* - includes white-spaces if exist (same at the end)
<[^>]*> represents <xml_tag> as non-greedy regex alternative cause <.*?> does not work for sed
g - substitutes everything e.g. closing xml </xml_tag> tag

逐鹿 2024-08-27 20:06:25

假设同一行,从标准输入输入:

sed -ne '/<\/job>/ { s/<[^>]*>\(.*\)<\/job>/\1/; p }'

注释:-n停止它自动输出所有内容; -e 表示它是一个单行代码(aot 脚本) /<\/job> 的作用类似于 grep; s 去除 opentag + 属性和 endtag; ; 是一条新语句; p 打印; {} 使 grep 作为一个语句应用于这两个语句。

Assuming same line, input from stdin:

sed -ne '/<\/job>/ { s/<[^>]*>\(.*\)<\/job>/\1/; p }'

notes: -n stops it outputting everything automatically; -e means it's a one-liner (aot a script) /<\/job> acts like a grep; s strips the opentag + attributes and endtag; ; is a new statement; p prints; {} makes the grep apply to both statements, as one.

甜是你 2024-08-27 20:06:25

怎么样:

cat a.xml | grep '<job' | cut -d '>' -f 2 | cut -d '<' -f 1

How about:

cat a.xml | grep '<job' | cut -d '>' -f 2 | cut -d '<' -f 1
半夏半凉 2024-08-27 20:06:25

演出有点晚了。

xmlcutty 从 XML 中剪切节点:

$ cat file.xml
<?xml version="1.0" encoding="utf-8"?>
<job xmlns="http://www.sample.com/">programming</job>
<job xmlns="http://www.sample.com/">designing</job>
<job xmlns="http://www.sample.com/">managing</job>
<job xmlns="http://www.sample.com/">teaching</job>

path 参数命名元素的路径你想剪掉。在本例中,由于我们对标签根本不感兴趣,因此我们将标签重命名为 \n,因此我们得到了一个不错的列表:

$ xmlcutty -path /job -rename '\n' file.xml
programming
designing
managing
teaching

注意,XML 一开始就无效(没有根元素)。 xmlcutty 也可以处理稍微损坏的 XML。

A bit late to the show.

xmlcutty cuts out nodes from XML:

$ cat file.xml
<?xml version="1.0" encoding="utf-8"?>
<job xmlns="http://www.sample.com/">programming</job>
<job xmlns="http://www.sample.com/">designing</job>
<job xmlns="http://www.sample.com/">managing</job>
<job xmlns="http://www.sample.com/">teaching</job>

The path argument names the path to the element you want to cut out. In this case, since we are not interested in the tags at all, we rename the tag to \n, so we get a nice list:

$ xmlcutty -path /job -rename '\n' file.xml
programming
designing
managing
teaching

Note, that the XML was not valid to begin with (no root element). xmlcutty can work with slightly broken XML, too.

蹲墙角沉默 2024-08-27 20:06:25

| yourxmlfile.xml

<item> 
  <title>15:54:57 - George:</title>
  <description>Diane DeConn? You saw Diane DeConn!</description> 
</item> 
<item> 
  <title>15:55:17 - Jerry:</title> 
  <description>Something huh?</description>
</item>

grep '标题' yourxmlfile.xml

  <title>15:54:57 - George:</title>
  <title>15:55:17 - Jerry:</title>

grep '标题' yourxmlfile.xml | awk -F">" '{print $2}'

  15:54:57 - George:</title
  15:55:17 - Jerry:</title

grep '标题' yourxmlfile.xml | awk -F">" '{打印 $2}' | awk -F"<" '{打印$1}'

  15:54:57 - George:
  15:55:17 - Jerry:

yourxmlfile.xml

<item> 
  <title>15:54:57 - George:</title>
  <description>Diane DeConn? You saw Diane DeConn!</description> 
</item> 
<item> 
  <title>15:55:17 - Jerry:</title> 
  <description>Something huh?</description>
</item>

grep 'title' yourxmlfile.xml

  <title>15:54:57 - George:</title>
  <title>15:55:17 - Jerry:</title>

grep 'title' yourxmlfile.xml | awk -F">" '{print $2}'

  15:54:57 - George:</title
  15:55:17 - Jerry:</title

grep 'title' yourxmlfile.xml | awk -F">" '{print $2}' | awk -F"<" '{print $1}'

  15:54:57 - George:
  15:55:17 - Jerry:
つ低調成傷 2024-08-27 20:06:25

使用 xml2 将面向行的工具与 XML 结合使用

示例:

xml2 <foo.xml | sed -n 's#.*/job=##p'

输出:

programming

从哪里获取 xml2

通常可以使用系统的包管理器来安装 xml2 命令(例如,apt install xml2)。也可以从 https://github.com/cryptorick/xml2 下载。

xml2 文档

为什么需要 xml2

单纯地使用 grep、sed 和 awk 是很脆弱的。考虑以下 XML 文件,它会破坏此类解决方案:

<?xml version="1.0" encoding="utf-8"?>
<root>
  <job xmlns=
       "http://www.people.com/"
       val1="fish" val2="carrot"
       val3="narf"
       >teaching<!-- A comment about the </job> tag --></job>
</root>

为什么不使用 xml_grep 和朋友

这个问题的大多数可靠答案都建议使用工具,例如 xml_grep,它使用 XPath 语法。 XPath 专为搜索 XML 文档而设计,如果您已经了解 XPath 或不了解其他任何内容,那么它是一个很好的解决方案。

然而,如果您只需要搜索 XML 文件并了解标准 UNIX 工具,那么可能不值得您花时间学习 XPath,因为它的实用性比 XML 有限。幸运的是,xml2 提供了一种简单的方法来利用 UNIX 和正则表达式的功能,将 XML 语法转换为“平面文件”格式,其中每条记录都位于一行上。

xml2 输出示例

例如,运行 xml2 < foo.xml on the following file:

<?xml version="1.0" encoding="utf-8"?>
<root>
  <job xmlns="http://www.sample.com/">programming</job>
  <job xmlns="http://www.supple.com/">designing</job>
  <job xmlns="http://www.simple.com/">managing</job>
  <job xmlns=
       "http://www.people.com/"
       val1="fish" val2="carrot"
       val3="narf"
       >teaching<!-- A comment about the </job> tag --></job>
</root>

将输出以下文本文件:

/root/job/@xmlns=http://www.sample.com/
/root/job=programming
/root/job
/root/job/@xmlns=http://www.supple.com/
/root/job=designing
/root/job
/root/job/@xmlns=http://www.simple.com/
/root/job=managing
/root/job
/root/job/@xmlns=http://www.people.com/
/root/job/@val1=fish
/root/job/@val2=carrot
/root/job/@val3=narf
/root/job=teaching
/root/job/!= A comment about the </job> tag 

如您所见,XML 文件的特性已标准化,并且可以通过 grep、sed 或 awk 轻松解析输出。特别是,命令 xml2 输出:

programming
designing
managing
teaching

旁注:我添加了一个 节点以使文件有效 XML,尽管 xml2 也可以正常工作方式。

限制

虽然 xml2 对于搜索和替换来说非常方便,但如果您要使用 XML 进行大量工作,您可能需要学习 XPath 和 XSLT,它们可以执行更强大的层次结构转换。

Use xml2 to use line-oriented tools with XML

Example:

xml2 <foo.xml | sed -n 's#.*/job=##p'

Output:

programming

Where to get xml2

The xml2 command often can be installed using your system's package manager (for example, apt install xml2). It can also be downloaded from https://github.com/cryptorick/xml2.

xml2 Documentation

Why xml2 is needed

Naive use of grep, sed, and awk is brittle. Consider the following XML file which would break such solutions:

<?xml version="1.0" encoding="utf-8"?>
<root>
  <job xmlns=
       "http://www.people.com/"
       val1="fish" val2="carrot"
       val3="narf"
       >teaching<!-- A comment about the </job> tag --></job>
</root>

Why not xml_grep and friends

Most of the robust answers to this question suggest using tools, such as xml_grep, which search using the XPath syntax. XPath is designed especially for searching XML documents and is a fine solution if you already know XPath or don't know anything else.

However, if you just need to search XML files and know the standard UNIX tools, it may not be worth your time to learn XPath which has limited utility beyond XML. Fortunately, xml2 provides an easy way to leverage the power of UNIX and regular expressions by converting the XML syntax to a "flat file" format in which each record is on a single line.

Example xml2 output

For example, running xml2 < foo.xml on the following file:

<?xml version="1.0" encoding="utf-8"?>
<root>
  <job xmlns="http://www.sample.com/">programming</job>
  <job xmlns="http://www.supple.com/">designing</job>
  <job xmlns="http://www.simple.com/">managing</job>
  <job xmlns=
       "http://www.people.com/"
       val1="fish" val2="carrot"
       val3="narf"
       >teaching<!-- A comment about the </job> tag --></job>
</root>

would output the following text file:

/root/job/@xmlns=http://www.sample.com/
/root/job=programming
/root/job
/root/job/@xmlns=http://www.supple.com/
/root/job=designing
/root/job
/root/job/@xmlns=http://www.simple.com/
/root/job=managing
/root/job
/root/job/@xmlns=http://www.people.com/
/root/job/@val1=fish
/root/job/@val2=carrot
/root/job/@val3=narf
/root/job=teaching
/root/job/!= A comment about the </job> tag 

As you can see, the peculiarities of the XML file have been made normalized and the output can be easily parsed by grep, sed, or awk. In particular, the command xml2 <foo.xml | sed -n 's#.*/job=##p' outputs:

programming
designing
managing
teaching

Side note: I added a <root> node to make the file valid XML although xml2 works fine either way.

Limitations

While xml2 is very handy for search and replace, if you are going to be doing a lot of work with XML, you'll probably want to learn XPath and XSLT which can perform more powerful hierarchical transformations.

莫多说 2024-08-27 20:06:24

您真的必须只使用这些工具吗?它们不是为 XML 处理而设计的,尽管可以得到在大多数情况下都可以正常工作的东西,但它会在边缘情况下失败,例如编码、换行符等。

我推荐 xml_grep:

xml_grep 'job' jobs.xml --text_only

它给出了输出:

programming

在 u​​buntu 上/debian, xml_grep 位于 xml-twig-tools 包中。

Do you really have to use only those tools? They're not designed for XML processing, and although it's possible to get something that works OK most of the time, it will fail on edge cases, like encoding, line breaks, etc.

I recommend xml_grep:

xml_grep 'job' jobs.xml --text_only

Which gives the output:

programming

On ubuntu/debian, xml_grep is in the xml-twig-tools package.

凉风有信 2024-08-27 20:06:24
 grep '<job' file_name | cut -f2 -d">"|cut -f1 -d"<"
 grep '<job' file_name | cut -f2 -d">"|cut -f1 -d"<"
嘿嘿嘿 2024-08-27 20:06:24

使用 xmlstarlet:

echo '<job xmlns="http://www.sample.com/">programming</job>' | \
   xmlstarlet sel -N var="http://www.sample.com/" -t -m "//var:job" -v '.'

Using xmlstarlet:

echo '<job xmlns="http://www.sample.com/">programming</job>' | \
   xmlstarlet sel -N var="http://www.sample.com/" -t -m "//var:job" -v '.'
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文