从简单的 XML 文件中提取数据
我有一个包含以下内容的 XML 文件:
<?xml version="1.0" encoding="utf-8"?>
<job xmlns="http://www.sample.com/">programming</job>
我需要一种方法来提取 标签中的内容,然后进行编程这个案例。这应该在 Linux 命令提示符下使用 grep/sed/awk 完成。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
请不要对 XML 使用基于行和正则表达式的解析。这是一个坏主意。您可以拥有语义相同但格式不同的 XML,而正则表达式和基于行的解析根本无法应对它。
像一元标签和变量换行之类的东西 - 这些片段“说”同样的事情:
希望这可以清楚地说明为什么制作基于正则表达式/行的解析器很困难?幸运的是,您不需要这样做。许多脚本语言至少有一个,有时甚至更多的解析器选项。
正如之前的海报所提到的 -
xml_grep
可用。这实际上是一个基于XML::Twig
perl 库的工具。然而,它的作用是使用“xpath 表达式”来查找某些内容,并区分文档结构、属性和“内容”。例如:
但是,为了做出更好的答案,这里有几个基于源数据的“自己动手”的示例:
第一种方法:
使用
twig handlers
来捕获特定类型的元素并执行操作在他们身上。这样做的优点是它可以“随时”解析 XML,并允许您根据需要随时修改它。当您使用purge
或flush
处理大文件时,这对于丢弃“已处理”XML 特别有用:这将使用
<> 获取输入(通过管道输入或通过命令行
./myscript somefile.xml
指定)并处理它 - 每个job
元素,它将提取并打印任何文本联系。 (您可能需要print $_ -> text,"\n"
插入换行符)。因为它匹配“job”元素,所以它也会匹配嵌套的 job 元素:
将匹配两次,但也会将某些输出打印两次。不过,如果您愿意,也可以匹配
/job
。有用 - 例如,这可以让您打印和删除一个元素,或者复制并粘贴一个元素来修改 XML 结构。或者 - 首先解析,然后根据结构“打印”:
由于
job
是您的根元素,我们需要做的就是打印它的文本。但我们可以更挑剔一些,查找
job
或/job
并专门打印它:您可以使用
XML::Twig
spretty_print
选项也可以重新格式化您的 XML:有多种输出格式选项,但对于更简单的 XML(如您的),大多数看起来都非常相似。
Please don't use line and regex based parsing on XML. It is a bad idea. You can have semantically identical XML with different formatting, and regex and line based parsing simply cannot cope with it.
Things like unary tags and variable line wrapping - these snippets 'say' the same thing:
Hopefully this makes it clear why making a regex/line based parser is difficult? Fortunately, you don't need to. Many scripting languages have at least one, sometimes more parser options.
As a previous poster has alluded to -
xml_grep
is available. That's actually a tool based off theXML::Twig
perl library. However what it does is use 'xpath expressions' to find something, and differentiates between document structure, attributes and 'content'.E.g.:
However in the interest of making better answers, here's a couple of examples of 'roll your own' based on your source data:
First way:
Use
twig handlers
that catches elements of a particular type and acts on them. The advantage of doing it this way is it parses the XML 'as you go', and lets you modify it in flight if you need to. This is particularly useful for discarding 'processed' XML when you're working with large files, usingpurge
orflush
:Which will use
<>
to take input (piped in, or specified via commandline./myscript somefile.xml
) and process it - eachjob
element, it'll extract and print any text associated. (You might wantprint $_ -> text,"\n"
to insert a linefeed).Because it's matching on 'job' elements, it'll also match on nested job elements:
Will match twice, but print some of the output twice too. You can however, match on
/job
instead if you prefer. Usefully - this lets you e.g. print and delete an element or copy and paste one modifying the XML structure.Alternatively - parse first, and 'print' based on structure:
As
job
is your root element, all we need do is print the text of it.But we can be a bit more discerning, and look for
job
or/job
and print that specifically instead:You can use
XML::Twig
spretty_print
option to reformat your XML too:There's a variety of output format options, but for simpler XML (like yours) most will look pretty similar.
只需使用 awk,无需其他外部工具。如果您想要的标签出现在多行中,则以下工作有效。
just use awk, no need other external tools. Below works if your desired tags appears in multitine.
使用sed命令:
示例:
说明:
cat file.xml | sed -ne '//s#\s*<[^>]*>\s*##gp'
n
- 禁止打印所有行< br>e
- 脚本//
- 查找包含指定模式的行,例如接下来是替换部分
s///p
,删除除所需值之外的所有内容,其中/
替换为#
以提高可读性:s#\s*<[^>]*>\s*##gp
\s*
- 包括空格(如果存在)(末尾相同)<[^>]*>
表示
作为非贪婪正则表达式替代原因<.*?>
不适用于 sedg - 替换所有内容,例如关闭 xml
标记
Using sed command:
Example:
Explanation:
cat file.xml | sed -ne '/<pattern_to_find>/s#\s*<[^>]*>\s*##gp'
n
- suppress printing all linese
- script/<pattern_to_find>/
- finds lines that contain specified pattern what could be e.g.<heading>
next is substitution part
s///p
that removes everything except desired value where/
is replaced with#
for better readability:s#\s*<[^>]*>\s*##gp
\s*
- includes white-spaces if exist (same at the end)<[^>]*>
represents<xml_tag>
as non-greedy regex alternative cause<.*?>
does not work for sedg - substitutes everything e.g. closing xml
</xml_tag>
tag假设同一行,从标准输入输入:
注释:
-n
停止它自动输出所有内容;-e
表示它是一个单行代码(aot 脚本)/<\/job>
的作用类似于 grep;s
去除 opentag + 属性和 endtag;;
是一条新语句;p
打印;{}
使 grep 作为一个语句应用于这两个语句。Assuming same line, input from stdin:
notes:
-n
stops it outputting everything automatically;-e
means it's a one-liner (aot a script)/<\/job>
acts like a grep;s
strips the opentag + attributes and endtag;;
is a new statement;p
prints;{}
makes the grep apply to both statements, as one.怎么样:
How about:
演出有点晚了。
xmlcutty 从 XML 中剪切节点:
path
参数命名元素的路径你想剪掉。在本例中,由于我们对标签根本不感兴趣,因此我们将标签重命名为\n
,因此我们得到了一个不错的列表:注意,XML 一开始就无效(没有根元素)。 xmlcutty 也可以处理稍微损坏的 XML。
A bit late to the show.
xmlcutty cuts out nodes from XML:
The
path
argument names the path to the element you want to cut out. In this case, since we are not interested in the tags at all, we rename the tag to\n
, so we get a nice list:Note, that the XML was not valid to begin with (no root element). xmlcutty can work with slightly broken XML, too.
| yourxmlfile.xml
grep '标题' yourxmlfile.xml
grep '标题' yourxmlfile.xml | awk -F">" '{print $2}'
grep '标题' yourxmlfile.xml | awk -F">" '{打印 $2}' | awk -F"<" '{打印$1}'
yourxmlfile.xml
grep 'title' yourxmlfile.xml
grep 'title' yourxmlfile.xml | awk -F">" '{print $2}'
grep 'title' yourxmlfile.xml | awk -F">" '{print $2}' | awk -F"<" '{print $1}'
使用
xml2
将面向行的工具与 XML 结合使用示例:
输出:
从哪里获取 xml2
通常可以使用系统的包管理器来安装 xml2 命令(例如,
apt install xml2)。也可以从 https://github.com/cryptorick/xml2 下载。
xml2 文档
为什么需要 xml2
单纯地使用 grep、sed 和 awk 是很脆弱的。考虑以下 XML 文件,它会破坏此类解决方案:
为什么不使用 xml_grep 和朋友
这个问题的大多数可靠答案都建议使用工具,例如 xml_grep,它使用 XPath 语法。 XPath 专为搜索 XML 文档而设计,如果您已经了解 XPath 或不了解其他任何内容,那么它是一个很好的解决方案。
然而,如果您只需要搜索 XML 文件并了解标准 UNIX 工具,那么可能不值得您花时间学习 XPath,因为它的实用性比 XML 有限。幸运的是,
xml2
提供了一种简单的方法来利用 UNIX 和正则表达式的功能,将 XML 语法转换为“平面文件”格式,其中每条记录都位于一行上。xml2 输出示例
例如,运行
xml2 < foo.xml
on the following file:将输出以下文本文件:
如您所见,XML 文件的特性已标准化,并且可以通过 grep、sed 或 awk 轻松解析输出。特别是,命令
xml2 输出:
旁注:我添加了一个
节点以使文件有效 XML,尽管 xml2 也可以正常工作方式。限制
虽然 xml2 对于搜索和替换来说非常方便,但如果您要使用 XML 进行大量工作,您可能需要学习 XPath 和 XSLT,它们可以执行更强大的层次结构转换。
Use
xml2
to use line-oriented tools with XMLExample:
Output:
Where to get xml2
The xml2 command often can be installed using your system's package manager (for example,
apt install xml2
). It can also be downloaded from https://github.com/cryptorick/xml2.xml2 Documentation
Why xml2 is needed
Naive use of grep, sed, and awk is brittle. Consider the following XML file which would break such solutions:
Why not xml_grep and friends
Most of the robust answers to this question suggest using tools, such as xml_grep, which search using the XPath syntax. XPath is designed especially for searching XML documents and is a fine solution if you already know XPath or don't know anything else.
However, if you just need to search XML files and know the standard UNIX tools, it may not be worth your time to learn XPath which has limited utility beyond XML. Fortunately,
xml2
provides an easy way to leverage the power of UNIX and regular expressions by converting the XML syntax to a "flat file" format in which each record is on a single line.Example xml2 output
For example, running
xml2 < foo.xml
on the following file:would output the following text file:
As you can see, the peculiarities of the XML file have been made normalized and the output can be easily parsed by grep, sed, or awk. In particular, the command
xml2 <foo.xml | sed -n 's#.*/job=##p'
outputs:Side note: I added a
<root>
node to make the file valid XML although xml2 works fine either way.Limitations
While xml2 is very handy for search and replace, if you are going to be doing a lot of work with XML, you'll probably want to learn XPath and XSLT which can perform more powerful hierarchical transformations.
您真的必须只使用这些工具吗?它们不是为 XML 处理而设计的,尽管可以得到在大多数情况下都可以正常工作的东西,但它会在边缘情况下失败,例如编码、换行符等。
我推荐 xml_grep:
它给出了输出:
在 ubuntu 上/debian, xml_grep 位于 xml-twig-tools 包中。
Do you really have to use only those tools? They're not designed for XML processing, and although it's possible to get something that works OK most of the time, it will fail on edge cases, like encoding, line breaks, etc.
I recommend xml_grep:
Which gives the output:
On ubuntu/debian, xml_grep is in the xml-twig-tools package.
使用 xmlstarlet:
Using xmlstarlet: