如何将 XML 解析为 CSV,其中数据仅在属性中
我尝试解析的 XML 文件包含属性中包含的所有数据。我找到了如何构建要插入到文本文件中的字符串。
我有这个 XML 文件:
<ig:prescribed_item class_ref="0161-1#01-765557#1">
<ig:prescribed_property property_ref="0161-1#02-016058#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
<ig:prescribed_property property_ref="0161-1#02-016059#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
</ig:prescribed_item>
</ig:identification_guide>
我想将其解析为这样的文本文件,并为每个属性复制类引用:
class_ref|property_ref|is_required|UOM_ref
0161-1#01-765557#1|0161-1#02-016058#1|false|0161-1#05-003260#1
0161-1#01-765557#1|0161-1#02-016059#1|false|0161-1#05-003260#1
这是我到目前为止的代码:
require 'nokogiri'
doc = Nokogiri::XML(File.open("file.xml"), 'UTF-8') do |config|
config.strict
end
content = doc.xpath("//ig:prescribed_item/@class_ref").map {|i|
i.search("//ig:prescribed_item/ig:prescribed_property/@property_ref").map { |d| d.text }
}
puts content.inspect
content.each do |c|
puts c.join('|')
end
The XML file I am trying to parse has all the data contained in attributes. I found how to build the string to insert into the text file.
I have this XML file:
<ig:prescribed_item class_ref="0161-1#01-765557#1">
<ig:prescribed_property property_ref="0161-1#02-016058#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
<ig:prescribed_property property_ref="0161-1#02-016059#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
</ig:prescribed_item>
</ig:identification_guide>
And I want to parse it into a text file like this with the class ref duplicated for each property:
class_ref|property_ref|is_required|UOM_ref
0161-1#01-765557#1|0161-1#02-016058#1|false|0161-1#05-003260#1
0161-1#01-765557#1|0161-1#02-016059#1|false|0161-1#05-003260#1
This is the code I have so far:
require 'nokogiri'
doc = Nokogiri::XML(File.open("file.xml"), 'UTF-8') do |config|
config.strict
end
content = doc.xpath("//ig:prescribed_item/@class_ref").map {|i|
i.search("//ig:prescribed_item/ig:prescribed_property/@property_ref").map { |d| d.text }
}
puts content.inspect
content.each do |c|
puts c.join('|')
end
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我会使用 CSS 访问器稍微简化一下:
输出:
在Nokogiri中,有两种类型的“查找节点”方法:“搜索”方法将与特定访问器匹配的所有节点作为
NodeSet
返回,“at”方法返回NodeSet
的第一个Node
,该节点将是第一个遇到与访问器匹配的节点。“搜索”方法包括
search
、css
、xpath
和/
等。 “at”方法包括at
、at_css
、at_xpath
和%
等。search
和at
都接受 XPath 或 CSS 访问器。回到 pp.at_css('|prescribed_unit_of_measure')['UOM_ref']:此时代码中的
pp
是一个包含“prescribed_property”节点的局部变量。因此,我告诉代码在pp
下查找与 CSS|prescribed_unit_of_measure
访问器匹配的第一个节点,换句话说,第一个;
标签包含在pp
节点中。当 Nokogiri 找到该节点时,它会返回该节点的UOM_ref
属性值。仅供参考,在 Nokogiri 中,
/
和%
运算符分别是search
和at
的别名。它们是“Hpricot”兼容性的一部分;当 Hpricot 是首选的 XML/HTML 解析器时,我们经常使用它们,但它们对于大多数 Nokogiri 开发人员来说并不是惯用的。我怀疑这是为了避免与操作符的常规使用混淆,至少就我而言是这样。另外,Nokogiri 的 CSS 访问器有一些特别的有趣之处;它们支持命名空间,就像 XPath 访问器一样,只是它们使用
|
。 Nokogiri 会让我们忽略名称空间,这就是我所做的。您需要查看 Nokogiri 文档中的 CSS 和命名空间以获取更多信息。I'd simplify it a bit using CSS accessors:
Which outputs:
In Nokogiri, there are two types of "find a node" methods: The "search" methods return all nodes that match a particular accessor as a
NodeSet
, and the "at" methods return the firstNode
of theNodeSet
which will be the first encountered Node that matched the accessor.The "search" methods are things like
search
,css
,xpath
and/
. The "at" methods are things likeat
,at_css
,at_xpath
and%
. Bothsearch
andat
accept either XPath or CSS accessors.Back to
pp.at_css('|prescribed_unit_of_measure')['UOM_ref']
: At that point in the codepp
is a local variable containing a "prescribed_property" Node. So, I'm telling the code to find the first node underpp
that matches the CSS|prescribed_unit_of_measure
accessor, in other words the first<dt:prescribed_unit_of_measure>
tag contained by thepp
node. When Nokogiri finds that node, it returns the value of theUOM_ref
attribute of the node.As a FYI, the
/
and%
operators are aliased tosearch
andat
respectively in Nokogiri. They're part of its "Hpricot" compatability; We used to use them a lot when Hpricot was the XML/HTML parser of choice, but they're not idiomatic for most Nokogiri developers. I suspect it's to avoid confusion with the regular use of the operators, at least it is in my case.Also, Nokogiri's CSS accessors have some extra-special juiciness; They support namespaces, like the XPath accessors do, only they use
|
. Nokogiri will let us ignore the namespaces, which is what I did. You'll want to nose around in the Nokogiri docs for CSS and namespaces for more information.肯定有基于属性的解析方法。
Engineyard 文章“Nokogiri 入门”有一个完整的描述。
但很快,他们给出的例子是:
There are definitely ways of parsing based on attributes.
The Engine yard article "Getting started with Nokogiri" has a full description.
But quickly, the examples they give are: