如何将 XML 解析为 CSV，其中数据仅在属性中

发布于 2024-11-16 15:25:28 字数 1767 浏览 2 评论 0原文

我尝试解析的 XML 文件包含属性中包含的所有数据。我找到了如何构建要插入到文本文件中的字符串。

我有这个 XML 文件：

<ig:prescribed_item class_ref="0161-1#01-765557#1">
  <ig:prescribed_property property_ref="0161-1#02-016058#1" is_required="false" combination_allowed="false" one_of_allowed="false">
    <dt:measure_number_type representation_ref="0161-1#04-000005#1">
      <dt:real_type>
        <dt:real_format pattern="\d(1,)\.\d(1,)"/>
      </dt:real_type>
      <dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
    </dt:measure_number_type>
  </ig:prescribed_property>
  <ig:prescribed_property property_ref="0161-1#02-016059#1" is_required="false" combination_allowed="false" one_of_allowed="false">
    <dt:measure_number_type representation_ref="0161-1#04-000005#1">
      <dt:real_type>
        <dt:real_format pattern="\d(1,)\.\d(1,)"/>
      </dt:real_type>
      <dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
    </dt:measure_number_type>
  </ig:prescribed_property>
</ig:prescribed_item>
  </ig:identification_guide>

我想将其解析为这样的文本文件，并为每个属性复制类引用：

class_ref|property_ref|is_required|UOM_ref
0161-1#01-765557#1|0161-1#02-016058#1|false|0161-1#05-003260#1
0161-1#01-765557#1|0161-1#02-016059#1|false|0161-1#05-003260#1

这是我到目前为止的代码：

require 'nokogiri'

doc = Nokogiri::XML(File.open("file.xml"), 'UTF-8') do |config|
  config.strict
end

content = doc.xpath("//ig:prescribed_item/@class_ref").map {|i|
  i.search("//ig:prescribed_item/ig:prescribed_property/@property_ref").map { |d| d.text }
}

puts content.inspect

content.each do |c|
  puts c.join('|')
end

原文

The XML file I am trying to parse has all the data contained in attributes. I found how to build the string to insert into the text file.

I have this XML file:

<ig:prescribed_item class_ref="0161-1#01-765557#1">
  <ig:prescribed_property property_ref="0161-1#02-016058#1" is_required="false" combination_allowed="false" one_of_allowed="false">
    <dt:measure_number_type representation_ref="0161-1#04-000005#1">
      <dt:real_type>
        <dt:real_format pattern="\d(1,)\.\d(1,)"/>
      </dt:real_type>
      <dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
    </dt:measure_number_type>
  </ig:prescribed_property>
  <ig:prescribed_property property_ref="0161-1#02-016059#1" is_required="false" combination_allowed="false" one_of_allowed="false">
    <dt:measure_number_type representation_ref="0161-1#04-000005#1">
      <dt:real_type>
        <dt:real_format pattern="\d(1,)\.\d(1,)"/>
      </dt:real_type>
      <dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
    </dt:measure_number_type>
  </ig:prescribed_property>
</ig:prescribed_item>
  </ig:identification_guide>

And I want to parse it into a text file like this with the class ref duplicated for each property:

class_ref|property_ref|is_required|UOM_ref
0161-1#01-765557#1|0161-1#02-016058#1|false|0161-1#05-003260#1
0161-1#01-765557#1|0161-1#02-016059#1|false|0161-1#05-003260#1

This is the code I have so far:

require 'nokogiri'

doc = Nokogiri::XML(File.open("file.xml"), 'UTF-8') do |config|
  config.strict
end

content = doc.xpath("//ig:prescribed_item/@class_ref").map {|i|
  i.search("//ig:prescribed_item/ig:prescribed_property/@property_ref").map { |d| d.text }
}

puts content.inspect

content.each do |c|
  puts c.join('|')
end

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

森罗 2024-11-23 15:25:28

我会使用 CSS 访问器稍微简化一下：

xml = <<EOT
<ig:prescribed_item class_ref="0161-1#01-765557#1">
    <ig:prescribed_property property_ref="0161-1#02-016058#1" is_required="false" combination_allowed="false" one_of_allowed="false">
        <dt:measure_number_type representation_ref="0161-1#04-000005#1">
            <dt:real_type>
                <dt:real_format pattern="\d(1,)\.\d(1,)"/>
            </dt:real_type>
            <dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
        </dt:measure_number_type>
    </ig:prescribed_property>
    <ig:prescribed_property property_ref="0161-1#02-016059#1" is_required="false" combination_allowed="false" one_of_allowed="false">
        <dt:measure_number_type representation_ref="0161-1#04-000005#1">
            <dt:real_type>
                <dt:real_format pattern="\d(1,)\.\d(1,)"/>
            </dt:real_type>
            <dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
        </dt:measure_number_type>
    </ig:prescribed_property>
</ig:prescribed_item>
</ig:identification_guide>
EOT

require 'nokogiri'

doc = Nokogiri::XML(xml)

data = [ %w[ class_ref property_ref is_required UOM_ref] ]

doc.css('|prescribed_item').each do |pi|
  pi.css('|prescribed_property').each do |pp|
    data << [
      pi['class_ref'],
      pp['property_ref'],
      pp['is_required'],
      pp.at_css('|prescribed_unit_of_measure')['UOM_ref']
    ]
  end
end

puts data.map{ |row| row.join('|') }

输出：

class_ref|property_ref|is_required|UOM_ref
0161-1#01-765557#1|0161-1#02-016058#1|false|0161-1#05-003260#1
0161-1#01-765557#1|0161-1#02-016059#1|false|0161-1#05-003260#1

您能否更详细地解释一下这一行“pp.at_css('|prescribed_unit_of_measure')['UOM_ref']”

在Nokogiri中，有两种类型的“查找节点”方法：“搜索”方法将与特定访问器匹配的所有节点作为 NodeSet 返回，“at”方法返回 NodeSet 的第一个 Node，该节点将是第一个遇到与访问器匹配的节点。

“搜索”方法包括 search、css、xpath 和 / 等。 “at”方法包括 at、at_css、at_xpath 和 % 等。 search 和 at 都接受 XPath 或 CSS 访问器。

回到 pp.at_css('|prescribed_unit_of_measure')['UOM_ref']：此时代码中的 pp 是一个包含“prescribed_property”节点的局部变量。因此，我告诉代码在 pp 下查找与 CSS |prescribed_unit_of_measure 访问器匹配的第一个节点，换句话说，第一个 ; 标签包含在 pp 节点中。当 Nokogiri 找到该节点时，它会返回该节点的 UOM_ref 属性值。

仅供参考，在 Nokogiri 中，/ 和 % 运算符分别是 search 和 at 的别名。它们是“Hpricot”兼容性的一部分；当 Hpricot 是首选的 XML/HTML 解析器时，我们经常使用它们，但它们对于大多数 Nokogiri 开发人员来说并不是惯用的。我怀疑这是为了避免与操作符的常规使用混淆，至少就我而言是这样。

另外，Nokogiri 的 CSS 访问器有一些特别的有趣之处；它们支持命名空间，就像 XPath 访问器一样，只是它们使用 |。 Nokogiri 会让我们忽略名称空间，这就是我所做的。您需要查看 Nokogiri 文档中的 CSS 和命名空间以获取更多信息。

I'd simplify it a bit using CSS accessors:

xml = <<EOT
<ig:prescribed_item class_ref="0161-1#01-765557#1">
    <ig:prescribed_property property_ref="0161-1#02-016058#1" is_required="false" combination_allowed="false" one_of_allowed="false">
        <dt:measure_number_type representation_ref="0161-1#04-000005#1">
            <dt:real_type>
                <dt:real_format pattern="\d(1,)\.\d(1,)"/>
            </dt:real_type>
            <dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
        </dt:measure_number_type>
    </ig:prescribed_property>
    <ig:prescribed_property property_ref="0161-1#02-016059#1" is_required="false" combination_allowed="false" one_of_allowed="false">
        <dt:measure_number_type representation_ref="0161-1#04-000005#1">
            <dt:real_type>
                <dt:real_format pattern="\d(1,)\.\d(1,)"/>
            </dt:real_type>
            <dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
        </dt:measure_number_type>
    </ig:prescribed_property>
</ig:prescribed_item>
</ig:identification_guide>
EOT

require 'nokogiri'

doc = Nokogiri::XML(xml)

data = [ %w[ class_ref property_ref is_required UOM_ref] ]

doc.css('|prescribed_item').each do |pi|
  pi.css('|prescribed_property').each do |pp|
    data << [
      pi['class_ref'],
      pp['property_ref'],
      pp['is_required'],
      pp.at_css('|prescribed_unit_of_measure')['UOM_ref']
    ]
  end
end

puts data.map{ |row| row.join('|') }

Which outputs:

class_ref|property_ref|is_required|UOM_ref
0161-1#01-765557#1|0161-1#02-016058#1|false|0161-1#05-003260#1
0161-1#01-765557#1|0161-1#02-016059#1|false|0161-1#05-003260#1

Could you explain this line in greater detail "pp.at_css('|prescribed_unit_of_measure')['UOM_ref']"

In Nokogiri, there are two types of "find a node" methods: The "search" methods return all nodes that match a particular accessor as a NodeSet, and the "at" methods return the first Node of the NodeSet which will be the first encountered Node that matched the accessor.

The "search" methods are things like search, css, xpath and /. The "at" methods are things like at, at_css, at_xpath and %. Both search and at accept either XPath or CSS accessors.

Back to pp.at_css('|prescribed_unit_of_measure')['UOM_ref']: At that point in the code pp is a local variable containing a "prescribed_property" Node. So, I'm telling the code to find the first node under pp that matches the CSS |prescribed_unit_of_measure accessor, in other words the first <dt:prescribed_unit_of_measure> tag contained by the pp node. When Nokogiri finds that node, it returns the value of the UOM_ref attribute of the node.

As a FYI, the / and % operators are aliased to search and at respectively in Nokogiri. They're part of its "Hpricot" compatability; We used to use them a lot when Hpricot was the XML/HTML parser of choice, but they're not idiomatic for most Nokogiri developers. I suspect it's to avoid confusion with the regular use of the operators, at least it is in my case.

Also, Nokogiri's CSS accessors have some extra-special juiciness; They support namespaces, like the XPath accessors do, only they use |. Nokogiri will let us ignore the namespaces, which is what I did. You'll want to nose around in the Nokogiri docs for CSS and namespaces for more information.

回复收藏 0 原文