如何使用 jython/python ElementTree 解析 Google Refine 中的 xml

发布于 2024-12-21 10:09:45 字数 1281 浏览 2 评论 0原文

我尝试使用 Jython 和 ElementTree 解析 Google Refine 中的一些 xml,但我正在努力寻找任何文档来帮助我完成这项工作(可能不是 python 编码器没有帮助)

这是我尝试的 XML 的摘录解析。我正在尝试返回所有 dc:indentifier 的连接字符串:

<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:creator>J. Koenig</dc:creator>
  <dc:date>2010-01-13T15:47:38Z</dc:date>
  <dc:date>2010-01-13T15:47:38Z</dc:date>
  <dc:date>2010-01-13T15:47:38Z</dc:date>
  <dc:identifier>CCTL0059</dc:identifier>
  <dc:identifier>CCTL0059</dc:identifier>
  <dc:identifier>http://open.jorum.ac.uk:80/xmlui/handle/123456789/335</dc:identifier>
  <dc:format>application/pdf</dc:format>
</oai_dc:dc>

这是迄今为止我得到的代码。这是一个返回任何内容的测试,因为现在我得到的只是“错误:null”

from elementtree import ElementTree as ET
element = ET.parse(value)

namespace = "{http://www.openarchives.org/OAI/2.0/oai_dc/}"
e = element.findall('{0}identifier'.format(namespace))
for i in e:
   count += 1
return count

I trying to parse some xml in Google Refine using Jython and ElementTree but I'm struggling to find any documentation to help me getting this working (probably not helped by not being a python coder)

Here's an extract of the XML I'm trying to parse. I'm trying to return a joined string of all the dc:indentifier:

<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:creator>J. Koenig</dc:creator>
  <dc:date>2010-01-13T15:47:38Z</dc:date>
  <dc:date>2010-01-13T15:47:38Z</dc:date>
  <dc:date>2010-01-13T15:47:38Z</dc:date>
  <dc:identifier>CCTL0059</dc:identifier>
  <dc:identifier>CCTL0059</dc:identifier>
  <dc:identifier>http://open.jorum.ac.uk:80/xmlui/handle/123456789/335</dc:identifier>
  <dc:format>application/pdf</dc:format>
</oai_dc:dc>

Here's the code I've got so far. This is a test to return anything as right now all I'm getting is 'Error: null'

from elementtree import ElementTree as ET
element = ET.parse(value)

namespace = "{http://www.openarchives.org/OAI/2.0/oai_dc/}"
e = element.findall('{0}identifier'.format(namespace))
for i in e:
   count += 1
return count

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

不如归去 2024-12-28 10:09:45

您可以使用这样的 GREL 表达式,尝试一下:

forEach(value.parseHtml().select("dc|identifier"),v,v.htmlText()).join(",")

对于找到的每个标识符,给我 htmlText 并用逗号将它们全部连接起来。
parseHtml() 使用 Jsoup.org 库,实际上只是解析标签和结构。它还知道如何使用 ns|identifier 格式解析名称空间,并且是在这种情况下获得所需内容的好方法。

You can use a GREL expression like this, try it:

forEach(value.parseHtml().select("dc|identifier"),v,v.htmlText()).join(",")

For each identifier found, give me the htmlText and join them all with commas.
parseHtml() uses Jsoup.org library and really just parses tags and structure. It also knows about parsing namespaces with the format of ns|identifier and is a nice way to get what your after in this case.

披肩女神 2024-12-28 10:09:45

您使用了错误的命名空间。这适用于 Jython 2.5.1:

from xml.etree import ElementTree as ET
element = ET.fromstring(value) # `value` is a string with the xml from question

namespace = "{http://purl.org/dc/elements/1.1/}"
for e in element.getiterator(namespace+'identifier'):
    print e.text

输出

CCTL0059
CCTL0059
http://open.jorum.ac.uk:80/xmlui/handle/123456789/335

You've used the wrong namespace. This works on Jython 2.5.1:

from xml.etree import ElementTree as ET
element = ET.fromstring(value) # `value` is a string with the xml from question

namespace = "{http://purl.org/dc/elements/1.1/}"
for e in element.getiterator(namespace+'identifier'):
    print e.text

Output

CCTL0059
CCTL0059
http://open.jorum.ac.uk:80/xmlui/handle/123456789/335
尤怨 2024-12-28 10:09:45

这是对 JF Sebastian 版本的一个轻微调整,可以直接粘贴到 Google Refine 中:

from xml.etree import ElementTree as ET
element = ET.fromstring(value)
namespace = "{http://purl.org/dc/elements/1.1/}"
return ','.join([e.text for e in element.getiterator(namespace+'identifier')])

它返回一个逗号分隔的列表,但您可以更改 return 语句中使用的分隔符。

Here's a slight tweak on J.F. Sebastian's version which can be pasted directly into Google Refine:

from xml.etree import ElementTree as ET
element = ET.fromstring(value)
namespace = "{http://purl.org/dc/elements/1.1/}"
return ','.join([e.text for e in element.getiterator(namespace+'identifier')])

It returns a comma separated list, but you can change the delimiter used in the return statement.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文