使用 lxml 提取所有元素事先未知的数据

发布于 2024-10-03 01:45:46 字数 638 浏览 2 评论 0原文

我有一些大致标准化的 sgml 文件。但是，在打开文件并亲自阅读之前，标签中可能包含我不知道其存在的数据。例如，文件具有地址，并且通常地址具有街道、城市、州、邮政编码和电话。地址的每个元素都用一个标签表示

 <ADDRESS>
 <STREET>One Main Street
 <CITY>Gotham City
 <ZIP>99999 0123
 <PHONE>555-123-5467
 </ADDRESS>

，但是，例如，我发现有国家、STREET1、STREET2 的标签。我有超过 200K 个文件需要处理，我想知道是否可以提取地址的所有元素，而不必担心知道未知标签的存在。

到目前为止我所做的是，

h=fromstring(my_data_in_a_string)
for each in h.cssselect('mail_address'):
    each.text_content()

但我得到的是有问题的，因为我无法确定一个元素在哪里结束以及下一个元素在哪里开始

One Main StreetGotham City99999 0123555-123-5467

原文

I have some sgml files that are roughly standardized. However, there can be data contained within a tag that I do not know exists before I open the file and personally read it. For example, the files have addresses and generally the addresses have a street, a city, a state, a zip and a phone. Each element of the address is indicated with a tag

 <ADDRESS>
 <STREET>One Main Street
 <CITY>Gotham City
 <ZIP>99999 0123
 <PHONE>555-123-5467
 </ADDRESS>

But, for example, I have discovered that there are tags for Country, STREET1, STREET2. I have over 200K files to process and I want know if it is possible to pull out all of the elements of the addresses without having to worry about knowing the existence of unknown tags.

What I have done so far is

h=fromstring(my_data_in_a_string)
for each in h.cssselect('mail_address'):
    each.text_content()

but what I get is problematic because I can't identify where one element ends and the next begins

One Main StreetGotham City99999 0123555-123-5467

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

幸福丶如此 2024-10-10 01:45:46

为了获取所有标签，我们像这样遍历文档：

假设您的 XML 结构如下：

<ADDRESS>
 <STREET>One Main Street</STREET>
 <CITY>Gotham City</CITY>
 <ZIP>99999 0123</ZIP>
 <PHONE>555-123-5467</PHONE>
 </ADDRESS>

我们解析它：

>>> from lxml import etree
>>> f = etree.parse('foo.xml')  # path to XML file
>>> root = f.getroot() # get the root element
>>> for tags in root.iter(): # iter through the root element
...     print tags.tag       # print all the tags
... 
ADDRESS
STREET
CITY
ZIP
PHONE

现在假设您的 XML 还具有额外的标签；你不知道的标签。由于我们正在迭代 XML，因此上面的代码也将返回这些标签。

<ADDRESS>
         <STREET>One Main Street</STREET>
         <STREET1>One Second Street</STREET1>
        <CITY>Gotham City</CITY>
         <ZIP>99999 0123</ZIP>
         <PHONE>555-123-5467</PHONE>         
         <COUNTRY>USA</COUNTRY>    
</ADDRESS>

上面的代码返回：

ADDRESS
STREET
STREET1
CITY
ZIP
PHONE
COUNTRY

现在如果我们想获取标签的文本，过程是相同的。只需像这样打印 tag.text ：

>>> for tags in root.iter():
...     print tags.text
... 

One Main Street
One Second Street
Gotham City
99999 0123
555-123-5467
USA

To get all the tags, we iter through the document like this:

Suppose your XML structure is like this:

<ADDRESS>
 <STREET>One Main Street</STREET>
 <CITY>Gotham City</CITY>
 <ZIP>99999 0123</ZIP>
 <PHONE>555-123-5467</PHONE>
 </ADDRESS>

We parse it:

>>> from lxml import etree
>>> f = etree.parse('foo.xml')  # path to XML file
>>> root = f.getroot() # get the root element
>>> for tags in root.iter(): # iter through the root element
...     print tags.tag       # print all the tags
... 
ADDRESS
STREET
CITY
ZIP
PHONE

Now suppose your XML has extra tags as well; tags you are not aware about. Since we are iterating through the XML, the above code will return those tags as well.

<ADDRESS>
         <STREET>One Main Street</STREET>
         <STREET1>One Second Street</STREET1>
        <CITY>Gotham City</CITY>
         <ZIP>99999 0123</ZIP>
         <PHONE>555-123-5467</PHONE>         
         <COUNTRY>USA</COUNTRY>    
</ADDRESS>

The above code returns:

ADDRESS
STREET
STREET1
CITY
ZIP
PHONE
COUNTRY

Now if we want to get the text of the tags, the procedure is the same. Just print tag.text like this:

>>> for tags in root.iter():
...     print tags.text
... 

One Main Street
One Second Street
Gotham City
99999 0123
555-123-5467
USA

回复收藏 0 原文

~没有更多了~

关于作者

千纸鹤带着心事

暂无简介

0 文章

0 评论

963 人气

关注发私信

友情链接

文江博客

使用 lxml 提取所有元素事先未知的数据

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

使用 lxml 提取所有元素事先未知的数据

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。