使用 lxml 提取所有元素事先未知的数据
我有一些大致标准化的 sgml 文件。但是,在打开文件并亲自阅读之前,标签中可能包含我不知道其存在的数据。例如,文件具有地址,并且通常地址具有街道、城市、州、邮政编码和电话。地址的每个元素都用一个标签表示
<ADDRESS>
<STREET>One Main Street
<CITY>Gotham City
<ZIP>99999 0123
<PHONE>555-123-5467
</ADDRESS>
,但是,例如,我发现有国家、STREET1、STREET2 的标签。我有超过 200K 个文件需要处理,我想知道是否可以提取地址的所有元素,而不必担心知道未知标签的存在。
到目前为止我所做的是,
h=fromstring(my_data_in_a_string)
for each in h.cssselect('mail_address'):
each.text_content()
但我得到的是有问题的,因为我无法确定一个元素在哪里结束以及下一个元素在哪里开始
One Main StreetGotham City99999 0123555-123-5467
I have some sgml files that are roughly standardized. However, there can be data contained within a tag that I do not know exists before I open the file and personally read it. For example, the files have addresses and generally the addresses have a street, a city, a state, a zip and a phone. Each element of the address is indicated with a tag
<ADDRESS>
<STREET>One Main Street
<CITY>Gotham City
<ZIP>99999 0123
<PHONE>555-123-5467
</ADDRESS>
But, for example, I have discovered that there are tags for Country, STREET1, STREET2. I have over 200K files to process and I want know if it is possible to pull out all of the elements of the addresses without having to worry about knowing the existence of unknown tags.
What I have done so far is
h=fromstring(my_data_in_a_string)
for each in h.cssselect('mail_address'):
each.text_content()
but what I get is problematic because I can't identify where one element ends and the next begins
One Main StreetGotham City99999 0123555-123-5467
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
为了获取所有标签,我们像这样遍历文档:
假设您的 XML 结构如下:
我们解析它:
现在假设您的 XML 还具有额外的标签;你不知道的标签。由于我们正在迭代 XML,因此上面的代码也将返回这些标签。
上面的代码返回:
现在如果我们想获取标签的文本,过程是相同的。只需像这样打印 tag.text :
To get all the tags, we iter through the document like this:
Suppose your XML structure is like this:
We parse it:
Now suppose your XML has extra tags as well; tags you are not aware about. Since we are iterating through the XML, the above code will return those tags as well.
The above code returns:
Now if we want to get the text of the tags, the procedure is the same. Just print tag.text like this: