xml 中的最后一个元素没有被拾取
我下面有一个 python 3 脚本,该脚本应该下载一个 xml 文件并将其拆分为较小的文件,每个文件仅包含 500 个项目。我遇到两个问题:
- 原始 xml 中的最后一项不存在于分割文件中,
- 如果原始 xml 的长度为 1000 项,它将创建第三个空 xml 文件。
谁能告诉我我的代码中哪里可能存在这样的错误导致这些症状?
import urllib.request as urllib2
from lxml import etree
def _yield_str_from_net(url, car_tag):
xml_file = urllib2.urlopen(url)
for _, element in etree.iterparse(xml_file, tag=car_tag):
yield etree.tostring(element, pretty_print=True).decode('utf-8')
element.clear()
def split_xml(url, car_tag, save_as):
output_file_num = 1
net_file_iter = _yield_str_from_net(url, car_tag)
while True:
file_name = "%s%s.xml" % (save_as, output_file_num)
print("Making %s" % file_name)
with open(file_name, mode='w', encoding='utf-8') as the_file:
for elem_count in range(500): # want only 500 items
try:
elem = next(net_file_iter)
except StopIteration:
return
the_file.write(elem)
print("processing element #%s" % elem_count)
output_file_num += 1
if __name__ == '__main__':
split_xml("http://www.my_xml_url.com/",
'my_tag',
'my_file')
I have a python 3 script below that is supposed to download an xml file and split it into smaller files with only 500 items each. I am having two problems:
- the last item in the original xml is not present in the split files
- if the original xml was 1000 items long it will create a 3rd empty xml file.
Can anyone tell me where there could be such an error in my code to cause these symptoms?
import urllib.request as urllib2
from lxml import etree
def _yield_str_from_net(url, car_tag):
xml_file = urllib2.urlopen(url)
for _, element in etree.iterparse(xml_file, tag=car_tag):
yield etree.tostring(element, pretty_print=True).decode('utf-8')
element.clear()
def split_xml(url, car_tag, save_as):
output_file_num = 1
net_file_iter = _yield_str_from_net(url, car_tag)
while True:
file_name = "%s%s.xml" % (save_as, output_file_num)
print("Making %s" % file_name)
with open(file_name, mode='w', encoding='utf-8') as the_file:
for elem_count in range(500): # want only 500 items
try:
elem = next(net_file_iter)
except StopIteration:
return
the_file.write(elem)
print("processing element #%s" % elem_count)
output_file_num += 1
if __name__ == '__main__':
split_xml("http://www.my_xml_url.com/",
'my_tag',
'my_file')
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
第二个并不是错误,而是设计使然。读取 1000 个元素后,迭代器尚不知道没有更多项,因此继续执行 while True 循环。
如果迭代器有一个
hasNext
那么你可以将其替换为while hasNext
以解决此问题。不幸的是Python中没有这样的东西。对于第一个问题:目前我在您的代码中看不到任何解释此问题的内容。
The second one is no error but by design. After reading 1000 elements the iterator does not yet know that there is no further item and thus continues with the
while True
loop.It would be great if iterators would have a
hasNext
then you could replace it bywhile hasNext
in order to overcome this issue. Unfortunately there is no such thing in python.For the first question: currently I can't see anything in your code explaining this issue.