使用 python etree.iterparse() 解析巨大的 xml 文件无法正常工作。代码有逻辑错误吗？

发布于 2024-11-07 17:03:35 字数 4547 浏览 2 评论 0原文

我想解析一个巨大的 xml 文件。这个巨大文件中的记录确实看起来像这个。一般来说，该文件看起来像这样，

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
    record_1
    ...
    record_n
</dblp>

我编写了一些代码，这将使我从此文件中选择录音。

如果我让代码运行（需要近 50 分钟，包括存储在 MySQL 数据库中）我注意到，有一条记录，似乎有近 100 万作者。这一定是错误的。我什至通过查看文件来检查它，确保该文件没有错误。该论文只有 5 或 6 位作者，因此 dblp.xml 一切都很好。所以我假设我的代码中有逻辑错误。但我不知道这可能在哪里。 也许有人可以告诉我，错误出在哪里？

代码停在 if len(auth) >; 行中。 2000.

import sys
import MySQLdb
from lxml import etree


elements = ['article', 'inproceedings', 'proceedings', 'book', 'incollection']
tags = ["author", "title", "booktitle", "year", "journal"]


def fast_iter(context, cursor):
    mydict = {} # represents a paper with all its tags.
    auth = [] # a list of authors who have written the paper "together".
    counter = 0 # counts the papers

    for event, elem in context:
        if elem.tag in elements and event == "start":
            mydict["element"] = elem.tag
            mydict["mdate"] = elem.get("mdate")
            mydict["key"] = elem.get("key")

        elif elem.tag == "title" and elem.text != None:
            mydict["title"] = elem.text
        elif elem.tag == "booktitle" and elem.text != None:
            mydict["booktitle"] = elem.text
        elif elem.tag == "year" and elem.text != None:
            mydict["year"] = elem.text
        elif elem.tag == "journal" and elem.text != None:
            mydict["journal"] = elem.text
        elif elem.tag == "author" and elem.text != None:
            auth.append(elem.text)
        elif event == "end" and elem.tag in elements:
            counter += 1
            print counter
            #populate_database(mydict, auth, cursor)
            mydict.clear()
            auth = []
            if mydict or auth:
                sys.exit("Program aborted because auth or mydict was not deleted properly!")
        if len(auth) > 200: # There are up to ~150 authors per paper. 
            sys.exit("auth: It seams there is a paper which has too many authors.!")
        if len(mydict) > 50: # A paper can have much metadata.
            sys.exit("mydict: It seams there is a paper which has too many tags.")

        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context


def main():
        cursor = connectToDatabase()
        cursor.execute("""SET NAMES utf8""")

        context = etree.iterparse(PATH_TO_XML, dtd_validation=True, events=("start", "end"))
        fast_iter(context, cursor)

        cursor.close()


if __name__ == '__main__':
    main()

编辑：

当我编写这个函数时，我完全被误导了。我犯了一个巨大的错误，因为我忽略了，在试图跳过一些不需要的录音时，却被一些想要的录音弄乱了。在文件中的某个点上，我连续跳过了近一百万条记录，下面的想要记录被炸毁了。

在约翰和保罗的帮助下，我成功地重写了我的代码。它现在正在解析，并且有望做得很好。如果某些意外错误仍未解决，我会报告。否则谢谢大家的帮助！我真的很感激！

def fast_iter2(context, cursor):
    elements = set([
        'article', 'inproceedings', 'proceedings', 'book', 'incollection',
        'phdthesis', "mastersthesis", "www"
        ])
    childElements = set(["title", "booktitle", "year", "journal", "ee"])

    paper = {} # represents a paper with all its tags.
    authors = []   # a list of authors who have written the paper "together".
    paperCounter = 0
    for event, element in context:
        tag = element.tag
        if tag in childElements:
            if element.text:
                paper[tag] = element.text
                # print tag, paper[tag]
        elif tag == "author":
            if element.text:
                authors.append(element.text)
                # print "AUTHOR:", authors[-1]
        elif tag in elements:
            paper["element"] = tag
            paper["mdate"] = element.get("mdate")
            paper["dblpkey"] = element.get("key")
            # print tag, element.get("mdate"), element.get("key"), event
            if paper["element"] in ['phdthesis', "mastersthesis", "www"]:
                pass
            else:
                populate_database(paper, authors, cursor)
            paperCounter += 1
            print paperCounter
            paper = {}
            authors = []
            # if paperCounter == 100:
            #     break
            element.clear()
            while element.getprevious() is not None:
                del element.getparent()[0]
    del context

原文

I want to parse a huge file xml-file. The records in this huge file do look for example like this. And in general the file looks like this

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
    record_1
    ...
    record_n
</dblp>

I wrote some code, that shall get me a selection of recordings from this file.

If I let the code run (takes nearly 50 Minutes including storage in the MySQL-Database) I notice, that there is a record, which seams to have nearly a million authors. This must be wrong. I even checked up on it by looking into the file make sure, that the file has no errors in it. The paper has only 5 or 6 authors, so all is fine wirh dblp.xml. So I assume a logic error in my code. But I can't figure out where this could be. Perhaps someone can tell me, where the error is?

The code stops in the line if len(auth) > 2000.

import sys
import MySQLdb
from lxml import etree


elements = ['article', 'inproceedings', 'proceedings', 'book', 'incollection']
tags = ["author", "title", "booktitle", "year", "journal"]


def fast_iter(context, cursor):
    mydict = {} # represents a paper with all its tags.
    auth = [] # a list of authors who have written the paper "together".
    counter = 0 # counts the papers

    for event, elem in context:
        if elem.tag in elements and event == "start":
            mydict["element"] = elem.tag
            mydict["mdate"] = elem.get("mdate")
            mydict["key"] = elem.get("key")

        elif elem.tag == "title" and elem.text != None:
            mydict["title"] = elem.text
        elif elem.tag == "booktitle" and elem.text != None:
            mydict["booktitle"] = elem.text
        elif elem.tag == "year" and elem.text != None:
            mydict["year"] = elem.text
        elif elem.tag == "journal" and elem.text != None:
            mydict["journal"] = elem.text
        elif elem.tag == "author" and elem.text != None:
            auth.append(elem.text)
        elif event == "end" and elem.tag in elements:
            counter += 1
            print counter
            #populate_database(mydict, auth, cursor)
            mydict.clear()
            auth = []
            if mydict or auth:
                sys.exit("Program aborted because auth or mydict was not deleted properly!")
        if len(auth) > 200: # There are up to ~150 authors per paper. 
            sys.exit("auth: It seams there is a paper which has too many authors.!")
        if len(mydict) > 50: # A paper can have much metadata.
            sys.exit("mydict: It seams there is a paper which has too many tags.")

        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context


def main():
        cursor = connectToDatabase()
        cursor.execute("""SET NAMES utf8""")

        context = etree.iterparse(PATH_TO_XML, dtd_validation=True, events=("start", "end"))
        fast_iter(context, cursor)

        cursor.close()


if __name__ == '__main__':
    main()

EDIT:

I was totally misguided, when I wrote this function. I made a huge mistake by overlooking, that while trying to skip some unwanted recordings the get messed up with some wanted recordings. And at a certain point in the file, where I skiped nearly a million records in a row, the following wanted record got blown up.

With the help of John and Paul I managed to rewrite my code. It is parsing right now, and seams to do it well. I'll report back, if some unexpected errors remained unsolved. Elsewise thank you all for your help! I really appreciated it!

def fast_iter2(context, cursor):
    elements = set([
        'article', 'inproceedings', 'proceedings', 'book', 'incollection',
        'phdthesis', "mastersthesis", "www"
        ])
    childElements = set(["title", "booktitle", "year", "journal", "ee"])

    paper = {} # represents a paper with all its tags.
    authors = []   # a list of authors who have written the paper "together".
    paperCounter = 0
    for event, element in context:
        tag = element.tag
        if tag in childElements:
            if element.text:
                paper[tag] = element.text
                # print tag, paper[tag]
        elif tag == "author":
            if element.text:
                authors.append(element.text)
                # print "AUTHOR:", authors[-1]
        elif tag in elements:
            paper["element"] = tag
            paper["mdate"] = element.get("mdate")
            paper["dblpkey"] = element.get("key")
            # print tag, element.get("mdate"), element.get("key"), event
            if paper["element"] in ['phdthesis', "mastersthesis", "www"]:
                pass
            else:
                populate_database(paper, authors, cursor)
            paperCounter += 1
            print paperCounter
            paper = {}
            authors = []
            # if paperCounter == 100:
            #     break
            element.clear()
            while element.getprevious() is not None:
                del element.getparent()[0]
    del context

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

春夜浅 2024-11-14 17:03:35

请消除一个令人困惑的来源：您实际上并没有说过您所展示的代码实际上在您的“事物计数 > 2000”测试之一中确实出错了。如果不是，那么问题出在数据库更新代码（您没有向我们展示）。

如果发生这种情况：

(1) 将限制从 2000 减少到合理值（对于 auth 大约为 20，对于 mydict 正好为 7）

(2) 当发生故障时，打印 repr(我的字典）；打印; print repr(auth) 并与您的文件进行比较来分析内容。

另外：使用 iterparse() 时，不能保证 elem.text 在“开始”事件发生时已被解析。为了节省一些运行时间，您应该仅在“结束”事件发生时访问 elem.text。事实上，似乎根本没有理由想要“开始”事件。您还定义了一个列表 tags 但从未使用它。函数的开头可以写得更简洁：

def fast_iter(context, cursor):
    mydict = {} # represents a paper with all its tags.
    auth = [] # a list of authors who have written the paper "together".
    counter = 0 # counts the papers
    tagset1 = set(['article', 'inproceedings', 'proceedings', 'book', 'incollection'])
    tagset2 = set(["title", "booktitle", "year", "journal"])
    for event, elem in context:
        tag = elem.tag
        if tag in tagset2:
            if elem.text:
                mydict[tag] = elem.text
        elif tag == "author":
            if elem.text:
                auth.append(elem.text)
        elif tag in tagset1:
            counter += 1
            print counter
            mydict["element"] = tag
            mydict["mdate"] = elem.get("mdate")
            mydict["dblpkey"] = elem.get("key")
            #populate_database(mydict, auth, cursor)
            mydict.clear() # Why not just do mydict = {} ??
            auth = []
            # etc etc

不要忘记修复对 iterparse() 的调用以删除事件参数。

此外，我相当确定 elem.clear() 仅应在事件“结束”时执行，并且仅当 tag in taget1 时才需要执行。仔细阅读相关文档。在“开始”事件中进行清理很可能会损坏您的树。

Please eliminate one source of confusion: You haven't actually said that the code that you showed does actually trip over on one of your "count of things > 2000" tests. If not, then the problem lies in the database update code (that you haven't showed us).

If it does so trip over:

(1) Reduce the limits from 2000 to reasonable values (about 20 for auth and exactly 7 for mydict)

(2) When the trip happens, print repr(mydict); print; print repr(auth) and analyse the contents in comparison with your file.

Aside: with iterparse(), elem.text is NOT guaranteed to have been parsed when the "start" event happens. To save some running time, you should access elem.text only when the "end" event happens. In fact, there seems to be no reason why you want "start" events at all. Also you define a list tags but never use it. The start of your function could be written much more concisely:

def fast_iter(context, cursor):
    mydict = {} # represents a paper with all its tags.
    auth = [] # a list of authors who have written the paper "together".
    counter = 0 # counts the papers
    tagset1 = set(['article', 'inproceedings', 'proceedings', 'book', 'incollection'])
    tagset2 = set(["title", "booktitle", "year", "journal"])
    for event, elem in context:
        tag = elem.tag
        if tag in tagset2:
            if elem.text:
                mydict[tag] = elem.text
        elif tag == "author":
            if elem.text:
                auth.append(elem.text)
        elif tag in tagset1:
            counter += 1
            print counter
            mydict["element"] = tag
            mydict["mdate"] = elem.get("mdate")
            mydict["dblpkey"] = elem.get("key")
            #populate_database(mydict, auth, cursor)
            mydict.clear() # Why not just do mydict = {} ??
            auth = []
            # etc etc

Don't forget to fix the call to iterparse() to remove the events arg.

Also I'm reasonably certain that the elem.clear() should be done only when event is "end" and needs to be done only when tag in tagset1. Read the relevant docs carefully. Doing the cleanup in a "start" event could very well be damaging your tree.

回复收藏 0 原文

尸血腥色 2024-11-14 17:03:35

在检测元素中标签的开始和停止的代码块中添加打印语句，以确保正确检测到这些。我怀疑由于某种原因您没有获得清除作者列表的代码。

尝试注释掉这段代码（或者至少将其移动到“end”处理块中）：

    elem.clear()
    while elem.getprevious() is not None:
        del elem.getparent()[0]

当您遍历 XML 时，Python 应该负责清除这些元素。 “del context”也是多余的。让参考计数器在这里为您完成工作。

Add print statements in the blocks of code where you detect start and stop of a tag in elements, to make sure you are detecting these properly. I suspect that for some reason you aren't getting to the code that clears the authors list.

Try commenting out this code (or at least, move it into the "end" handling block):

    elem.clear()
    while elem.getprevious() is not None:
        del elem.getparent()[0]

Python should take care of clearing these elements for you as you traverse the XML. The "del context" is also superfluous. Let the reference counters do the work for you here.

回复收藏 0 原文

~没有更多了~