提取XML数据的速度比嵌套循环更快?

发布于 2025-02-09 10:19:33 字数 2247 浏览 3 评论 0 原文

我有一个具有以下格式的XML文件:

<JMdict>
...
<entry>
        <ent_seq>2232410</ent_seq>
        <k_ele>
                <keb>筆おろし</keb>
        </k_ele>
        <k_ele>
                <keb>筆下ろし</keb>
        </k_ele>
        <k_ele>
                <keb>筆降ろし</keb>
                <ke_inf>&iK;</ke_inf>
        </k_ele>
        <r_ele>
                <reb>ふでおろし</reb>
        </r_ele>
        <sense>
                <pos>&n;</pos>
                <pos>&vs;</pos>
                <gloss>using a new brush for the first time</gloss>
        </sense>
        <sense>
                <pos>&n;</pos>
                <pos>&vs;</pos>
                <gloss>doing something for the first time</gloss>
        </sense>
        <sense>
                <pos>&n;</pos>
                <pos>&vs;</pos>
                <gloss>man losing his virginity (esp. to an older woman)</gloss>
        </sense>
</entry>
...
</JMdict>

链接到整个XML文件:

这基本上是电子日语/英语词典。有很多入口标签。我正在尝试创建一个搜索函数,该功能将根据任何keb,reb和“光泽”标签中的文本值返回ent_seq号码。

我有鲍泽代码可以执行我需要做的事情,但有些慢(438毫秒)。然后,该SEQ号码将用于在另一个数据集中查找数据,如果我打算在Web应用中使用它,我希望它更快。有办法吗?

from xml.etree import ElementTree as ET

tree = ET.parse("../../resources/JMdict_e.xml")
root = tree.getroot()

search_term = '筆おろし'
seq_tags = []

for dictionary in root.iter('JMdict'):
    
    for child in dictionary:
        
        for grandchild in child:
            if grandchild.tag == 'ent_seq':
                ent_seq = grandchild.text
                
            for greatgrandchild in grandchild:
                if greatgrandchild.tag in ['keb','reb','gloss']:
                    if greatgrandchild.text == search_term:
                        seq_tags.append(ent_seq)

                    
print(seq_tags)

任何帮助和技巧将不胜感激。

I have an XML file that has the below format:

<JMdict>
...
<entry>
        <ent_seq>2232410</ent_seq>
        <k_ele>
                <keb>筆おろし</keb>
        </k_ele>
        <k_ele>
                <keb>筆下ろし</keb>
        </k_ele>
        <k_ele>
                <keb>筆降ろし</keb>
                <ke_inf>&iK;</ke_inf>
        </k_ele>
        <r_ele>
                <reb>ふでおろし</reb>
        </r_ele>
        <sense>
                <pos>&n;</pos>
                <pos>&vs;</pos>
                <gloss>using a new brush for the first time</gloss>
        </sense>
        <sense>
                <pos>&n;</pos>
                <pos>&vs;</pos>
                <gloss>doing something for the first time</gloss>
        </sense>
        <sense>
                <pos>&n;</pos>
                <pos>&vs;</pos>
                <gloss>man losing his virginity (esp. to an older woman)</gloss>
        </sense>
</entry>
...
</JMdict>

Link to the whole XML file: http://ftp.edrdg.org/pub/Nihongo/JMdict_e.gz

This is basically an electronic Japanese/English dictionary. There are many entry tags. I'm trying to create a search function that will return the ent_seq number based on the text values in any of the keb, reb, and gloss tags.

I have the bellow code which does what I need it to do but is somewhat slow (438 ms). This seq number will then be used to find data in another dataset and if I plan on using it in a web app, I would like it to be faster. Is there a way?

from xml.etree import ElementTree as ET

tree = ET.parse("../../resources/JMdict_e.xml")
root = tree.getroot()

search_term = '筆おろし'
seq_tags = []

for dictionary in root.iter('JMdict'):
    
    for child in dictionary:
        
        for grandchild in child:
            if grandchild.tag == 'ent_seq':
                ent_seq = grandchild.text
                
            for greatgrandchild in grandchild:
                if greatgrandchild.tag in ['keb','reb','gloss']:
                    if greatgrandchild.text == search_term:
                        seq_tags.append(ent_seq)

                    
print(seq_tags)

Any help and tips would be most appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

鹿港小镇 2025-02-16 10:19:33

使用XPATH在同一表达式中搜索超过1个元素(可能是2个以上条件的一个和/或表达式)

from lxml import etree
import time
import io

st = time.process_time()
parser = etree.XMLParser(compact=True, huge_tree=True, resolve_entities=False)
with open('/home/luis/tmp/JMdict_e', 'rb') as f:
    et = time.process_time()
    res = et - st
    print('read file:', res, 'seconds')
    tree = etree.parse(f, parser)
    
    et = time.process_time()
    res1 = et - res
    print('parse:', res1, 'seconds')

    slist = tree.xpath('//entry[k_ele/keb = "筆おろし"]/ent_seq | //entry[r_ele/reb = "エヌきょう"]/ent_seq')
    #slist = tree.xpath('//entry[1]/k_ele/keb')
    et = time.process_time()
    res2 = et - res1
    print('xpath:', res, 'seconds')
    
    #print(slist)
    for d in slist:
        print( d.text)

et = time.process_time()
res = et - st
print('CPU Execution time:', res, 'seconds')

Using xpath to search on more than 1 element in the same expression (could be an AND/OR expression with more than 2 conditions)

from lxml import etree
import time
import io

st = time.process_time()
parser = etree.XMLParser(compact=True, huge_tree=True, resolve_entities=False)
with open('/home/luis/tmp/JMdict_e', 'rb') as f:
    et = time.process_time()
    res = et - st
    print('read file:', res, 'seconds')
    tree = etree.parse(f, parser)
    
    et = time.process_time()
    res1 = et - res
    print('parse:', res1, 'seconds')

    slist = tree.xpath('//entry[k_ele/keb = "筆おろし"]/ent_seq | //entry[r_ele/reb = "エヌきょう"]/ent_seq')
    #slist = tree.xpath('//entry[1]/k_ele/keb')
    et = time.process_time()
    res2 = et - res1
    print('xpath:', res, 'seconds')
    
    #print(slist)
    for d in slist:
        print( d.text)

et = time.process_time()
res = et - st
print('CPU Execution time:', res, 'seconds')
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文