NOTXMLERROR:无法解析XML数据

发布于 2025-01-25 10:41:18 字数 737 浏览 5 评论 0原文

我正在尝试使用来自Biopython的Entrez模块来从PubMed Central检出全文文章。这是我做同样的代码。

import urllib3
import json
import requests
from Bio import Entrez
from Bio.Entrez import efetch, Parser
print(Parser.__file__)
pmcid = 'PMC2837563'

def print_text(pmcid):
    handle = efetch(db='pmc', id=pmcid, retmode='xml', rettype=None)
    #print(handle.read())
    record = Entrez.read(handle)
    print(record)

print_text(pmcid)


hander.read()工作,这意味着数据正在正确获取。但是,我无法执行entrez.read(handle)将获取的数据转换为Python对象。它给我以下错误:

NotXMLError: Failed to parse the XML data (syntax error: line 1036, column 69). Please make sure that the input data are in XML format.

有人可以告诉我该怎么办?根据《生物闻》文档,这似乎是正确的语法。

I'm trying to use the Entrez module from Biopython to retrive full text articles from PubMed Central. This is my code to do the same.

import urllib3
import json
import requests
from Bio import Entrez
from Bio.Entrez import efetch, Parser
print(Parser.__file__)
pmcid = 'PMC2837563'

def print_text(pmcid):
    handle = efetch(db='pmc', id=pmcid, retmode='xml', rettype=None)
    #print(handle.read())
    record = Entrez.read(handle)
    print(record)

print_text(pmcid)


handle.read() works which means the data is being fetched properly. But, I'm not able to do Entrez.read(handle) to convert the fetched data into a python object. It gives me the below error:

NotXMLError: Failed to parse the XML data (syntax error: line 1036, column 69). Please make sure that the input data are in XML format.

Could someone tell me what to do about this? This seems to be correct syntax as per the biopython documentation.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

梦情居士 2025-02-01 10:41:18

原因是最后一个可用的生物繁殖版本(1.79)未用 uri http://www.niso.org/schemas/ali/1.0/ 。 GitHub版本具有更正的 Parser ,但现在无法从pip中获得。
比较:

当前1.79

    def startNamespaceDeclHandler(self, prefix, uri):
        """Handle start of an XML namespace declaration."""
        if prefix == "xsi":
            # This is an xml schema
            self.schema_namespace = uri
            self.parser.StartElementHandler = self.schemaHandler
        else:
            # Note that the DTD for MathML specifies a default attribute
            # that declares the namespace for each MathML element. This means
            # that MathML element in the XML has an invisible MathML namespace
            # declaration that triggers a call to startNamespaceDeclHandler
            # and endNamespaceDeclHandler. Therefore we need to count how often
            # startNamespaceDeclHandler and endNamespaceDeclHandler were called
            # to find out their first and last invocation for each namespace.
            if prefix == "mml":
                assert uri == "http://www.w3.org/1998/Math/MathML"
            elif prefix == "xlink":
                assert uri == "http://www.w3.org/1999/xlink"
            else:
                raise ValueError("Unknown prefix '%s' with uri '%s'" % (prefix, uri))
            self.namespace_level[prefix] += 1
            self.namespace_prefix[uri] = prefix

github,

    def startNamespaceDeclHandler(self, prefix, uri):
        """Handle start of an XML namespace declaration."""
        if prefix == "xsi":
            # This is an xml schema
            self.schema_namespace = uri
            self.parser.StartElementHandler = self.schemaHandler
        else:
            # Note that the DTD for MathML specifies a default attribute
            # that declares the namespace for each MathML element. This means
            # that MathML element in the XML has an invisible MathML namespace
            # declaration that triggers a call to startNamespaceDeclHandler
            # and endNamespaceDeclHandler. Therefore we need to count how often
            # startNamespaceDeclHandler and endNamespaceDeclHandler were called
            # to find out their first and last invocation for each namespace.
            if prefix == "mml":
                assert uri == "http://www.w3.org/1998/Math/MathML"
            elif prefix == "xlink":
                assert uri == "http://www.w3.org/1999/xlink"
            elif prefix == "ali":
                assert uri == "http://www.niso.org/schemas/ali/1.0/"
            else:
                raise ValueError(f"Unknown prefix '{prefix}' with uri '{uri}'")
            self.namespace_level[prefix] += 1
            self.namespace_prefix[uri] = prefix

因此您可以交换或编辑 parser.py 文件,或使用第三方库将句柄转换为内置python对象。


如果您只想下载文章的全文,则可以尝试通过metapub&尝试下载PDF。继续通过textract提取文本。

import metapub
from urllib.request import urlretrieve
import textract

pmcid = 'PMC2837563'

fetch = metapub.PubMedFetcher()
article_metadata = fetch.article_by_pmcid(pmcid)

#Get just an abstract
abstract = article_metadata.abstract

#Download full article text
pmid = article_metadata.pmid
url = metapub.FindIt(pmid).url

urlretrieve(url, any_path)

with open(another_path, "w") as textfile:
    textfile.write(textract.process(
        any_path,
        extension='pdf',
        method='pdftotext',
        encoding="utf_8",
    ))

The reason is that the last available Biopython version (1.79) does not recognise DTD with uri http://www.niso.org/schemas/ali/1.0/. The GitHub version has the corrected Parser but it is not available from pip now.
Compare:

current 1.79

    def startNamespaceDeclHandler(self, prefix, uri):
        """Handle start of an XML namespace declaration."""
        if prefix == "xsi":
            # This is an xml schema
            self.schema_namespace = uri
            self.parser.StartElementHandler = self.schemaHandler
        else:
            # Note that the DTD for MathML specifies a default attribute
            # that declares the namespace for each MathML element. This means
            # that MathML element in the XML has an invisible MathML namespace
            # declaration that triggers a call to startNamespaceDeclHandler
            # and endNamespaceDeclHandler. Therefore we need to count how often
            # startNamespaceDeclHandler and endNamespaceDeclHandler were called
            # to find out their first and last invocation for each namespace.
            if prefix == "mml":
                assert uri == "http://www.w3.org/1998/Math/MathML"
            elif prefix == "xlink":
                assert uri == "http://www.w3.org/1999/xlink"
            else:
                raise ValueError("Unknown prefix '%s' with uri '%s'" % (prefix, uri))
            self.namespace_level[prefix] += 1
            self.namespace_prefix[uri] = prefix

GitHub

    def startNamespaceDeclHandler(self, prefix, uri):
        """Handle start of an XML namespace declaration."""
        if prefix == "xsi":
            # This is an xml schema
            self.schema_namespace = uri
            self.parser.StartElementHandler = self.schemaHandler
        else:
            # Note that the DTD for MathML specifies a default attribute
            # that declares the namespace for each MathML element. This means
            # that MathML element in the XML has an invisible MathML namespace
            # declaration that triggers a call to startNamespaceDeclHandler
            # and endNamespaceDeclHandler. Therefore we need to count how often
            # startNamespaceDeclHandler and endNamespaceDeclHandler were called
            # to find out their first and last invocation for each namespace.
            if prefix == "mml":
                assert uri == "http://www.w3.org/1998/Math/MathML"
            elif prefix == "xlink":
                assert uri == "http://www.w3.org/1999/xlink"
            elif prefix == "ali":
                assert uri == "http://www.niso.org/schemas/ali/1.0/"
            else:
                raise ValueError(f"Unknown prefix '{prefix}' with uri '{uri}'")
            self.namespace_level[prefix] += 1
            self.namespace_prefix[uri] = prefix

So you can either exchange or edit Parser.py file, or use third party libraries for converting your handle to built-in python object.


If you want download just a full text of the article, you could try to download a pdf through metapub & go on to extract a text via textract.

import metapub
from urllib.request import urlretrieve
import textract

pmcid = 'PMC2837563'

fetch = metapub.PubMedFetcher()
article_metadata = fetch.article_by_pmcid(pmcid)

#Get just an abstract
abstract = article_metadata.abstract

#Download full article text
pmid = article_metadata.pmid
url = metapub.FindIt(pmid).url

urlretrieve(url, any_path)

with open(another_path, "w") as textfile:
    textfile.write(textract.process(
        any_path,
        extension='pdf',
        method='pdftotext',
        encoding="utf_8",
    ))
卷耳 2025-02-01 10:41:18

我更新生物繁殖后再次提出了此错误。我使用retrieveng XML通过Esummary的代码表现良好,但最近我没有使用相同的代码重现结果。我进入了Parser.py代码,现在它具有ALI符号。但是生物繁殖失败在解析XML形式的输出中。

This error raised again after I updated biopython. My code with retrieveng xml via esummary performed well, but recently I failed to reproduce results with the same code. I went into the Parser.py code, and now it possesses ali notation. But biopython fails in parsing the xml-formatted output.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文