NOTXMLERROR：无法解析XML数据

发布于 2025-01-25 10:41:18 字数 737 浏览 5 评论 0原文

我正在尝试使用来自Biopython的Entrez模块来从PubMed Central检出全文文章。这是我做同样的代码。

import urllib3
import json
import requests
from Bio import Entrez
from Bio.Entrez import efetch, Parser
print(Parser.__file__)
pmcid = 'PMC2837563'

def print_text(pmcid):
    handle = efetch(db='pmc', id=pmcid, retmode='xml', rettype=None)
    #print(handle.read())
    record = Entrez.read(handle)
    print(record)

print_text(pmcid)

hander.read（）工作，这意味着数据正在正确获取。但是，我无法执行entrez.read（handle）将获取的数据转换为Python对象。它给我以下错误：

NotXMLError: Failed to parse the XML data (syntax error: line 1036, column 69). Please make sure that the input data are in XML format.

有人可以告诉我该怎么办？根据《生物闻》文档，这似乎是正确的语法。

原文

I'm trying to use the Entrez module from Biopython to retrive full text articles from PubMed Central. This is my code to do the same.

import urllib3
import json
import requests
from Bio import Entrez
from Bio.Entrez import efetch, Parser
print(Parser.__file__)
pmcid = 'PMC2837563'

def print_text(pmcid):
    handle = efetch(db='pmc', id=pmcid, retmode='xml', rettype=None)
    #print(handle.read())
    record = Entrez.read(handle)
    print(record)

print_text(pmcid)

handle.read() works which means the data is being fetched properly. But, I'm not able to do Entrez.read(handle) to convert the fetched data into a python object. It gives me the below error:

NotXMLError: Failed to parse the XML data (syntax error: line 1036, column 69). Please make sure that the input data are in XML format.

Could someone tell me what to do about this? This seems to be correct syntax as per the biopython documentation.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦情居士 2025-02-01 10:41:18

原因是最后一个可用的生物繁殖版本（1.79）未用 uri http://www.niso.org/schemas/ali/1.0/ 。 GitHub版本具有更正的 Parser ，但现在无法从pip中获得。
比较：

当前1.79

    def startNamespaceDeclHandler(self, prefix, uri):
        """Handle start of an XML namespace declaration."""
        if prefix == "xsi":
            # This is an xml schema
            self.schema_namespace = uri
            self.parser.StartElementHandler = self.schemaHandler
        else:
            # Note that the DTD for MathML specifies a default attribute
            # that declares the namespace for each MathML element. This means
            # that MathML element in the XML has an invisible MathML namespace
            # declaration that triggers a call to startNamespaceDeclHandler
            # and endNamespaceDeclHandler. Therefore we need to count how often
            # startNamespaceDeclHandler and endNamespaceDeclHandler were called
            # to find out their first and last invocation for each namespace.
            if prefix == "mml":
                assert uri == "http://www.w3.org/1998/Math/MathML"
            elif prefix == "xlink":
                assert uri == "http://www.w3.org/1999/xlink"
            else:
                raise ValueError("Unknown prefix '%s' with uri '%s'" % (prefix, uri))
            self.namespace_level[prefix] += 1
            self.namespace_prefix[uri] = prefix

github，

    def startNamespaceDeclHandler(self, prefix, uri):
        """Handle start of an XML namespace declaration."""
        if prefix == "xsi":
            # This is an xml schema
            self.schema_namespace = uri
            self.parser.StartElementHandler = self.schemaHandler
        else:
            # Note that the DTD for MathML specifies a default attribute
            # that declares the namespace for each MathML element. This means
            # that MathML element in the XML has an invisible MathML namespace
            # declaration that triggers a call to startNamespaceDeclHandler
            # and endNamespaceDeclHandler. Therefore we need to count how often
            # startNamespaceDeclHandler and endNamespaceDeclHandler were called
            # to find out their first and last invocation for each namespace.
            if prefix == "mml":
                assert uri == "http://www.w3.org/1998/Math/MathML"
            elif prefix == "xlink":
                assert uri == "http://www.w3.org/1999/xlink"
            elif prefix == "ali":
                assert uri == "http://www.niso.org/schemas/ali/1.0/"
            else:
                raise ValueError(f"Unknown prefix '{prefix}' with uri '{uri}'")
            self.namespace_level[prefix] += 1
            self.namespace_prefix[uri] = prefix

因此您可以交换或编辑 parser.py 文件，或使用第三方库将句柄转换为内置python对象。

如果您只想下载文章的全文，则可以尝试通过metapub＆amp;尝试下载PDF。继续通过textract提取文本。

import metapub
from urllib.request import urlretrieve
import textract

pmcid = 'PMC2837563'

fetch = metapub.PubMedFetcher()
article_metadata = fetch.article_by_pmcid(pmcid)

#Get just an abstract
abstract = article_metadata.abstract

#Download full article text
pmid = article_metadata.pmid
url = metapub.FindIt(pmid).url

urlretrieve(url, any_path)

with open(another_path, "w") as textfile:
    textfile.write(textract.process(
        any_path,
        extension='pdf',
        method='pdftotext',
        encoding="utf_8",
    ))

The reason is that the last available Biopython version (1.79) does not recognise DTD with uri http://www.niso.org/schemas/ali/1.0/. The GitHub version has the corrected Parser but it is not available from pip now.
Compare:

current 1.79

    def startNamespaceDeclHandler(self, prefix, uri):
        """Handle start of an XML namespace declaration."""
        if prefix == "xsi":
            # This is an xml schema
            self.schema_namespace = uri
            self.parser.StartElementHandler = self.schemaHandler
        else:
            # Note that the DTD for MathML specifies a default attribute
            # that declares the namespace for each MathML element. This means
            # that MathML element in the XML has an invisible MathML namespace
            # declaration that triggers a call to startNamespaceDeclHandler
            # and endNamespaceDeclHandler. Therefore we need to count how often
            # startNamespaceDeclHandler and endNamespaceDeclHandler were called
            # to find out their first and last invocation for each namespace.
            if prefix == "mml":
                assert uri == "http://www.w3.org/1998/Math/MathML"
            elif prefix == "xlink":
                assert uri == "http://www.w3.org/1999/xlink"
            else:
                raise ValueError("Unknown prefix '%s' with uri '%s'" % (prefix, uri))
            self.namespace_level[prefix] += 1
            self.namespace_prefix[uri] = prefix

GitHub

    def startNamespaceDeclHandler(self, prefix, uri):
        """Handle start of an XML namespace declaration."""
        if prefix == "xsi":
            # This is an xml schema
            self.schema_namespace = uri
            self.parser.StartElementHandler = self.schemaHandler
        else:
            # Note that the DTD for MathML specifies a default attribute
            # that declares the namespace for each MathML element. This means
            # that MathML element in the XML has an invisible MathML namespace
            # declaration that triggers a call to startNamespaceDeclHandler
            # and endNamespaceDeclHandler. Therefore we need to count how often
            # startNamespaceDeclHandler and endNamespaceDeclHandler were called
            # to find out their first and last invocation for each namespace.
            if prefix == "mml":
                assert uri == "http://www.w3.org/1998/Math/MathML"
            elif prefix == "xlink":
                assert uri == "http://www.w3.org/1999/xlink"
            elif prefix == "ali":
                assert uri == "http://www.niso.org/schemas/ali/1.0/"
            else:
                raise ValueError(f"Unknown prefix '{prefix}' with uri '{uri}'")
            self.namespace_level[prefix] += 1
            self.namespace_prefix[uri] = prefix

So you can either exchange or edit Parser.py file, or use third party libraries for converting your handle to built-in python object.

If you want download just a full text of the article, you could try to download a pdf through metapub & go on to extract a text via textract.

import metapub
from urllib.request import urlretrieve
import textract

pmcid = 'PMC2837563'

fetch = metapub.PubMedFetcher()
article_metadata = fetch.article_by_pmcid(pmcid)

#Get just an abstract
abstract = article_metadata.abstract

#Download full article text
pmid = article_metadata.pmid
url = metapub.FindIt(pmid).url

urlretrieve(url, any_path)

with open(another_path, "w") as textfile:
    textfile.write(textract.process(
        any_path,
        extension='pdf',
        method='pdftotext',
        encoding="utf_8",
    ))

回复收藏 0 原文