NOTXMLERROR:无法解析XML数据
我正在尝试使用来自Biopython的Entrez模块来从PubMed Central检出全文文章。这是我做同样的代码。
import urllib3
import json
import requests
from Bio import Entrez
from Bio.Entrez import efetch, Parser
print(Parser.__file__)
pmcid = 'PMC2837563'
def print_text(pmcid):
handle = efetch(db='pmc', id=pmcid, retmode='xml', rettype=None)
#print(handle.read())
record = Entrez.read(handle)
print(record)
print_text(pmcid)
hander.read()工作,这意味着数据正在正确获取。但是,我无法执行entrez.read(handle)
将获取的数据转换为Python对象。它给我以下错误:
NotXMLError: Failed to parse the XML data (syntax error: line 1036, column 69). Please make sure that the input data are in XML format.
有人可以告诉我该怎么办?根据《生物闻》文档,这似乎是正确的语法。
I'm trying to use the Entrez module from Biopython to retrive full text articles from PubMed Central. This is my code to do the same.
import urllib3
import json
import requests
from Bio import Entrez
from Bio.Entrez import efetch, Parser
print(Parser.__file__)
pmcid = 'PMC2837563'
def print_text(pmcid):
handle = efetch(db='pmc', id=pmcid, retmode='xml', rettype=None)
#print(handle.read())
record = Entrez.read(handle)
print(record)
print_text(pmcid)
handle.read() works which means the data is being fetched properly. But, I'm not able to do Entrez.read(handle)
to convert the fetched data into a python object. It gives me the below error:
NotXMLError: Failed to parse the XML data (syntax error: line 1036, column 69). Please make sure that the input data are in XML format.
Could someone tell me what to do about this? This seems to be correct syntax as per the biopython documentation.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
原因是最后一个可用的生物繁殖版本(1.79)未用 uri http://www.niso.org/schemas/ali/1.0/ 。 GitHub版本具有更正的 Parser ,但现在无法从
pip
中获得。比较:
当前1.79
github,
因此您可以交换或编辑 parser.py 文件,或使用第三方库将句柄转换为内置python对象。
如果您只想下载文章的全文,则可以尝试通过
metapub
&尝试下载PDF。继续通过textract
提取文本。The reason is that the last available Biopython version (1.79) does not recognise DTD with uri http://www.niso.org/schemas/ali/1.0/. The GitHub version has the corrected Parser but it is not available from
pip
now.Compare:
current 1.79
GitHub
So you can either exchange or edit Parser.py file, or use third party libraries for converting your handle to built-in python object.
If you want download just a full text of the article, you could try to download a pdf through
metapub
& go on to extract a text viatextract
.我更新生物繁殖后再次提出了此错误。我使用retrieveng XML通过Esummary的代码表现良好,但最近我没有使用相同的代码重现结果。我进入了Parser.py代码,现在它具有ALI符号。但是生物繁殖失败在解析XML形式的输出中。
This error raised again after I updated biopython. My code with retrieveng xml via esummary performed well, but recently I failed to reproduce results with the same code. I went into the Parser.py code, and now it possesses ali notation. But biopython fails in parsing the xml-formatted output.