帮助解析 之间使用 BeautifulSoup 的标签
我尝试使用 BeautifulSoup 和 python 解析网站上的信息。该 html 如下所示。我希望我的解析数据看起来像:
ID 定义
赖氨酸生物合成 - 假鼻伯克霍尔德菌 17
...其余数据位于类似位置(在“pre”标签内和“a”标签外。
我该怎么做?
<pre>ID Definition
----------------------------------------------------------------------------------------------------
<a href="/kegg-bin/show_pathway?bpm00300">bpm00300</a> Lysine biosynthesis - Burkholderia pseudomallei 17
<a href="/kegg-bin/show_pathway?bpm00330">bpm00330</a> Arginine and proline metabolism - Burkholderia pse
<a href="/kegg-bin/show_pathway?bpm01100">bpm01100</a> Metabolic pathways - Burkholderia pseudomallei 171
<a href="/kegg-bin/show_pathway?bpm01110">bpm01110</a> Biosynthesis of secondary metabolites - Burkholder
</pre>
我尝试过:
y=soup.find('pre') #returns data between <pre> tags. Specific to KEGG
for a in y:
z =a.string
这给了我:
ID Definition
----------------------------------------------------------------------------------------------------
感谢您的帮助!
I am attempint to parse out information from a website using BeautifulSoup and python. The html looks like the following. I am wanting my parsed data to look like:
ID Definition
Lysine.biosynthesis - Burkholderia psuedomallei 17
... rest of data in similar place (within the "pre" tags and outside the "a" tags.
How can I do this?
<pre>ID Definition
----------------------------------------------------------------------------------------------------
<a href="/kegg-bin/show_pathway?bpm00300">bpm00300</a> Lysine biosynthesis - Burkholderia pseudomallei 17
<a href="/kegg-bin/show_pathway?bpm00330">bpm00330</a> Arginine and proline metabolism - Burkholderia pse
<a href="/kegg-bin/show_pathway?bpm01100">bpm01100</a> Metabolic pathways - Burkholderia pseudomallei 171
<a href="/kegg-bin/show_pathway?bpm01110">bpm01110</a> Biosynthesis of secondary metabolites - Burkholder
</pre>
I have tried by:
y=soup.find('pre') #returns data between <pre> tags. Specific to KEGG
for a in y:
z =a.string
This gave me:
ID Definition
----------------------------------------------------------------------------------------------------
Thanks for the help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
BeautifulSoup() 及其搜索方法 返回一个分层解析树对象,不仅仅是一个字符串。在找到的节点上迭代 findChildren() 可以完成您想要的操作(并跳过标题行):
BeautifulSoup() and its search methods return you a hierarchical parse-tree object, not just a string. Iterating through findChildren() on the node found does what you want (and skips the header line):