BeautifulSoup 解析树上的深度优先遍历
有没有办法在 BeautifulSoup 解析树上进行 DFT?我试图做一些事情,比如从根开始,通常获取所有子元素,然后为每个子元素获取它们的子元素,等等,直到我到达终端节点,此时我将构建返回树的方式。问题是我似乎找不到一种方法可以让我做到这一点。我找到了 findChildren 方法,但这似乎只是将整个页面多次放入列表中,每个后续条目都会减少。我也许可以使用它来进行遍历,但是除了列表中的最后一个条目之外,似乎没有任何方法可以将条目识别为终端节点。有什么想法吗?
Is there a way to do a DFT on a BeautifulSoup parse tree? I'm trying to do something like starting at the root, usually , get all the child elements and then for each child element get their children, etc until I hit a terminal node at which point I'll build my way back up the tree. Problem is I can't seem to find a method that will allow me to do this. I found the findChildren method but that seems to just put the entire page in a list multiple times with each subsequent entry getting reduced. I might be able to use this to do a traversal however other than the last entry in the list it doesn't appear there is any way to identify entries as terminal nodes or not. Any ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
mytag.find_all()
已经这样做了:输出
输出确认,这是深度优先遍历。
Old Beautiful Soup 3 答案:
recursiveChildGenerator()
已经做到了:Output
For the html from @msalvadores 的答案:
注意:
html
由于 示例包含两个打开标签。
mytag.find_all()
already does that:Output
The output confirms that, it is a depth first traversal.
Old Beautiful Soup 3 answer:
recursiveChildGenerator()
already does that:Output
For the html from @msalvadores's answer:
NOTE:
html
is printed twice due to the example contains two opening<html>
tags.我认为您可以使用“childGenerator”方法并递归地使用该方法以 DFT 方式解析树。
通过 dir(x) 中的“childGenerator”,我们可以确保元素是容器,而终端节点(例如 NavigableStrings)不是容器并且不包含子节点。
对于一些 HTML 示例:
此脚本打印...
I think you can use the method "childGenerator" and recursively use this one to parse the tree in a DFT fashion.
With
"childGenerator" in dir(x)
we make sure that an element is a container, terminal nodes such asNavigableStrings
are not containers and do not contain children.For some example HTML like:
This scripts prints ...