BeautifulSoup 解析树上的深度优先遍历

发布于 2024-10-14 16:15:20 字数 241 浏览 2 评论 0原文

有没有办法在 BeautifulSoup 解析树上进行 DFT?我试图做一些事情,比如从根开始,通常获取所有子元素,然后为每个子元素获取它们的子元素,等等,直到我到达终端节点,此时我将构建返回树的方式。问题是我似乎找不到一种方法可以让我做到这一点。我找到了 findChildren 方法,但这似乎只是将整个页面多次放入列表中,每个后续条目都会减少。我也许可以使用它来进行遍历,但是除了列表中的最后一个条目之外,似乎没有任何方法可以将条目识别为终端节点。有什么想法吗?

Is there a way to do a DFT on a BeautifulSoup parse tree? I'm trying to do something like starting at the root, usually , get all the child elements and then for each child element get their children, etc until I hit a terminal node at which point I'll build my way back up the tree. Problem is I can't seem to find a method that will allow me to do this. I found the findChildren method but that seems to just put the entire page in a list multiple times with each subsequent entry getting reduced. I might be able to use this to do a traversal however other than the last entry in the list it doesn't appear there is any way to identify entries as terminal nodes or not. Any ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

很快妥协 2024-10-21 16:15:20

mytag.find_all() 已经这样做了:

如果您调用 mytag.find_all(),Beautiful Soup 将检查 mytag 的所有后代:它的子级、子级的子级等等

from bs4 import BeautifulSoup  # pip install beautifulsoup4

soup = BeautifulSoup("""<!doctype html>
<div id=a>A
  <div id=1>A1</div>
  <div id=2>A2</div>
</div>
<div id=b>B
  <div id=I>BI</div>
  <div id=II>BII</div>
</div>
""")

for div in soup.find_all("div", recursive=True):
    print(div.get('id'))

输出

a
1
2
b
I
II

输出确认,这是深度优先遍历。


Old Beautiful Soup 3 答案:

recursiveChildGenerator() 已经做到了:

soup = BeautifulSoup.BeautifulSoup(html)
for child in soup.recursiveChildGenerator():
     name = getattr(child, "name", None)
     if name is not None:
         print name
     elif not child.isspace(): # leaf node, don't print spaces
         print child

Output

For the html from @msalvadores 的答案

html
ul
li
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
li
Aliquam tincidunt mauris eu risus.
li
Vestibulum auctor dapibus neque.
html

注意:html 由于 示例包含两个打开 标签。

mytag.find_all() already does that:

If you call mytag.find_all(), Beautiful Soup will examine all the descendants of mytag: its children, its children’s children, and so on

from bs4 import BeautifulSoup  # pip install beautifulsoup4

soup = BeautifulSoup("""<!doctype html>
<div id=a>A
  <div id=1>A1</div>
  <div id=2>A2</div>
</div>
<div id=b>B
  <div id=I>BI</div>
  <div id=II>BII</div>
</div>
""")

for div in soup.find_all("div", recursive=True):
    print(div.get('id'))

Output

a
1
2
b
I
II

The output confirms that, it is a depth first traversal.


Old Beautiful Soup 3 answer:

recursiveChildGenerator() already does that:

soup = BeautifulSoup.BeautifulSoup(html)
for child in soup.recursiveChildGenerator():
     name = getattr(child, "name", None)
     if name is not None:
         print name
     elif not child.isspace(): # leaf node, don't print spaces
         print child

Output

For the html from @msalvadores's answer:

html
ul
li
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
li
Aliquam tincidunt mauris eu risus.
li
Vestibulum auctor dapibus neque.
html

NOTE: html is printed twice due to the example contains two opening <html> tags.

等风来 2024-10-21 16:15:20

我认为您可以使用“childGenerator”方法并递归地使用该方法以 DFT 方式解析树。

def recursiveChildren(x):
   if "childGenerator" in dir(x):
      for child in x.childGenerator():
          name = getattr(child, "name", None)
          if name is not None:
             print "[Container Node]",child.name
          recursiveChildren(child)
    else:
       if not x.isspace(): #Just to avoid printing "\n" parsed from document.
          print "[Terminal Node]",x

if __name__ == "__main__":
    soup = BeautifulSoup(your_data)
    for child in soup.childGenerator():
        recursiveChildren(child)

通过 dir(x) 中的“childGenerator”,我们可以确保元素是容器,而终端​​节点(例如 NavigableStrings)不是容器并且不包含子节点。

对于一些 HTML 示例:

<html>
<ul>
   <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li>
   <li>Aliquam tincidunt mauris eu risus.</li>
   <li>Vestibulum auctor dapibus neque.</li>
</ul>
</html>

此脚本打印...

[Container Node] ul
[Container Node] li
[Terminal Node] Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
[Container Node] li
[Terminal Node] Aliquam tincidunt mauris eu risus.
[Container Node] li
[Terminal Node] Vestibulum auctor dapibus neque.

I think you can use the method "childGenerator" and recursively use this one to parse the tree in a DFT fashion.

def recursiveChildren(x):
   if "childGenerator" in dir(x):
      for child in x.childGenerator():
          name = getattr(child, "name", None)
          if name is not None:
             print "[Container Node]",child.name
          recursiveChildren(child)
    else:
       if not x.isspace(): #Just to avoid printing "\n" parsed from document.
          print "[Terminal Node]",x

if __name__ == "__main__":
    soup = BeautifulSoup(your_data)
    for child in soup.childGenerator():
        recursiveChildren(child)

With "childGenerator" in dir(x) we make sure that an element is a container, terminal nodes such as NavigableStrings are not containers and do not contain children.

For some example HTML like:

<html>
<ul>
   <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li>
   <li>Aliquam tincidunt mauris eu risus.</li>
   <li>Vestibulum auctor dapibus neque.</li>
</ul>
</html>

This scripts prints ...

[Container Node] ul
[Container Node] li
[Terminal Node] Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
[Container Node] li
[Terminal Node] Aliquam tincidunt mauris eu risus.
[Container Node] li
[Terminal Node] Vestibulum auctor dapibus neque.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文