如何在美丽的小组中保留订单？

发布于 2025-02-10 00:58:14 字数 598 浏览 1 评论 0原文

我正在使用美丽的肥皂来提取网页中的可见文本，因此我尝试实现以下解决方案：

def filter_visible_texts(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def extract_visible_text(soup):
    visible_texts = soup.find_all(text=True)
    print(visible_texts)
    filtered_visible_texts = filter(filter_visible_texts, visible_texts)
    return set(text.strip() for text in filtered_visible_texts)

问题是保留订单对我来说至关重要。

Beautifulsoup的文档没有说明可选参数以保存订单。这不可能吗？

原文

I'm using Beautifulsoup Soap to extract visible text in webpage, so I tried to implement the following solution:

def filter_visible_texts(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def extract_visible_text(soup):
    visible_texts = soup.find_all(text=True)
    print(visible_texts)
    filtered_visible_texts = filter(filter_visible_texts, visible_texts)
    return set(text.strip() for text in filtered_visible_texts)

The problem is that it's critical to me to preserve order.

The documentation of Beautifulsoup doesn't say anything regarding optional parameter to preserve order. Isn't this possible?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

栀子花开つ 2025-02-17 00:58:14

您的问题是SET结构。根据文档>确保您再次获得相同的订单。
要保持订单，您可以将dict与索引用作密钥。要删除重复项（如果需要），您需要写一点循环。

我构建了一些测试HTML，因为我不知道您的网站看起来是什么样的，请检查XML-tree中的级别是否会影响顺序。我注意到的是，从上到底出现在HTML文件中，订单是正确的。

<html>
  <body>
    <div/>
    <div class="some-class">
        <div>
            <a href="example.com" title="Title of the link">link 1</a>
        </div>
        <div>
            <div>text inside div</div>
        </div>
        <a href="example.com"  title="Some more title">link 2</a>
    </div>
  </body>
</html>

脚本用于提取，基本上是您的脚本A，而无需过滤

from bs4 import BeautifulSoup


with open("test_order.html") as fp:
  soup = BeautifulSoup(fp, 'html.parser')

text = soup.find_all(text=True)
print(text)
print(set(text.strip() for text in text))

和输出：

['\n', '\n', '\n', '\n', '\n', 'link 1', '\n', '\n', '\n', 'text inside div', '\n', '\n', 'link 2', '\n', '\n', '\n']
{'', 'link 1', 'link 2', 'text inside div'}

如您在第一个输出中所见，订单为link1，text，link2。转换为集合后，订单会更改。

就您的示例而言，可能是某些文本出现在页面顶部的情况更远，因为它使用CSS进行了样式，但是在HTML本身中，它在以后定义。

Your problem is the set structure. According to the documentation it's an unordered collection, i.e. you'll never be sure you get the same order again.
For keeping order, you could use a dict with the index as key. To remove duplicates (if needed), you'd need to write a little loop.

I built a little test html since I don't know what your website looks like to check if the level in the XML-tree does affect order. What I noticed is, the order is correct, from top to bottom as they appear in the html file.

<html>
  <body>
    <div/>
    <div class="some-class">
        <div>
            <a href="example.com" title="Title of the link">link 1</a>
        </div>
        <div>
            <div>text inside div</div>
        </div>
        <a href="example.com"  title="Some more title">link 2</a>
    </div>
  </body>
</html>

The script used to extract, basically your script a without the filtering

from bs4 import BeautifulSoup


with open("test_order.html") as fp:
  soup = BeautifulSoup(fp, 'html.parser')

text = soup.find_all(text=True)
print(text)
print(set(text.strip() for text in text))

And the output:

['\n', '\n', '\n', '\n', '\n', 'link 1', '\n', '\n', '\n', 'text inside div', '\n', '\n', 'link 2', '\n', '\n', '\n']
{'', 'link 1', 'link 2', 'text inside div'}

As you can see, in the first output, the order is link1, text, link2. After converting to a set, the order changes.

For your example, it may be the case that some text appears farther to the top of the page because it's styled this way using CSS but in the html itself, it is defined at a later point.

回复收藏 0 原文

~没有更多了~