如何在美丽的小组中保留订单?
我正在使用美丽的肥皂来提取网页中的可见文本,因此我尝试实现以下解决方案:
def filter_visible_texts(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def extract_visible_text(soup):
visible_texts = soup.find_all(text=True)
print(visible_texts)
filtered_visible_texts = filter(filter_visible_texts, visible_texts)
return set(text.strip() for text in filtered_visible_texts)
问题是保留订单对我来说至关重要。
Beautifulsoup的文档没有说明可选参数以保存订单。这不可能吗?
I'm using Beautifulsoup Soap to extract visible text in webpage, so I tried to implement the following solution:
def filter_visible_texts(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def extract_visible_text(soup):
visible_texts = soup.find_all(text=True)
print(visible_texts)
filtered_visible_texts = filter(filter_visible_texts, visible_texts)
return set(text.strip() for text in filtered_visible_texts)
The problem is that it's critical to me to preserve order.
The documentation of Beautifulsoup doesn't say anything regarding optional parameter to preserve order. Isn't this possible?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的问题是
SET
结构。根据文档>确保您再次获得相同的订单。要保持订单,您可以将
dict
与索引用作密钥。要删除重复项(如果需要),您需要写一点循环。我构建了一些测试HTML,因为我不知道您的网站看起来是什么样的,请检查XML-tree中的级别是否会影响顺序。我注意到的是,从上到底出现在HTML文件中,订单是正确的。
脚本用于提取,基本上是您的脚本A,而无需过滤
和输出:
如您在第一个输出中所见,订单为link1,text,link2。转换为集合后,订单会更改。
就您的示例而言,可能是某些文本出现在页面顶部的情况更远,因为它使用CSS进行了样式,但是在HTML本身中,它在以后定义。
Your problem is the
set
structure. According to the documentation it's an unordered collection, i.e. you'll never be sure you get the same order again.For keeping order, you could use a
dict
with the index as key. To remove duplicates (if needed), you'd need to write a little loop.I built a little test html since I don't know what your website looks like to check if the level in the XML-tree does affect order. What I noticed is, the order is correct, from top to bottom as they appear in the html file.
The script used to extract, basically your script a without the filtering
And the output:
As you can see, in the first output, the order is link1, text, link2. After converting to a set, the order changes.
For your example, it may be the case that some text appears farther to the top of the page because it's styled this way using CSS but in the html itself, it is defined at a later point.