Python过滤器列表从html源代码中删除某些链接

发布于 2024-10-08 05:37:19 字数 476 浏览 1 评论 0原文

我有 html 源代码，我想过滤掉一个或多个链接并保留其他链接。

我已经使用“*”作为通配符设置了过滤器：

<a*>Link1</a>‚ <a*>Link2</a>‚ or <a*>Link3</a>
<a*>A bad link*</a>
some text* <a*>update*</a>
other text right before link <a*>click here</a>

我想使用 python 从 html 源代码中过滤掉链接的每个实例。我可以将列表加载到数组中。我需要一些关于过滤器的帮助。每个换行符都表示一个单独的过滤器，我只想删除链接而不是文本

我对 python 和 regex/beautifulsoup 仍然很陌生。即使您能为我指出正确的方向，我也会非常感激。

原文

I have html source code which I want to filter out one or more links and keep the others.

I have set up my filter with "*" as the wildcard:

<a*>Link1</a>‚ <a*>Link2</a>‚ or <a*>Link3</a>
<a*>A bad link*</a>
some text* <a*>update*</a>
other text right before link <a*>click here</a>

I would like to filter out every instance of the link from the html source code using python. I'm ok with loading the list into an array. I need some help with the filter. Each line break would signify a separate filter and I only want to remove the link(s) and not the text

I am still very new to python and regex/beautifulsoup. Even if you could point me in the right direction, it would be greatly appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

潇烟暮雨 2024-10-15 05:37:19

删除标签并仅保留这些标签中未包含的文本：

>>> from BeautifulSoup import BeautifulSoup as bs
>>> markup = """<a*>Link1</a> <a*>Link2</a> or <a*>Link3</a>
... <a*>A bad link*</a>
... some text* <a*>update*</a>
... other text right before link <a*>click here</a>"""
>>> soup = bs(markup)
>>> TAGS_TO_EXTRACT = ('a',)
>>> for tag in soup.findAll():
...   if tag.name in TAGS_TO_EXTRACT:
...     tag.extract()
...
>>> soup
  or

some text*
other text right before link

我不清楚您是否想要标签中的文本。如果您希望标签中包含文本，请执行以下操作：

>>> for tag in soup.findAll():
...   if tag.name in TAGS_TO_EXTRACT:
...     tag.replaceWith(tag.text)
...
>>> soup
Link1 Link2 or Link3
A bad link*
some text* update*
other text right before link click here

To remove <a> tags and keep only the text not contained within those tags:

>>> from BeautifulSoup import BeautifulSoup as bs
>>> markup = """<a*>Link1</a> <a*>Link2</a> or <a*>Link3</a>
... <a*>A bad link*</a>
... some text* <a*>update*</a>
... other text right before link <a*>click here</a>"""
>>> soup = bs(markup)
>>> TAGS_TO_EXTRACT = ('a',)
>>> for tag in soup.findAll():
...   if tag.name in TAGS_TO_EXTRACT:
...     tag.extract()
...
>>> soup
  or

some text*
other text right before link

It's not clear to me if you want the text within the tags or not. If you want the text contained within the tags do something like this instead:

>>> for tag in soup.findAll():
...   if tag.name in TAGS_TO_EXTRACT:
...     tag.replaceWith(tag.text)
...
>>> soup
Link1 Link2 or Link3
A bad link*
some text* update*
other text right before link click here

回复收藏 0 原文

成熟稳重的好男人 2024-10-15 05:37:19

仅以重新组装整个文档为目的而仅丢弃部分信息来解析它会产生大量不需要的代码。

所以，我认为这对于正则表达式来说是更好的工作。 Python 的正则表达式可以有一个回调函数，允许用户自定义替换字符串。在这种情况下，创建一个匹配“坏链接”、之间的文本和结束链接标记的正则表达式很简单，并且仅保留之间的文本。

import re

markup = """<a*>Link1</a>‚ <a*>Link2</a>‚ or <a*>Link3</a>
<a*>A bad link*</a>
some text* <a*>update*</a>
other text right before link <a*>click here</a>"""

filtered = re.sub (r"(\<a.*?>)(.*?)(\</a\s*\>)",lambda match: match.groups()[1] , markup)

Parsing it with the only purose of reassembling the whole document discarding just a part of the information would yield a lot of uneeded code.

So, I think this is better as a job for regular expressions. Python's regular expressions can have a callback function that allows one to customize the substitution string. In this case, it is a simple matter of creating a regexp that matches the "bad link", the text in between, and the end link mark-up, and preserves only the text in between.

import re

markup = """<a*>Link1</a>‚ <a*>Link2</a>‚ or <a*>Link3</a>
<a*>A bad link*</a>
some text* <a*>update*</a>
other text right before link <a*>click here</a>"""

filtered = re.sub (r"(\<a.*?>)(.*?)(\</a\s*\>)",lambda match: match.groups()[1] , markup)

回复收藏 0 原文

~没有更多了~