Python过滤器列表从html源代码中删除某些链接
我有 html 源代码,我想过滤掉一个或多个链接并保留其他链接。
我已经使用“*”作为通配符设置了过滤器:
<a*>Link1</a>‚ <a*>Link2</a>‚ or <a*>Link3</a>
<a*>A bad link*</a>
some text* <a*>update*</a>
other text right before link <a*>click here</a>
我想使用 python 从 html 源代码中过滤掉链接的每个实例。我可以将列表加载到数组中。我需要一些关于过滤器的帮助。每个换行符都表示一个单独的过滤器,我只想删除链接而不是文本
我对 python 和 regex/beautifulsoup 仍然很陌生。即使您能为我指出正确的方向,我也会非常感激。
I have html source code which I want to filter out one or more links and keep the others.
I have set up my filter with "*" as the wildcard:
<a*>Link1</a>‚ <a*>Link2</a>‚ or <a*>Link3</a>
<a*>A bad link*</a>
some text* <a*>update*</a>
other text right before link <a*>click here</a>
I would like to filter out every instance of the link from the html source code using python. I'm ok with loading the list into an array. I need some help with the filter. Each line break would signify a separate filter and I only want to remove the link(s) and not the text
I am still very new to python and regex/beautifulsoup. Even if you could point me in the right direction, it would be greatly appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
删除
标签并仅保留这些标签中未包含的文本:
我不清楚您是否想要标签中的文本。如果您希望标签中包含文本,请执行以下操作:
To remove
<a>
tags and keep only the text not contained within those tags:It's not clear to me if you want the text within the tags or not. If you want the text contained within the tags do something like this instead:
仅以重新组装整个文档为目的而仅丢弃部分信息来解析它会产生大量不需要的代码。
所以,我认为这对于正则表达式来说是更好的工作。 Python 的正则表达式可以有一个回调函数,允许用户自定义替换字符串。在这种情况下,创建一个匹配“坏链接”、之间的文本和结束链接标记的正则表达式很简单,并且仅保留之间的文本。
Parsing it with the only purose of reassembling the whole document discarding just a part of the information would yield a lot of uneeded code.
So, I think this is better as a job for regular expressions. Python's regular expressions can have a callback function that allows one to customize the substitution string. In this case, it is a simple matter of creating a regexp that matches the "bad link", the text in between, and the end link mark-up, and preserves only the text in between.