如何使用 lxml 和 iterlinks 替换链接

发布于 2024-11-03 09:13:32 字数 386 浏览 3 评论 0 原文

我是 lxml 新手，我正在尝试弄清楚如何使用 iterlinks() 重写链接。

import lxml.html
html = lxml.html.document_fromstring(doc)
for element, attribute, link, pos in html.iterlinks():
    if attibute == "src":
         link = link.replace('foo', 'bar')
print lxml.html.tostring(html)

然而，这实际上并没有取代链接。我知道我可以使用 .rewrite_links，但 iterlinks 提供有关每个链接的更多信息，所以我更愿意使用它。

提前致谢。

原文

I'm new to lxml and I'm trying to figure how to rewrite links using iterlinks().

import lxml.html
html = lxml.html.document_fromstring(doc)
for element, attribute, link, pos in html.iterlinks():
    if attibute == "src":
         link = link.replace('foo', 'bar')
print lxml.html.tostring(html)

However, this doesn't actually replace the links. I know I can use .rewrite_links, but iterlinks provides more information about each link, so I would prefer to use this.

Thanks in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

鸢与 2024-11-10 09:13:32

您必须更改元素本身，而不是仅仅为变量名称 link 分配新的（字符串）值，在本例中是通过设置其 src 属性：

new_src = link.replace('foo', 'bar') # or element.get('src').replace('foo', 'bar')
element.set('src', new_src)

请注意 -如果您知道您感兴趣的“链接”，例如，仅 img 元素 - 您还可以使用 .findall() （或 xpath 或 css）获取元素选择器）而不是使用.iterlinks()。

Instead of just assigning a new (string) value to the variable name link, you have to alter the element itself, in this case by setting its src attribute:

new_src = link.replace('foo', 'bar') # or element.get('src').replace('foo', 'bar')
element.set('src', new_src)

Note that - if you know which "links" you are interested in, for example, only img elements - you can also get the elements by using .findall() (or xpath or css selectors) instead of using .iterlinks().

回复收藏 0 原文

一桥轻雨一伞开 2024-11-10 09:13:32

这是使用 rewrite_links 的工作代码：

from lxml.html import fromstring, tostring

e = fromstring("<html><body><a href='http://localhost'>hello</body></html>")

def my_rewriter(link):
  return "http://newlink.com"

e.rewrite_links(my_rewriter)
print(tostring(e))

输出：

    b'<html><body><a href="http://newlink.com">hello</a></body></html>'

Here is working code with rewrite_links:

from lxml.html import fromstring, tostring

e = fromstring("<html><body><a href='http://localhost'>hello</body></html>")

def my_rewriter(link):
  return "http://newlink.com"

e.rewrite_links(my_rewriter)
print(tostring(e))

Output:

    b'<html><body><a href="http://newlink.com">hello</a></body></html>'

回复收藏 0 原文

可是我不能没有你 2024-11-10 09:13:32

lxml 提供了一个 rewrite_links 方法（或将要解析的文本传递到文档中的函数）来提供更改文档中所有链接的方法：

.rewrite_links（link_repl_func，resolve_base_href = True，base_href =无）：
这将使用给定的链接替换功能重写文档中的所有链接。如果给定一个base_href值，所有的链接都会在与这个URL连接后传入。
对于每个链接，都会调用 link_repl_func(link) 。然后，该函数返回新链接，或者返回 None 以删除包含该链接的属性或标签。请注意，所有链接都将被传入，包括“#anchor”（纯粹是内部的）之类的链接，以及“mailto:[电子邮件受保护]"（或JavaScript:...)。