如何使用 lxml 和 iterlinks 替换链接

发布于 2024-11-03 09:13:32 字数 386 浏览 3 评论 0 原文

我是 lxml 新手,我正在尝试弄清楚如何使用 iterlinks() 重写链接。

import lxml.html
html = lxml.html.document_fromstring(doc)
for element, attribute, link, pos in html.iterlinks():
    if attibute == "src":
         link = link.replace('foo', 'bar')
print lxml.html.tostring(html)

然而,这实际上并没有取代链接。我知道我可以使用 .rewrite_links,但 iterlinks 提供有关每个链接的更多信息,所以我更愿意使用它。

提前致谢。

I'm new to lxml and I'm trying to figure how to rewrite links using iterlinks().

import lxml.html
html = lxml.html.document_fromstring(doc)
for element, attribute, link, pos in html.iterlinks():
    if attibute == "src":
         link = link.replace('foo', 'bar')
print lxml.html.tostring(html)

However, this doesn't actually replace the links. I know I can use .rewrite_links, but iterlinks provides more information about each link, so I would prefer to use this.

Thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

鸢与 2024-11-10 09:13:32

您必须更改元素本身,而不是仅仅为变量名称 link 分配新的(字符串)值,在本例中是通过设置其 src 属性:

new_src = link.replace('foo', 'bar') # or element.get('src').replace('foo', 'bar')
element.set('src', new_src)

请注意 -如果您知道您感兴趣的“链接”,例如,仅 img 元素 - 您还可以使用 .findall() (或 xpath 或 css)获取元素选择器)而不是使用.iterlinks()

Instead of just assigning a new (string) value to the variable name link, you have to alter the element itself, in this case by setting its src attribute:

new_src = link.replace('foo', 'bar') # or element.get('src').replace('foo', 'bar')
element.set('src', new_src)

Note that - if you know which "links" you are interested in, for example, only img elements - you can also get the elements by using .findall() (or xpath or css selectors) instead of using .iterlinks().

一桥轻雨一伞开 2024-11-10 09:13:32

这是使用 rewrite_links 的工作代码:

from lxml.html import fromstring, tostring

e = fromstring("<html><body><a href='http://localhost'>hello</body></html>")

def my_rewriter(link):
  return "http://newlink.com"

e.rewrite_links(my_rewriter)
print(tostring(e))

输出:

    b'<html><body><a href="http://newlink.com">hello</a></body></html>'

Here is working code with rewrite_links:

from lxml.html import fromstring, tostring

e = fromstring("<html><body><a href='http://localhost'>hello</body></html>")

def my_rewriter(link):
  return "http://newlink.com"

e.rewrite_links(my_rewriter)
print(tostring(e))

Output:

    b'<html><body><a href="http://newlink.com">hello</a></body></html>'
可是我不能没有你 2024-11-10 09:13:32

lxml 提供了一个 rewrite_links 方法(或将要解析的文本传递到文档中的函数)来提供更改文档中所有链接的方法:

.rewrite_links(link_repl_func,resolve_base_href = True,base_href =无):
这将使用给定的链接替换功能重写文档中的所有链接。如果给定一个base_href值,所有的链接都会在与这个URL连接后传入。
对于每个链接,都会调用 link_repl_func(link) 。然后,该函数返回新链接,或者返回 None 以删除包含该链接的属性或标签。请注意,所有链接都将被传入,包括“#anchor”(纯粹是内部的)之类的链接,以及“mailto:[电子邮件受保护]"(或JavaScript:...)。

lxml provides a rewrite_links method (or function that you pass the text to be parsed into a document) to provide a method of changing all links in a document:

.rewrite_links(link_repl_func, resolve_base_href=True, base_href=None):
This rewrites all the links in the document using your given link replacement function. If you give a base_href value, all links will be passed in after they are joined with this URL.
For each link link_repl_func(link) is called. That function then returns the new link, or None to remove the attribute or tag that contains the link. Note that all links will be passed in, including links like "#anchor" (which is purely internal), and things like "mailto:[email protected]" (or javascript:...).

琴流音 2024-11-10 09:13:32

链接可能只是实际对象的副本。尝试替换循环中元素的属性。即使 element 可能只是一个副本,但它值得一试......

Probably link is just a copy of the actual object. Try replacing the attribute of the element in your loop. Even element can be just a copy, but it deserves a try...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文