使用 Parser 替换所有 IMG 元素的 SRC

发布于 2024-08-07 19:33:53 字数 231 浏览 6 评论 0原文

我正在寻找一种方法来替换所有不使用正则表达式的 IMG 标签中的 SRC 属性。（想要使用默认 Python 安装中包含的任何开箱即用的 HTML 解析器）我需要减少源代码：

<img src="cid:imagename">

我正在尝试替换所有 src 标签以指向附件的 cid对于 HTML 电子邮件，因此我还需要更改源内容，因此它只是不带路径或扩展名的文件名。

原文

I am looking for a way to replace the SRC attribute in all IMG tags not using Regular expressions. (Would like to use any out-of-the box HTML parser included with default Python install) I need to reduce the source from what ever it may be to:

<img src="cid:imagename">

I am trying to replace all src tags to point to the cid of an attachment for an HTML email so I will also need to change whatever the source is so it's simply the file name without the path or extension.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

伏妖词 2024-08-14 19:33:53

Python 标准库中有一个 HTML 解析器，但它不是很有用，并且自 Python 2.6 以来已被弃用。使用 BeautifulSoup 做这种事情非常简单：

from BeautifulSoup import BeautifulSoup
from os.path import basename, splitext
soup = BeautifulSoup(my_html_string)
for img in soup.findAll('img'):
    img['src'] = 'cid:' + splitext(basename(img['src']))[0]
my_html_string = str(soup)

There is a HTML parser in the Python standard library, but it’s not very useful and it’s deprecated since Python 2.6. Doing this kind of things with BeautifulSoup is really easy:

from BeautifulSoup import BeautifulSoup
from os.path import basename, splitext
soup = BeautifulSoup(my_html_string)
for img in soup.findAll('img'):
    img['src'] = 'cid:' + splitext(basename(img['src']))[0]
my_html_string = str(soup)

回复收藏 0 原文

悲歌长辞 2024-08-14 19:33:53

这是解决您的问题的 pyparsing 方法。您需要编写自己的代码来转换 http src 属性。

from pyparsing import *
import urllib2

imgtag = makeHTMLTags("img")[0]

page = urllib2.urlopen("http://www.yahoo.com")
html = page.read()
page.close()

# print html

def modifySrcRef(tokens):
    ret = "<img"
    for k,i in tokens.items():
        if k in ("startImg","empty"): continue
        if k.lower() == "src":
            # or do whatever with this
            i = i.upper() 
        ret += ' %s="%s"' % (k,i)
    return ret + " />"

imgtag.setParseAction(modifySrcRef)

print imgtag.transformString(html)

标签转换为：

<img src="HTTP://L.YIMG.COM/A/I/WW/BETA/Y3.GIF" title="Yahoo" height="44" width="232" alt="Yahoo!" />
<a href="r/xy"><img src="HTTP://L.YIMG.COM/A/I/WW/TBL/ALLYS.GIF" height="20" width="138" alt="All Yahoo! Services" border="0" /></a>

Here is a pyparsing approach to your problem. You'll need to do your own code to transform the http src attribute.

from pyparsing import *
import urllib2

imgtag = makeHTMLTags("img")[0]

page = urllib2.urlopen("http://www.yahoo.com")
html = page.read()
page.close()

# print html

def modifySrcRef(tokens):
    ret = "<img"
    for k,i in tokens.items():
        if k in ("startImg","empty"): continue
        if k.lower() == "src":
            # or do whatever with this
            i = i.upper() 
        ret += ' %s="%s"' % (k,i)
    return ret + " />"

imgtag.setParseAction(modifySrcRef)

print imgtag.transformString(html)

The tags convert to:

<img src="HTTP://L.YIMG.COM/A/I/WW/BETA/Y3.GIF" title="Yahoo" height="44" width="232" alt="Yahoo!" />
<a href="r/xy"><img src="HTTP://L.YIMG.COM/A/I/WW/TBL/ALLYS.GIF" height="20" width="138" alt="All Yahoo! Services" border="0" /></a>

回复收藏 0 原文

~没有更多了~