BeautifulSoup python 解析html文件

发布于 2024-12-04 08:53:56 字数 395 浏览 0 评论 0原文

我正在使用 BeautifulSoup 将 html 文件中的所有逗号替换为 &sbquo;。这是我的代码：

f = open(sys.argv[1],"r")
data = f.read()

soup = BeautifulSoup(data)

comma = re.compile(',') 


for t in soup.findAll(text=comma):
        t.replaceWith(t.replace(',', '&sbquo;'))

除非 html 文件中包含一些 javascript，否则此代码有效。在这种情况下，它甚至会替换 javascript 代码中的逗号(,)。这不是必需的。我只想替换html文件的所有文本内容。

原文

I am using BeautifulSoup to replace all the commas in an html file with ‚. Here is my code for that:

f = open(sys.argv[1],"r")
data = f.read()

soup = BeautifulSoup(data)

comma = re.compile(',') 


for t in soup.findAll(text=comma):
        t.replaceWith(t.replace(',', '‚'))

This code works except when there is some javascript included in the html file. In that case it even replaces the comma(,) with in the javascript code. Which is not required. I only want to replace in all the text content of the html file.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

征棹 2024-12-11 08:53:56

soup.findall 可以采取可调用：

tags_to_skip = set(["script", "style"])
# Add to this list as needed

def valid_tags(tag):
    """Filter tags on the basis of their tag names

    If the tag name is found in ``tags_to_skip`` then
    the tag is dropped.  Otherwise, it is kept.
    """
    if tag.source.name.lower() not in tags_to_skip:
        return True
    else:
        return False

for t in soup.findAll(valid_tags):
    t.replaceWith(t.replace(',', '‚'))

soup.findall can take a callable:

tags_to_skip = set(["script", "style"])
# Add to this list as needed

def valid_tags(tag):
    """Filter tags on the basis of their tag names

    If the tag name is found in ``tags_to_skip`` then
    the tag is dropped.  Otherwise, it is kept.
    """
    if tag.source.name.lower() not in tags_to_skip:
        return True
    else:
        return False

for t in soup.findAll(valid_tags):
    t.replaceWith(t.replace(',', '‚'))

回复收藏 0 原文

~没有更多了~