清理html2txt后的文本

发布于 2024-12-11 12:43:27 字数 1758 浏览 0 评论 0原文

我正在使用 lxml 将 html 转换为 txt。我几乎到达了我想要的地方，解析、转换和清理的某些部分（制表符、空格、空行）功能已准备就绪，程序已启动并运行。

然而，在我用大约一百个 html（全部来自不同站点）尝试我的代码后，我注意到一些例外，即类似这样的行：

#wrapper #PrimaryNav {margin:0;*overflow:hidden;}
a.scbbtnred{background-position:right -44px;}
a.scbbtnblack{background-position:right -176px;}
.ghsearch{width:58px;height:21px;line-height:21px;background-position:0 -80px;}
a.scbbtnred span span{background-color:#f00;background-position:0 -22px;}

我假设这些是 CSS？或其他网络编程的东西。但我对这些完全陌生。

问题：这些线条是什么？对于如何乘坐这些线路有什么建议吗？

编辑：这是我在这个问题之前如何做的部分，供将来进入这篇文章的人参考（对于Python来说，这里的很多东西都可以改进，但它适用于我）：

# Function for html2txt using lxml
# Author:
# http://groups.google.com/group/cn.bbs.comp.lang.python/browse_thread/thread/781a357e2ce66ce8
def html2text(html):
    tree = lxml.etree.fromstring(html, lxml.etree.HTMLParser()) if isinstance(html, basestring) else html 
    for skiptag in ('//script', '//iframe', '//style'):    
        for node in tree.xpath(skiptag):
            node.getparent().remove(node)
    # return lxml.etree.tounicode(tree, method='text')
    return lxml.etree.tostring(tree, encoding=unicode, method='text')



#Function for cleanup the text:
# 1: clearnup: 1)tabs, 2)spaces, 3)empty lines;
# 2: remove short lines
def textcleanup(text):
    # temp list for process
    text_list = []
    for s in text.splitlines():
        # Strip out meaningless spaces and tabs
        s = s.strip()
        # Set length limit
        if s.__len__() > 35:
            text_list.append(s)
    cleaned = os.linesep.join(text_list)
    # Get rid of empty lines
    cleaned = os.linesep.join([s for s in cleaned.splitlines() if s])
    return cleaned

原文

I am using lxml to convert html into txt. I almost get to where I wanted with parsing, converting and some parts of the cleanup (tabs, spaces, empty lines) functions ready and a program up and running.

However, after I tried my code with about a hundred htmls (all from different sites), I noticed some exceptions, i.e. lines like:

#wrapper #PrimaryNav {margin:0;*overflow:hidden;}
a.scbbtnred{background-position:right -44px;}
a.scbbtnblack{background-position:right -176px;}
.ghsearch{width:58px;height:21px;line-height:21px;background-position:0 -80px;}
a.scbbtnred span span{background-color:#f00;background-position:0 -22px;}

I assume these are CSS? or other web programming things. But I am totally unfamiliar with these.

Questions: What are these lines? And any suggestions for how to get ride of these lines?

Edit: Here is how I did the parts before this question for reference for anyone who drops into this post in the future (new to python, a lot of things here can be improved, but it works ok for me):

# Function for html2txt using lxml
# Author:
# http://groups.google.com/group/cn.bbs.comp.lang.python/browse_thread/thread/781a357e2ce66ce8
def html2text(html):
    tree = lxml.etree.fromstring(html, lxml.etree.HTMLParser()) if isinstance(html, basestring) else html 
    for skiptag in ('//script', '//iframe', '//style'):    
        for node in tree.xpath(skiptag):
            node.getparent().remove(node)
    # return lxml.etree.tounicode(tree, method='text')
    return lxml.etree.tostring(tree, encoding=unicode, method='text')



#Function for cleanup the text:
# 1: clearnup: 1)tabs, 2)spaces, 3)empty lines;
# 2: remove short lines
def textcleanup(text):
    # temp list for process
    text_list = []
    for s in text.splitlines():
        # Strip out meaningless spaces and tabs
        s = s.strip()
        # Set length limit
        if s.__len__() > 35:
            text_list.append(s)
    cleaned = os.linesep.join(text_list)
    # Get rid of empty lines
    cleaned = os.linesep.join([s for s in cleaned.splitlines() if s])
    return cleaned

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

戏剧牡丹亭 2024-12-18 12:43:27

这确实是CSS。您将得到这样的文档：

<style>
#wrapper #PrimaryNav {margin:0;*overflow:hidden;}
a.scbbtnred{background-position:right -44px;}
a.scbbtnblack{background-position:right -176px;}
.ghsearch{width:58px;height:21px;line-height:21px;background-position:0 -80px;}
a.scbbtnred span span{background-color:#f00;background-position:0 -22px;}
</style>
<div>
    <p>This bit is HTML</p>
</div>

在解析文本之前，您需要删除所有 style 标记。

That is indeed CSS. You're getting a document like this:

<style>
#wrapper #PrimaryNav {margin:0;*overflow:hidden;}
a.scbbtnred{background-position:right -44px;}
a.scbbtnblack{background-position:right -176px;}
.ghsearch{width:58px;height:21px;line-height:21px;background-position:0 -80px;}
a.scbbtnred span span{background-color:#f00;background-position:0 -22px;}
</style>
<div>
    <p>This bit is HTML</p>
</div>

You need to remove all style tags before parsing out the text.

回复收藏 0 原文

~没有更多了~