如何像 Lynx 一样使用 Perl 将 HTML 呈现为文本？

发布于 2024-08-15 20:21:29 字数 902 浏览 5 评论 0原文

可能的重复：
您建议使用哪个 CPAN 模块进行车削HTML 转换为纯文本？

问题：

是否有一个模块可以呈现 HTML，专门用于收集文本，同时遵守字体样式标记，例如 < code>、、等和 break-line ，类似于 Lynx。

例如：

# cat test.html

<body>  
<div id="foo" class="blah">  
<tt>test<br>
<b>test</b><br>
whatever<br>
test</tt>
</div>
</body>

# lynx.exe --dump test.html

test
test
whatever
test

注意：第二个线应该是粗体。

原文

Possible Duplicate:
Which CPAN module would you recommend for turning HTML into plain text?

Question:

Is there a module to render HTML, specifically to gather the text, while adhering to font-style tags, such as <tt>, <b>, <i>, etc and break-line <br>, similar to Lynx.

For example:

# cat test.html

<body>  
<div id="foo" class="blah">  
<tt>test<br>
<b>test</b><br>
whatever<br>
test</tt>
</div>
</body>

# lynx.exe --dump test.html

test
test
whatever
test

Note: the second line should be bold.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

〗斷ホ乔殘χμё〖 2024-08-22 20:21:29

Lynx 是一个大程序，它的 html 渲染将非常重要。

这个怎么样：

my $lynx = '/path/to/lynx';
my $html = [ html here ];
my $txt = `$lynx --dump --width 9999 -stdin <<EOF\n$html\nEOF\n`;

Lynx is a big program and its html rendering will be non trivial.

How about this:

my $lynx = '/path/to/lynx';
my $html = [ html here ];
my $txt = `$lynx --dump --width 9999 -stdin <<EOF\n$html\nEOF\n`;

回复收藏 0 原文

染年凉城似染瑾 2024-08-22 20:21:29

转至 search.cpan.org 并搜索 HTML 文本这将为您提供很多选项来满足您的特定需求。 HTML::FormatText 是一个很好的基线，然后分支到特定的变体例如 HTML::FormatText::WithLinks 如果你愿意的话将链接保留为脚注。

回复收藏 0 原文

余生一个溪 2024-08-22 20:21:29

我在 Windows 上，所以我无法完全测试这一点，但你可以调整 HTML::Parser 附带的 htext：

#!/usr/bin/perl

use strict; use warnings;

use HTML::Parser;
use Term::ANSIColor;

use HTML::Parser 3.00 ();

my %inside;

sub tag {
   my($tag, $num) = @_;
   $inside{$tag} += $num;
   print " ";  # not for all tags
}

sub text {
    return if $inside{script} || $inside{style};
    my $esc = 1;
    if ( $inside{b} or $inside{strong} ) {
        print color 'blue';
    }
    elsif ( $inside{i} or $inside{em} ) {
        print color 'yellow';
    }
    else {
        $esc = 0;
    }
    print $_[0];
    print color 'reset' if $esc;
}

HTML::Parser->new(api_version => 3,
    handlers => [
        start => [\&tag, "tagname, '+1'"],
        end   => [\&tag, "tagname, '-1'"],
        text  => [\&text, "dtext"],
    ],
    marked_sections => 1,
)->parse_file(shift) || die "Can't open file: $!\n";;

I am on Windows so I cannot fully test this but you can adapt htext that comes with HTML::Parser:

#!/usr/bin/perl

use strict; use warnings;

use HTML::Parser;
use Term::ANSIColor;

use HTML::Parser 3.00 ();

my %inside;

sub tag {
   my($tag, $num) = @_;
   $inside{$tag} += $num;
   print " ";  # not for all tags
}

sub text {
    return if $inside{script} || $inside{style};
    my $esc = 1;
    if ( $inside{b} or $inside{strong} ) {
        print color 'blue';
    }
    elsif ( $inside{i} or $inside{em} ) {
        print color 'yellow';
    }
    else {
        $esc = 0;
    }
    print $_[0];
    print color 'reset' if $esc;
}

HTML::Parser->new(api_version => 3,
    handlers => [
        start => [\&tag, "tagname, '+1'"],
        end   => [\&tag, "tagname, '-1'"],
        text  => [\&text, "dtext"],
    ],
    marked_sections => 1,
)->parse_file(shift) || die "Can't open file: $!\n";;

回复收藏 0 原文

~没有更多了~