如何使用 Perl 正则表达式替换 HTML 属性中的多个单词(每个单词都哈希为备用单词)?

发布于 2024-07-30 20:41:33 字数 354 浏览 5 评论 0原文

我正在编写一个 HTML 混淆器,并且我有一个将用户友好名称(id 和类)与混淆名称(如 a、b、c 等)相关联的哈希值。 我无法想出一个正则表达式来完成替换类似的内容

<div class="left tall">

如果

<div class="a b">

标签只能接受一个类,则正则表达式将简单地类似于

s/(class|id)="(.*?)"/$1="$hash{$2}"/

我应该如何更正它以考虑引号内的多个类名? 该解决方案最好与 Perl 兼容。

I'm writing an HTML obfuscator, and I have a hash correlating user-friendly names (of ids and classes) to obfuscated names (like a,b,c,etc). I'm having trouble coming up with a regexp for accomplishing replacing something like

<div class="left tall">

with

<div class="a b">

If tags could only accept one class, the regexp would simply be something like

s/(class|id)="(.*?)"/$1="$hash{$2}"/

How should I correct this to account for for multiple class names within quotes? Preferrably, the solution should be Perl compatible.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

﹏半生如梦愿梦如真 2024-08-06 20:41:33

首先,您不应该为此使用正则表达式。 您试图使用一个正则表达式做太多事情(请参阅您能否提供一些示例来说明为什么使用正则表达式解析 XML 和 HTML 很困难?< /a> 为什么)。 您需要的是一个 HTML 解析器。 有关使用各种解析器的示例,请参阅您能提供一个使用您最喜欢的解析器解析 HTML 的示例吗?

看看 HTML::Parser。 这是一个可能不完整的实现:

#!/usr/bin/perl

use strict;
use warnings;

use HTML::Parser;

{
    my %map = (
        foo => "f",
        bar => "b",
    );

    sub start {
        my ($tag, $attr) = @_;
        my $attr_string = '';
        for my $key (keys %$attr) {
            if ($key eq 'class') {
                my @classes = split " ", $attr->{$key};
                #FIXME: this should be using //, but
                #it is only availble starting in 5.10
                #so I am using || which will do the
                #wrong thing if the class is 0, so
                #don't use a class of 0 in %map , m'kay
                $attr->{$key} = join " ", 
                    map { $map{$_} || $_ } @classes;
            }
            $attr_string .= qq/ $key="$attr->{$key}"/;
        }

        print "<$tag$attr_string>";
    }
}

sub text {
    print shift;
}

sub end {
    my $tag = shift;
    print "</$tag>";
}

my $p = HTML::Parser->new(
    start_h => [ \&start, "tagname,attr" ],
    text_h  => [ \&text, "dtext" ],
    end_h   => [ \&end, "tagname" ],
);

$p->parse_file(\*DATA);

__DATA__
<html>
    <head>
        <title>foo</title>
    </head>
    <body>
        <span class="foo">Foo!</span> <span class="bar">Bar!</span>
        <span class="foo bar">Foo Bar!</span>
        This should not be touched: class="foo"
    </body>
</html>

You shouldn't be using a regex for this in the first place. You are trying to do too much with one regex (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

Take a look at HTML::Parser. Here is a, probably incomplete, implementation:

#!/usr/bin/perl

use strict;
use warnings;

use HTML::Parser;

{
    my %map = (
        foo => "f",
        bar => "b",
    );

    sub start {
        my ($tag, $attr) = @_;
        my $attr_string = '';
        for my $key (keys %$attr) {
            if ($key eq 'class') {
                my @classes = split " ", $attr->{$key};
                #FIXME: this should be using //, but
                #it is only availble starting in 5.10
                #so I am using || which will do the
                #wrong thing if the class is 0, so
                #don't use a class of 0 in %map , m'kay
                $attr->{$key} = join " ", 
                    map { $map{$_} || $_ } @classes;
            }
            $attr_string .= qq/ $key="$attr->{$key}"/;
        }

        print "<$tag$attr_string>";
    }
}

sub text {
    print shift;
}

sub end {
    my $tag = shift;
    print "</$tag>";
}

my $p = HTML::Parser->new(
    start_h => [ \&start, "tagname,attr" ],
    text_h  => [ \&text, "dtext" ],
    end_h   => [ \&end, "tagname" ],
);

$p->parse_file(\*DATA);

__DATA__
<html>
    <head>
        <title>foo</title>
    </head>
    <body>
        <span class="foo">Foo!</span> <span class="bar">Bar!</span>
        <span class="foo bar">Foo Bar!</span>
        This should not be touched: class="foo"
    </body>
</html>
东京女 2024-08-06 20:41:33

我想我会这样做:

s/  
    (class|id)="([^"]+)"
/   
    $1 . '="' . (
        join ' ', map { $hash{$_} } split m!\s+!, $2
    ) . '"'
/ex;

I guess I'd do this:

s/  
    (class|id)="([^"]+)"
/   
    $1 . '="' . (
        join ' ', map { $hash{$_} } split m!\s+!, $2
    ) . '"'
/ex;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文