删除 XML 标签中的空格

发布于 2024-11-27 16:16:15 字数 661 浏览 6 评论 0原文

我正在尝试编写一个 perl 脚本，从 XML 标记中删除空格，但在值内保留空格。例如，假设我有：

<Example>This is an example.</Exampl   e>

我想要完成的是专门删除中的空白。由于这将在整个 XML 文档上工作，所以我想我应该使用替换运算符做一些事情，但我不太清楚如何只匹配可能位于 XML 标记本身内部的空白。

非常感谢任何帮助！

编辑：我添加了一个正在发生的事情的真实示例：

not well-formed (invalid token) at line 42, column 25, byte 1456:
                    <Artist>Eminem</Artist>
                    <FileName>eminem feat lil wayne - no love -
hotnewhiphop com(2).mp3</    FileName>
========================^
                    <FileSize>4804478</FileSize>

原文

I'm trying to write a perl script that removes whitespace from XML tags, but leaves whitespace inside of the values. For example, let's say I have:

<Example>This is an example.</Exampl   e>

What I'm looking to accomplish is to knock off the whitespace specifically in </Exampl e>. Since this will be working on an entire XML document, I figured I'd do something with the substitution operator, but I can't quite figure out how to only match whitespace that might be inside of the XML tags themselves.

Any help is greatly appreciated!

Edit: I've added a real example of what is occurring:

not well-formed (invalid token) at line 42, column 25, byte 1456:
                    <Artist>Eminem</Artist>
                    <FileName>eminem feat lil wayne - no love -
hotnewhiphop com(2).mp3</    FileName>
========================^
                    <FileSize>4804478</FileSize>

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

尴尬癌患者 2024-12-04 16:16:15

s!(</?\w+)\s+(\w+\s+/?>)!$1$2!g;

如果您实际上想在带有属性的标签中保留空格，情况会变得更加复杂，因为空格是标签中的合法字符。您几乎必须找到后面没有等号或空格+等号的“单词”，并将它们与前一个（未引用）单词结合起来。

sub marry_inner_splits {
    my $_ = shift;
    # fix broken tags
    s|^/?(\w+)\s+(\w+)\b(?!\s*=)|$1$2|; 
    # find the resulting position.
    my $pos = index( $_, ' ' );
    # return if there is no whitespace.
    return $_ if $pos == -1;
    # bind the rest of the text to the substring
    substr( $_, $pos ) =~ s/(\s*\w+)\s+(\w+\s*=\s*(?:"[^"]+"|'[^']+')\s*)/$1$2/g;
    return $_;
}

my $tag_str = q{Some stuff before the tag <ta g attr1="val1" att   r2="value #2"     /></Escap   e>};
$tag_str =~ s/<([^>]+)>/'<' . marry_inner_splits($1) . '>'/ge;

e 标志意味着您正在*eval*-替换部分。

s!(</?\w+)\s+(\w+\s+/?>)!$1$2!g;

If you want to actually leave whitespace in a tag with attributes, it gets more complex, because whitespace is a legitimate character in a tag. You pretty much have to find the "words" with no equals or space + equals after them and marry them to the previous--unquoted--word.

sub marry_inner_splits {
    my $_ = shift;
    # fix broken tags
    s|^/?(\w+)\s+(\w+)\b(?!\s*=)|$1$2|; 
    # find the resulting position.
    my $pos = index( $_, ' ' );
    # return if there is no whitespace.
    return $_ if $pos == -1;
    # bind the rest of the text to the substring
    substr( $_, $pos ) =~ s/(\s*\w+)\s+(\w+\s*=\s*(?:"[^"]+"|'[^']+')\s*)/$1$2/g;
    return $_;
}

my $tag_str = q{Some stuff before the tag <ta g attr1="val1" att   r2="value #2"     /></Escap   e>};
$tag_str =~ s/<([^>]+)>/'<' . marry_inner_splits($1) . '>'/ge;

The e flag means that you are*eval*-ing in the replacement part.

回复收藏 0 原文