清理带有多个尖括号的字符串

发布于 2024-12-28 15:10:53 字数 1942 浏览 3 评论 0原文

我有以下 HTML 部分

<div class="article">this is a div article content</div>

，它被与 HTML 无关的程序“标记”在单词 div、class 和 article 上，结果in：

<<hl>div</hl> <hl>class</hl>="<hl>article</hl>">this is a <hl>div</hl> <hl>article</hl> content</<hl>div</hl>>

虽然我真正需要的是：

<div class="article">this is a <hl>div</hl> <hl>article</hl> content</div>

由于输出完全是垃圾（甚至像 HTML Tidy 这样的工具也会窒息），我认为正则表达式替换将有助于去除额外的 ;s 在 HTML 标记内：

replace(/<([^>]*)<hl>([^<]*?)<\/hl>([^>]*?)>/g, '<$1$2$3>')

现在，这可以工作，但只能替换标记中的第一个出现，即 div：

<div <hl>class</hl>="<hl>article</hl>">this is a <hl>div</hl> <hl>article</hl> content</div>

我的问题是：如何替换标记内的所有，以确保 HTML 保持有效？

附加说明：

我根本不需要标签属性（即 class="article" 可以消失）
我可以更改和 < /hl> 对于任何其他字符串
是的，输出来自 Solr

更新：我接受了 jcollado 的答案，但我需要在 Javascript 中使用它。这是等效的代码：

var stripIllegalTags = function(html) {

  var output = '',
    dropChar,
    parsingTag = false;

  for (var i=0; i < html.length; i++) {
    var character = html[i];

    if (character == '<') {
      if (parsingTag) {
        do {
          dropChar = html[i+1];
          i++;
        } while (dropChar != '>');
        continue;
      }
      parsingTag = true;
    } else if (character == '>') {
      parsingTag = false;
    }

    output += character;

  }

  return output;

}

原文

I have the following bit of HTML

<div class="article">this is a div article content</div>

which is being "tagged" by an HTML-agnostic program on the words div, class and article, resulting in:

<<hl>div</hl> <hl>class</hl>="<hl>article</hl>">this is a <hl>div</hl> <hl>article</hl> content</<hl>div</hl>>

although what I really need is:

<div class="article">this is a <hl>div</hl> <hl>article</hl> content</div>

Since the output is utter garbage (even tools like HTML Tidy choke on it), I figured a regex replace would help strip out the extra <hl>s inside the HTML tag:

replace(/<([^>]*)<hl>([^<]*?)<\/hl>([^>]*?)>/g, '<$1$2$3>')

Now, this works but only replaces the first occurrence in the tag, that is, the div:

<div <hl>class</hl>="<hl>article</hl>">this is a <hl>div</hl> <hl>article</hl> content</div>

My question is: how do I replace all <hl>s inside the tag, so as to make sure the HTML remains valid?

Additional notes:

I don't need the tag attributes at all (i.e. class="article" can disappear)
I can change <hl> and </hl> for any other strings
Yes, the output comes from Solr

UPDATE: I accepted jcollado's answer, but I needed this in Javascript. This is the equivalent code:

var stripIllegalTags = function(html) {

  var output = '',
    dropChar,
    parsingTag = false;

  for (var i=0; i < html.length; i++) {
    var character = html[i];

    if (character == '<') {
      if (parsingTag) {
        do {
          dropChar = html[i+1];
          i++;
        } while (dropChar != '>');
        continue;
      }
      parsingTag = true;
    } else if (character == '>') {
      parsingTag = false;
    }

    output += character;

  }

  return output;

}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

习惯那些不曾习惯的习惯 2025-01-04 15:10:53

也许下面的代码对您有帮助：

class HTMLCleaner(object):
    def parse(self, html):
        output = []
        parsing_tag = False

        html = iter(html)
        for char in html:
            if char == '<':
                if parsing_tag:
                    drop_char = html.next()
                    while drop_char != '>':
                        drop_char = html.next()
                    continue
                parsing_tag = True
            elif char == '>':
                parsing_tag = False

            output.append(char)

        return ''.join(output)

html = '<<hl>div</hl> <hl>class</hl>="<hl>article</hl>">this is a <hl>div</hl> <hl>article</hl> content</<hl>div</hl>>'

parser = HTMLCleaner()
print parser.parse(html)

给定输入的输出是：

<div class="article">this is a <hl>div</hl> <hl>article</hl> content</div>

我相信这就是您正在寻找的内容。

当另一个标签尚未解析时，代码基本上会删除所有标签。

Maybe the piece of code below is helpful for you:

class HTMLCleaner(object):
    def parse(self, html):
        output = []
        parsing_tag = False

        html = iter(html)
        for char in html:
            if char == '<':
                if parsing_tag:
                    drop_char = html.next()
                    while drop_char != '>':
                        drop_char = html.next()
                    continue
                parsing_tag = True
            elif char == '>':
                parsing_tag = False

            output.append(char)

        return ''.join(output)

html = '<<hl>div</hl> <hl>class</hl>="<hl>article</hl>">this is a <hl>div</hl> <hl>article</hl> content</<hl>div</hl>>'

parser = HTMLCleaner()
print parser.parse(html)

The output for the given input is:

<div class="article">this is a <hl>div</hl> <hl>article</hl> content</div>

which I believe is what you're looking for.

The code basically drops all tags when another tag hasn't been parsed yet.

回复收藏 0 原文

~没有更多了~

关于作者

雨后彩虹

暂无简介

文章

28 人气

关注发私信

十二

文章 0 评论 0

关注

飞烟轻若梦

文章 0 评论 0

关注

OPleyuhuo

文章 0 评论 0

关注

wxb0109

文章 0 评论 0

关注

旧城空念

文章 0 评论 0

关注

-小熊_

文章 0 评论 0

友情链接

文江博客

清理带有多个尖括号的字符串

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

清理带有多个尖括号的字符串

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。