清理带有多个尖括号的字符串
我有以下 HTML 部分
<div class="article">this is a div article content</div>
,它被与 HTML 无关的程序“标记”在单词 div
、class
和 article
上,结果in:
<<hl>div</hl> <hl>class</hl>="<hl>article</hl>">this is a <hl>div</hl> <hl>article</hl> content</<hl>div</hl>>
虽然我真正需要的是:
<div class="article">this is a <hl>div</hl> <hl>article</hl> content</div>
由于输出完全是垃圾(甚至像 HTML Tidy
这样的工具也会窒息),我认为正则表达式替换将有助于去除额外的
s 在 HTML 标记内:
replace(/<([^>]*)<hl>([^<]*?)<\/hl>([^>]*?)>/g, '<$1$2$3>')
现在,这可以工作,但只能替换标记中的第一个出现,即 div
:
<div <hl>class</hl>="<hl>article</hl>">this is a <hl>div</hl> <hl>article</hl> content</div>
我的问题是:如何替换标记内的所有
,以确保 HTML 保持有效?
附加说明:
- 我根本不需要标签属性(即
class="article"
可以消失) - 我可以更改
和< /hl>
对于任何其他字符串 - 是的,输出来自 Solr
更新:我接受了 jcollado 的答案,但我需要在 Javascript 中使用它。这是等效的代码:
var stripIllegalTags = function(html) {
var output = '',
dropChar,
parsingTag = false;
for (var i=0; i < html.length; i++) {
var character = html[i];
if (character == '<') {
if (parsingTag) {
do {
dropChar = html[i+1];
i++;
} while (dropChar != '>');
continue;
}
parsingTag = true;
} else if (character == '>') {
parsingTag = false;
}
output += character;
}
return output;
}
I have the following bit of HTML
<div class="article">this is a div article content</div>
which is being "tagged" by an HTML-agnostic program on the words div
, class
and article
, resulting in:
<<hl>div</hl> <hl>class</hl>="<hl>article</hl>">this is a <hl>div</hl> <hl>article</hl> content</<hl>div</hl>>
although what I really need is:
<div class="article">this is a <hl>div</hl> <hl>article</hl> content</div>
Since the output is utter garbage (even tools like HTML Tidy
choke on it), I figured a regex replace would help strip out the extra <hl>
s inside the HTML tag:
replace(/<([^>]*)<hl>([^<]*?)<\/hl>([^>]*?)>/g, '<$1$2$3>')
Now, this works but only replaces the first occurrence in the tag, that is, the div
:
<div <hl>class</hl>="<hl>article</hl>">this is a <hl>div</hl> <hl>article</hl> content</div>
My question is: how do I replace all <hl>
s inside the tag, so as to make sure the HTML remains valid?
Additional notes:
- I don't need the tag attributes at all (i.e.
class="article"
can disappear) - I can change
<hl>
and</hl>
for any other strings - Yes, the output comes from Solr
UPDATE: I accepted jcollado's answer, but I needed this in Javascript. This is the equivalent code:
var stripIllegalTags = function(html) {
var output = '',
dropChar,
parsingTag = false;
for (var i=0; i < html.length; i++) {
var character = html[i];
if (character == '<') {
if (parsingTag) {
do {
dropChar = html[i+1];
i++;
} while (dropChar != '>');
continue;
}
parsingTag = true;
} else if (character == '>') {
parsingTag = false;
}
output += character;
}
return output;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
也许下面的代码对您有帮助:
给定输入的输出是:
我相信这就是您正在寻找的内容。
当另一个标签尚未解析时,代码基本上会删除所有标签。
Maybe the piece of code below is helpful for you:
The output for the given input is:
which I believe is what you're looking for.
The code basically drops all tags when another tag hasn't been parsed yet.