为什么 Swing Parser 的 handleText 不处理嵌套标签?

发布于 2024-08-08 09:59:22 字数 1396 浏览 7 评论 0原文

我需要转换一些具有嵌套标签的 HTML 文本,以用 css 属性装饰“matches”以突出显示它(如 firefox 搜索)。 我不能只进行简单的替换(例如,假设用户搜索“img”),因此我尝试仅在正文文本中进行替换(而不是在标签属性上)。

我有一个非常简单的 HTML 解析器,我认为应该这样做:

final Pattern pat = Pattern.compile(srch, Pattern.CASE_INSENSITIVE);
Matcher m = pat.matcher(output);
if (m.find()) {
    final StringBuffer ret = new StringBuffer(output.length()+100);
    lastPos=0;
    try {
        new ParserDelegator().parse(new StringReader(output.toString()),
        new HTMLEditorKit.ParserCallback () {
            public void handleText(char[] data, int pos) {
                ret.append(output.subSequence(lastPos, pos));
                Matcher m = pat.matcher(new String(data));
                ret.append(m.replaceAll("<span class=\"search\">$0</span>"));
                lastPos=pos+data.length;
            }
        }, false);
        ret.append(output.subSequence(lastPos, output.length()));
        return ret;
    } catch (Exception e) {
 return output;
    }
}
return output;

我的问题是,当我调试它时,handleText 被包含标签的文本调用!就好像它只深入一层。有人知道为什么吗?我需要对 HTMLParser 做一些简单的事情(没有太多使用它)来启用嵌套标签的“正确”行为吗?

PS - 我自己想出来了 - 请参阅下面的答案。简短的回答是,如果您传递 HTML,而不是预先转义的 HTML,它就可以正常工作。哎哟!希望这对其他人有帮助。

<span>example with <a href="#">nested</a> <p>more nesting</p>
</span> <!-- all this gets thrown together -->

I need to transform some HTML text that has nested tags to decorate 'matches' with a css attribute to highlight it (like firefox search).
I can't just do a simple replace (think if user searched for "img" for example), so I'm trying to just do the replace within the body text (not on tag attributes).

I have a pretty straightforward HTML parser that I think should do this:

final Pattern pat = Pattern.compile(srch, Pattern.CASE_INSENSITIVE);
Matcher m = pat.matcher(output);
if (m.find()) {
    final StringBuffer ret = new StringBuffer(output.length()+100);
    lastPos=0;
    try {
        new ParserDelegator().parse(new StringReader(output.toString()),
        new HTMLEditorKit.ParserCallback () {
            public void handleText(char[] data, int pos) {
                ret.append(output.subSequence(lastPos, pos));
                Matcher m = pat.matcher(new String(data));
                ret.append(m.replaceAll("<span class=\"search\">$0</span>"));
                lastPos=pos+data.length;
            }
        }, false);
        ret.append(output.subSequence(lastPos, output.length()));
        return ret;
    } catch (Exception e) {
 return output;
    }
}
return output;

My problem is, when I debug this, the handleText is getting called with text that includes tags! It's like it's only going one level deep. Anyone know why? Is there some simple thing I need to do to HTMLParser (haven't used it much) to enable 'proper' behavior of nested tags?

PS - I figured it out myself - see answer below. Short answer is, it works fine if you pass it HTML, not pre-escaped HTML. Doh! Hope this helps someone else.

<span>example with <a href="#">nested</a> <p>more nesting</p>
</span> <!-- all this gets thrown together -->

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

空城仅有旧梦在 2024-08-15 09:59:22

我在 XP 上使用 JDK6 似乎工作得很好。我用 head 和 body 标签包装了您的示例 HTML。我得到三行输出:

a)示例
b) 嵌套
c)更多嵌套

这是我使用的代码:

import java.io.*;
import java.net.*;
import javax.swing.text.html.parser.*;
import javax.swing.text.html.*;

public class ParserCallbackText extends HTMLEditorKit.ParserCallback
{
    public void handleText(char[] data, int pos)
    {
        System.out.println( data );
    }

    public static void main(String[] args)
        throws Exception
    {
        Reader reader = getReader(args[0]);
        ParserCallbackText parser = new ParserCallbackText();
        new ParserDelegator().parse(reader, parser, true);
    }

    static Reader getReader(String uri)
        throws IOException
    {
        // Retrieve from Internet.
        if (uri.startsWith("http:"))
        {
            URLConnection conn = new URL(uri).openConnection();
            return new InputStreamReader(conn.getInputStream());
        }
        // Retrieve from file.
        else
        {
            return new FileReader(uri);
        }
    }
}

Seems to work fine for me using JDK6 on XP. I wrapped your sample HTML with head and body tags. I got three lines of output:

a) example with
b) nested
c) more nesting

Here's the code I used:

import java.io.*;
import java.net.*;
import javax.swing.text.html.parser.*;
import javax.swing.text.html.*;

public class ParserCallbackText extends HTMLEditorKit.ParserCallback
{
    public void handleText(char[] data, int pos)
    {
        System.out.println( data );
    }

    public static void main(String[] args)
        throws Exception
    {
        Reader reader = getReader(args[0]);
        ParserCallbackText parser = new ParserCallbackText();
        new ParserDelegator().parse(reader, parser, true);
    }

    static Reader getReader(String uri)
        throws IOException
    {
        // Retrieve from Internet.
        if (uri.startsWith("http:"))
        {
            URLConnection conn = new URL(uri).openConnection();
            return new InputStreamReader(conn.getInputStream());
        }
        // Retrieve from file.
        else
        {
            return new FileReader(uri);
        }
    }
}
╭⌒浅淡时光〆 2024-08-15 09:59:22

很抱歉提出了误导性问题 - 我发现了我的问题,并且它没有包含在我的描述中 - 我的输入字符串已经过预处理,所以我正在查看诸如

<span>example with <a href="#"> nested >/a< >p<more nesting>/p<
</span> <!-- well of course it all gets thrown together -->

Sorry for the misleading question - I found my problem, and it wasn't included in my description - my input string had been pre-processed so I was looking at text such as

<span>example with <a href="#"> nested >/a< >p<more nesting>/p<
</span> <!-- well of course it all gets thrown together -->
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文