如何为 HTML 简单文本制作正则表达式模式？

发布于 2024-10-06 22:38:50 字数 427 浏览 4 评论 0原文

我正在尝试学习课程的正则表达式模式。我正在制作一个简单的 HTML 词法分析器/解析器。我知道这不是制作词法分析器/解析器的最佳或最有效的方法，但它只是为了理解正则表达式模式。

所以我的问题是，如何创建一个模式来检查字符串是否不包含任何 HTML 标记（即）并且不包含任何 HTML 实体（即 &耳鼻喉科；）？

这是我到目前为止可以想到的，但它仍然不起作用：

.+?(^(?:&[A-Za-z0-9#]+;)^(?:<.*?>))

编辑：唯一的问题是我不能否定最终结果我需要找到一个完整的模式来完成这个任务如果可能的话，尽管可能不太好。我从未提到过，但它几乎应该匹配 HTML 页面中的任何简单文本。

原文

I am trying to learn Regex patterns for a class. I am making a simple HTML Lexer/Parser. I know this is not the best or most efficient way to make a Lexer/Parser but it is only to understand Regex patterns.

So my question is, How do I create a pattern that checks if the String does not contain any HTML tags (ie <TAG>) and does not contain any HTML Entities (ie &ENT;)?

This is what I could come up with so far but it still does not work:

.+?(^(?:&[A-Za-z0-9#]+;)^(?:<.*?>))

EDIT: The only problem is that I can't negate the final outcome I need to find a complete pattern that would accomplish this task if it's possible, although it might not be pretty. I never mentioned but it's pretty much supposed to match any Simple Text in an HTML page.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

痴情换悲伤 2024-10-13 22:38:50

您可以使用表达式 <.+?>|&.+?; 搜索匹配项，然后对结果求反。

<.+?> 首先表示 <，然后是任何内容（一次或多次），然后是 >
& .+?; 首先表示 &，然后是任何内容（一次或多次），然后是 ;

这是一个带有 ideone.com 演示在这里。

import java.util.regex.*;

public class Test {
    public static void main(String[] args) {
        String[] tests = { "hello", "hello <b>world</b>!", "Hello world" };
        Pattern p = Pattern.compile("<.+?>|&.+?;");
        for (String test : tests) {
            Matcher m = p.matcher(test);
            if (m.find())
                System.out.printf("\"%s\" has HTML: %s%n", test, m.group());
            else
                System.out.printf("\"%s\" does have no HTML%n", test);
        }
    }
}

输出：

"hello" does have no HTML
"hello <b>world</b>!" has HTML: <b>
"Hello world" has HTML:

You could use the expression <.+?>|&.+?; to search for a match, and then negate the result.

<.+?> says first a < then anything (one or more times) then a >
&.+?; says first a & then anything (one or more times) then a ;

Here is a complete example with an ideone.com demo here.

import java.util.regex.*;

public class Test {
    public static void main(String[] args) {
        String[] tests = { "hello", "hello <b>world</b>!", "Hello world" };
        Pattern p = Pattern.compile("<.+?>|&.+?;");
        for (String test : tests) {
            Matcher m = p.matcher(test);
            if (m.find())
                System.out.printf("\"%s\" has HTML: %s%n", test, m.group());
            else
                System.out.printf("\"%s\" does have no HTML%n", test);
        }
    }
}

Output:

"hello" does have no HTML
"hello <b>world</b>!" has HTML: <b>
"Hello world" has HTML:

回复收藏 0 原文

莫多说 2024-10-13 22:38:50

如果您要匹配不遵循模式的字符串，最简单的方法是匹配模式，然后否定测试结果。

<[^>]+>|&[^;]+;

与此模式匹配的任何字符串都将具有至少一个标记（如您所定义的）或实体（如您所定义的）。因此，您想要的字符串是与此模式不匹配的字符串（它们没有标签或实体）。

If you're looking to match strings that do NOT follow a pattern, the simplest thing to do is to match the pattern and then negate the result of the test.

<[^>]+>|&[^;]+;

Any string that matches this pattern will have AT LEAST ONE tag (as you've defined it) or entity (as you've defined it). So the strings you want are strings that DO NOT match this pattern (they will have NO tags or entities).

回复收藏 0 原文

~没有更多了~