在没有解析器的情况下从网页中提取除标签之外的所有内容 - 使用扫描仪和正则表达式?

发布于 2024-09-17 16:05:39 字数 700 浏览 9 评论 0原文

使用 Android SDK 工作,它是 Java 减去一些东西。

我有一个解决方案,可以从网页中提取两个正则表达式模式。我遇到的问题是它在 HTML 标签内查找内容。我尝试了 jTidy,但它在 Android 上太慢了。不知道为什么,但我的扫描仪正则表达式匹配解决方案多次鞭打它。

目前,我将页面源抓取到一个输入流中

is = uconn.getInputStream();

,然后进行匹配和提取,如下所示:

Scanner scanner = new Scanner(in, "UTF-8");
String match = "";   
while (match != null) {   
    match = scanner.findWithinHorizon(extractPattern, 0);   
    if (match != null) {   
        String matchit = scanner.match().group(grp);  

它工作得非常好并且速度很快。

我的正则表达式模式已经有点疯狂了,实际上是这样的两个模式(p1 | p2)

关于如何“但不在 HTML 标签内”或在开始时排除 HTML 标签有什么想法吗? 如果我可以从源中排除 HTML 标签,这可能会显着加快我的界面速度,因为我还需要对原始数据执行一些其他操作。

Working on Android SDK, it's Java minus some things.

I have a solution that pulls out two regex patterns from web pages. The problems I'm having is that it's finding things inside HTML tags. I tried jTidy, but it was just too slow on the Android. Not sure why but my Scanner regex match solution whips it many times over.

currently, I grab the page source into a IntputStream

is = uconn.getInputStream();

and the match and extract like this:

Scanner scanner = new Scanner(in, "UTF-8");
String match = "";   
while (match != null) {   
    match = scanner.findWithinHorizon(extractPattern, 0);   
    if (match != null) {   
        String matchit = scanner.match().group(grp);  

it works very nicely and is fast.

My regex pattern is already kinda crazy, actually two patterns in an or like this (p1|p2)

Any ideas on how I do that "but not inside HTML tags" or exclude HTML tags at the start?
If I can exclude HTML tags from my source that will likely speed up my interface significantly as I have a few other things I need to do with the raw data.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

笨笨の傻瓜 2024-09-24 16:08:56

为什么不使用 javax.xml.parsers 解析 HTML (ergo xml)

Why don't you use javax.xml.parsers to parse HTML (ergo xml)

鱼窥荷 2024-09-24 16:07:57

您可以做的一件事是为右尖括号添加前瞻:

(p1|p2)(?![^<>]*+>)

这个想法是,在找到匹配项后,您向前扫描一下;如果您在没有首先看到左括号的情况下找到了右括号,则匹配必须发生在标签内,因此请拒绝它。但请注意,即使在格式良好的 HTML 中,也有很多东西可能会让您感到困惑,例如 SGML 注释、CDATA 部分,甚至属性值中的尖括号。

另一种方法是匹配标签并忽略这些匹配:

((?:<[^<>]++>)++)(p1|p2)

然后测试是否是匹配的组#1:

MatchResult match = scanner.match();
if (match.start(1) != -1) {
    // keep searching
}

但同样,作为一个通用解决方案,由于我引用的原因,这太脆弱了多于。如果您确定其中一种解决方案(或任何正则表达式解决方案)与您正在处理的特定页面兼容,则应仅使用其中一种解决方案。

One thing you can do is add a lookahead for the closing angle bracket:

(p1|p2)(?![^<>]*+>)

The idea is, after you find a match you scan forward a bit; if you find a closing bracket without first seeing an opening bracket, the match must have occurred inside a tag, so reject it. But be aware that even in well-formed HTML there are many things that can mess you up, like SGML comments, CDATA sections, or even angle brackets in attribute values.

Another approach would be to match the tags and ignore those matches:

((?:<[^<>]++>)++)(p1|p2)

Then you test whether it was group #1 that matched:

MatchResult match = scanner.match();
if (match.start(1) != -1) {
    // keep searching
}

But again, as a general solution this is way too fragile, for the reasons I cited above. You should only use one of these solutions (or any regex solution) if you're sure it's compatible with the particular pages you're working on.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文