当前位置：文江博客话题详情

使用Xpath提取网页元素

发布于 2021-12-09 04:56:37 字数 252 浏览 794 评论 9

@黄亿华你好，想跟你请教个问题：

使用您的webmagic，在用Xpath选取页面元素时，选取不到。

描述：在CSS网页布局的网页中，使用是正常的，但是遇到使用表格进行网页布局的网站，再使用Xpath去页面元素，就取不回来了。

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

高跟鞋的旋律 2021-12-10 02:57:21

你可以用HtmlCleaner来试下是否支持
siblingsindex。Spider.xsoupOff()

回复收藏 0

残花月 2021-12-10 02:57:20

@黄亿华

我读过你的代码，包括你学习jsoup的文档，我对jsoup整体还是不太清楚。xsoup 是对jsoup的二次封装，jsoup node中有个siblingsindex属性，能不能封装下 Evaluator.IndexEquals来实现 “page.getHtml().xpath("/html/body/table[5]/tbody/tr/td[3]/table/tbody /tr[2]/td[1]/table/tbody/tr/td/a").links().all();”这种方式获取数据？

回复收藏 0

疑心病 2021-12-10 02:57:09

@黄亿华

webmagic支持“/div[@class='Question']//div[@class='Content']/div[@class='detail']”，怎么不支持“page.getHtml().xpath("/html/body/table[5]/tbody/tr/td[3]/table/tbody/tr[2]/td[1]/table/tbody/tr/td/a").links().all();”这种方式？

在xpathparser中：

    private void findElements() {

        if (tq.matches("@")) {

            consumeAttribute();

        } else if (tq.matches("*")) {

            allElements();

        } else if (tq.matchesRegex("\w+\(.*\)")) {

            consumeFunction();

        } else if (tq.matchesWord()) {

            byTag();

        } else if (tq.matches("[@")) {

            byAttribute();

        } else if (tq.matchesRegex("\[\d+\]")) {

            byNth();

        } else {

            // unhandled

            throw new Selector.SelectorParseException("Could not parse query '%s': unexpected token at '%s'", query, tq.remainder());

        }

}

其中的“tq.matchesRegex("\[\d+\]")”不是对例如“table[5]”这种方式处理了，怎么不可以用？

回复收藏 0

永不分离 2021-12-10 02:56:02

webmagic支持“/div[@class='Question']//div[@class='Content']/div[@class='detail']”，怎么不支持“page.getHtml().xpath("/html/body/table[5]/tbody/tr/td[3]/table/tbody/tr[2]/td[1]/table/tbody/tr/td/a").links().all();”这种方式？