使用 JSoup 仅从多个页面获取文本

发布于 2024-12-27 17:49:12 字数 488 浏览 5 评论 0原文

我有一组 1000 个页面（链接），是通过向 Google 查询得到的。我正在使用 JSoup。我想删除图像、链接、菜单、视频等，只获取每个页面的主要文章。

我的问题是每个页面都有不同的 DOM 树，所以我不能对每个页面使用相同的命令！您知道有什么方法可以同时处理 1000 个页面吗？我想我必须使用正则表达式。也许是这样

textdoc.body().select("[id*=main]").text();//get id that contains the word main
textdoc.body().select("[class*=main]").text();//get class that contains the word main
textdoc.body().select("[id*=content]").text();//get id that contains the word content

，但我觉得我总是会错过一些东西。还有更好的想法吗？

原文

I have a set of 1000 pages(links) that I get by putting a query to Google. I am using JSoup. I want to get rid of images, links, menus, videos, etc. and take only the main article from every page.

My problem is that every page has a different DOM tree so I cannot use the same command for every page! Do you know any way to do this for 1000 pages simultaneously? I guess that I have to use regular expressions. Something like that perhaps

textdoc.body().select("[id*=main]").text();//get id that contains the word main
textdoc.body().select("[class*=main]").text();//get class that contains the word main
textdoc.body().select("[id*=content]").text();//get id that contains the word content

But I feel that always I will miss something with this. Any better ideas?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一桥轻雨一伞开 2025-01-03 17:49:12

Element main = doc.select("div.main").first();
Elements links = main.select("a[href]");

所有不同的页面都有主文章的主类吗？

Element main = doc.select("div.main").first();
Elements links = main.select("a[href]");

All different pages have main class for the main article?

回复收藏 0 原文

~没有更多了~

关于作者

情绪少女

暂无简介

文章

29 人气

关注发私信

alipaysp_snBf0MSZIv

文章 0 评论 0

关注

梦断已成空

文章 0 评论 0

关注

瞎闹

文章 0 评论 0

关注

凯凯我们等你回来

文章 0 评论 0

关注

寄意

文章 0 评论 0

关注

似梦非梦

文章 0 评论 0

友情链接

文江博客

使用 JSoup 仅从多个页面获取文本

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

使用 JSoup 仅从多个页面获取文本

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。