当前位置：文江博客话题详情

在java中提取页面的主要部分

发布于 2024-10-20 23:51:31 字数 66 浏览 2 评论 0原文

你好我在维基百科中有一个个性页面，我想用java源代码从主要部分提取HTML代码。

你有什么想法吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

仅冇旳回忆 2024-10-27 23:51:31

使用 Jsoup，特别是选择器语法。

Document doc = Jsoup.parse(new URL("http://en.wikipedia.org/", 10000);
Elements interestingParts = doc.select("div.interestingClass");

//get the combined HTML fragments as a String
String selectedHtmlAsString = interestingParts.html();

//get all the links
Elements links = interestingParts.select("a[href]");

//filter the document to include certain tags only
Whitelist allowedTags = Whitelist.simpleText().addTags("blockquote","code", "p");
Cleaner cleaner = new Cleaner(allowedTags);
Document filteredDoc = cleaner.clean(doc);

它是一个非常有用的 API，用于解析 HTML 页面并提取所需的数据。

Use Jsoup, specifically the selector syntax.

Document doc = Jsoup.parse(new URL("http://en.wikipedia.org/", 10000);
Elements interestingParts = doc.select("div.interestingClass");

//get the combined HTML fragments as a String
String selectedHtmlAsString = interestingParts.html();

//get all the links
Elements links = interestingParts.select("a[href]");

//filter the document to include certain tags only
Whitelist allowedTags = Whitelist.simpleText().addTags("blockquote","code", "p");
Cleaner cleaner = new Cleaner(allowedTags);
Document filteredDoc = cleaner.clean(doc);

It's a very useful API for parsing HTML pages and extracting the desired data.

回复收藏 0 原文

墨离汐 2024-10-27 23:51:31

对于维基百科，有 API： http://www.mediawiki.org/wiki/API:Main_page

回复收藏 0 原文

一城柳絮吹成雪 2024-10-27 23:51:31

分析网页结构
使用 JSoup 解析 HTML

回复收藏 0 原文

安稳善良 2024-10-27 23:51:31

请注意，这会返回 HTML 源代码的 STRING（某种 blob），而不是格式良好的内容项。

我自己用这个——我有一个小片段可以满足我的需要。传入 url、任何开始和停止文本或布尔值以获取所有内容。

public static String getPage(
      String url, 
      String booleanStart, 
      String booleanStop, 
      boolean getAll) throws Exception {
    StringBuilder page = new StringBuilder();
    URL iso3 = new URL(url);
    URLConnection iso3conn = iso3.openConnection();
    BufferedReader in = new BufferedReader(
        new InputStreamReader(
            iso3conn.getInputStream()));
    String inputLine;

    if (getAll) {
      while ((inputLine = in.readLine()) != null) {
        page.append(inputLine);
      }
    } else {    
      boolean save = false;
      while ((inputLine = in.readLine()) != null) {
        if (inputLine.contains(booleanStart)) 
          save = true;
        if (save) 
          page.append(inputLine);
        if (save && inputLine.contains(booleanStop)) {
          break;
        }
      }
    }
    in.close();
    return page.toString();
  }

Note that this returns a STRING (blob of a sort) of the HTML source code, not a nicely formatted content item.

I use this myself - a little snippet I have for whatever i need. Pass in the url, any start and stop text, or the boolean to get everything.

public static String getPage(
      String url, 
      String booleanStart, 
      String booleanStop, 
      boolean getAll) throws Exception {
    StringBuilder page = new StringBuilder();
    URL iso3 = new URL(url);
    URLConnection iso3conn = iso3.openConnection();
    BufferedReader in = new BufferedReader(
        new InputStreamReader(
            iso3conn.getInputStream()));
    String inputLine;

    if (getAll) {
      while ((inputLine = in.readLine()) != null) {
        page.append(inputLine);
      }
    } else {    
      boolean save = false;
      while ((inputLine = in.readLine()) != null) {
        if (inputLine.contains(booleanStart)) 
          save = true;
        if (save) 
          page.append(inputLine);
        if (save && inputLine.contains(booleanStop)) {
          break;
        }
      }
    }
    in.close();
    return page.toString();
  }

回复收藏 0 原文

~没有更多了~