如何使用 HTML Parser 获取 HTML 页面中所有标签的完整信息

发布于 2024-08-22 04:49:28 字数 1419 浏览 7 评论 0原文

我正在使用 HTML 解析器来开发应用程序。下面的代码无法获取页面中的整套标签。有一些标签被遗漏了，它们的属性和文本正文也被遗漏了。请帮我解释为什么会发生这种情况......或者建议我其他方式......

 URL url = new URL("...");
 PrintWriter pw=new PrintWriter(new FileWriter("HTMLElements.txt"));

 URLConnection connection = url.openConnection();
 InputStream is = connection.getInputStream();
 InputStreamReader isr = new InputStreamReader(is);
 BufferedReader br = new BufferedReader(isr);

 HTMLEditorKit htmlKit = new HTMLEditorKit();
 HTMLDocument htmlDoc = (HTMLDocument)htmlKit.createDefaultDocument();
 HTMLEditorKit.Parser parser = new ParserDelegator();
 HTMLEditorKit.ParserCallback callback = htmlDoc.getReader(0);
 parser.parse(br, callback, true);

 ElementIterator iterator = new ElementIterator(htmlDoc);
 Element element;
   while ((element = iterator.next()) != null) 
   {
     AttributeSet attributes = element.getAttributes();
     Enumeration e=attributes.getAttributeNames();

     pw.println("Element Name :"+element.getName());
     while(e.hasMoreElements())
     {
      Object key=e.nextElement();
      Object val=attributes.getAttribute(key);
      int startOffset = element.getStartOffset();
   int endOffset = element.getEndOffset();
   int length = endOffset - startOffset;
   String text=htmlDoc.getText(startOffset, length);

      pw.println("Key :"+key.toString()+" Value :"+val.toString()+"\r\n"+"Text :"+text+"\r\n");

     }
   }

}

原文

I am using HTML Parser to develop an application.
The code below is not able to get the entire set of tags in the page.
There are some tags which are missed out and the attributes and text body of them are also missed out.
Please help me to explain why is this happening.....or suggest me other way....

 URL url = new URL("...");
 PrintWriter pw=new PrintWriter(new FileWriter("HTMLElements.txt"));

 URLConnection connection = url.openConnection();
 InputStream is = connection.getInputStream();
 InputStreamReader isr = new InputStreamReader(is);
 BufferedReader br = new BufferedReader(isr);

 HTMLEditorKit htmlKit = new HTMLEditorKit();
 HTMLDocument htmlDoc = (HTMLDocument)htmlKit.createDefaultDocument();
 HTMLEditorKit.Parser parser = new ParserDelegator();
 HTMLEditorKit.ParserCallback callback = htmlDoc.getReader(0);
 parser.parse(br, callback, true);

 ElementIterator iterator = new ElementIterator(htmlDoc);
 Element element;
   while ((element = iterator.next()) != null) 
   {
     AttributeSet attributes = element.getAttributes();
     Enumeration e=attributes.getAttributeNames();

     pw.println("Element Name :"+element.getName());
     while(e.hasMoreElements())
     {
      Object key=e.nextElement();
      Object val=attributes.getAttribute(key);
      int startOffset = element.getStartOffset();
   int endOffset = element.getEndOffset();
   int length = endOffset - startOffset;
   String text=htmlDoc.getText(startOffset, length);

      pw.println("Key :"+key.toString()+" Value :"+val.toString()+"\r\n"+"Text :"+text+"\r\n");

     }
   }

}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

寄意 2024-08-29 04:49:28

我使用 HTML Parser 相当可靠地完成此操作（前提是 HTML 文档不更改其结构）。具有稳定 API 的 Web 服务要好得多，但有时我们就是没有。

总体思路：

您首先必须知道信息在哪些标签（div、meta、span 等）中您想要在其中，并且知道识别这些标签的属性。示例：

 <span class="price"> $7.95</span>

如果您正在寻找此“价格”，那么您对带有 class“价格”的 span 标签感兴趣。

HTML 解析器具有按属性过滤的功能。

filter = new HasAttributeFilter("class", "price");

当您使用过滤器进行解析时，您将获得一个 Nodes 列表，您可以对它们执行 instanceof 操作以确定它们是否属于您感兴趣的类型，对于 span 你会做类似

if (node instanceof Span) // or any other supported element.

查看支持的标签列表这里。

使用 HTML 解析器获取具有网站描述的元标记的示例：

标记示例：

<meta name="description" content="Amazon.com: frankenstein: Books"/>

代码：

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.tags.MetaTag;

public class HTMLParserTest {
    public static void main(String... args) {
        Parser parser = new Parser();
        //<meta name="description" content="Some texte about the site." />
        HasAttributeFilter filter = new HasAttributeFilter("name", "description");
        try {
            parser.setResource("http://www.youtube.com");
            NodeList list = parser.parse(filter);
            Node node = list.elementAt(0);

            if (node instanceof MetaTag) {
                MetaTag meta = (MetaTag) node;
                String description = meta.getAttribute("content");

                System.out.println(description);
                // Prints: "YouTube is a place to discover, watch, upload and share videos."
            }

        } catch (ParserException e) {
            e.printStackTrace();
        }
    }

}

I am doing this fairly reliably with HTML Parser, (provided that the HTML document does not change its structure). A web service with a stable API is much better, but sometimes we just do not have one.

General idea:

You first have to know in what tags (div, meta, span, etc) the information you want are in, and know the attributes to identify those tags. Example :

 <span class="price"> $7.95</span>

if you are looking for this "price", then you are interested in span tags with class "price".

HTML Parser has a filter-by-attribute functionality.

filter = new HasAttributeFilter("class", "price");

When you parse using a filter, you will get a list of Nodes that you can do a instanceof operation on them to determine if they are of the type you are interested in, for span you'd do something like

if (node instanceof Span) // or any other supported element.

See list of supported tags here.

An example with HTML Parser to grab the meta tag that has description about a site:

Tag Sample :

<meta name="description" content="Amazon.com: frankenstein: Books"/>

Code:

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.tags.MetaTag;

public class HTMLParserTest {
    public static void main(String... args) {
        Parser parser = new Parser();
        //<meta name="description" content="Some texte about the site." />
        HasAttributeFilter filter = new HasAttributeFilter("name", "description");
        try {
            parser.setResource("http://www.youtube.com");
            NodeList list = parser.parse(filter);
            Node node = list.elementAt(0);

            if (node instanceof MetaTag) {
                MetaTag meta = (MetaTag) node;
                String description = meta.getAttribute("content");

                System.out.println(description);
                // Prints: "YouTube is a place to discover, watch, upload and share videos."
            }

        } catch (ParserException e) {
            e.printStackTrace();
        }
    }

}

回复收藏 0 原文

摘星┃星的人 2024-08-29 04:49:28

根据评论：

实际上我想提取在线购物网站（例如 amazon.com）中列出的所有产品的产品名称、价格等信息，我该怎么做？？？

第 1 步：< /strong> 阅读他们的机器人文件。它通常位于网站的根目录中，例如 http://amazon.com/robots.txt。如果您尝试访问的 URL 被 * 的 User-Agent 上的 Disallow 覆盖，则停止这里。联系他们，详细解释你想要做什么，并向他们询问可以为你提供所需信息的方法/替代方案/网络服务。否则，您就违反了法律，并且可能会面临被网站和/或您的 ISP 列入黑名单或更糟的风险。如果没有，则继续执行步骤 2。

步骤 2： 检查相关网站是否还没有可用的公共 Web 服务，这比解析整个 HTML 页面更容易使用。使用 Web 服务，您将基于一组简单的参数以简洁的格式（JSON 或 XML）准确获得您正在寻找的信息。环顾四周或联系他们以获取有关任何网络服务的详细信息。如果没有办法，请继续执行步骤 3。

步骤 3：了解 HTML/CSS/JS 的工作原理，了解如何使用 Firebug 等 Web 开发工具，了解如何解释 HTML/CSS/JS 源代码你可以通过右键单击>看到查看页面源代码。我敢打赌，相关网站使用 JS/Ajax 来加载/填充您想要收集的信息。在这种情况下，您需要使用一个能够解析和执行 JS 的 HTML 解析器（您正在使用的解析器不能执行此操作）。这不是一件容易的事，所以我不会详细解释它，直到完全清楚你想要实现什么、是否允许以及是否没有更易于使用的 Web 服务可用的。