如何使用 HTML Parser 获取 HTML 页面中所有标签的完整信息

发布于 2024-08-22 04:49:28 字数 1419 浏览 7 评论 0原文

我正在使用 HTML 解析器来开发应用程序。 下面的代码无法获取页面中的整套标签。 有一些标签被遗漏了,它们的属性和文本正文也被遗漏了。 请帮我解释为什么会发生这种情况......或者建议我其他方式......

 URL url = new URL("...");
 PrintWriter pw=new PrintWriter(new FileWriter("HTMLElements.txt"));

 URLConnection connection = url.openConnection();
 InputStream is = connection.getInputStream();
 InputStreamReader isr = new InputStreamReader(is);
 BufferedReader br = new BufferedReader(isr);

 HTMLEditorKit htmlKit = new HTMLEditorKit();
 HTMLDocument htmlDoc = (HTMLDocument)htmlKit.createDefaultDocument();
 HTMLEditorKit.Parser parser = new ParserDelegator();
 HTMLEditorKit.ParserCallback callback = htmlDoc.getReader(0);
 parser.parse(br, callback, true);

 ElementIterator iterator = new ElementIterator(htmlDoc);
 Element element;
   while ((element = iterator.next()) != null) 
   {
     AttributeSet attributes = element.getAttributes();
     Enumeration e=attributes.getAttributeNames();

     pw.println("Element Name :"+element.getName());
     while(e.hasMoreElements())
     {
      Object key=e.nextElement();
      Object val=attributes.getAttribute(key);
      int startOffset = element.getStartOffset();
   int endOffset = element.getEndOffset();
   int length = endOffset - startOffset;
   String text=htmlDoc.getText(startOffset, length);

      pw.println("Key :"+key.toString()+" Value :"+val.toString()+"\r\n"+"Text :"+text+"\r\n");

     }
   }

}

I am using HTML Parser to develop an application.
The code below is not able to get the entire set of tags in the page.
There are some tags which are missed out and the attributes and text body of them are also missed out.
Please help me to explain why is this happening.....or suggest me other way....

 URL url = new URL("...");
 PrintWriter pw=new PrintWriter(new FileWriter("HTMLElements.txt"));

 URLConnection connection = url.openConnection();
 InputStream is = connection.getInputStream();
 InputStreamReader isr = new InputStreamReader(is);
 BufferedReader br = new BufferedReader(isr);

 HTMLEditorKit htmlKit = new HTMLEditorKit();
 HTMLDocument htmlDoc = (HTMLDocument)htmlKit.createDefaultDocument();
 HTMLEditorKit.Parser parser = new ParserDelegator();
 HTMLEditorKit.ParserCallback callback = htmlDoc.getReader(0);
 parser.parse(br, callback, true);

 ElementIterator iterator = new ElementIterator(htmlDoc);
 Element element;
   while ((element = iterator.next()) != null) 
   {
     AttributeSet attributes = element.getAttributes();
     Enumeration e=attributes.getAttributeNames();

     pw.println("Element Name :"+element.getName());
     while(e.hasMoreElements())
     {
      Object key=e.nextElement();
      Object val=attributes.getAttribute(key);
      int startOffset = element.getStartOffset();
   int endOffset = element.getEndOffset();
   int length = endOffset - startOffset;
   String text=htmlDoc.getText(startOffset, length);

      pw.println("Key :"+key.toString()+" Value :"+val.toString()+"\r\n"+"Text :"+text+"\r\n");

     }
   }

}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

寄意 2024-08-29 04:49:28

我使用 HTML Parser 相当可靠地完成此操作(前提是 HTML 文档不更改其结构)。具有稳定 API 的 Web 服务要好得多,但有时我们就是没有。

总体思路

您首先必须知道信息在哪些标签(divmetaspan 等)中您想要在其中,并且知道识别这些标签的属性。示例:

 <span class="price"> $7.95</span>

如果您正在寻找此“价格”,那么您对带有 class“价格”的 span 标签感兴趣。

HTML 解析器具有按属性过滤的功能。

filter = new HasAttributeFilter("class", "price");

当您使用过滤器进行解析时,您将获得一个 Nodes 列表,您可以对它们执行 instanceof 操作以确定它们是否属于您感兴趣的类型,对于 span 你会做类似

if (node instanceof Span) // or any other supported element.

查看支持的标签列表 这里

使用 HTML 解析器获取具有网站描述的元标记的示例:

标记示例:

<meta name="description" content="Amazon.com: frankenstein: Books"/> 

代码:

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.tags.MetaTag;

public class HTMLParserTest {
    public static void main(String... args) {
        Parser parser = new Parser();
        //<meta name="description" content="Some texte about the site." />
        HasAttributeFilter filter = new HasAttributeFilter("name", "description");
        try {
            parser.setResource("http://www.youtube.com");
            NodeList list = parser.parse(filter);
            Node node = list.elementAt(0);

            if (node instanceof MetaTag) {
                MetaTag meta = (MetaTag) node;
                String description = meta.getAttribute("content");

                System.out.println(description);
                // Prints: "YouTube is a place to discover, watch, upload and share videos."
            }

        } catch (ParserException e) {
            e.printStackTrace();
        }
    }

}

I am doing this fairly reliably with HTML Parser, (provided that the HTML document does not change its structure). A web service with a stable API is much better, but sometimes we just do not have one.

General idea:

You first have to know in what tags (div, meta, span, etc) the information you want are in, and know the attributes to identify those tags. Example :

 <span class="price"> $7.95</span>

if you are looking for this "price", then you are interested in span tags with class "price".

HTML Parser has a filter-by-attribute functionality.

filter = new HasAttributeFilter("class", "price");

When you parse using a filter, you will get a list of Nodes that you can do a instanceof operation on them to determine if they are of the type you are interested in, for span you'd do something like

if (node instanceof Span) // or any other supported element.

See list of supported tags here.

An example with HTML Parser to grab the meta tag that has description about a site:

Tag Sample :

<meta name="description" content="Amazon.com: frankenstein: Books"/> 

Code:

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.tags.MetaTag;

public class HTMLParserTest {
    public static void main(String... args) {
        Parser parser = new Parser();
        //<meta name="description" content="Some texte about the site." />
        HasAttributeFilter filter = new HasAttributeFilter("name", "description");
        try {
            parser.setResource("http://www.youtube.com");
            NodeList list = parser.parse(filter);
            Node node = list.elementAt(0);

            if (node instanceof MetaTag) {
                MetaTag meta = (MetaTag) node;
                String description = meta.getAttribute("content");

                System.out.println(description);
                // Prints: "YouTube is a place to discover, watch, upload and share videos."
            }

        } catch (ParserException e) {
            e.printStackTrace();
        }
    }

}
摘星┃星的人 2024-08-29 04:49:28

根据评论:

实际上我想提取在线购物网站(例如 amazon.com)中列出的所有产品的产品名称、价格等信息,我该怎么做???

第 1 步:< /strong> 阅读他们的机器人文件。它通常位于网站的根目录中,例如 http://amazon.com/robots.txt。如果您尝试访问的 URL 被 *User-Agent 上的 Disallow 覆盖,则停止这里。联系他们,详细解释你想要做什么,并向他们询问可以为你提供所需信息的方法/替代方案/网络服务。否则,您就违反了法律,并且可能会面临被网站和/或您的 ISP 列入黑名单或更糟的风险。如果没有,则继续执行步骤 2。

步骤 2: 检查相关网站是否还没有可用的公共 Web 服务,这比解析整个 HTML 页面更容易使用。使用 Web 服务,您将基于一组简单的参数以简洁的格式(JSON 或 XML)准确获得您正在寻找的信息。环顾四周或联系他们以获取有关任何网络服务的详细信息。如果没有办法,请继续执行步骤 3。

步骤 3:了解 HTML/CSS/JS 的工作原理,了解如何使用 Firebug 等 Web 开发工具,了解如何解释 HTML/CSS/JS 源代码你可以通过右键单击>看到查看页面源代码。我敢打赌,相关网站使用 JS/Ajax 来加载/填充您想要收集的信息。在这种情况下,您需要使用一个能够解析和执行 JS 的 HTML 解析器(您正在使用的解析器不能执行此操作)。这不是一件容易的事,所以我不会详细解释它,直到完全清楚你想要实现什么、是否允许以及是否没有更易于使用的 Web 服务可用的。

As per the comments:

actually i want to extract information such as product name,price etc of all products listed in an online shopping site such as amazon.com How should i go about it???

Step 1: read their robots file. It's usually found on the root of the site, for example http://amazon.com/robots.txt. If the URL you're trying to access is covered by a Disallow on an User-Agent of *, then stop here. Contact them, explain them in detail what you're trying to do and ask them for ways/alternatives/webservices which can provide you the information you need. Else you're violating the laws and you may risk to get blacklisted by the site and/or by your ISP or worse. If not, then proceed to step 2.

Step 2: check if the site in question hasn't already a public webservice available which is much more easy to use than parsing a whole HTML page. Using a webservice, you'll get exactly the information you're looking for in a concise format (JSON or XML) based on a simple set of parameters. Look around or contact them for details about any webservices. If there's no way, proceed to step 3.

Step 3: learn how HTML/CSS/JS work, learn how to work with webdeveloper tools like Firebug, learn how to interpret the HTML/CSS/JS source you see by rightclick > View Page Source. My bet that the site in question uses JS/Ajax to load/populate the information you'd like to gather. In that case, you'll need to use a HTML parser which is capable of parsing and executing JS as well (the one you're using namely doesn't do that). This isn't going to be an easy job, so I won't explain it in detail until it's entirely clear what you're trying to achieve and if that is allowed and if there aren't more-easy-to-use webservices available.

趁微风不噪 2024-08-29 04:49:28

您似乎使用了 Swing HtmlDocument。这可能不是有史以来最聪明的想法。
我相信使用 NekoHtml 为例,您会得到更好的结果。

You seemed to use the Swing HtmlDocument. It may not be the smartest idea ever.
I believe you would have better results using, as an example, NekoHtml.

夏天碎花小短裙 2024-08-29 04:49:28

或者您可以使用的另一个简单库是 jtidy,它可以在解析 html 之前清理它。
希望这有帮助。

http://sourceforge.net/projects/jtidy/

再见!

Or another simple library you can use is jtidy that can clean up your html before parsing it.
Hope this helps.

http://sourceforge.net/projects/jtidy/

Ciao!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文