如何使用 HTML Parser 获取 HTML 页面中所有标签的完整信息
我正在使用 HTML 解析器来开发应用程序。 下面的代码无法获取页面中的整套标签。 有一些标签被遗漏了,它们的属性和文本正文也被遗漏了。 请帮我解释为什么会发生这种情况......或者建议我其他方式......
URL url = new URL("...");
PrintWriter pw=new PrintWriter(new FileWriter("HTMLElements.txt"));
URLConnection connection = url.openConnection();
InputStream is = connection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr);
HTMLEditorKit htmlKit = new HTMLEditorKit();
HTMLDocument htmlDoc = (HTMLDocument)htmlKit.createDefaultDocument();
HTMLEditorKit.Parser parser = new ParserDelegator();
HTMLEditorKit.ParserCallback callback = htmlDoc.getReader(0);
parser.parse(br, callback, true);
ElementIterator iterator = new ElementIterator(htmlDoc);
Element element;
while ((element = iterator.next()) != null)
{
AttributeSet attributes = element.getAttributes();
Enumeration e=attributes.getAttributeNames();
pw.println("Element Name :"+element.getName());
while(e.hasMoreElements())
{
Object key=e.nextElement();
Object val=attributes.getAttribute(key);
int startOffset = element.getStartOffset();
int endOffset = element.getEndOffset();
int length = endOffset - startOffset;
String text=htmlDoc.getText(startOffset, length);
pw.println("Key :"+key.toString()+" Value :"+val.toString()+"\r\n"+"Text :"+text+"\r\n");
}
}
}
I am using HTML Parser to develop an application.
The code below is not able to get the entire set of tags in the page.
There are some tags which are missed out and the attributes and text body of them are also missed out.
Please help me to explain why is this happening.....or suggest me other way....
URL url = new URL("...");
PrintWriter pw=new PrintWriter(new FileWriter("HTMLElements.txt"));
URLConnection connection = url.openConnection();
InputStream is = connection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr);
HTMLEditorKit htmlKit = new HTMLEditorKit();
HTMLDocument htmlDoc = (HTMLDocument)htmlKit.createDefaultDocument();
HTMLEditorKit.Parser parser = new ParserDelegator();
HTMLEditorKit.ParserCallback callback = htmlDoc.getReader(0);
parser.parse(br, callback, true);
ElementIterator iterator = new ElementIterator(htmlDoc);
Element element;
while ((element = iterator.next()) != null)
{
AttributeSet attributes = element.getAttributes();
Enumeration e=attributes.getAttributeNames();
pw.println("Element Name :"+element.getName());
while(e.hasMoreElements())
{
Object key=e.nextElement();
Object val=attributes.getAttribute(key);
int startOffset = element.getStartOffset();
int endOffset = element.getEndOffset();
int length = endOffset - startOffset;
String text=htmlDoc.getText(startOffset, length);
pw.println("Key :"+key.toString()+" Value :"+val.toString()+"\r\n"+"Text :"+text+"\r\n");
}
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我使用 HTML Parser 相当可靠地完成此操作(前提是 HTML 文档不更改其结构)。具有稳定 API 的 Web 服务要好得多,但有时我们就是没有。
总体思路:
您首先必须知道信息在哪些标签(
div
、meta
、span
等)中您想要在其中,并且知道识别这些标签的属性。示例:如果您正在寻找此“价格”,那么您对带有
class
“价格”的span
标签感兴趣。HTML 解析器具有按属性过滤的功能。
当您使用过滤器进行解析时,您将获得一个
Nodes
列表,您可以对它们执行instanceof
操作以确定它们是否属于您感兴趣的类型,对于span
你会做类似查看支持的标签列表 这里。
使用 HTML 解析器获取具有网站描述的元标记的示例:
标记示例:
代码:
I am doing this fairly reliably with HTML Parser, (provided that the HTML document does not change its structure). A web service with a stable API is much better, but sometimes we just do not have one.
General idea:
You first have to know in what tags (
div
,meta
,span
, etc) the information you want are in, and know the attributes to identify those tags. Example :if you are looking for this "price", then you are interested in
span
tags withclass
"price".HTML Parser has a filter-by-attribute functionality.
When you parse using a filter, you will get a list of
Nodes
that you can do ainstanceof
operation on them to determine if they are of the type you are interested in, forspan
you'd do something likeSee list of supported tags here.
An example with HTML Parser to grab the meta tag that has description about a site:
Tag Sample :
Code:
根据评论:
第 1 步:< /strong> 阅读他们的机器人文件。它通常位于网站的根目录中,例如 http://amazon.com/robots.txt。如果您尝试访问的 URL 被
*
的User-Agent
上的Disallow
覆盖,则停止这里。联系他们,详细解释你想要做什么,并向他们询问可以为你提供所需信息的方法/替代方案/网络服务。否则,您就违反了法律,并且可能会面临被网站和/或您的 ISP 列入黑名单或更糟的风险。如果没有,则继续执行步骤 2。步骤 2: 检查相关网站是否还没有可用的公共 Web 服务,这比解析整个 HTML 页面更容易使用。使用 Web 服务,您将基于一组简单的参数以简洁的格式(JSON 或 XML)准确获得您正在寻找的信息。环顾四周或联系他们以获取有关任何网络服务的详细信息。如果没有办法,请继续执行步骤 3。
步骤 3:了解 HTML/CSS/JS 的工作原理,了解如何使用 Firebug 等 Web 开发工具,了解如何解释 HTML/CSS/JS 源代码你可以通过右键单击>看到查看页面源代码。我敢打赌,相关网站使用 JS/Ajax 来加载/填充您想要收集的信息。在这种情况下,您需要使用一个能够解析和执行 JS 的 HTML 解析器(您正在使用的解析器不能执行此操作)。这不是一件容易的事,所以我不会详细解释它,直到完全清楚你想要实现什么、是否允许以及是否没有更易于使用的 Web 服务可用的。
As per the comments:
Step 1: read their robots file. It's usually found on the root of the site, for example http://amazon.com/robots.txt. If the URL you're trying to access is covered by a
Disallow
on anUser-Agent
of*
, then stop here. Contact them, explain them in detail what you're trying to do and ask them for ways/alternatives/webservices which can provide you the information you need. Else you're violating the laws and you may risk to get blacklisted by the site and/or by your ISP or worse. If not, then proceed to step 2.Step 2: check if the site in question hasn't already a public webservice available which is much more easy to use than parsing a whole HTML page. Using a webservice, you'll get exactly the information you're looking for in a concise format (JSON or XML) based on a simple set of parameters. Look around or contact them for details about any webservices. If there's no way, proceed to step 3.
Step 3: learn how HTML/CSS/JS work, learn how to work with webdeveloper tools like Firebug, learn how to interpret the HTML/CSS/JS source you see by rightclick > View Page Source. My bet that the site in question uses JS/Ajax to load/populate the information you'd like to gather. In that case, you'll need to use a HTML parser which is capable of parsing and executing JS as well (the one you're using namely doesn't do that). This isn't going to be an easy job, so I won't explain it in detail until it's entirely clear what you're trying to achieve and if that is allowed and if there aren't more-easy-to-use webservices available.
您似乎使用了 Swing HtmlDocument。这可能不是有史以来最聪明的想法。
我相信使用 NekoHtml 为例,您会得到更好的结果。
You seemed to use the Swing HtmlDocument. It may not be the smartest idea ever.
I believe you would have better results using, as an example, NekoHtml.
或者您可以使用的另一个简单库是 jtidy,它可以在解析 html 之前清理它。
希望这有帮助。
http://sourceforge.net/projects/jtidy/
再见!
Or another simple library you can use is jtidy that can clean up your html before parsing it.
Hope this helps.
http://sourceforge.net/projects/jtidy/
Ciao!