我想从我的 servlet 读取给定 URL 的源代码(HTML 标签)。
例如,URL 是 http://www.google.com,我的 servlet 需要读取 HTML 源代码。我需要这个的原因是,我的网络应用程序将读取其他网页并获取有用的内容并用它做一些事情。
比方说,我的应用程序显示了一个城市中某一类别的商店列表。该列表的生成方式是,我的 Web 应用程序(servlet)浏览给定的网页,该网页显示各种商店并读取内容。通过源代码,我的 servlet 会过滤源代码并获取有用的详细信息。最后创建列表(因为我的 servlet 无法访问给定 URL 的 Web 应用程序数据库)。
有谁知道有什么解决办法吗? (特别是我需要在 servlet 中执行此操作)如果您认为还有另一种从另一个站点获取详细信息的最佳方法,请告诉我。
谢谢
I want to read a source code (HTML tags) of a given URL from my servlet.
For example, URL is http://www.google.com and my servlet needs to read the HTML source code. Why I need this is, my web application is going to read other web pages and get useful content and do something with it.
Lets say, my application shows a shop list of one category in a city. How that list is generated is, my web application (servlet) goes through a given web page which is displaying various shops and read content. With the source code my servlet filters that source and get useful details. Finally creates the list (because my servlet has no access to the given URL's web applications database).
Any know any solution? (specially I need this to do in servlets) If do you think that there is another best way to get details from another site, please let me know.
Thank you
发布评论
评论(6)
您不需要 servlet 来从远程服务器读取数据。您可以使用 java.net.URL< /a> 或 java.net.URLConnection< /a> 从中读取远程内容的类HTTP 服务器。例如,
You don't need servlet to read data from a remote server. You can just use java.net.URL or java.net.URLConnection class to read remote content from HTTP server. For example,
查看 jsoup 来获取和解析 HTML。
Take a look at jsoup for fetching and parsing the HTML.
您正在尝试做的事情称为网络抓取。 Kayak 和类似的网站可以做到这一点。请在网络上搜索它;)在 java 中你可以做到这一点。
响应将为您提供该 URL 返回的完整 HTML 内容。
What you are trying to do is called web scraping. Kayak and similar websites do it. Do search for it on web ;) Well in java you can do this.
response will give you complete HTML content returned by that URL.
正如上面所写,您不需要 servlet 来实现此目的。 Servlet API用于响应请求,Servlet容器运行在服务器端。如果我理解正确,那么您不需要任何服务器来实现此目的。您只需要简单的 http 客户端模拟器。我希望以下示例对您有所帮助:
As writing above, you don't need servlet for this purpose. Servlet API is used for responsing to requests, servlet container is running on the server side. If i understand you right, you don't need any server for this purpose. You need just simple http client emulator. I hope that following example will help you:
有几种解决方案。
最简单的一种是使用正则表达式。如果您只想从
等标签中提取链接,则用户正则表达式如
. group(1) 包含 URL。现在,只需创建 Matcher 并在
matcher.find()
为 true 时迭代您的文档。下一个解决方案是使用 XML 解析器来解析 HTML。如果您的网站是使用格式良好的 HTML (XHTML) 编写的,那么这将工作得很好。由于情况并非总是如此,因此该解决方案仅适用于选定的站点。
下一个解决方案是使用 java 内置 HTML 解析器: http:// java.sun.com/products/jfc/tsc/articles/bookmarks/
下一个最灵活的方法是使用“真正的”html 解析器,甚至更好的基于 java 的 HTML 浏览器: Java HTML 解析
现在,这取决于您的任务的详细信息。如果解析静态锚标记就足够了,请使用正则表达式。如果没有,请选择以下方式之一。
There are several solutions.
The simplest one is using regular expressions. If you just want to extract links from tags like
<a href="THE URL">
user regular expression like<a\s+href\s*=\s*["']?(.*?)["']\s*/>
. The group(1) contains URL. Now just create Matcher and iterate over your document whilematcher.find()
is true.The next solution is using XML parser to parse HTML. This will work fine if you sites are written using well formatted HTML (XHTML). Since it is not always true this solution is applicable for selected sites only.
The next solution is using the java built-in HTML parser: http://java.sun.com/products/jfc/tsc/articles/bookmarks/
The next, most flexible is way is using the "real" html parser and even better java based HTML browser: Java HTML Parsing
Now it depends on details of your task. If parsing of static anchor tags is enough, user regular expressions. If not choose one of the next ways.
正如人们所说,您可以使用核心类 java.net.URL 和 java.net.URLConnection 来获取网页。
但对于此目的更有用的是 Apache HttpClient。寻找文档和示例如下: http://hc.apache.org/
As people said, you may use core classes java.net.URL and java.net.URLConnection for fetch webpages.
But more useful for that purpose is Apache HttpClient. Look for docs & examples here: http://hc.apache.org/