首先,您需要熟悉 Java 中的 HTMLDOM 解析器,例如 JTidy。这将帮助您从 HTML 文件中提取所需的内容。获得必要的内容后,您可以使用JDBC 放入数据库。
使用正则表达式来完成这项工作可能很诱人。但不要。 HTML 不是常规语言,因此正则表达式不是最佳选择。
First you need to get familiar with a HTMLDOM parser in Java like JTidy. This will help you to extract the stuff you want from a HTML file. Once you have the essential stuff, you can use JDBC to put in the database.
It might be tempting to use regular expression for this job. But don't. HTML is not a regular language so regex are not the way to go.
I am running a scraper using JSoup I'm a noob yet found it to be very intuitive and easy to work with. It is also capable of parsing a wide range or sources html, XML, RSS, etc.
I experimented with htmlunit with little to no success.
我在一个抓取 HTML 页面的项目中成功使用了 lobo 浏览器 API。 lobo 浏览器项目提供了一个浏览器,但您也可以非常轻松地使用它背后的 API。它还会执行 javascript,如果该 javascript 操作 DOM,那么当您研究 DOM 时,这也会反映在 DOM 中。所以,简而言之,API 允许你模仿浏览器,你还可以使用 cookies 之类的东西。
现在,为了从 HTML 中获取数据,我首先将 HTML 转换为有效的 XHTML。您可以使用 jtidy 。由于 XHTML 是有效的 XML,因此您可以使用 XPath 非常轻松地检索所需的数据。如果您尝试编写从原始 HTML 解析数据的代码,您的代码很快就会变得一团糟。因此我会使用 XPath。
i successfully used lobo browser API in a project that scraped HTML pages. the lobo browser project offers a browser but you can also use the API behind it very easily. it will also execute javascript and if that javascript manipulates the DOM, then that will also be reflected in the DOM when you investigate the DOM. so, in short, the API allows you mimic a browser, you can also work with cookies and stuff.
now for getting the data out of the HTML, i would first transform the HTML to valid XHTML. you can use jtidy for this. since XHTML is valid XML, you can use XPath to retrieve the data you want very easily. if you try to write code that parses the data from the raw HTML, your code will become a mess quickly. therefore i'd use XPath.
Once you have the data, you can insert it into a DB with JDBC or maybe use Hibernate if you want to avoid writing too much SQL
A HUGE percentage of websites are build on malformed HTML code. It is essential that you use something like HtmlCleaner to clean up the source code that you want to parse. Then you can successfully use XPath to extract Nodes and Regex to parse specific part of the strings you extracted from the page.
At least this is the technique I used.
You can use the xHtml that is returned from HtmlCleaner as a sort of Interface between your Application and the remote Page you're trying to parse. You should test against this and in the case the remote page changes you just have to extract the new xHtml cleaned by HtmlCleaner, re-adapt the XPath Queries to extract what you need and re-test your Application code against the new Interface.
In the case you want to create a MultiThreaded 'scraper' be aware that HtmlCleaner is not Thread Safe (refer my post here). This post can give you an idea of how to parse a correctly formatted xHtml using XPath.
Good Luck! ;)
note: at the time I implemented my Scraper, HtmlCleaner did a better job in normalizing the pages I wanted to parse. In some cases jTidy was failing in doing the same job so I'd suggest you to give it a try
发布评论
评论(5)
首先,您需要熟悉 Java 中的
HTML
DOM
解析器,例如 JTidy。这将帮助您从HTML
文件中提取所需的内容。获得必要的内容后,您可以使用JDBC
放入数据库
。使用正则表达式来完成这项工作可能很诱人。但不要。 HTML 不是常规语言,因此正则表达式不是最佳选择。
First you need to get familiar with a
HTML
DOM
parser in Java like JTidy. This will help you to extract the stuff you want from aHTML
file. Once you have the essential stuff, you can useJDBC
to put in thedatabase
.It might be tempting to use regular expression for this job. But don't. HTML is not a regular language so regex are not the way to go.
我正在使用 JSoup 运行一个爬虫,我是一个菜鸟,但发现它非常直观且易于使用。它还能够解析各种来源的 html、XML、RSS 等。
我尝试过 htmlunit,但几乎没有成功。
I am running a scraper using JSoup I'm a noob yet found it to be very intuitive and easy to work with. It is also capable of parsing a wide range or sources html, XML, RSS, etc.
I experimented with htmlunit with little to no success.
我在一个抓取 HTML 页面的项目中成功使用了 lobo 浏览器 API。 lobo 浏览器项目提供了一个浏览器,但您也可以非常轻松地使用它背后的 API。它还会执行 javascript,如果该 javascript 操作 DOM,那么当您研究 DOM 时,这也会反映在 DOM 中。所以,简而言之,API 允许你模仿浏览器,你还可以使用 cookies 之类的东西。
现在,为了从 HTML 中获取数据,我首先将 HTML 转换为有效的 XHTML。您可以使用 jtidy 。由于 XHTML 是有效的 XML,因此您可以使用 XPath 非常轻松地检索所需的数据。如果您尝试编写从原始 HTML 解析数据的代码,您的代码很快就会变得一团糟。因此我会使用 XPath。
获得数据后,您可以使用 JDBC 将其插入数据库中如果你想避免编写太多 SQL,请使用 Hibernate
i successfully used lobo browser API in a project that scraped HTML pages. the lobo browser project offers a browser but you can also use the API behind it very easily. it will also execute javascript and if that javascript manipulates the DOM, then that will also be reflected in the DOM when you investigate the DOM. so, in short, the API allows you mimic a browser, you can also work with cookies and stuff.
now for getting the data out of the HTML, i would first transform the HTML to valid XHTML. you can use jtidy for this. since XHTML is valid XML, you can use XPath to retrieve the data you want very easily. if you try to write code that parses the data from the raw HTML, your code will become a mess quickly. therefore i'd use XPath.
Once you have the data, you can insert it into a DB with JDBC or maybe use Hibernate if you want to avoid writing too much SQL
很大一部分网站都是基于格式错误的 HTML 代码构建的。
使用 HtmlCleaner 清理您想要解析的源代码。
然后,您可以成功使用 XPath 提取节点并使用 Regex 解析从页面提取的字符串的特定部分。
至少这是我使用的技术。
您可以使用从 HtmlCleaner 返回的 xHtml 作为应用程序和您尝试解析的远程页面之间的一种接口。您应该对此进行测试,如果远程页面发生更改,您只需提取由 HtmlCleaner 清理的新 xHtml,重新调整 XPath 查询以提取您需要的内容,并针对新接口重新测试您的应用程序代码。
如果您想创建多线程“抓取器”,请注意 HtmlCleaner 不是线程安全的(请参阅我的 帖子此处)。
这篇帖子可以让您了解如何使用解析格式正确的xHtml X 路径。
祝你好运! ;)
注意:在我实现 Scraper 时,HtmlCleaner 在规范化我想要解析的页面方面做得更好。在某些情况下,jTidy 无法完成同样的工作,所以我建议您尝试一下
A HUGE percentage of websites are build on malformed HTML code.
It is essential that you use something like HtmlCleaner to clean up the source code that you want to parse.
Then you can successfully use XPath to extract Nodes and Regex to parse specific part of the strings you extracted from the page.
At least this is the technique I used.
You can use the xHtml that is returned from HtmlCleaner as a sort of Interface between your Application and the remote Page you're trying to parse. You should test against this and in the case the remote page changes you just have to extract the new xHtml cleaned by HtmlCleaner, re-adapt the XPath Queries to extract what you need and re-test your Application code against the new Interface.
In the case you want to create a MultiThreaded 'scraper' be aware that HtmlCleaner is not Thread Safe (refer my post here).
This post can give you an idea of how to parse a correctly formatted xHtml using XPath.
Good Luck! ;)
note: at the time I implemented my Scraper, HtmlCleaner did a better job in normalizing the pages I wanted to parse. In some cases jTidy was failing in doing the same job so I'd suggest you to give it a try
使用JTidy,您可以从 HTML 中抓取数据。然后您就可以使用JDBC。
Using JTidy you can scrap data from HTML. Then yoou can use JDBC.