delphi从网络中提取信息的最佳方法
我想知道是否有比解析我正在搜索的 HTML 更好的方法从网页中提取信息。即:从“imdb.com”提取电影评级
我目前正在使用 IndyHttp 组件来获取页面,并使用 strUtils 来解析文本,但内容有限。
I want to know if there is a better way of extracting info from a web page than parsing the HTML for what i'm searching. ie: Extracting movie rating from 'imdb.com'
I'm currently using the IndyHttp components to get the page and i'm using strUtils to parse the text but the content is limited.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我发现简单的正则表达式在处理好的网站时非常直观和简单,而且 IMDB 是一个很好的网站。
例如,IMDB 电影 HTML 页面上的电影评级位于带有
class="star-box-giga-star"
的中。使用正则表达式很容易提取。以下正则表达式会将电影分级从原始 HTML 提取到捕获组 1 中:
它不太漂亮,但它可以完成工作。正则表达式查找“star-box-giga-star”类 ID,然后查找终止
DIV
的>
,然后捕获所有内容,直到出现以下内容<代码><。要创建这样的新正则表达式,您应该使用允许检查元素的 Web 浏览器(例如 Crome 或 Opera)。使用 Chrome,您只需查看网页,右键单击要捕获的元素并执行检查元素
,然后查找可轻松识别的元素,这些元素可用于创建良好的正则表达式。在这种情况下,“star-box-giga-star”类显然很容易识别!在好的网站上找到此类可识别元素通常不会有任何问题,因为好的网站使用 CSS,而 CSS 需要ID
或class
才能设置样式元素正确。I found plain simple regex-es to be highly intuitive and simple when dealing with good web-sites, and IMDB is a good web site.
For example the movie rating on the IMDB's movie HTML page is in a
<DIV>
withclass="star-box-giga-star"
. That's VERY easy to extract using a regular expression. The following regular expression will extract the movie rating from the raw HTML into capture group 1:It's not pretty, but it does the job. The regex looks for the "star-box-giga-star" class id, then it looks for the
>
that terminates theDIV
, and then captures everything until the following<
. To create a new regex like this you should use a web browser that allows inspecting elements (for example Crome or Opera). With Chrome you can simply look at the web-page, right-click on the element you want to capture and doInspect element
, then look around for easily identifiable elements that can be used to create a good regex. In this case the"star-box-giga-star"
class is obviously easily identifiable! You'll usually have no problem finding such identifiable elements on good web sites because good web sites use CSS and CSS requiresID
's orclass
'es to be able to style the elements properly.处理 RSS feed 更加舒适。
截至发帖时,网站上唯一可用的 RSS 源有:
然而,您可以通过与 服务台。
有关 RSS feed 处理的资源:
Processing RSS feed is more comfortable.
As of the time of posting, the only RSS feeds available on the site are:
Yet, you may make a call for adding a new one by getting in touch with the help desk.
Resources on RSS feed processing:
抓取网站时,您不能依赖信息的可用性。 IMDB 可能会检测到您的抓取行为并尝试阻止您,或者他们可能会经常更改格式以使其变得更加困难。
因此,您应该始终尝试使用受支持的 API 或 RSS 源,或者至少获得网站的许可来聚合其数据,并确保遵守其条款。通常,您必须为此类访问付费。未经许可抓取网站可能会让您在几个法律方面(拒绝服务和知识产权)承担责任。
以下是 IMDB 的声明:
要回答你的问题,更好的方法是使用网站提供的方法。对于非商业用途,如果您遵守其条款,您可以直接下载 IMDB 数据库并使用那里的数据,而不是抓取他们的网站。只需经常更新数据库,这是比抓取网站更好的解决方案。您甚至可以围绕它包装您自己的 Web API。评级可作为独立表格使用。
When scraping websites, you cannot rely on the availability of the information. IMDB may detect your scraping and attempt to block you, or they may frequently change the format to make it more difficult.
Therefore, you should always try to use a supported API Or RSS feed, or at least get permission from the web site to aggregate their data, and ensure that you're abiding by their terms. Often, you will have to pay for this type of access. Scraping a website without permission may open you up to liability on a couple legal fronts (Denial of Service and Intellectual Property).
Here's IMDB's statement:
To answer your question, the better way is to use the method provided by the website. For non-commercial use, and if you abide by their terms, you can download the IMDB database directly and use the data from there instead of scraping their site. Simply update your database frequently, and it's a better solution than scraping the site. You could even wrap your own web API around it. Ratings are available as a standalone table.
使用 HTML Tidy 将任何 HTML 转换为有效的 XML,然后使用 XML 解析器,可能使用 XPATH 或开发自己的代码(这就是我所做的)。
Use HTML Tidy to convert any HTML to valid XML and then use an XML parser, maybe using XPATH or developing your own code (which is what I do).
发布的所有答案都很好地涵盖了您的一般问题。我通常遵循与 Cosmin 详述的策略类似的策略。我使用 wininet 和 regex 来满足大部分 Web 提取需求。
但让我在提取 IMDB 资格的具体子问题上补充两点意见。 IMDBAPI.COM提供了返回json代码的查询接口,对于此类搜索非常方便。
因此,用于获取 IMDB 评级的非常简单的命令行程序是......
All the answers posted cover well your generic question. I usually follow an strategy similar to the one detailed by Cosmin. I use wininet and regex for most of my web extraction needs.
But let me add my two cents at the specific subquestion on extracting imdb qualification. IMDBAPI.COM provides a query interface returning json code, which is very handy for this type of searches.
So a very simple command line program for getting a imdb rating would be...
如果您正在抓取的页面是有效的 XML,我将使用
SimpleXML
来提取信息。效果很好。资源:
If the page you are crawling is valid XML, i use
SimpleXML
to extract infos. Works pretty well.Resource: