delphi从网络中提取信息的最佳方法

发布于 2024-12-27 06:06:47 字数 121 浏览 1 评论 0原文

我想知道是否有比解析我正在搜索的 HTML 更好的方法从网页中提取信息。即:从“imdb.com”提取电影评级

我目前正在使用 IndyHttp 组件来获取页面,并使用 strUtils 来解析文本,但内容有限。

I want to know if there is a better way of extracting info from a web page than parsing the HTML for what i'm searching. ie: Extracting movie rating from 'imdb.com'

I'm currently using the IndyHttp components to get the page and i'm using strUtils to parse the text but the content is limited.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

沫离伤花 2025-01-03 06:06:47

我发现简单的正则表达式在处理好的网站时非常直观和简单,而且 IMDB 是一个很好的网站。

例如,IMDB 电影 HTML 页面上的电影评级位于带有 class="star-box-giga-star"

中。使用正则表达式很容易提取。以下正则表达式会将电影分级从原始 HTML 提取到捕获组 1 中:

star-box-giga-star[^>]*>([^<]*)<

它不太漂亮,但它可以完成工作。正则表达式查找“​​star-box-giga-star”类 ID,然后查找终止 DIV>,然后捕获所有内容,直到出现以下内容<代码><。要创建这样的新正则表达式,您应该使用允许检查元素的 Web 浏览器(例如 Crome 或 Opera)。使用 Chrome,您只需查看网页,右键单击要捕获的元素并执行检查元素,然后查找可轻松识别的元素,这些元素可用于创建良好的正则表达式。在这种情况下,“star-box-giga-star”类显然很容易识别!在好的网站上找到此类可识别元素通常不会有任何问题,因为好的网站使用 CSS,而 CSS 需要 IDclass 才能设置样式元素正确。

I found plain simple regex-es to be highly intuitive and simple when dealing with good web-sites, and IMDB is a good web site.

For example the movie rating on the IMDB's movie HTML page is in a <DIV> with class="star-box-giga-star". That's VERY easy to extract using a regular expression. The following regular expression will extract the movie rating from the raw HTML into capture group 1:

star-box-giga-star[^>]*>([^<]*)<

It's not pretty, but it does the job. The regex looks for the "star-box-giga-star" class id, then it looks for the > that terminates the DIV, and then captures everything until the following <. To create a new regex like this you should use a web browser that allows inspecting elements (for example Crome or Opera). With Chrome you can simply look at the web-page, right-click on the element you want to capture and do Inspect element, then look around for easily identifiable elements that can be used to create a good regex. In this case the "star-box-giga-star" class is obviously easily identifiable! You'll usually have no problem finding such identifiable elements on good web sites because good web sites use CSS and CSS requires ID's or class'es to be able to style the elements properly.

丶视觉 2025-01-03 06:06:47

处理 RSS feed 更加舒适。

截至发帖时,网站上唯一可用的 RSS 源有:

  • 出生于该日期
  • 死亡于该日期
  • 每日民意调查

然而,您可以通过与 服务台

有关 RSS feed 处理的资源:

Processing RSS feed is more comfortable.

As of the time of posting, the only RSS feeds available on the site are:

  • Born on this Date
  • Died on this Date
  • Daily Poll

Yet, you may make a call for adding a new one by getting in touch with the help desk.

Resources on RSS feed processing:

自演自醉 2025-01-03 06:06:47

抓取网站时,您不能依赖信息的可用性。 IMDB 可能会检测到您的抓取行为并尝试阻止您,或者他们可能会经常更改格式以使其变得更加困难。

因此,您应该始终尝试使用受支持的 API 或 RSS 源,或者至少获得网站的许可来聚合其数据,并确保遵守其条款。通常,您必须为此类访问付费。未经许可抓取网站可能会让您在几个法律方面(拒绝服务和知识产权)承担责任。

以下是 IMDB 的声明

您不得使用数据挖掘、机器人、屏幕抓取或类似功能
我们网站上的在线数据收集和提取工具。

要回答你的问题,更好的方法是使用网站提供的方法。对于非商业用途,如果您遵守其条款,您可以直接下载 IMDB 数据库并使用那里的数据,而不是抓取他们的网站。只需经常更新数据库,这是比抓取网站更好的解决方案。您甚至可以围绕它包装您自己的 Web API。评级可作为独立表格使用。

When scraping websites, you cannot rely on the availability of the information. IMDB may detect your scraping and attempt to block you, or they may frequently change the format to make it more difficult.

Therefore, you should always try to use a supported API Or RSS feed, or at least get permission from the web site to aggregate their data, and ensure that you're abiding by their terms. Often, you will have to pay for this type of access. Scraping a website without permission may open you up to liability on a couple legal fronts (Denial of Service and Intellectual Property).

Here's IMDB's statement:

You may not use data mining, robots, screen scraping, or similar
online data gathering and extraction tools on our website.

To answer your question, the better way is to use the method provided by the website. For non-commercial use, and if you abide by their terms, you can download the IMDB database directly and use the data from there instead of scraping their site. Simply update your database frequently, and it's a better solution than scraping the site. You could even wrap your own web API around it. Ratings are available as a standalone table.

九厘米的零° 2025-01-03 06:06:47

使用 HTML Tidy 将任何 HTML 转换为有效的 XML,然后使用 XML 解析器,可能使用 XPATH 或开发自己的代码(这就是我所做的)。

Use HTML Tidy to convert any HTML to valid XML and then use an XML parser, maybe using XPATH or developing your own code (which is what I do).

微凉徒眸意 2025-01-03 06:06:47

发布的所有答案都很好地涵盖了您的一般问题。我通常遵循与 Cosmin 详述的策略类似的策略。我使用 wininet 和 regex 来满足大部分 Web 提取需求。

但让我在提取 IMDB 资格的具体子问题上补充两点意见。 IMDBAPI.COM提供了返回json代码的查询接口,对于此类搜索非常方便。

因此,用于获取 IMDB 评级的非常简单的命令行程序是......

program imdbrating;
{$apptype console}
uses htmlutils;

function ExtractJsonParm(parm,h:string):string;
 var r:integer;
 begin
  r:=pos('"'+Parm+'":',h);
  if r<>0 then 
    result:=copy(h,r+length(Parm)+4,pos(',',copy(h,r+length(Parm)+4,length(h)))-2)
  else
    result:='N/A';
 end;
    
var h:string;
begin
  h:=HttpGet('http://www.imdbapi.com/?t=' + UrlEncode(ParamStr(1)));
  writeln(ExtractJsonParm('Rating',h));
end.

All the answers posted cover well your generic question. I usually follow an strategy similar to the one detailed by Cosmin. I use wininet and regex for most of my web extraction needs.

But let me add my two cents at the specific subquestion on extracting imdb qualification. IMDBAPI.COM provides a query interface returning json code, which is very handy for this type of searches.

So a very simple command line program for getting a imdb rating would be...

program imdbrating;
{$apptype console}
uses htmlutils;

function ExtractJsonParm(parm,h:string):string;
 var r:integer;
 begin
  r:=pos('"'+Parm+'":',h);
  if r<>0 then 
    result:=copy(h,r+length(Parm)+4,pos(',',copy(h,r+length(Parm)+4,length(h)))-2)
  else
    result:='N/A';
 end;
    
var h:string;
begin
  h:=HttpGet('http://www.imdbapi.com/?t=' + UrlEncode(ParamStr(1)));
  writeln(ExtractJsonParm('Rating',h));
end.
晌融 2025-01-03 06:06:47

如果您正在抓取的页面是有效的 XML,我将使用 SimpleXML 来提取信息。效果很好。

资源:

If the page you are crawling is valid XML, i use SimpleXML to extract infos. Works pretty well.

Resource:

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文