当前位置：文江博客话题详情

delphi从网络中提取信息的最佳方法

发布于 2024-12-27 06:06:47 字数 121 浏览 1 评论 0原文

我想知道是否有比解析我正在搜索的 HTML 更好的方法从网页中提取信息。即：从“imdb.com”提取电影评级

我目前正在使用 IndyHttp 组件来获取页面，并使用 strUtils 来解析文本，但内容有限。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

沫离伤花 2025-01-03 06:06:47

我发现简单的正则表达式在处理好的网站时非常直观和简单，而且 IMDB 是一个很好的网站。

例如，IMDB 电影 HTML 页面上的电影评级位于带有 class="star-box-giga-star" 的

中。使用正则表达式很容易提取。以下正则表达式会将电影分级从原始 HTML 提取到捕获组 1 中：

star-box-giga-star[^>]*>([^<]*)<

它不太漂亮，但它可以完成工作。正则表达式查找“star-box-giga-star”类 ID，然后查找终止 DIV 的 >，然后捕获所有内容，直到出现以下内容<代码><。要创建这样的新正则表达式，您应该使用允许检查元素的 Web 浏览器（例如 Crome 或 Opera）。使用 Chrome，您只需查看网页，右键单击要捕获的元素并执行检查元素，然后查找可轻松识别的元素，这些元素可用于创建良好的正则表达式。在这种情况下，“star-box-giga-star”类显然很容易识别！在好的网站上找到此类可识别元素通常不会有任何问题，因为好的网站使用 CSS，而 CSS 需要 ID 或 class 才能设置样式元素正确。

I found plain simple regex-es to be highly intuitive and simple when dealing with good web-sites, and IMDB is a good web site.

For example the movie rating on the IMDB's movie HTML page is in a <DIV> with class="star-box-giga-star". That's VERY easy to extract using a regular expression. The following regular expression will extract the movie rating from the raw HTML into capture group 1:

star-box-giga-star[^>]*>([^<]*)<

It's not pretty, but it does the job. The regex looks for the "star-box-giga-star" class id, then it looks for the > that terminates the DIV, and then captures everything until the following <. To create a new regex like this you should use a web browser that allows inspecting elements (for example Crome or Opera). With Chrome you can simply look at the web-page, right-click on the element you want to capture and do Inspect element, then look around for easily identifiable elements that can be used to create a good regex. In this case the "star-box-giga-star" class is obviously easily identifiable! You'll usually have no problem finding such identifiable elements on good web sites because good web sites use CSS and CSS requires ID's or class'es to be able to style the elements properly.

回复收藏 0 原文

丶视觉 2025-01-03 06:06:47

处理 RSS feed 更加舒适。

截至发帖时，网站上唯一可用的 RSS 源有：

出生于该日期
死亡于该日期
每日民意调查

然而，您可以通过与服务台。

有关 RSS feed 处理的资源：

相关帖子在此处。
超级对象
维基百科。

回复收藏 0 原文

自演自醉 2025-01-03 06:06:47

抓取网站时，您不能依赖信息的可用性。 IMDB 可能会检测到您的抓取行为并尝试阻止您，或者他们可能会经常更改格式以使其变得更加困难。

因此，您应该始终尝试使用受支持的 API 或 RSS 源，或者至少获得网站的许可来聚合其数据，并确保遵守其条款。通常，您必须为此类访问付费。未经许可抓取网站可能会让您在几个法律方面（拒绝服务和知识产权）承担责任。

以下是 IMDB 的声明：

您不得使用数据挖掘、机器人、屏幕抓取或类似功能
我们网站上的在线数据收集和提取工具。

要回答你的问题，更好的方法是使用网站提供的方法。对于非商业用途，如果您遵守其条款，您可以直接下载 IMDB 数据库并使用那里的数据，而不是抓取他们的网站。只需经常更新数据库，这是比抓取网站更好的解决方案。您甚至可以围绕它包装您自己的 Web API。评级可作为独立表格使用。

回复收藏 0 原文

九厘米的零° 2025-01-03 06:06:47

使用 HTML Tidy 将任何 HTML 转换为有效的 XML，然后使用 XML 解析器，可能使用 XPATH 或开发自己的代码（这就是我所做的）。

回复收藏 0 原文

微凉徒眸意 2025-01-03 06:06:47

发布的所有答案都很好地涵盖了您的一般问题。我通常遵循与 Cosmin 详述的策略类似的策略。我使用 wininet 和 regex 来满足大部分 Web 提取需求。

但让我在提取 IMDB 资格的具体子问题上补充两点意见。 IMDBAPI.COM提供了返回json代码的查询接口，对于此类搜索非常方便。

因此，用于获取 IMDB 评级的非常简单的命令行程序是......

program imdbrating;
{$apptype console}
uses htmlutils;

function ExtractJsonParm(parm,h:string):string;
 var r:integer;
 begin
  r:=pos('"'+Parm+'":',h);
  if r<>0 then 
    result:=copy(h,r+length(Parm)+4,pos(',',copy(h,r+length(Parm)+4,length(h)))-2)
  else
    result:='N/A';
 end;
    
var h:string;
begin
  h:=HttpGet('http://www.imdbapi.com/?t=' + UrlEncode(ParamStr(1)));
  writeln(ExtractJsonParm('Rating',h));
end.

All the answers posted cover well your generic question. I usually follow an strategy similar to the one detailed by Cosmin. I use wininet and regex for most of my web extraction needs.

But let me add my two cents at the specific subquestion on extracting imdb qualification. IMDBAPI.COM provides a query interface returning json code, which is very handy for this type of searches.

So a very simple command line program for getting a imdb rating would be...

program imdbrating;
{$apptype console}
uses htmlutils;

function ExtractJsonParm(parm,h:string):string;
 var r:integer;
 begin
  r:=pos('"'+Parm+'":',h);
  if r<>0 then 
    result:=copy(h,r+length(Parm)+4,pos(',',copy(h,r+length(Parm)+4,length(h)))-2)
  else
    result:='N/A';
 end;
    
var h:string;
begin
  h:=HttpGet('http://www.imdbapi.com/?t=' + UrlEncode(ParamStr(1)));
  writeln(ExtractJsonParm('Rating',h));
end.

回复收藏 0 原文