在 XML 源中搜索关键字

发布于 2024-09-15 14:50:57 字数 320 浏览 24 评论 0原文

总之,

我正在构建一个网站,该网站将从大约 35 个不同的 RSS 源收集新闻报道,并将其存储在一个数组中。我使用 foreach() 循环来搜索标题和描述,以查看它是否包含大约 40 个关键字之一,对每篇文章使用 substr() 。如果搜索成功,该文章将存储在数据库中,并最终将出现在网站上。

该脚本每 30 分钟运行一次。问题是,这需要 1-3 分钟,具体取决于返回的故事数量。不是“可怕”,但在分片托管环境中,我可以看到这会导致很多问题,特别是随着网站的增长和添加更多的提要/关键字。

有什么方法可以优化关键字的“搜索”,以便加快“索引”速度?

谢谢!!

All,

I'm building a site which will gather news stories from about 35 different RSS feeds, storing in an array. I'm using a foreach() loop to search the title and description to see if it contains one of about 40 keywords, using substr() for each article. If the search is successful, that article is stored in a DB, and ultimately will appear on the site.

The script runs every 30 mins. Trouble is, it takes 1-3 mins depending on how many stories are returned. Not 'terrible' but on a shard hosting env, I can see this causing plenty of issues, especially as the site grows and more feeds/keywords are added.

Are there any ways that I can optimize the 'searching' of keywords, so that I can speed up the 'indexing'?

Thanks!!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

人心善变 2024-09-22 14:50:57

35-40 个 RSS 提要是需要一个脚本一次性处理和解析的大量请求。您的瓶颈很可能是请求,而不是解析。您应该将关注点分开。使用一个脚本,每分钟左右一次请求一个 RSS 提要,并将结果存储在本地。然后另一个脚本应该每 15-30 分钟解析并保存/删除临时结果。

35-40 RSS feeds are a lot of requests for one script to handle and parse all at once. Your bottleneck is most likely the requests, not the parsing. You should separate the concerns. Have one script that requests an RSS feed one at a time every minute or so, and store the results locally. Then another script should parse and save/remove the temporary results every 15-30 minutes.

如日中天 2024-09-22 14:50:57

您可以使用 XPath 直接搜索 XML...类似:

$dom = new DomDocument();
$dom->loadXml($feedXml);
$xpath = new DomXpath($dom);

$query = '//item[contains(title, "foo")] | //item[contains(description, "foo")]';
$matchingNodes = $xpath->query($query);

那么, $matchingNodes 将是 所有匹配的 item 节点的 DomNodeList。然后您可以将它们保存在数据库中...

因此,要将其调整为您的现实世界示例,您可以构建查询来一次性完成所有搜索:

$query = array();
foreach($keywords as $keyword) {
    $query[] = '//item[contains(title, "'.$keyword.'")]';
    $query[] = '//item[contains(description, "'.$keyword.'")]';
}
$query = implode('|', $query);

或者只是重新查询每个关键字...个人而言,我会构建一个巨大的查询,因为所有匹配都是在编译的 C 代码中完成的(因此应该比在 php 中循环并在那里聚合结果更有效)...

You could use XPath to search the XML directly... Something like:

$dom = new DomDocument();
$dom->loadXml($feedXml);
$xpath = new DomXpath($dom);

$query = '//item[contains(title, "foo")] | //item[contains(description, "foo")]';
$matchingNodes = $xpath->query($query);

Then, $matchingNodes will be a DomNodeList of all the matching item nodes. Then you can save those in the database...

So to adjust this to your real world example, you could either build the query to do all the searching for you in one shot:

$query = array();
foreach($keywords as $keyword) {
    $query[] = '//item[contains(title, "'.$keyword.'")]';
    $query[] = '//item[contains(description, "'.$keyword.'")]';
}
$query = implode('|', $query);

Or just re-query for each keyword... Personally, I'd build one giant query, since then all the matching is done in complied C code (and hence should be more efficient than looping in php land and aggregating the results there)...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文