在 Perl 中检查 10,000 个博客源的有效方法

发布于 2024-10-06 23:40:55 字数 300 浏览 2 评论 0原文

我们有 10,000 个博客,我们希望每天多次检查新帖子。我希望得到一些关于使用 Perl 实现这一点的最有效方法的示例代码的想法。

目前我们只使用 LWP::UserAgent 下载每个 RSS 提要,然后根据已找到的 URL 的 MySQL 数据库表检查生成的提要中的每个 URL。不用说,这不能很好地扩展并且效率非常低。

预先感谢您的帮助&建议!

We have 10,000s of blogs we want to check multiple times a day for new posts. I'd love some ideas with example code on the most efficient way to do this using Perl.

Currently we are just using LWP::UserAgent to download each RSS feed and then checking each URL in the resulting feed against a MySQL database table of already found URLs one at a time. Needless to say this doesn't scale well and is super inefficient.

Thanks in advance for your help & advice!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

寒冷纷飞旳雪 2024-10-13 23:40:55

不幸的是,除了进行某种轮询之外,可能没有其他方法。

幸运的是,实施 PubSubHubbub 协议可以极大地帮助减少支持该协议的 Feed 的轮询量。

对于那些不支持 PubSubHubbub 的提要,您必须确保使用 HTTP 级协议(例如 ETagsIf-Modified-Since 标头才能知道如果/当资源已更新)。
还要确保实施某种退避机制。

Unfortunately, there is probably no other way than do some kind of polling.

Luckily, implementing the PubSubHubbub protocol can greatly help reduce the amount of polling for the feeds who support it.

For those feeds who don't support PubSubHubbub, then you'll have to make sure you use HTTP-level protocols (like ETags or If-Modified-Since headers to know if/when a resource has been updated).
Also make sure you implement some kind of back-off mechanisms.

酒解孤独 2024-10-13 23:40:55

也许看看 AnyEvent::Feed,它是异步的(使用 AnyEvent 事件循环)可配置的轮询间隔以及对“已读”文章的内置支持,以及对 RSS 和 Atom 提要的支持。您可以创建一个轮询每个提要的单个进程,也可以创建轮询提要列表的不同部分的多个进程。

从剧情简介来看:

      use AnyEvent;
      use AnyEvent::Feed;

      my $feed_reader =
         AnyEvent::Feed->new (
            url      => 'http://example.com/atom.xml',
            interval => $seconds,

            on_fetch => sub {
               my ($feed_reader, $new_entries, $feed, $error) = @_;

               if (defined $error) {
                  warn "ERROR: $error\n";
                  return;
               }
               for (@$new_entries) {
                     my ($hash, $entry) = @_;
                     # $hash a unique hash describing the $entry
                     # $entry is the XML::Feed::Entry object of the new entries
                     # since the last fetch.
               }

            }
         );

Perhaps look at AnyEvent::Feed, it is asynchronous (using the AnyEvent event loop) with configurable polling intervals as well as built in support for 'seen' articles, and support for RSS and Atom feeds. You could possibly create a single process polling every feed or multiple processes polling different sections of your feed list.

From the synopsis:

      use AnyEvent;
      use AnyEvent::Feed;

      my $feed_reader =
         AnyEvent::Feed->new (
            url      => 'http://example.com/atom.xml',
            interval => $seconds,

            on_fetch => sub {
               my ($feed_reader, $new_entries, $feed, $error) = @_;

               if (defined $error) {
                  warn "ERROR: $error\n";
                  return;
               }
               for (@$new_entries) {
                     my ($hash, $entry) = @_;
                     # $hash a unique hash describing the $entry
                     # $entry is the XML::Feed::Entry object of the new entries
                     # since the last fetch.
               }

            }
         );
转身泪倾城 2024-10-13 23:40:55

似乎两个问题合二为一:进行比较。其他人已经回答了获取部分。至于比较:

  • 我最近一直在阅读redis,它似乎是非常适合您,因为它每秒可以执行许多简单的操作(假设〜80k /s)。因此,检查您是否已有网址应该很快。但从未真正使用过它;)

  • 一个想法:在解析 RSS 之前您是否尝试过比较大小?如果不经常更改,可能会节省您一些时间。

Seems like two questions rolled into one: fetching an comparing. Others have answered the fetch part. As for comparing:

  • I've been reading about redis lately and it seems like a good fit for you as it can do a lot of simple operations per second (lets say ~80k /s). So checking if you already have an url should go really fast. Never actually used it though ;)

  • An idea: Have you tried comparing on size before parsing the RSS? Might save you some time if the change infrequently.

若相惜即相离 2024-10-13 23:40:55

10000 并不算多。

您可能可以使用一些简单的方法来处理,例如分叉一些从数据库获取 RSS URL 的工作进程,获取它们并更新数据库:

for (1..$n) {
  my $pid = fork;
  if (!$pid) {
     defined $pid or die "fork failed";
     my $db = open_db();
     while (1) {
       $url = get_next_url($db) or last;
       $rss = feed_rss($url);
       update_rss($db, $rss);
     }
     exit(0);
  }
}
wait_for_workers(@pid);

考虑到您无法使用其他响应者已经指出的一些现有应用程序。

10000 are not so many.

You could probably handle then using some simple approach like forking some worker processes that get RSS URLs from the db, fetch them and update the database:

for (1..$n) {
  my $pid = fork;
  if (!$pid) {
     defined $pid or die "fork failed";
     my $db = open_db();
     while (1) {
       $url = get_next_url($db) or last;
       $rss = feed_rss($url);
       update_rss($db, $rss);
     }
     exit(0);
  }
}
wait_for_workers(@pid);

That, considering you are not able to use some of the existent applications already pointed by other responders.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文