与 Google Reader 同步时如何跳过已知条目？

发布于 2024-07-10 20:03:16 字数 736 浏览 7 评论 0原文

为了将离线客户端写入 Google Reader 服务，我想知道如何最好地与该服务同步。

似乎还没有官方文档，到目前为止我找到的最佳来源是： http://code.google.com/p/pyrfeed/wiki/GoogleReaderAPI

现在考虑一下：根据上面的信息，我可以下载所有未读的项目，我可以指定要下载的项目数量并使用原子- id 我可以检测到我已经下载的重复条目。

我缺少的是一种指定我只需要自上次同步以来的更新的方法。我可以说给我 10 个（参数n=10）最新（参数r=d）条目。如果我指定参数r=o（日期升序），那么我还可以指定参数ot=[上次同步时间]，但只有这样，升序才不会'当我只想阅读某些项目而不是所有项目时，这没有任何意义。

知道如何解决这个问题，而无需再次下载所有项目并拒绝重复项吗？这不是一种非常经济的民意调查方式。

有人建议我可以指定只想要未读的条目。但为了使该解决方案以 Google Reader 不再提供此条目的方式工作，我需要将它们标记为已读。反过来，这意味着我需要在客户端上保留自己的已读/未读状态，并且当用户登录到在线版本的 Google 阅读器时，条目已标记为已读。那对我不起作用。

干杯，马里亚诺

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

橘虞初梦 2024-07-17 20:03:17

要获取最新条目，请使用标准的从最新日期降序下载，该下载将从最新条目开始。您将在 XML 结果中收到一个“继续”标记，看起来像这样：

<gr:continuation>CArhxxjRmNsC</gr:continuation>`

扫描结果，找出任何新的内容。您应该发现，要么所有结果都是新的，要么在某一点上的所有内容都是新的，而之后的所有结果您都已经知道了。

在后一种情况下，你已经完成了，但在前一种情况下，你需要找到比你已经检索到的内容更旧的新内容。通过使用延续来获取从刚刚检索到的集合中最后一个结果之后开始的结果（通过在 GET 请求中将其作为 c 参数传递）来执行此操作，例如：以

http://www.google.com/reader/atom/user/-/state/com.google/reading-list?c=CArhxxjRmNsC

这种方式继续，直到获得一切。

n 参数是要检索的项目数的计数，非常适合于此，并且您可以随时更改它。如果检查频率是用户设置的，因此可能非常频繁或非常罕见，您可以使用自适应算法来减少网络流量和处理负载。最初请求少量最新条目，例如 5 个（将 n=5 添加到 GET 请求的 URL）。如果都是新的，在下一个请求中，
当你使用延续时，要求一个更大的数字，比如 20。如果这些仍然是新的，要么是提要有很多更新，要么已经有一段时间了，所以以 100 为一组继续，或者其他什么。

但是，如果我错了，请纠正我，您还想知道，在下载一个项目后，其状态是否因使用 Google Reader 界面阅读该项目的人而从“未读”更改为“已读”。

一种方法是：

更新 Google 上已在本地读取的任何项目的状态。
检查并保存提要的未读计数。（您需要在下一步之前执行此操作，以便保证在下载最新项目和检查读取计数之间没有新项目到达。）
下载最新项目。
计算您的阅读次数，并将其与谷歌的进行比较。如果提要的阅读次数比您计算的要高，您就知道有人在 Google 上阅读了某些内容。
如果在谷歌上阅读了某些内容，请开始下载已读项目并将其与未读项目数据库进行比较。你会发现一些谷歌说已读的项目，而你的数据库声明是未读的；更新这些。继续这样做，直到您发现这些项目的数量等于您的阅读计数与谷歌的阅读计数之间的差异，或者直到下载变得不合理。
如果您没有找到所有已读项目，这就是生活；将剩余的数量记录为“未发现的未读”总数，您还需要将其包含在您认为未读的本地数量的下一次计算中。

如果用户订阅了很多不同的博客，他也可能对它们进行广泛的标记，因此您可以在每个标签的基础上完成整个事情，而不是针对整个提要，这应该有助于减少数据量，因为如果用户没有在谷歌阅读器上阅读任何新内容，则无需对标签进行任何传输。

整个方案也可以应用于其他状态，例如加星标或未加星标。

现在，正如你所说，这

...这意味着我需要在客户端上保留自己的已读/未读状态，并且当用户登录到在线版本的 Google Reader 时，条目已标记为已读。这对我不起作用。

确实如此。既不保持本地已读/未读状态（因为您无论如何都保留所有项目的数据库），也不标记在谷歌中已读的项目（API 支持）似乎都非常困难，那么为什么这对您不起作用呢？

然而，还有一个进一步的问题：用户可能会在谷歌上将已读的内容标记为未读。这给系统带来了一些麻烦。我的建议是，如果你真的想尝试解决这个问题，那就假设用户一般只会接触最近的东西，并且每次下载最新的几百个左右的项目，检查所有项目的状态他们。（这并不是那么糟糕；下载 100 个项目需要 0.3 秒（300KB）到 2.5 秒（2.5MB），尽管是在非常快的宽带连接上。）

同样，如果用户有大量的订阅，他也可能拥有相当多的标签，因此在每个标签的基础上执行此操作会加快速度。实际上，我建议您不仅要按标签进行检查，还要分散检查，每分钟检查一个标签，而不是每二十分钟检查一次所有标签。如果您想降低带宽，您还可以对旧项目的状态更改进行“大检查”，频率低于“新项目”检查的频率，也许每隔几个小时一次。

这有点占用带宽，主要是因为您需要从 Google 下载完整的文章来检查状态。不幸的是，我在可用的 API 文档中看不到任何解决办法。我唯一真正的建议是尽量减少对非新项目的状态检查。

To get the latest entries, use the standard from-newest-date-descending download, which will start from the latest entries. You will receive a "continuation" token in the XML result, looking something like this:

<gr:continuation>CArhxxjRmNsC</gr:continuation>`

Scan through the results, pulling out anything new to you. You should find that either all results are new, or everything up to a point is new, and all after that are already known to you.

In the latter case, you're done, but in the former you need to find the new stuff older than what you've already retrieved. Do this by using the continuation to get the results starting from just after the last result in the set you just retrieved by passing it in the GET request as the c parameter, e.g.:

http://www.google.com/reader/atom/user/-/state/com.google/reading-list?c=CArhxxjRmNsC

Continue this way until you have everything.

The n parameter, which is a count of the number of items to retrieve, works well with this, and you can change it as you go. If the frequency of checking is user-set, and thus could be very frequent or very rare, you can use an adaptive algorithm to reduce network traffic and your processing load. Initially request a small number of the latest entries, say five (add n=5 to the URL of your GET request). If all are new, in the next request,
where you use the continuation, ask for a larger number, say, 20. If those are still all new, either the feed has a lot of updates or it's been a while, so continue on in groups of 100 or whatever.

However, and correct me if I'm wrong here, you also want to know, after you've downloaded an item, whether its state changes from "unread" to "read" due to the person reading it using the Google Reader interface.

One approach to this would be:

Update the status on google of any items that have been read locally.
Check and save the unread count for the feed. (You want to do this before the next step, so that you guarantee that new items have not arrived between your download of the newest items and the time you check the read count.)
Download the latest items.
Calculate your read count, and compare that to google's. If the feed has a higher read count than you calculated, you know that something's been read on google.
If something has been read on google, start downloading read items and comparing them with your database of unread items. You'll find some items that google says are read that your database claims are unread; update these. Continue doing so until you've found a number of these items equal to the difference between your read count and google's, or until the downloads get unreasonable.
If you didn't find all of the read items, c'est la vie; record the number remaining as an "unfound unread" total which you also need to include in your next calculation of the local number you think are unread.

If the user subscribes to a lot of different blogs, it's also likely he labels them extensively, so you can do this whole thing on a per-label basis rather than for the entire feed, which should help keep the amount of data down, since you won't need to do any transfers for labels where the user didn't read anything new on google reader.

This whole scheme can be applied to other statuses, such as starred or unstarred, as well.

Now, as you say, this

...would mean that I need to keep my own read/unread state on the client and that the entries are already marked as read when the user logs on to the online version of Google Reader. That doesn't work for me.

True enough. Neither keeping a local read/unread state (since you're keeping a database of all of the items anyway) nor marking items read in google (which the API supports) seems very difficult, so why doesn't this work for you?

There is one further hitch, however: the user may mark something read as unread on google. This throws a bit of a wrench into the system. My suggestion there, if you really want to try to take care of this, is to assume that the user in general will be touching only more recent stuff, and download the latest couple hundred or so items every time, checking the status on all of them. (This isn't all that bad; downloading 100 items took me anywhere from 0.3s for 300KB, to 2.5s for 2.5MB, albeit on a very fast broadband connection.)

Again, if the user has a large number of subscriptions, he's also probably got a reasonably large number of labels, so doing this on a per-label basis will speed things up. I'd suggest, actually, that not only do you check on a per-label basis, but you also spread out the checks, checking a single label each minute rather than everything once every twenty minutes. You can also do this "big check" for status changes on older items less often than you do a "new stuff" check, perhaps once every few hours, if you want to keep bandwidth down.

This is a bit of bandwidth hog, mainly because you need to download the full article from Google merely to check the status. Unfortunately, I can't see any way around that in the API docs that we have available to us. My only real advice is to minimize the checking of status on non-new items.

回复收藏 0 原文