从各种来源汇总

发布于 2024-09-18 23:46:11 字数 344 浏览 15 评论 0原文

这可能是一个远远超出我现在能力的项目,但我还有大约一整月的时间来完成它,所以我认为我可以做到。我想要构建的是:从各种来源收集有关特定主题的新闻。容易,对吧?只需获取 RSS 提要并将其显示在页面上即可。好吧,我想要一些更高级的东西:重复删除和自定义演示(即能够定义/更改新闻标题的显示格式)。

我使用过 Yahoo Pipes 和其他一些工具,但我面临着两个大问题:

  1. 某些来源不提供 rss 提要。我如何创建一个?
  2. 查找和删除重复项的最佳方法是什么?我考虑比较一下标题并检查是否存在大于 50% 的匹配。但这是一个好的做法吗?

请添加我可能没有考虑到的任何其他事情(问题、建议等)。

It could be a project well beyond my skills right now but I've got around one full month to spend on it so I think I can do it. What I want to build is this: Gather news about a specific subject from various sources. Easy, right? Just get the rss feeds and display them on a page. Well, I want something more advanced: Duplicates removed and customized presentation (that is, be able to define/change the format in which the news headlines are displayed).

I've played a bit with Yahoo Pipes and some other tools and I am facing two big problems:

  1. Some sources don't provide rss feeds. How do I create one?
  2. What's the best method to find and remove duplicates. I thought about comparing the headlines and checking if there is a matching bigger than, say, 50%. Is that a good practice though?

Please add any other things (problems, suggestions, whatever) I might not have considered.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

信愁 2024-09-25 23:46:11

重复是一个令人讨厌的问题。我最终做了什么:

  • 1. 去掉除链接之外的所有 HTML 标签(虽然我开始使用正则表达式,但我很受伤。我最终转向自定义解析以删除标签)
  • 2. 去掉所有空格
  • 3. 区分大小写
  • 4 . 使用 MD5 对所有内容进行哈希处理。

这就是您将链接保留在以下位置的原因:
评论可能很简单,比如“是的,这很糟糕”。 “是的,这很糟糕”可能是一个常见的评论。但是,如果文本“这很糟糕”链接到不同的事物,那么它就不是重复的评论。

此外,您会发现 HTML 标签转义对于 RSS 提要来说很奇怪。你可能会认为流浪<<将被双重编码:(我认为)&<;
但事实并非如此。它被编码<
但 HTML 标签也是如此!

:

我最终复制了 Mozilla Firefox 解析的所有已知 HTML 标签,并手动识别这些标签。

从 HTML 创建 RSS 提要非常麻烦,我只能向您推荐 Spinn3r 等服务,它们在重复数据删除和内容提取方面非常出色。这些服务通常使用高于我的基于概率的算法。我知道有一家提供商成功地对页面进行了重新调整(他们必须知道某个页面是基于 MySpace 或 Blogger 的),但他们的表现并不理想。

Duplication is a nasty issue. What I eventually ended up doing:

  • 1. Strip out all HTML tags except for links (Although I started using regex, I was burned. I eventually moved to custom parsing to remove tags)
  • 2. Strip out all whitespace
  • 3. Case-desensitize
  • 4. Hash all that with MD5.

Here's why you leave the link in:
A comment might be as simple as "Yes, this sucks". "Yes, this sucks" could be a common comment. BUT if the text "this sucks" is linked to different things, then it is not a duplicate comment.

Additionally, you will find that HTML tag escaping is weird with RSS feeds. You would think that a stray < would be double-encoded: (I think)&<;
But it is not. It is encoded <
But so too are HTML tags!

:<p>
I eventually copied all the known HTML tags as parsed by Mozilla Firefox and manually recognized those tags.

Creating an RSS feed from HTML is quite nasty and I can only point you to services such as Spinn3r, which are fantastic at de-duplication and content extraction. These services typically use probability-based algorithms that are above me. I know of one provider that got away with regexing pages (They had to know that a certain page was MySpace-based or Blogger-based) but they did not perform admirably.

浪推晚风 2024-09-25 23:46:11

您可能想尝试使用 YQL 模块 用于抓取不提供 RSS 的网页。 以下是用于抓取 HTML 的 YQL 语句示例

关于重复项,请查看此管道

定制演示:如果您希望它真正定制,您必须自己操作管道结果,例如将其作为 JSON 获取,然后使用 Javascript 操作它,或者在服务器端处理它。

You might want to try to use the YQL module to scrape a webpage that doesn't provide RSS. Here's a sample of a YQL statement to scrape HTML.

About duplicates, take a look at this pipe.

Customized presentation: if you want it truly customized you'll have to manipulate the pipe results yourself, e.g. get it as JSON an manipulate it with Javascript, or process it server-side.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文