从各种来源汇总
这可能是一个远远超出我现在能力的项目,但我还有大约一整月的时间来完成它,所以我认为我可以做到。我想要构建的是:从各种来源收集有关特定主题的新闻。容易,对吧?只需获取 RSS 提要并将其显示在页面上即可。好吧,我想要一些更高级的东西:重复删除和自定义演示(即能够定义/更改新闻标题的显示格式)。
我使用过 Yahoo Pipes 和其他一些工具,但我面临着两个大问题:
- 某些来源不提供 rss 提要。我如何创建一个?
- 查找和删除重复项的最佳方法是什么?我考虑比较一下标题并检查是否存在大于 50% 的匹配。但这是一个好的做法吗?
请添加我可能没有考虑到的任何其他事情(问题、建议等)。
It could be a project well beyond my skills right now but I've got around one full month to spend on it so I think I can do it. What I want to build is this: Gather news about a specific subject from various sources. Easy, right? Just get the rss feeds and display them on a page. Well, I want something more advanced: Duplicates removed and customized presentation (that is, be able to define/change the format in which the news headlines are displayed).
I've played a bit with Yahoo Pipes and some other tools and I am facing two big problems:
- Some sources don't provide rss feeds. How do I create one?
- What's the best method to find and remove duplicates. I thought about comparing the headlines and checking if there is a matching bigger than, say, 50%. Is that a good practice though?
Please add any other things (problems, suggestions, whatever) I might not have considered.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
重复是一个令人讨厌的问题。我最终做了什么:
这就是您将链接保留在以下位置的原因:
评论可能很简单,比如“是的,这很糟糕”。 “是的,这很糟糕”可能是一个常见的评论。但是,如果文本“这很糟糕”链接到不同的事物,那么它就不是重复的评论。
此外,您会发现 HTML 标签转义对于 RSS 提要来说很奇怪。你可能会认为流浪<<将被双重编码:(我认为)&<;
但事实并非如此。它被编码<
但 HTML 标签也是如此!
:
我最终复制了 Mozilla Firefox 解析的所有已知 HTML 标签,并手动识别这些标签。
从 HTML 创建 RSS 提要非常麻烦,我只能向您推荐 Spinn3r 等服务,它们在重复数据删除和内容提取方面非常出色。这些服务通常使用高于我的基于概率的算法。我知道有一家提供商成功地对页面进行了重新调整(他们必须知道某个页面是基于 MySpace 或 Blogger 的),但他们的表现并不理想。
Duplication is a nasty issue. What I eventually ended up doing:
Here's why you leave the link in:
A comment might be as simple as "Yes, this sucks". "Yes, this sucks" could be a common comment. BUT if the text "this sucks" is linked to different things, then it is not a duplicate comment.
Additionally, you will find that HTML tag escaping is weird with RSS feeds. You would think that a stray < would be double-encoded: (I think)&<;
But it is not. It is encoded <
But so too are HTML tags!
:<p>
I eventually copied all the known HTML tags as parsed by Mozilla Firefox and manually recognized those tags.
Creating an RSS feed from HTML is quite nasty and I can only point you to services such as Spinn3r, which are fantastic at de-duplication and content extraction. These services typically use probability-based algorithms that are above me. I know of one provider that got away with regexing pages (They had to know that a certain page was MySpace-based or Blogger-based) but they did not perform admirably.
您可能想尝试使用 YQL 模块 用于抓取不提供 RSS 的网页。 以下是用于抓取 HTML 的 YQL 语句示例。
关于重复项,请查看此管道。
定制演示:如果您希望它真正定制,您必须自己操作管道结果,例如将其作为 JSON 获取,然后使用 Javascript 操作它,或者在服务器端处理它。
You might want to try to use the YQL module to scrape a webpage that doesn't provide RSS. Here's a sample of a YQL statement to scrape HTML.
About duplicates, take a look at this pipe.
Customized presentation: if you want it truly customized you'll have to manipulate the pipe results yourself, e.g. get it as JSON an manipulate it with Javascript, or process it server-side.