博客的区别特征,即博客与普通网站的区别
我正在研究可以将博客与普通网站区分开来的东西。程序需要能够从网站的 html 或网站支持的特定功能中识别这些内容。例如。 ping。新闻网站也是如此。
我正在开发一个博客/新闻监控程序,它将索引网站以自动确定它是博客还是新闻网站,然后监控其确定为博客或新闻的网站上的帖子的评论等中的用户反馈自然。
所以我真正想要的是关于在识别这些网站时我可以使用或注意什么的建议。
它将是一个用 java 编写的桌面应用程序,因此如果您有 java 中的任何代码细节,那就太好了。
提前致谢
I'm looking at things that can distinguish a blog from a normal website. These are things that a program needs to be able identify from the html of a website or particular features that a site supports. For eg. pings. The same for news websites.
I'm working on a blog/news monitor program and it will index sites to automatically determine if it is a blog or a news site and then monitor user feedback in comments etc on posts from sites that it determines to be of a blog or news nature.
So what i'm really after is suggestions on what i can use or look out for in identifying these sites.
It's going to be a desktop app written in java so if you have any code specifics in java that'll be great.
thanks in advance
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以在页面中搜索“博客”一词,因为它可能会出现。具体来说,您可以在 HTML 页面的某些部分中查找它,或排除某些部分(例如链接)。这将为您提供一个不错的起点。
但最终,这必须手动完成。您应该构建一个界面,以便人们在提交网站时指定它是博客还是新闻网站,或其不同功能。然后,您应该创建站点和功能的数据库,并对它们进行标记,以便您或其他管理员可以查看它们并进行更改。一旦您对网站执行了此操作,您将永远不需要再次执行此操作,因此例如 http://*.wordpress.com/ 都将成为博客。
有些功能您可以自动检测或有很好的机会检测到,但最终您需要手动检查。
You can search the page for the word "blog", as this will probably be present. Specifically, you can look for it in parts of the HTML page, or exclude parts - like links. This will give you a decent starting point.
Ultimately, though, this is something that will have to be done manually. You should construct an interface for people to specify if it's a blog or news site, or different features of it, when the site is submitted. Then you should create a database of sites and features, and flag them so that you or another administrator can review them and make changes. Once you do this for a site, you'll never need to do it again, so for example http://*.wordpress.com/ is all going to be blogs.
Some features you can automatically detect or get a pretty good chance of detecting, but ultimately you will need a manual review.
寻找可发现的 RSS 或 Atom 提要,它们应该出现在博客或连续更新的新闻网站上。
Look for a discoverable RSS or Atom feed, which should be present on a blog or serially-updated news site.