算法:确定主页类型?
我已经考虑这个问题有一段时间了,所以我想我应该寻求建议:
我有一些爬虫进入某个网站的根目录(可以是来自 www.StackOverFlow.com、www.SomeDudesPersonalSite.se 甚至是任何内容) www.Facebook.com)。然后我需要确定我正在访问什么“主页”。不同的类型可能是:
- 论坛
- 博客
- 链接目录
- 社交媒体网站
- 新闻网站
- “一个人网站”
我已经集思广益了一段时间了,最好的解决方案似乎是一些带有点系统的启发式方法。我的意思是,不同的趋势为不同的类型提供了一些分数,然后程序随后进行猜测。
但这就是我陷入困境的地方..你如何发现趋势?
- 目录可能很容易:如果站点索引/传出链接非常高,目录应该获得几个点。
- 新闻网站/博客可能很容易:如果索引的大量网站有日期时间,这些类型应该得到几个点。
但我真的找不到太多趋势。
SO:我的问题是: 关于如何做到这一点有什么想法吗?
非常感谢..
I've been thinking about this for a while now, so I thought I would ask for suggestions:
I have some crawler which enters the root of some site (could be anything from www.StackOverFlow.com, www.SomeDudesPersonalSite.se or even www.Facebook.com). Then I need to determin what "kind of homepage" I'm visiting.. Different types could for instance be:
- Forum
- Blog
- Link catalog
- Social media site
- News site
- "One man site"
I've been brainstorming for a while, and the best solution seems to be some heuristic with a point system. By this I mean different trends gives some points to the different types, and then the program makes a guess afterwards.
But this is where I get stuck.. How do you detect trends?
- Catalogs could be easy: If sitesIndexed/Outgoing links is very high, catalogs should get several points.
- News sites/Blogs could be easy: If a high amount of sites indexed has a datetime, those types should get several points..
BUT I can't really find too many trends.
SO: My question is:
Any ideas on how to do this?
Thanks so much..
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我相信您正在尝试文档分类,这是一个经过充分研究的主题。
http://en.wikipedia.org/wiki/Document_classification
你会看到一个相当多的列表不同的方法。但是,在确定您所说的“趋势”之前提出其中任何一种(或神经网络等)都是过早提出的。我建议研究“网络文档分类”或类似内容。显然,它是文档分类的一个相当大的子集,如果您可以访问学术期刊,那么有很多难以理解的文章供您欣赏。
我确实还发现你的想法是一项家庭作业——也许如果你特别大胆,你可以联系教授。
http://uhaweb.hartford.edu/compsci/ccli/wdc.htm
最后,我相信这是一个可访问的(如果格式奇怪的话)网站,其中有一般性且可能过时的讨论:
http://www.webology.ir/2008/v5n1/a52.html
恐怕我对这个主题没有太多的个人知识,所以我最多能做的就是告诉你关键字“文档分类”并提供一些快速的谷歌搜索。然而,如果我想尝试这个概念,我认为简单地寻找某些关键字的比率是一个不错的起始“趋势”。 (“销售”或“购买”或“客户”是购物网站的趋势,“我的”、“意见”、“评论”是博客等)
I believe you are attempting document classification, which is a well-researched topic.
http://en.wikipedia.org/wiki/Document_classification
You will see a considerable list of many different methods. But to suggest any one of those (or neural networks or the like) prior to determining the "trends" as you call them is to suggest it prematurely. I would recommend looking into "web document classification" or the like. It is evidently a considerable subset of document classification, and if you have access to academic journals there are plenty of incomprehensible articles for your enjoyment.
I did also find your idea as a homework assignment -- perhaps if you are particularly audacious you could contact the professor.
http://uhaweb.hartford.edu/compsci/ccli/wdc.htm
Lastly, I believe that this is an accessible (if strangely formatted) website that has a general and perhaps outdated discussion:
http://www.webology.ir/2008/v5n1/a52.html
I'm afraid I don't have much personal knowledge of the topic, so the most I could do was tell you the keyword "document classification" and provide some quick googling. However, if I wanted to play around with this concept, I think simply looking for the rate of certain keywords is a decent starting "trend." ("Sale" or "purchase" or "customers" are trends for shopping sites, "my," "opinion," "comment," for blogs, and so on)
您可以训练神经网络来识别它们。给它链接的数量/类型,也许还有 HTML 标签的类型。
我认为否则你只会事后猜测是什么造就了一个网站。
You could train a neural network to recognise them. Give it number/types of links, maybe types of HTML tags as well.
I think otherwise you're just going to be second-guessing what makes a site what it is.