3.4 解决我们最初的难题
现在综合前面所学到的知识,通过下面这个新帖子(分配给变量new_post )来演示一下我们的系统。
Disk drive problems. Hi, I have a problem with my hard disk. After 1 year it is working only sporadically now. I tried tformat it, but now it doesn't boot any more. Any ideas? Thanks.
如前所述,在预测标签之前先把这个帖子向量化,如下:
>>> new_post_vec = vectorizer.transform([new_post]) >>> new_post_label = km.predict(new_post_vec)[0]
既然有了聚类信息,我们并不需要用new_post_vec 和所有帖子的向量进行比较。相反,我们只需要专注于同一个簇中的帖子。让我们从原始数据集中取出它们的索引。
>>> similar_indices = (km.labels_==new_post_label).nonzero()[0]
括号中的比较操作可以得到一个布尔型数组,nonzer将这个数组转化为一个更小的数组,它包含True 元素的索引。
然后使用similar_indeces 简单地构建一个帖子列表,以及它们的相似度分值,如下所示:
>>> similar = [] >>> for i in similar_indices: ... dist = sp.linalg.norm((new_post_vec - vectorized[i]).toarray()) ... similar.append((dist, dataset.data[i])) >>> similar = sorted(similar) >>> print(len(similar)) 44
我们发现簇中有44个帖子。为了尽快给用户一个直观印象,告诉他们相似帖子是什么样子的,我们把最相似的帖子(show_at_1 ),最不相似的帖子(show_at_3 ),以及它们之间的帖子(show_at_2 )呈现出来。它们都来自于同一个簇,如下所示:
>>> show_at_1 = similar[0] >>> show_at_2 = similar[len(similar)/2] >>> show_at_3 = similar[-1]
下面这个表显示了这些帖子以及它们的相似值:
位置 | 相似度 | 帖子节选 |
1 | 1.018 | BOOT PROBLEM with IDE controller Hi, I've got a Multi I/O card (IDE controller + serial/parallel interface) and twfloppy drives (5 1/4, 3 1/2) and a Quantum ProDrive 80AT connected tit. I was able tformat the hard disk, but I could not boot from it. I can boot from drive A: (which disk drive does not matter) but if I remove the disk from drive A and press the reset switch, the LED of drive A: continues tglow, and the hard disk is not accessed at all. I guess this must be a problem of either the Multi I/card\nor floppy disk drive settings (jumper configuration?) Does someone have any hint what could be the reason for it. […] |
2 | 1.294 | IDE Cable I just bought a new IDE hard drive for my system tgwith the one I already had. My problem is this. My system only had a IDE cable for one drive, sI had tbuy cable with twdrive connectors on it, and consequently have tswitch cables. The problem is, the new hard drive\'s manual refers tmatching pin 1 on the cable with both pin 1 on the drive itself and pin 1 on the IDE card. But for the life of me I cannot figure out how ttell which way tplug in the cable talign these. Secondly, the cable has like a connector at twends and one between them. I figure one end goes in the controller and then the other twgintthe drives. Does it matter which I plug intthe "master" drive and which intthe "Slave"? any help appreciated […] |
3 | 1.375 | Conner CP3204F infplease How tchange the cluster size Wondering if somebody could tell me if we can change the cluster size of my IDE drive. Normally I can dit with Norton's Calibrat on MFM/RLL drives but dunnif I can on IDE too. […] |
这些帖子是如何反映出相似度分值的,是一件挺有趣的事情。第一个帖子包含所有出现在新帖子中的重要词语。第二个也是围绕硬盘来说的,但它缺少诸如格式化这样的概念。最后,第三个帖子只有一点关联性。然而,我们可以说,这三个帖子跟新帖子都属于同一个领域。
换个角度看噪声
我们不应期待完美的聚类。从某种意义上说,这是指,隶属于同一新闻组的帖子(例如,comp.graphics )聚类到了一起。对于我们不得不面对的噪声,有一个例子可以快速地给我们一个直观印象:
>>> post_group = zip(dataset.data, dataset.target) >>> z = (len(post[0]), post[0], dataset.target_names[post[1]]) for post in post_group >>> print(sorted(z)[5:7]) [(107, 'From: "kwansik kim" <kkim@cs.indiana.edu>\nSubject: Where is FAQ ?\n\nWhere can I find it ?\n\nThanks, Kwansik\n\n', 'comp. graphics'), (110, 'From: lioness@maple.circa.ufl.edu\nSubject: What is 3dO?\n\n\nSomeone please fill me in on what 3do.\n\nThanks,\n\nBH\n', 'comp.graphics')]
对这两个帖子,若只考虑经过预处理步骤的词语的话, 这里并没有真正地表示出,它们属于comp.graphics :
>>> analyzer = vectorizer.build_analyzer() >>> list(analyzer(z[5][1])) [u'kwansik', u'kim', u'kkim', u'cs', u'indiana', u'edu', u'subject', u'faq', u'thank', u'kwansik'] >>> list(analyzer(z[6][1])) [u'lioness', u'mapl', u'circa', u'ufl', u'edu', u'subject', u'3do', u'3do', u'thank', u'bh']
这里只经过了词语切分、大小写转换和停用词删除等步骤。如果我们把能用min_df 和max_df 过滤掉的词语也删去(这个后续会由fit_transform 完成),那么情况会变得更糟:
>>> list(set(analyzer(z[5][1])).intersection( vectorizer.get_feature_names())) [u'cs', u'faq', u'thank'] >>> list(set(analyzer(z[6][1])).intersection( vectorizer.get_feature_names())) [u'bh', u'thank']
此外,多数词语在其他帖子中出现的频率也都很高。这个我们可以查看一下IDF值。请记住TF-IDF值越高,词语在帖子中的区分性就越大。同时,既然IDF在这里是一个乘法因子,如果它的值较小,那么它就是在传递一个信号:该词语总体上没有什么价值。
>>> for term in ['cs', 'faq', 'thank', 'bh', 'thank']: ... print('IDF(%s)=%.2f'%(term, vectorizer._tfidf.idf_[vectorizer.vocabulary_[term]]) IDF(cs)=3.23 IDF(faq)=4.17 IDF(thank)=2.23 IDF(bh)=6.57 IDF(thank)=2.23
所以,除了bh (它的值接近IDF的最高值6.74),这些词语都没有多大的区分度。也就是说,属于不同新闻组的帖子将会类聚到一起。
然而,这对于我们的目标并没有太大帮助,因为我们只对减少与新帖子做比较的帖子数量感兴趣。毕竟,对于来自于训练数据中的特定新闻组,我们并没有特别的兴趣。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论