文章来源于网络收集而来，版权归原创者所有，如有侵权请及时联系！

3.4 解决我们最初的难题

发布于 2024-01-30 22:34:09 字数 5205 浏览 0 评论 0 收藏 0

现在综合前面所学到的知识，通过下面这个新帖子（分配给变量new_post ）来演示一下我们的系统。

Disk drive problems. Hi, I have a problem with my hard disk.
After 1 year it is working only sporadically now.
I tried tformat it, but now it doesn't boot any more.
Any ideas? Thanks.

如前所述，在预测标签之前先把这个帖子向量化，如下：

>>> new_post_vec = vectorizer.transform([new_post])
>>> new_post_label = km.predict(new_post_vec)[0]

既然有了聚类信息，我们并不需要用new_post_vec 和所有帖子的向量进行比较。相反，我们只需要专注于同一个簇中的帖子。让我们从原始数据集中取出它们的索引。

>>> similar_indices = (km.labels_==new_post_label).nonzero()[0]

括号中的比较操作可以得到一个布尔型数组，nonzer将这个数组转化为一个更小的数组，它包含True 元素的索引。

然后使用similar_indeces 简单地构建一个帖子列表，以及它们的相似度分值，如下所示：

>>> similar = []
>>> for i in similar_indices:
... dist = sp.linalg.norm((new_post_vec - vectorized[i]).toarray())
... similar.append((dist, dataset.data[i]))
>>> similar = sorted(similar)
>>> print(len(similar))
44

我们发现簇中有44个帖子。为了尽快给用户一个直观印象，告诉他们相似帖子是什么样子的，我们把最相似的帖子（show_at_1 ），最不相似的帖子（show_at_3 ），以及它们之间的帖子（show_at_2 ）呈现出来。它们都来自于同一个簇，如下所示：

>>> show_at_1 = similar[0]
>>> show_at_2 = similar[len(similar)/2]
>>> show_at_3 = similar[-1]

下面这个表显示了这些帖子以及它们的相似值：

位置	相似度	帖子节选
1	1.018	BOOT PROBLEM with IDE controller Hi, I've got a Multi I/O card (IDE controller + serial/parallel interface) and twfloppy drives (5 1/4, 3 1/2) and a Quantum ProDrive 80AT connected tit. I was able tformat the hard disk, but I could not boot from it. I can boot from drive A: (which disk drive does not matter) but if I remove the disk from drive A and press the reset switch, the LED of drive A: continues tglow, and the hard disk is not accessed at all. I guess this must be a problem of either the Multi I/card\nor floppy disk drive settings (jumper configuration?) Does someone have any hint what could be the reason for it. […]
2	1.294	IDE Cable I just bought a new IDE hard drive for my system tgwith the one I already had. My problem is this. My system only had a IDE cable for one drive, sI had tbuy cable with twdrive connectors on it, and consequently have tswitch cables. The problem is, the new hard drive\'s manual refers tmatching pin 1 on the cable with both pin 1 on the drive itself and pin 1 on the IDE card. But for the life of me I cannot figure out how ttell which way tplug in the cable talign these. Secondly, the cable has like a connector at twends and one between them. I figure one end goes in the controller and then the other twgintthe drives. Does it matter which I plug intthe "master" drive and which intthe "Slave"? any help appreciated […]
3	1.375	Conner CP3204F infplease How tchange the cluster size Wondering if somebody could tell me if we can change the cluster size of my IDE drive. Normally I can dit with Norton's Calibrat on MFM/RLL drives but dunnif I can on IDE too. […]

这些帖子是如何反映出相似度分值的，是一件挺有趣的事情。第一个帖子包含所有出现在新帖子中的重要词语。第二个也是围绕硬盘来说的，但它缺少诸如格式化这样的概念。最后，第三个帖子只有一点关联性。然而，我们可以说，这三个帖子跟新帖子都属于同一个领域。

换个角度看噪声

我们不应期待完美的聚类。从某种意义上说，这是指，隶属于同一新闻组的帖子（例如，comp.graphics ）聚类到了一起。对于我们不得不面对的噪声，有一个例子可以快速地给我们一个直观印象：

>>> post_group = zip(dataset.data, dataset.target)
>>> z = (len(post[0]), post[0], dataset.target_names[post[1]]) for
post in post_group
>>> print(sorted(z)[5:7])
[(107, 'From: "kwansik kim" <kkim@cs.indiana.edu>\nSubject: Where
is FAQ ?\n\nWhere can I find it ?\n\nThanks, Kwansik\n\n', 'comp.
graphics'), (110, 'From: lioness@maple.circa.ufl.edu\nSubject: What is
3dO?\n\n\nSomeone please fill me in on what 3do.\n\nThanks,\n\nBH\n',
'comp.graphics')]

对这两个帖子，若只考虑经过预处理步骤的词语的话，这里并没有真正地表示出，它们属于comp.graphics ：

>>> analyzer = vectorizer.build_analyzer()
>>> list(analyzer(z[5][1]))
[u'kwansik', u'kim', u'kkim', u'cs', u'indiana', u'edu', u'subject',
u'faq', u'thank', u'kwansik']
>>> list(analyzer(z[6][1]))
[u'lioness', u'mapl', u'circa', u'ufl', u'edu', u'subject', u'3do',
u'3do', u'thank', u'bh']

这里只经过了词语切分、大小写转换和停用词删除等步骤。如果我们把能用min_df 和max_df 过滤掉的词语也删去（这个后续会由fit_transform 完成），那么情况会变得更糟：

>>> list(set(analyzer(z[5][1])).intersection(
vectorizer.get_feature_names()))
[u'cs', u'faq', u'thank']
>>> list(set(analyzer(z[6][1])).intersection(
vectorizer.get_feature_names()))
[u'bh', u'thank']

此外，多数词语在其他帖子中出现的频率也都很高。这个我们可以查看一下IDF值。请记住TF-IDF值越高，词语在帖子中的区分性就越大。同时，既然IDF在这里是一个乘法因子，如果它的值较小，那么它就是在传递一个信号：该词语总体上没有什么价值。

>>> for term in ['cs', 'faq', 'thank', 'bh', 'thank']:
... print('IDF(%s)=%.2f'%(term,
vectorizer._tfidf.idf_[vectorizer.vocabulary_[term]])
IDF(cs)=3.23
IDF(faq)=4.17
IDF(thank)=2.23
IDF(bh)=6.57
IDF(thank)=2.23

所以，除了bh （它的值接近IDF的最高值6.74），这些词语都没有多大的区分度。也就是说，属于不同新闻组的帖子将会类聚到一起。

然而，这对于我们的目标并没有太大帮助，因为我们只对减少与新帖子做比较的帖子数量感兴趣。毕竟，对于来自于训练数据中的特定新闻组，我们并没有特别的兴趣。

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据

3.4 解决我们最初的难题

换个角度看噪声

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。