返回介绍

建立数据

发布于 2025-01-01 12:38:39 字数 3217 浏览 0 评论 0 收藏 0

Scikit Learn 附带了许多内置数据集,以及加载工具来加载多个标准外部数据集。 这是一个很好的资源,数据集包括波士顿房价,人脸图像,森林斑块,糖尿病,乳腺癌等。 我们将使用新闻组数据集。

新闻组是 Usenet 上的讨论组,它在网络真正起飞之前的 80 年代和 90 年代很流行。 该数据集包括 18,000 个新闻组帖子,带有 20 个主题。

categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
remove = ('headers', 'footers', 'quotes')
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=remove)

newsgroups_train.filenames.shape, newsgroups_train.target.shape

# ((2034,), (2034,))

我们来看看一些数据。 你能猜出这些消息属于哪个类别?

print("\n".join(newsgroups_train.data[:3]))

'''
Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych


Seems to be, barring evidence to the contrary, that Koresh was simply
another deranged fanatic who thought it neccessary to take a whole bunch of
folks with him, children and all, to satisfy his delusional mania. Jim
Jones, circa 1993.


Nope - fruitcakes like Koresh have been demonstrating such evil corruption
for centuries.

 >In article <1993Apr19.020359.26996@sq.sq.com>, msb@sq.sq.com (Mark Brader) 

MB>                                                             So the
MB> 1970 figure seems unlikely to actually be anything but a perijove.

JG>Sorry, _perijoves_...I'm not used to talking this language.

Couldn't we just say periapsis or apoapsis?
'''

提示:perijove 的定义是离木星中心最近的木星卫星轨道上的点。

np.array(newsgroups_train.target_names)[newsgroups_train.target[:3]]

'''
array(['comp.graphics', 'talk.religion.misc', 'sci.space'], 
      dtype='<U18')
'''

target 属性是类别的整数索引。

newsgroups_train.target[:10]

# array([1, 3, 2, 0, 2, 0, 2, 1, 2, 1])

num_topics, num_top_words = 6, 8

接下来,scikit learn 有一个方法可以为我们提取所有字数。

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vectorizer = CountVectorizer(stop_words='english')
vectors = vectorizer.fit_transform(newsgroups_train.data).todense() # (documents, vocab)
vectors.shape #, vectors.nnz / vectors.shape[0], row_means.shape

# (2034, 26576)

print(len(newsgroups_train.data), vectors.shape)

# 2034 (2034, 26576)

vocab = np.array(vectorizer.get_feature_names())

vocab.shape

# (26576,)

vocab[7000:7020]

'''
array(['cosmonauts', 'cosmos', 'cosponsored', 'cost', 'costa', 'costar',
       'costing', 'costly', 'costruction', 'costs', 'cosy', 'cote',
       'couched', 'couldn', 'council', 'councils', 'counsel', 'counselees',
       'counselor', 'count'],
      dtype='<U80')
'''

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文