如何在分类中包含单词作为数字特征
在任何机器学习算法中使用单词本身作为特征的最佳方法是什么?
问题是我必须从特定段落中提取与单词相关的特征。我应该使用字典中的索引作为数字特征吗?如果是这样,我将如何标准化这些?
一般来说,单词本身如何用作 NLP 中的特征?
Whats the best method to use the words itself as the features in any machine learning algorithm ?
The problem I have to extract word related feature from a particular paragraph. Should I use the index in the dictionary as the numerical feature ? If so, how will I normalize these ?
In general, How are words itself used as features in NLP ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
有几种传统技术可以将单词映射到特征(二维数据矩阵中的列,其中行是单独的数据向量),以输入到机器学习模型。 分类:
a 布尔值< /em> 字段,对给定文档中该单词的存在或不存在进行编码;
a 的频率直方图
预先确定的一组单词,通常是包含训练数据的所有文档中最常出现的 X 个单词(有关此内容的更多信息,请参阅
本答案的最后一段);
两个或多个的并置
单词(例如“替代”和
“生活方式”连续顺序有
也不相关的含义
组成词);这种并置可以在数据模型本身中捕获,例如,表示文档中是否存在两个直接相邻的特定单词的布尔特征,或者可以在机器学习技术中利用这种关系,作为一种简单的方法在这种情况下,贝叶斯分类器会强调文本;
单词作为原始数据提取潜在特征,例如LSA 或潜在语义分析(有时也称为潜在语义索引的 LSI)。 LSA 是一种基于矩阵分解的技术,它从文本中导出从文本本身的单词中不明显的潜在变量。
机器学习中的常见参考数据集由 50 个左右的最常见单词(也称为“停用词”)组成(例如,a、an、 >of、and、the、there、if),适用于伦敦莎士比亚的已出版作品、奥斯汀和弥尔顿。具有单个隐藏层的基本多层感知器可以以 100% 的准确度分离该数据集。该数据集及其变体广泛存在于机器学习数据存储库和介绍分类的学术论文中结果同样很常见。
There are several conventional techniques by which words are mapped to features (columns in a 2D data matrix in which the rows are the individual data vectors) for input to machine learning models.classification:
a Boolean field which encodes the presence or absence of that word in a given document;
a frequency histogram of a
predetermined set of words, often the X most commonly occurring words from among all documents comprising the training data (more about this one in the
last paragraph of this Answer);
the juxtaposition of two or more
words (e.g., 'alternative' and
'lifestyle' in consecutive order have
a meaning not related either
component word); this juxtaposition can either be captured in the data model itself, eg, a boolean feature that represents the presence or absence of two particular words directly adjacent to one another in a document, or this relationship can be exploited in the ML technique, as a naive Bayesian classifier would do in this instanceemphasized text;
words as raw data to extract latent features, eg, LSA or Latent Semantic Analysis (also sometimes called LSI for Latent Semantic Indexing). LSA is a matrix decomposition-based technique which derives latent variables from the text not apparent from the words of the text itself.
A common reference data set in machine learning is comprised of frequencies of 50 or so of the most common words, aka "stop words" (e.g., a, an, of, and, the, there, if) for published works of Shakespeare, London, Austen, and Milton. A basic multi-layer perceptron with a single hidden layer can separate this data set with 100% accuracy. This data set and variations on it are widely available in ML Data Repositories and academic papers presenting classification results are likewise common.
标准方法是“词袋”表示,其中每个单词都有一个特征,如果该单词在文档中出现则给出“1”,如果没有出现则给出“0”。
这提供了很多功能,但如果您有像朴素贝叶斯这样的简单学习器,那仍然没问题。
“字典中的索引”是一个无用的功能,我不会使用它。
Standard approach is the "bag-of-words" representation where you have one feature per word, giving "1" if the word occurs in the document and "0" if it doesn't occur.
This gives lots of features, but if you have a simple learner like Naive Bayes, that's still OK.
"Index in the dictionary" is a useless feature, I wouldn't use it.
tf-idf 是将单词转换为数字特征的非常标准的方法。
您需要记住使用支持数字特征的学习算法,例如 SVM。朴素贝叶斯不支持数字特征。
tf-idf is a pretty standard way of turning words into numeric features.
You need to remember to use a learning algorithm that supports numeric featuers, like SVM. Naive Bayes doesn't support numeric features.