为什么word_index的长度大于num_words?
我有一个关于深度学习文本预处理的代码:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words = 10000)
tokenizer.fit_on_texts(X)
tokenizer.word_index
但是当我检查 tokenizer.word_index 的长度时,在知识中安全地获得 10000,我得到 13233。X 的长度等于 11541(包含 11541 的数据帧列,不过,如果有必要知道的话)。所以我的问题出现了:词汇量是多少? num_words 还是 word_index 的长度?看来我已经糊涂了!任何帮助表示赞赏。
I have a code, about text preprocessing for deep learning:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words = 10000)
tokenizer.fit_on_texts(X)
tokenizer.word_index
but when I check the length of tokenizer.word_index, safe in the knowledge to get 10000, I get 13233.The length of X is equal to 11541(a dataframe column containing 11541, if it matters to know, however). So my question arises: which is vocabulary size? num_words or the length of word_index? It seems I have confused! Any helps appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
根据官方文档,参数
num_words
是,word_index
将保存文本
中出现的所有单词。但当您使用 Tokenizer.texts_to_sequences 时会观察到差异。例如,让我们考虑一些句子,上面代码片段的输出将是,
根据文档,前
num_words-1
个单词(基于它们的频率)只能在转换单词时使用到指数。在我们的例子中,num_words=3
,因此我们希望 Tokenizer 仅使用2
个单词进行转换。texts
中最常见的两个单词是hello
和python
。考虑此示例来检查texts_to_sequences
的输出。输出,
观察到在第一句中,
hello
按预期进行编码。在第二句话中,单词java
未编码,因为它未包含在词汇表中。在第三句话中,同时包含单词hello
和python
,这是我们假设的预期行为。在第四句中,单词java
未在输出中进行编码。您可能已经理解,
num_words
是词汇大小,因为只有这么多单词在输出中进行编码。其余的单词,在我们的例子中,java
和world
在转换中被简单地省略。According to the official docs, the argument
num_words
is,word_index
will hold all the words which are present intexts
. But the difference is observed when you useTokenizer.texts_to_sequences
. For instance, let us consider some sentences,The output of the above snippet will be,
According to the docs, the top
num_words-1
words ( based on their frequency ) must only be used while transforming the words to indices. In our casenum_words=3
and hence we'd expect the Tokenizer to only use2
words for the transformation. The two most common words intexts
arehello
andpython
. Consider this example to inspect the output oftexts_to_sequences
The output,
Observe that in the first sentence,
hello
is encoded as expected. In the second sentence, the wordjava
isn't encoded as it was not included in the vocabulary. In the third sentence, both the wordshello
andpython
are included, which is the expected behavior as per our assumption. In the fourth sentence, the wordjava
isn't encoded in the output.As you might have understood,
num_words
is the vocab size as only these many words are being encoded in the output. Rest of the words, in our casejava
andworld
are simply omitted from the transformation.