为什么word_index的长度大于num_words？

发布于 2025-01-12 05:13:14 字数 416 浏览 1 评论 0原文

我有一个关于深度学习文本预处理的代码：

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words = 10000)
tokenizer.fit_on_texts(X)
tokenizer.word_index

但是当我检查 tokenizer.word_index 的长度时，在知识中安全地获得 10000，我得到 13233。X 的长度等于 11541（包含 11541 的数据帧列，不过，如果有必要知道的话）。所以我的问题出现了：词汇量是多少？ num_words 还是 word_index 的长度？看来我已经糊涂了！任何帮助表示赞赏。

原文

I have a code, about text preprocessing for deep learning:

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words = 10000)
tokenizer.fit_on_texts(X)
tokenizer.word_index

but when I check the length of tokenizer.word_index, safe in the knowledge to get 10000, I get 13233.The length of X is equal to 11541(a dataframe column containing 11541, if it matters to know, however). So my question arises: which is vocabulary size? num_words or the length of word_index? It seems I have confused! Any helps appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

风月客 2025-01-19 05:13:14

根据官方文档，参数num_words 是，

根据词频保留的最大单词数。仅保留最常见的 num_words-1 个单词。

word_index 将保存文本中出现的所有单词。但当您使用 Tokenizer.texts_to_sequences 时会观察到差异。例如，让我们考虑一些句子，

texts = [
    'hello world' , 
    'hello python' , 
    'python' , 
    'hello java' ,
    'hello java' , 
    'hello python'
]
# Frequency of words, hello -> 5, python -> 3 , java -> 2 , world -> 1
tokenizer = tf.keras.preprocessing.text.Tokenizer( num_words=3 )
tokenizer.fit_on_texts( texts )
print( tokenizer.word_index )

上面代码片段的输出将是，

{'hello': 1, 'python': 2, 'java': 3, 'world': 4}

根据文档，前 num_words-1 个单词（基于它们的频率）只能在转换单词时使用到指数。在我们的例子中，num_words=3，因此我们希望 Tokenizer 仅使用 2 个单词进行转换。 texts 中最常见的两个单词是 hello 和 python。考虑此示例来检查 texts_to_sequences 的输出。

input_seq = [
    'hello' , 
    'hello java' , 
    'hello python' , 
    'hello python java'
]
print( tokenizer.texts_to_sequences( input_seq ) )

输出，

[[1], [1], [1, 2], [1, 2]]

观察到在第一句中，hello 按预期进行编码。在第二句话中，单词 java 未编码，因为它未包含在词汇表中。在第三句话中，同时包含单词 hello 和 python，这是我们假设的预期行为。在第四句中，单词 java 未在输出中进行编码。

所以我的问题出现了：词汇量是多少？ num_words 或 word_index 的长度？

您可能已经理解，num_words 是词汇大小，因为只有这么多单词在输出中进行编码。其余的单词，在我们的例子中，java 和 world 在转换中被简单地省略。

According to the official docs, the argument num_words is,

the maximum number of words to keep, based on word frequency. Only the most common num_words-1 words will be kept.

word_index will hold all the words which are present in texts. But the difference is observed when you use Tokenizer.texts_to_sequences. For instance, let us consider some sentences,

texts = [
    'hello world' , 
    'hello python' , 
    'python' , 
    'hello java' ,
    'hello java' , 
    'hello python'
]
# Frequency of words, hello -> 5, python -> 3 , java -> 2 , world -> 1
tokenizer = tf.keras.preprocessing.text.Tokenizer( num_words=3 )
tokenizer.fit_on_texts( texts )
print( tokenizer.word_index )

The output of the above snippet will be,

{'hello': 1, 'python': 2, 'java': 3, 'world': 4}

According to the docs, the top num_words-1 words ( based on their frequency ) must only be used while transforming the words to indices. In our case num_words=3 and hence we'd expect the Tokenizer to only use 2 words for the transformation. The two most common words in texts are hello and python. Consider this example to inspect the output of texts_to_sequences

input_seq = [
    'hello' , 
    'hello java' , 
    'hello python' , 
    'hello python java'
]
print( tokenizer.texts_to_sequences( input_seq ) )

The output,

[[1], [1], [1, 2], [1, 2]]

Observe that in the first sentence, hello is encoded as expected. In the second sentence, the word java isn't encoded as it was not included in the vocabulary. In the third sentence, both the words hello and python are included, which is the expected behavior as per our assumption. In the fourth sentence, the word java isn't encoded in the output.

So my question arises: which is vocabulary size? num_words or the length of word_index?

As you might have understood, num_words is the vocab size as only these many words are being encoded in the output. Rest of the words, in our case java and world are simply omitted from the transformation.

回复收藏 0 原文

~没有更多了~