tf-idf 和以前未见过的术语
TF-IDF(术语频率 - 逆文档频率) 是信息检索的主要内容。 但这不是一个合适的模型,当新术语引入语料库时,它似乎就会崩溃。 当查询或新文档有新术语时,尤其是频率很高的情况下,人们如何处理。 在传统的余弦匹配下,这些不会对总匹配产生影响。
TF-IDF (term frequency - inverse document frequency) is a staple of information retrieval. It's not a proper model though, and it seems to break down when new terms are introduced into the corpus. How do people handle it when queries or new documents have new terms, especially if they are high frequency. Under traditional cosine matching, those would have no impact on the total match.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
呃,不,不会崩溃。
假设我有两个文件,A“黄鼠狼山羊”和B“奶酪地鼠”。 如果我们实际上将它们表示为向量,它们可能看起来像:
A [1,1,0,0]
B [0,0,1,1]
如果我们已经在索引文件中分配了这些向量,是的,当需要添加新术语时我们就会遇到问题。 但它的技巧是,那个向量永远不存在。 关键是倒排索引。
至于不影响余弦匹配的新术语,这可能是正确的,具体取决于您的意思。 如果我使用查询“marmoset kungfu”搜索我的 (A,B) 语料库,则语料库中既不存在狨猴也不存在 kungfu。 因此,代表我的查询的向量将与集合中的所有文档正交,并获得不好的余弦相似度得分。 但考虑到没有一个条款匹配,这似乎很合理。
Er, nope, doesn't break down.
Say I have two documents, A "weasel goat" and B "cheese gopher". If we actually represented these as vectors, they might look something like:
A [1,1,0,0]
B [0,0,1,1]
and if we've allocated these vectors in an index file, yeah, we've got a problem when it comes time to add a new term. But the trick of it is, that vector never exists. The key is the inverted index.
As far as new terms not affecting a cosine match, that might be true depending on what you mean. If I search my corpus of (A,B) with the query "marmoset kungfu", neither marmoset nor kungfu exist in the corpus. So the vector representing my query will be orthogonal to all the documents in the collection, and get a bad cosine similarity score. But considering none of the terms match, that seems pretty reasonable.
当您谈论“分解”时,我认为您的意思是新术语对相似性度量没有影响,因为它们在原始词汇定义的向量空间中没有任何表示。
处理此平滑问题的一种方法是考虑将词汇表固定为较小的词汇表,并将所有少于特定阈值的单词视为属于特殊的
_UNKNOWN_
单词。不过,我认为你对“崩溃”的定义不是很清楚; 你能澄清一下你的意思吗? 如果你能澄清这一点,也许我们可以讨论解决这些问题的方法。
When you talk about "break down" I think you mean that the new terms have no impact on the similarity measure, because they do not have any representation in the vector space defined by the original vocabulary.
One approach to handle this smoothing problem would be to consider fixing the vocabulary to a smaller vocabulary and treat all words rarer than a certain threshold as belonging to the special
_UNKNOWN_
word.However, I don't think your definition of "break down" is very clear; could you clarify what you mean there? If you could clear that up, perhaps we could discuss ways to work around those problems.