相关功能数据结构的建议
我们有一组 Documents ,每个 Documents 都有一组 Features 。 给定特征 A,我们需要知道在同一文档中具有特征 B 的概率是多少。
我想到建立一个概率矩阵 , st: M(i,j) = 假设特征 A 存在,文档中具有特征 B 的概率。
但是,我们还有一个额外的要求: 给定文档中的特征 A,那么概率 > 的所有特征是什么? P 位于同一文档中。
与此同时,我能想到的只是概率矩阵的稀疏矩阵,在计算之后,对于运行在所有列上的每个特征,按 P 对其进行排序,并将其保存在某个链接列表中。 (所以现在,我们对于每个特征,都有一个对应特征的列表
。这个空间复杂度相当大(最坏的情况:N^2,而且N很大!),并且每次搜索的时间复杂度是O(N)。
任何更好的主意吗?
We have a set of Documents , each has a set of Features.
Given feature A, we need to know what is the probability of having feature B in the same document.
I thought of building a probability matrix , s.t:
M(i,j) = Probability of having feature B in a document , given that feature A is there.
However , we have an additional requirement:
Given feature A is in the document , what are all the features that have a probability > P of being in the same document.
In the mean while all I could think off is a sparse matrix for the Probability matrix , and after it's computed , for each feature run over all the column , sort it by P , and keep it in a linked list somewhere. (So now , we have for each feature , a list of corresponding features
This space complexity is quite big (worst case: N^2, and N is large!) , and the time complexity for each search is O(N).
Any better idea?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果特征的数量与文档的数量相当或更大,请考虑保留倒排索引:对于每个特征,保存它所在的文档(例如,排序的列表)。然后,您可以通过对特征 A 和 B 的排序列表进行合并来算出给定 A 的 B 的概率。
对于“给定 A 预期的共同特征”问题,我能想到没有什么比预先计算每个特征的答案更好的了。 A 并希望最终的功能列表不会太长。
If the number of features is comparable with the number of documents, or greater, consider holding an inverted index: for each feature hold (e.g. a sorted list of) the documents in which it is present. You can then work out the probability of B given A by running a merge on the sorted lists for features A and B.
For the "common features expected given A" question, I can think of nothing better than pre-computing the answer for each A and hoping that the resulting list of features isn't too long.