算法:数据二值化
我有一个巨大的数据集,其中包含单词 word_i
和权重 weight[i,j]
, 其中权重是单词之间的“连接强度”。
我想对这些数据进行二值化,但我想知道是否有任何现有的算法可以对每个单词进行二进制编码,使得单词代码之间的汉明距离与该权重相关。
添加:
我正在解决的问题是我想尝试教神经网络或支持向量机在单词之间建立关联。这就是我决定对数据进行二值化的原因。 不要问为什么我不想使用马尔可夫模型或只是图表,我已经尝试过它们并想将它们与神经网络进行比较。
因此,
我希望给定单词“a”上的神经网络返回其最接近的关联或任何集合单词及其概率,
我尝试过二值化并使“ab”作为输入和权重作为首选答案,这效果很差,
我正在考虑将阈值(权重)再改变 1 位。这个阈值越小,需要的位数就越多,
我有一种情况:a->b w1; b->a w2; w1>>w2,所以方向很重要。
I have a huge dataset with words word_i
and weights weight[i,j]
,
where weight is the "connection strength" between words.
I'd like to binarize this data, but I want to know if there is any existing algorithm to make binary code of each word in such a way that the Hamming distance between the codes of the words correlates with this weight.
Added:
The problem I am working on is that I want to try to teach a neural net or SVM to make associations between words. And that's why I've decided to binarize data.
Don't ask why I don't want to use Markov models or just graphs, I've tried them and want to compare them with neural nets.
So,
I want my NN on given word "a" return its closest association or any set words and their probabilities,
I've tried to just binarize and make "ab" as input and weight as preferred answer, this worked badly,
I was thinking of making the threshold (for weights) to change 1 more bit. The smaller this threshold is, the more bits you require,
I have a situation: a->b w1; b->a w2; w1>>w2, so direction is significant.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
你可以做的是使用具有固定长度拓扑的自组织映射(SOM),例如N位单词,这样,例如,如果N=8,那么SOM中的每个单元恰好有8个邻居(其中一个位已翻转)。现在,如果您有 K 个 [词典] 单词,您可以将每个 [词典] 单词编码为 0..1 之间的实数向量,以便第 i 个单词的第 i 个元素设置为 1,其他元素设置为 0。然后您可以计算两个任意向量 a1...aK 和 b1...bK 之间的“距离”,通过求和
得出运行 SOM 算法的距离度量。当 SOM 稳定后,度量中彼此接近的 [字典] 单词在映射的拓扑中也彼此接近,从中您可以简单地获得编码为 [二进制] 单词。
请注意,地图的单元格数量必须多于单词数量,即 2**N > K.
这个答案当然假设有自组织映射的背景。看
http://en.wikipedia.org/wiki/Self-organizing_map
what you can do is to use a self-organizing map (SOM) with the topology of fixed length, say N-bit, words, so that e.g. if N=8 then every cell in the SOM has exactly 8 neighbors (those where one bit has been flipped). Now if you have K [dictionary] words you can encode every [dictionary] word as a vector of real numbers between 0..1 so that the ith word has the ith element set to 1 and others to 0. You can then calculate the "distance" between two arbitrary vectors a1...aK and b1...bK by summing over
which gives you the distance metric for running the SOM algorithm. When the SOM has stabilized, [dictionary] words near to each other in your metric are near to each other in the topology of the map, from which you get the encoding trivially as [binary] words.
Note that the map must have more cells than there are words, i.e. 2**N > K.
This answer of course assumes background with self organizing maps. See
http://en.wikipedia.org/wiki/Self-organizing_map