算法：数据二值化

发布于 2024-08-22 09:23:29 字数 545 浏览 2 评论 0原文

我有一个巨大的数据集，其中包含单词 word_i 和权重 weight[i,j]，其中权重是单词之间的“连接强度”。

我想对这些数据进行二值化，但我想知道是否有任何现有的算法可以对每个单词进行二进制编码，使得单词代码之间的汉明距离与该权重相关。

添加：
我正在解决的问题是我想尝试教神经网络或支持向量机在单词之间建立关联。这就是我决定对数据进行二值化的原因。不要问为什么我不想使用马尔可夫模型或只是图表，我已经尝试过它们并想将它们与神经网络进行比较。

因此，

我希望给定单词“a”上的神经网络返回其最接近的关联或任何集合单词及其概率，
我尝试过二值化并使“ab”作为输入和权重作为首选答案，这效果很差，
我正在考虑将阈值（权重）再改变 1 位。这个阈值越小，需要的位数就越多，
我有一种情况：a->b w1; b->a w2； w1>>w2，所以方向很重要。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

肥爪爪 2024-08-29 09:23:29

你可以做的是使用具有固定长度拓扑的自组织映射（SOM），例如N位单词，这样，例如，如果N=8，那么SOM中的每个单元恰好有8个邻居（其中一个位已翻转）。现在，如果您有 K 个 [词典] 单词，您可以将每个 [词典] 单词编码为 0..1 之间的实数向量，以便第 i 个单词的第 i 个元素设置为 1，其他元素设置为 0。然后您可以计算两个任意向量 a1...aK 和 b1...bK 之间的“距离”，通过求和

 i,j : ai * bj * distance(ai, bj)

得出运行 SOM 算法的距离度量。当 SOM 稳定后，度量中彼此接近的 [字典] 单词在映射的拓扑中也彼此接近，从中您可以简单地获得编码为 [二进制] 单词。

请注意，地图的单元格数量必须多于单词数量，即 2**N > K.

这个答案当然假设有自组织映射的背景。看
http://en.wikipedia.org/wiki/Self-organizing_map

what you can do is to use a self-organizing map (SOM) with the topology of fixed length, say N-bit, words, so that e.g. if N=8 then every cell in the SOM has exactly 8 neighbors (those where one bit has been flipped). Now if you have K [dictionary] words you can encode every [dictionary] word as a vector of real numbers between 0..1 so that the ith word has the ith element set to 1 and others to 0. You can then calculate the "distance" between two arbitrary vectors a1...aK and b1...bK by summing over

 i,j : ai * bj * distance(ai, bj)

which gives you the distance metric for running the SOM algorithm. When the SOM has stabilized, [dictionary] words near to each other in your metric are near to each other in the topology of the map, from which you get the encoding trivially as [binary] words.

Note that the map must have more cells than there are words, i.e. 2**N > K.

This answer of course assumes background with self organizing maps. See
http://en.wikipedia.org/wiki/Self-organizing_map

回复收藏 0 原文

~没有更多了~