将任意字符串映射到 RGB 值

发布于 2024-07-13 03:37:49 字数 870 浏览 10 评论 0原文

我有一大堆任意自然语言字符串。对于我的工具来分析它们，我需要将每个字符串转换为唯一的颜色值（RGB 或其他）。我需要颜色对比来取决于字符串相似性（字符串与其他字符串越不同，它们各自的颜色应该越不同）。如果我总是为同一字符串获得相同的颜色值，那就完美了。

关于如何解决这个问题有什么建议吗？

更新字符串之间的距离

我可能需要将“相似性”定义为类似 Levenstein 的距离。不需要自然语言解析。

即：

"I am going to the store" and 
"We are going to the store"

相似。

"I am going to the store" and 
"I am going to the store today"

也类似（但稍少）。

"I am going to the store" and 
"J bn hpjoh up uif tupsf"

完全不相似。

（谢谢，Welbog！）

我可能会知道确切地只有当我看到程序输出时我才需要什么距离函数。所以让我们从更简单的事情开始。

任务简化的更新

我已经删除了我自己的将任务分成两部分的建议 - 绝对距离计算和颜色分布。这不会很好地工作，因为首先我们将维度信息减少到单个维度，然后尝试将其合成到三个维度。

原文

I have a huge set of arbitrary natural language strings. For my tool to analyze them I need to convert each string to unique color value (RGB or other). I need color contrast to depend on string similarity (the more string is different from other, the more their respective colors should be different). Would be perfect if I would always get same color value for the same string.

Any advice on how to approach this problem?

Update on distance between strings

I probably need "similarity" defined as a Levenstein-like distance. No natural language parsing is required.

That is:

"I am going to the store" and 
"We are going to the store"

Similar.

"I am going to the store" and 
"I am going to the store today"

Similar as well (but slightly less).

"I am going to the store" and 
"J bn hpjoh up uif tupsf"

Quite not similar.

(Thanks, Welbog!)

I probably would know exactly what distance function I need only when I'll see program output. So lets start from simpler things.

Update on task simplification

I've removed my own suggestion to split task into two — absolute distance calculation and color distribution. This would not work well as at first we're reducing dimensional information to a single dimension, and then trying to synthesize it up to three dimensions.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夕色琉璃 2024-07-20 03:37:49

您需要详细说明“相似字符串”的含义，以便提出适当的转换函数。字符串是否

 "I am going to the store" and 
"We are going to the store"

被认为相似？那么字符串

 "I am going to the store" and 
"J bn hpjoh up uif tupsf"

（原始+1中的所有字母）或呢

 "I am going to the store" and 
"I am going to the store today"

？根据“相似”的含义，您可能会考虑不同的功能。

如果差异仅基于字符的值（Unicode 或它们来自的任何空间），那么您可以尝试将这些值相加，并将结果用作 HSV 空间的色调。如果字符串较长会导致颜色差异更大，您可以考虑按字符在字符串中的位置来衡量字符的重量。

如果差异更复杂，例如某些字母或单词的出现，那么您需要识别这一点。也许您可以根据字符串中 Es、Ss 和 Rs 的数量来决定红色、绿色和蓝色值（如果您的域有很多这些值）。或者根据元音与辅音或单词与音节的比例来选择色调。

有很多很多不同的方法来解决这个问题，但最好的方法实际上取决于“相似”字符串的含义。

You need to elaborate more on what you mean by "similar strings" in order to come up with an appropriate conversion function. Are the strings

 "I am going to the store" and 
"We are going to the store"

considered similar? What about the strings

 "I am going to the store" and 
"J bn hpjoh up uif tupsf"

(all of the letters in the original +1), or

 "I am going to the store" and 
"I am going to the store today"

? Based on what you mean by "similar", you might consider different functions.

If the difference can be based solely on the values of the characters (in Unicode or whatever space they are from), then you can try summing the values up and using the result as a hue for HSV space. If having a longer string should cause the colours to be more different, you might consider weighing characters by their position in the string.

If the difference is more complex, such as by the occurrences of certain letters or words, then you need to identify this. Maybe you can decide red, green and blue values based on the number of Es, Ss and Rs in a string, if your domain has a lot of these. Or pick a hue based on the ratio of vowels to consonents, or words to syllables.

There are many, many different ways to approach this, but the best one really depends on what you mean by "similar" strings.

回复收藏 0 原文

烟雨扶苏 2024-07-20 03:37:49

听起来你想要某种哈希值。它不需要是安全的（因此没有像 MD5 或 SHA 这样复杂的东西），但类似以下的内容

char1 + char2 + char3 + ... + charN % MAX_COLOUR_VALUE

可以作为简单的第一步。您还可以做一些更奇特的事情，让每个字符充当 R、G 和 B 的“振幅”（e 可以是 +1R、+2G 和 -4B 等），然后简单地将所有值相加一个字符串...将它们夹在末尾，您就有了一种将任意长度的字符串转换为颜色的方法，作为一种“颜色哈希”过程。

It sounds like you want a hash of some sort. It doesn't need to be secure (so nothing as complicated as MD5 or SHA) but something along the lines of:

char1 + char2 + char3 + ... + charN % MAX_COLOUR_VALUE

would work as a simple first step. You could also do fancier things along the lines of having each character act as an 'amplitude' for R,G and B (e could be +1R, +2G and -4B, etc.) and then simply add up all the values in a string... clamp them at the end and you have a method of turning arbitrary length strings into colours as a 'colour hash' sort of process.

回复收藏 0 原文

非要怀念 2024-07-20 03:37:49

首先，您需要选择一种方法来测量字符串相似度。最小编辑距离是传统的，但不足以对字符串进行良好排序，这就是如果您想每次都为相同的字符串分配相同的颜色，那么您将需要 - 也许您可以通过字母距离来衡量编辑成本。此外，如果您追求的是语音而不是书面形式的相似性（如果是这样，请首先考虑词干/音源传递）或某种其他意义上的“相似性”，则最小编辑距离本身可能不是很有用。

然后，您需要根据该指标选择一种遍历可见颜色空间的方法。考虑使用 HSL 或 HSV 颜色表示可能会有所帮助 - 这样算法就可以变得简单选择起始色调并遍历排序的语料库，将当前色调分配给每个字符串，然后根据字符串与前一个字符串的差异进行偏移。

回复收藏 0 原文

泪意 2024-07-20 03:37:49

永远不要让两根不同的绳子具有相同的颜色有多重要？

如果它不是那么重要那么也许这可以工作？

您可以选择与圆“同伦”的一维颜色空间：假设颜色函数 c(x) 是为 0 之间的 x 定义的代码>和<代码>1。那么你需要c(0) == c(1)。

现在，您将所有字符值的总和以某个缩放因子为模，并将其包装回颜色空间：

c( (SumOfCharValues(word) modulo ScalingFactor) / ScalingFactor )

如果您定义了，这可能会更好更高维度的“包装”颜色空间，并为每个维度选择不同的 SumOfCharValues 函数；有人建议交替使用总和和长度。

只是一个想法...HTH

回复收藏 0 原文

冷…雨湿花 2024-07-20 03:37:49

这是我的建议（我认为这个算法有一个通用名称，但我太累了，记不住）：

你想将每个字符串转换为 3D 点节点（r，g，b）（你可以缩放值，以便它们适合您的范围），从而最小化以下误差：

Error = \sum_i{\sum_j{(dist(node_i, node_j) - dist(str_i, str_j))^2}}

您可以这样做：

首先为每个字符串分配一个随机颜色（r，g，b）
重复直到您看到适合（例如，误差调整为小于\ epsilon = 0.0001):
1. 选择一个随机节点
2. 调整其位置（r、g、b）以使误差最小化
缩放坐标系，使每个节点坐标都在 [0., 1.) 或 [0, 256] 范围内

Here is my suggestion (I think there is a general name for this algorithm, but I'm too tired to remember it):

You want to transform each string to a 3D point node(r, g, b) (you can scale the values so that they fit your range) such that the following error is minimized:

Error = \sum_i{\sum_j{(dist(node_i, node_j) - dist(str_i, str_j))^2}}

You can do this:

First assign each string a random color (r, g, b)
Repeat until you see fit (eg. error is adjusted less than \epsilon = 0.0001):
1. Pick a random node
2. Adjust it's position (r, g, b) such that the error is minimized
Scale the coordinate system such that each nodes coordinates are in the range [0., 1.) or [0, 256]

回复收藏 0 原文