将任意字符串映射到 RGB 值

发布于 2024-07-13 03:37:49 字数 870 浏览 10 评论 0原文

我有一大堆任意自然语言字符串。 对于我的工具来分析它们,我需要将每个字符串转换为唯一的颜色值(RGB 或其他)。 我需要颜色对比来取决于字符串相似性(字符串与其他字符串越不同,它们各自的颜色应该越不同)。 如果我总是为同一字符串获得相同的颜色值,那就完美了。

关于如何解决这个问题有什么建议吗?

更新字符串之间的距离

我可能需要将“相似性”定义为类似 Levenstein 的距离。 不需要自然语言解析。

即:

"I am going to the store" and 
"We are going to the store"

相似。

"I am going to the store" and 
"I am going to the store today"

也类似(但稍少)。

"I am going to the store" and 
"J bn hpjoh up uif tupsf"

完全不相似。

(谢谢,Welbog!)

我可能会知道确切地只有当我看到程序输出时我才需要什么距离函数。 所以让我们从更简单的事情开始。

任务简化的更新

我已经删除了我自己的将任务分成两部分的建议 - 绝对距离计算和颜色分布。 这不会很好地工作,因为首先我们将维度信息减少到单个维度,然后尝试将其合成到三个维度。

I have a huge set of arbitrary natural language strings. For my tool to analyze them I need to convert each string to unique color value (RGB or other). I need color contrast to depend on string similarity (the more string is different from other, the more their respective colors should be different). Would be perfect if I would always get same color value for the same string.

Any advice on how to approach this problem?

Update on distance between strings

I probably need "similarity" defined as a Levenstein-like distance. No natural language parsing is required.

That is:

"I am going to the store" and 
"We are going to the store"

Similar.

"I am going to the store" and 
"I am going to the store today"

Similar as well (but slightly less).

"I am going to the store" and 
"J bn hpjoh up uif tupsf"

Quite not similar.

(Thanks, Welbog!)

I probably would know exactly what distance function I need only when I'll see program output. So lets start from simpler things.

Update on task simplification

I've removed my own suggestion to split task into two — absolute distance calculation and color distribution. This would not work well as at first we're reducing dimensional information to a single dimension, and then trying to synthesize it up to three dimensions.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

夕色琉璃 2024-07-20 03:37:49

您需要详细说明“相似字符串”的含义,以便提出适当的转换函数。 字符串是否

 "I am going to the store" and 
"We are going to the store" 

被认为相似? 那么字符串

 "I am going to the store" and 
"J bn hpjoh up uif tupsf" 

(原始+1中的所有字母)或 呢

 "I am going to the store" and 
"I am going to the store today"

? 根据“相似”的含义,您可能会考虑不同的功能。

如果差异仅基于字符的值(Unicode 或它们来自的任何空间),那么您可以尝试将这些值相加,并将结果用作 HSV 空间的色调。 如果字符串较长会导致颜色差异更大,您可以考虑按字符在字符串中的位置来衡量字符的重量。

如果差异更复杂,例如某些字母或单词的出现,那么您需要识别这一点。 也许您可以根据字符串中 Es、Ss 和 Rs 的数量来决定红色、绿色和蓝色值(如果您的域有很多这些值)。 或者根据元音与辅音或单词与音节的比例来选择色调。

有很多很多不同的方法来解决这个问题,但最好的方法实际上取决于“相似”字符串的含义。

You need to elaborate more on what you mean by "similar strings" in order to come up with an appropriate conversion function. Are the strings

 "I am going to the store" and 
"We are going to the store" 

considered similar? What about the strings

 "I am going to the store" and 
"J bn hpjoh up uif tupsf" 

(all of the letters in the original +1), or

 "I am going to the store" and 
"I am going to the store today"

? Based on what you mean by "similar", you might consider different functions.

If the difference can be based solely on the values of the characters (in Unicode or whatever space they are from), then you can try summing the values up and using the result as a hue for HSV space. If having a longer string should cause the colours to be more different, you might consider weighing characters by their position in the string.

If the difference is more complex, such as by the occurrences of certain letters or words, then you need to identify this. Maybe you can decide red, green and blue values based on the number of Es, Ss and Rs in a string, if your domain has a lot of these. Or pick a hue based on the ratio of vowels to consonents, or words to syllables.

There are many, many different ways to approach this, but the best one really depends on what you mean by "similar" strings.

烟雨扶苏 2024-07-20 03:37:49

听起来你想要某种哈希值。 它不需要是安全的(因此没有像 MD5 或 SHA 这样复杂的东西),但类似以下的内容

char1 + char2 + char3 + ... + charN % MAX_COLOUR_VALUE

可以作为简单的第一步。 您还可以做一些更奇特的事情,让每个字符充当 R、G 和 B 的“振幅”(e 可以是 +1R、+2G 和 -4B 等),然后简单地将所有值相加一个字符串...将它们夹在末尾,您就有了一种将任意长度的字符串转换为颜色的方法,作为一种“颜色哈希”过程。

It sounds like you want a hash of some sort. It doesn't need to be secure (so nothing as complicated as MD5 or SHA) but something along the lines of:

char1 + char2 + char3 + ... + charN % MAX_COLOUR_VALUE

would work as a simple first step. You could also do fancier things along the lines of having each character act as an 'amplitude' for R,G and B (e could be +1R, +2G and -4B, etc.) and then simply add up all the values in a string... clamp them at the end and you have a method of turning arbitrary length strings into colours as a 'colour hash' sort of process.

非要怀念 2024-07-20 03:37:49

首先,您需要选择一种方法来测量字符串相似度。 最小编辑距离是传统的,但不足以对字符串进行良好排序,这就是如果您想每次都为相同的字符串分配相同的颜色,那么您将需要 - 也许您可以通过字母距离来衡量编辑成本。 此外,如果您追求的是语音而不是书面形式的相似性(如果是这样,请首先考虑词干/音源传递)或某种其他意义上的“相似性”,则最小编辑距离本身可能不是很有用。

然后,您需要根据该指标选择一种遍历可见颜色空间的方法。 考虑使用 HSL 或 HSV 颜色表示可能会有所帮助 - 这样算法就可以变得简单选择起始色调并遍历排序的语料库,将当前色调分配给每个字符串,然后根据字符串与前一个字符串的差异进行偏移。

First, you'll need to pick a way to measure string similarity. Minimal edit distance is traditional, but is not sufficient to well-order the strings, which is what you will need if you want to allocate the same colours to the same strings every time - perhaps you could weight the edit costs by alphabetic distance. Also minimal edit distance by itself may not be very useful if what you are after is similarity in speech rather than in written form (if so, consider a stemming/soundex pass first), or some other sense of "similarity".

Then you need to pick a way of traversing the visible colour space based on that metric. It may be helpful to consider using HSL or HSV colour representation - the algorithm could then become as simple as picking a starting hue and walking the sorted corpus, assigning the current hue to each string before offsetting it by the string's difference from the previous one.

泪意 2024-07-20 03:37:49

永远不要让两根不同的绳子具有相同的颜色有多重要?

如果它不是那么重要那么也许这可以工作?

您可以选择与圆“同伦”的一维颜色空间:假设颜色函数 c(x) 是为 0 之间的 x 定义的代码>和<代码>1。 那么你需要c(0) == c(1)

现在,您将所有字符值的总和以某个缩放因子为模,并将其包装回颜色空间:

c( (SumOfCharValues(word) modulo ScalingFactor) / ScalingFactor )

如果您定义了,这可能会更好更高维度的“包装”颜色空间,并为每个维度选择不同的 SumOfCharValues 函数; 有人建议交替使用总和和长度。

只是一个想法...HTH

How important is it that you never end up with two dissimilar strings having the same colour?

If it's not that important then maybe this could work?

You could pick a 1 dimensional color space that is "homotopic" to the circle: Say the color function c(x) is defined for x between 0 and 1. Then you'd want c(0) == c(1).

Now you take the sum of all character values modulo some scaling factor and wrap this back to the color space:

c( (SumOfCharValues(word) modulo ScalingFactor) / ScalingFactor )

This might work even better if you defined a "wrapping" color space of higher dimensions and for each dimension pick different SumOfCharValues function; someone suggested alternating sum and length.

Just a thought... HTH

冷…雨湿花 2024-07-20 03:37:49

这是我的建议(我认为这个算法有一个通用名称,但我太累了,记不住):

你想将每个字符串转换为 3D 点节点(r,g,b)(你可以缩放值,以便它们适合您的范围),从而最小化以下误差:

Error = \sum_i{\sum_j{(dist(node_i, node_j) - dist(str_i, str_j))^2}}

您可以这样做:

  1. 首先为每个字符串分配一个随机颜色(r,g,b)
  2. 重复直到您看到适合(例如,误差调整为小于\ epsilon = 0.0001):
    1. 选择一个随机节点
    2. 调整其位置(r、g、b)以使误差最小化
  3. 缩放坐标系,使每个节点坐标都在 [0., 1.) 或 [0, 256] 范围内

Here is my suggestion (I think there is a general name for this algorithm, but I'm too tired to remember it):

You want to transform each string to a 3D point node(r, g, b) (you can scale the values so that they fit your range) such that the following error is minimized:

Error = \sum_i{\sum_j{(dist(node_i, node_j) - dist(str_i, str_j))^2}}

You can do this:

  1. First assign each string a random color (r, g, b)
  2. Repeat until you see fit (eg. error is adjusted less than \epsilon = 0.0001):
    1. Pick a random node
    2. Adjust it's position (r, g, b) such that the error is minimized
  3. Scale the coordinate system such that each nodes coordinates are in the range [0., 1.) or [0, 256]
冰魂雪魄 2024-07-20 03:37:49

您可以使用 MinHash 或其他一些LSH 方法 并将相似性定义为 带状疱疹通过杰卡德系数测量。
挖掘海量数据集,第 3 章中有一个很好的描述拉贾拉曼和乌尔曼。

You can use something like MinHash or some other LSH method and define similarity as intersection between sets of shingles measured by Jaccard coefficient.
There is a good description in Mining of Massive data sets, Ch.3 by Rajaraman and Ullman.

长途伴 2024-07-20 03:37:49

我可能会在两个字符串之间定义一些增量。 我不知道您将两个字符串的差异(或“不平等”)定义为什么,但我能想到的最明显的事情是字符串长度和特定字母出现的次数(以及它们在字符串中的索引) 。 实现它应该不难,以便它在相等的字符串中返回相同的颜色代码(如果您首先执行相等,并在进一步比较之前返回)。

当涉及到实际的 RGB 值时,我会尝试将字符串数据转换为 4 字节(RGBA),如果只使用 RGB,则转换为 3 字节。 我不知道是否每个字符串都适合它们(因为这可能是特定于语言的?)。

I would maybe define some delta between two strings. I don't know what you define as the difference (or "unequality") of two strings, but the most obvious thing I could think about would be string length and the number of occurences of particular letters (and their index in the string). It should not be tricky to implement it such that it returns the same color code in equal strings (if you do an equal first, and return before further comparison).

When it comes to the actual RGB value, I would try to convert the string data into 4 bytes (RGBA), or 3 bytes if you only use the RGB. I don't know if every string would fit into them (as that may be language specific?).

把时间冻结 2024-07-20 03:37:49

抱歉,但是您无法使用编辑距离或类似的方法来完成您正在寻找的操作。 RGB 和 HSV 是 3 维几何空间,但编辑距离描述的是度量空间 - 一组更宽松的约束,没有固定的维数。 没有办法将度量空间映射到固定数量的维度,同时始终保留局部性。

不过,就近似值而言,对于单个项,您可以使用 soundex 或 metaphone 等算法的修改来选择颜色; 例如,对于多个术语,您可以将 soundex 或 metaphone 分别应用于每个单词,然后将它们相加(带有溢出)。

Sorry, but you can't do what you're looking for with levenshtein distance or similar. RGB and HSV are 3-dimensional geometric spaces, but levenshtein distance describes a metric space - a much looser set of contstraints with no fixed number of dimensions. There's no way to map a metric space into a fixed number of dimensions while always preserving locality.

As far as approximations go, though, for single terms you could use a modification of an algorithm like soundex or metaphone to pick a color; for multiple terms, you could, for example, apply soundex or metaphone to each word individually, then sum them up (with overflow).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文