特定有限整数集的有效映射
我正在寻找以下整数列表和 0-127 范围的子集之间的小型、快速(双向)双射映射:
0x200C, 0x200D, 0x200E, 0x200F,
0x2013, 0x2014, 0x2015, 0x2017,
0x2018, 0x2019, 0x201A, 0x201C,
0x201D, 0x201E, 0x2020, 0x2021,
0x2022, 0x2026, 0x2030, 0x2039,
0x203A, 0x20AA, 0x20AB, 0x20AC,
0x20AF, 0x2116, 0x2122
一个明显的解决方案是:
y = x>>2 & 0x40 | x & 0x3f;
x = 0x2000 | y<<2 & 0x100 | y & 0x3f;
编辑:< /strong> 我缺少一些值,特别是 0x20Ax,它不适用于上述值。
另一个明显的解决方案是查找表,但在不使其变得不必要的大的情况下,查找表无论如何都需要一些位重新排列,我怀疑通过简单的位重新排列可以更好地完成整个任务。
出于好奇,这些幻数是旧版 ISO-8859 和 Windows 代码页中唯一出现的“大”Unicode 代码点。
I'm looking for a small, fast (in both directions) bijective mapping between the following list of integers and a subset of the range 0-127:
0x200C, 0x200D, 0x200E, 0x200F,
0x2013, 0x2014, 0x2015, 0x2017,
0x2018, 0x2019, 0x201A, 0x201C,
0x201D, 0x201E, 0x2020, 0x2021,
0x2022, 0x2026, 0x2030, 0x2039,
0x203A, 0x20AA, 0x20AB, 0x20AC,
0x20AF, 0x2116, 0x2122
One obvious solution is:
y = x>>2 & 0x40 | x & 0x3f;
x = 0x2000 | y<<2 & 0x100 | y & 0x3f;
Edit: I was missing some of the values, particularly 0x20Ax, which don't work with the above.
Another obvious solution is a lookup table, but without making it unnecessarily large, a lookup table would require some bit rearrangement anyway and I suspect the whole task can be better accomplished with simple bit rearrangement.
For the curious, those magic numbers are the only "large" Unicode codepoints that appear in legacy ISO-8859 and Windows codepages.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
此方法在有限域中使用乘法:
map()
将 unicode 点转换为唯一的 7 位数字,unmap()
则执行相反的操作。请注意,gcc 至少能够将其编译为不使用任何除法运算的 x86 代码,因为模数是一个常数。This method uses multiplication in a finite field:
map()
converts the unicode points to the unique 7 bit numbers, andunmap()
does the reverse. Note thatgcc
at least is able to compile this to x86 code which does not use any division operations, since the modulus is a constant.我知道它很难看,但除了最后一个值之外,如果考虑最低 6 位,所有其他值都已经是唯一的,因此您可以构建和逆映射:
在逆映射计算之后,两个变换函数是
函数
direct(x)
将返回 0 到 63 之间的数字,而给定 0 到 63 之间的数字的函数inverse(x)
将返回一个整数。对于列表中的所有 27 个值inverse(direct(x)) == x
。I know it's ugly, but except for last value all others are already unique if you consider lowest 6 bits, so you can just build and inverse map:
After this inverse map computation the two transform functions are
The function
direct(x)
will return a number between 0 and 63, and the functioninverse(x)
given a number between 0 and 63 will return an integer. For all the 27 values in your listinverse(direct(x)) == x
.我会选择一些简单(且便宜)的哈希函数
f
,您可以从映射到值f0、f1、...
的此类函数系列中选择它比如说,代码>0..255。如果您的哈希函数是随机的,那么根据生日悖论,您感兴趣的值会发生一些冲突,但不会很多。现在,一个简单的 perl(无论什么)脚本将允许您预处理固定值数据,以通过从集合中选择适当的函数来减少(甚至消除)冲突。
这种方法的优点是,如果您发现忘记了某个值(就像您已经做过的那样),或者某个奇怪的国家决定将 € 等奇怪的 unicode 字符映射到 8 位字符集中,您可以重新运行预处理。
而且,顺便说一句,我认为某些 iso-8859- 中的特殊字符数量是多少?集合一定比你这里拥有的大得多,不是吗?我会把它们全部带走。
编辑:在做了一些实验后,一个小perl脚本告诉我,当以10007或10009为模进行缩减时,出现在iso-8859编码之一中的所有577个unicode代码点都会映射到不同的位置。
编辑:对于有限的集合,下表可以解决问题:
I'd go for some simple (and cheap) hash function
f
that you choose out of a familyf0, f1, ...
of such functions that map to values0..255
, say. If your hash function would be random, by the birthday paradox you'd have some collisions for the values that you are interested in, but not many.Now a simple perl (of whatever) script will allow you to preprocess your fixed valued data to reduce (or even eliminate) collisions by choosing an appropriate function from your set.
This approach has the advantage that you can renew the preprocessing run if you find out that you forgot a value (as you already did) or some weird country decides to map bizarre unicode characters like € into an 8bit character set.
And, BTW, I think the amount of special characters that are in some of the iso-8859-? sets must be much larger than what you have, here, no? I'd take them all.
Edit: After doing some experiments a little perl script tells me that all 577 unicode code points that appear in one of the iso-8859 encodings map to different positions when reduced modulo 10007 or 10009.
Edit: The following table does the trick, for the limited set:
通过试用&错误,我得出了以下算法:
有时,即使在编写代码时,进化也会击败智能设计;)
By trial & error, I arrived at the following algorithm:
Sometimes, evolution beats intelligent design even when writing code ;)