特定有限整数集的有效映射

发布于 2024-10-15 20:22:39 字数 670 浏览 2 评论 0原文

我正在寻找以下整数列表和 0-127 范围的子集之间的小型、快速(双向)双射映射:

0x200C, 0x200D, 0x200E, 0x200F,
0x2013, 0x2014, 0x2015, 0x2017,
0x2018, 0x2019, 0x201A, 0x201C,
0x201D, 0x201E, 0x2020, 0x2021,
0x2022, 0x2026, 0x2030, 0x2039,
0x203A, 0x20AA, 0x20AB, 0x20AC,
0x20AF, 0x2116, 0x2122

一个明显的解决方案是:

y = x>>2 & 0x40 | x & 0x3f;
x = 0x2000 | y<<2 & 0x100 | y & 0x3f;

编辑:< /strong> 我缺少一些值,特别是 0x20Ax,它不适用于上述值。

另一个明显的解决方案是查找表,但在不使其变得不必要的大的情况下,查找表无论如何都需要一些位重新排列,我怀疑通过简单的位重新排列可以更好地完成整个任务。

出于好奇,这些幻数是旧版 ISO-8859 和 Windows 代码页中唯一出现的“大”Unicode 代码点。

I'm looking for a small, fast (in both directions) bijective mapping between the following list of integers and a subset of the range 0-127:

0x200C, 0x200D, 0x200E, 0x200F,
0x2013, 0x2014, 0x2015, 0x2017,
0x2018, 0x2019, 0x201A, 0x201C,
0x201D, 0x201E, 0x2020, 0x2021,
0x2022, 0x2026, 0x2030, 0x2039,
0x203A, 0x20AA, 0x20AB, 0x20AC,
0x20AF, 0x2116, 0x2122

One obvious solution is:

y = x>>2 & 0x40 | x & 0x3f;
x = 0x2000 | y<<2 & 0x100 | y & 0x3f;

Edit: I was missing some of the values, particularly 0x20Ax, which don't work with the above.

Another obvious solution is a lookup table, but without making it unnecessarily large, a lookup table would require some bit rearrangement anyway and I suspect the whole task can be better accomplished with simple bit rearrangement.

For the curious, those magic numbers are the only "large" Unicode codepoints that appear in legacy ISO-8859 and Windows codepages.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

狼亦尘 2024-10-22 20:22:39

此方法在有限域中使用乘法:

#define PRIME 0x119
#define OFFSET1 0x00f
#define OFFSET2 0x200c
#define OFFSET3 (OFFSET2 - OFFSET1)
#define MULTIPLIER 2
#define INVERSE 0x8d

unsigned map(unsigned n)
{
    return ((n - OFFSET3) * MULTIPLIER) % PRIME;
}

unsigned unmap(unsigned m)
{
    return ((m * INVERSE) + PRIME - OFFSET1) % PRIME + OFFSET2;
}

map() 将 unicode 点转换为唯一的 7 位数字,unmap() 则执行相反的操作。请注意,gcc 至少能够将其编译为不使用任何除法运算的 x86 代码,因为模数是一个常数。

This method uses multiplication in a finite field:

#define PRIME 0x119
#define OFFSET1 0x00f
#define OFFSET2 0x200c
#define OFFSET3 (OFFSET2 - OFFSET1)
#define MULTIPLIER 2
#define INVERSE 0x8d

unsigned map(unsigned n)
{
    return ((n - OFFSET3) * MULTIPLIER) % PRIME;
}

unsigned unmap(unsigned m)
{
    return ((m * INVERSE) + PRIME - OFFSET1) % PRIME + OFFSET2;
}

map() converts the unicode points to the unique 7 bit numbers, and unmap() does the reverse. Note that gcc at least is able to compile this to x86 code which does not use any division operations, since the modulus is a constant.

傲世九天 2024-10-22 20:22:39

我知道它很难看,但除了最后一个值之外,如果考虑最低 6 位,所有其他值都已经是唯一的,因此您可以构建和逆映射:

int ints[] = {0x200C, 0x200D, 0x200E, 0x200F,
              0x2013, 0x2014, 0x2015, 0x2017,
              0x2018, 0x2019, 0x201A, 0x201C,
              0x201D, 0x201E, 0x2020, 0x2021,
              0x2022, 0x2026, 0x2030, 0x2039,
              0x203A, 0x20AA, 0x20AB, 0x20AC,
              0x20AF, 0x2116, 0x2122};

int invmap[64];

void mkinvmap()
{
    for (int i=0; i<26; i++)
        invmap[ints[i]&63] = ints[i];
    invmap[0] = 0x2122;
}

在逆映射计算之后,两个变换函数是

int direct(int x)  { return x==0x2122 ? 0 : (x & 63); }
int inverse(int x) { return invmap[x]; }

函数 direct(x) 将返回 0 到 63 之间的数字,而给定 0 到 63 之间的数字的函数 inverse(x) 将返回一个整数。对于列表中的所有 27 个值 inverse(direct(x)) == x

I know it's ugly, but except for last value all others are already unique if you consider lowest 6 bits, so you can just build and inverse map:

int ints[] = {0x200C, 0x200D, 0x200E, 0x200F,
              0x2013, 0x2014, 0x2015, 0x2017,
              0x2018, 0x2019, 0x201A, 0x201C,
              0x201D, 0x201E, 0x2020, 0x2021,
              0x2022, 0x2026, 0x2030, 0x2039,
              0x203A, 0x20AA, 0x20AB, 0x20AC,
              0x20AF, 0x2116, 0x2122};

int invmap[64];

void mkinvmap()
{
    for (int i=0; i<26; i++)
        invmap[ints[i]&63] = ints[i];
    invmap[0] = 0x2122;
}

After this inverse map computation the two transform functions are

int direct(int x)  { return x==0x2122 ? 0 : (x & 63); }
int inverse(int x) { return invmap[x]; }

The function direct(x) will return a number between 0 and 63, and the function inverse(x) given a number between 0 and 63 will return an integer. For all the 27 values in your list inverse(direct(x)) == x.

宫墨修音 2024-10-22 20:22:39

我会选择一些简单(且便宜)的哈希函数 f,您可以从映射到值 f0、f1、... 的此类函数系列中选择它比如说,代码>0..255。如果您的哈希函数是随机的,那么根据生日悖论,您感兴趣的值会发生一些冲突,但不会很多。

现在,一个简单的 perl(无论什么)脚本将允许您预处理固定值数据,以通过从集合中选择适当的函数来减少(甚至消除)冲突。

这种方法的优点是,如果您发现忘记了某个值(就像您已经做过的那样),或者某个奇怪的国家决定将 € 等奇怪的 unicode 字符映射到 8 位字符集中,您可以重新运行预处理。

而且,顺便说一句,我认为某些 iso-8859- 中的特殊字符数量是多少?集合一定比你这里拥有的大得多,不是吗?我会把它们全部带走。

编辑:在做了一些实验后,一个小perl脚本告诉我,当以10007或10009为模进行缩减时,出现在iso-8859编码之一中的所有577个unicode代码点都会映射到不同的位置。

编辑:对于有限的集合,下表可以解决问题:

wchar_t const uniqTable[91] = {
[0x7] = L'\u2116' /* № */,
[0xD] = L'\uFFFD' /* � */,
[0xE] = L'\u200C' /* ‌ */,
[0xF] = L'\u200D' /* ‍ */,
[0x10] = L'\u200E' /* ‎ */,
[0x11] = L'\u200F' /* ‏ */,
[0x13] = L'\u2122' /* ™ */,
[0x15] = L'\u2013' /* – */,
[0x16] = L'\u2014' /* — */,
[0x17] = L'\u2015' /* ― */,
[0x19] = L'\u2017' /* ‗ */,
[0x1A] = L'\u2018' /* ‘ */,
[0x1B] = L'\u2019' /* ’ */,
[0x1C] = L'\u201A' /* ‚ */,
[0x1E] = L'\u201C' /* “ */,
[0x1F] = L'\u201D' /* ” */,
[0x20] = L'\u201E' /* „ */,
[0x22] = L'\u2020' /* † */,
[0x23] = L'\u2021' /* ‡ */,
[0x24] = L'\u2022' /* • */,
[0x28] = L'\u2026' /* … */,
[0x32] = L'\u2030' /* ‰ */,
[0x3B] = L'\u2039' /* ‹ */,
[0x3C] = L'\u203A' /* › */,
[0x51] = L'\u20AA' /* ₪ */,
[0x52] = L'\u20AB' /* ₫ */,
[0x53] = L'\u20AC' /* € */,
[0x56] = L'\u20AF' /* ₯ */,
};

I'd go for some simple (and cheap) hash function f that you choose out of a family f0, f1, ... of such functions that map to values 0..255, say. If your hash function would be random, by the birthday paradox you'd have some collisions for the values that you are interested in, but not many.

Now a simple perl (of whatever) script will allow you to preprocess your fixed valued data to reduce (or even eliminate) collisions by choosing an appropriate function from your set.

This approach has the advantage that you can renew the preprocessing run if you find out that you forgot a value (as you already did) or some weird country decides to map bizarre unicode characters like € into an 8bit character set.

And, BTW, I think the amount of special characters that are in some of the iso-8859-? sets must be much larger than what you have, here, no? I'd take them all.

Edit: After doing some experiments a little perl script tells me that all 577 unicode code points that appear in one of the iso-8859 encodings map to different positions when reduced modulo 10007 or 10009.

Edit: The following table does the trick, for the limited set:

wchar_t const uniqTable[91] = {
[0x7] = L'\u2116' /* № */,
[0xD] = L'\uFFFD' /* � */,
[0xE] = L'\u200C' /* ‌ */,
[0xF] = L'\u200D' /* ‍ */,
[0x10] = L'\u200E' /* ‎ */,
[0x11] = L'\u200F' /* ‏ */,
[0x13] = L'\u2122' /* ™ */,
[0x15] = L'\u2013' /* – */,
[0x16] = L'\u2014' /* — */,
[0x17] = L'\u2015' /* ― */,
[0x19] = L'\u2017' /* ‗ */,
[0x1A] = L'\u2018' /* ‘ */,
[0x1B] = L'\u2019' /* ’ */,
[0x1C] = L'\u201A' /* ‚ */,
[0x1E] = L'\u201C' /* “ */,
[0x1F] = L'\u201D' /* ” */,
[0x20] = L'\u201E' /* „ */,
[0x22] = L'\u2020' /* † */,
[0x23] = L'\u2021' /* ‡ */,
[0x24] = L'\u2022' /* • */,
[0x28] = L'\u2026' /* … */,
[0x32] = L'\u2030' /* ‰ */,
[0x3B] = L'\u2039' /* ‹ */,
[0x3C] = L'\u203A' /* › */,
[0x51] = L'\u20AA' /* ₪ */,
[0x52] = L'\u20AB' /* ₫ */,
[0x53] = L'\u20AC' /* € */,
[0x56] = L'\u20AF' /* ₯ */,
};
幽蝶幻影 2024-10-22 20:22:39

通过试用&错误,我得出了以下算法:

#include <assert.h>
#include <stdio.h>

static const unsigned CODES[] = {
    0x200C, 0x200D, 0x200E, 0x200F,
    0x2013, 0x2014, 0x2015, 0x2017,
    0x2018, 0x2019, 0x201A, 0x201C,
    0x201D, 0x201E, 0x2020, 0x2021,
    0x2022, 0x2026, 0x2030, 0x2039,
    0x203A, 0x20AA, 0x20AB, 0x20AC,
    0x20AF, 0x2116, 0x2122
};

static unsigned enc(unsigned value)
{
    return (value & 0x3F) + (value & 0x180) / 4;
}

static unsigned dec(unsigned value)
{
    return 0x2000 + value + ((value & 0x40) >> 6) * 3 *
        (0x20 + (value & 0x10) * 2 + (value & 0x20));
}

int main(void)
{
    const unsigned *const END = CODES + sizeof CODES / sizeof *CODES;
    const unsigned *current = CODES;
    for(; current < END; ++current)
    {
        printf("%04x -> %02x -> %04x\n",
            *current, enc(*current), dec(enc(*current)));

        assert(enc(*current) < 0x80);
        assert(dec(enc(*current)) == *current);
    }

    return 0;
}

有时,即使在编写代码时,进化也会击败智能设计;)

By trial & error, I arrived at the following algorithm:

#include <assert.h>
#include <stdio.h>

static const unsigned CODES[] = {
    0x200C, 0x200D, 0x200E, 0x200F,
    0x2013, 0x2014, 0x2015, 0x2017,
    0x2018, 0x2019, 0x201A, 0x201C,
    0x201D, 0x201E, 0x2020, 0x2021,
    0x2022, 0x2026, 0x2030, 0x2039,
    0x203A, 0x20AA, 0x20AB, 0x20AC,
    0x20AF, 0x2116, 0x2122
};

static unsigned enc(unsigned value)
{
    return (value & 0x3F) + (value & 0x180) / 4;
}

static unsigned dec(unsigned value)
{
    return 0x2000 + value + ((value & 0x40) >> 6) * 3 *
        (0x20 + (value & 0x10) * 2 + (value & 0x20));
}

int main(void)
{
    const unsigned *const END = CODES + sizeof CODES / sizeof *CODES;
    const unsigned *current = CODES;
    for(; current < END; ++current)
    {
        printf("%04x -> %02x -> %04x\n",
            *current, enc(*current), dec(enc(*current)));

        assert(enc(*current) < 0x80);
        assert(dec(enc(*current)) == *current);
    }

    return 0;
}

Sometimes, evolution beats intelligent design even when writing code ;)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文