有没有办法让这个哈希查找更快?

发布于 2024-09-13 00:12:17 字数 4998 浏览 1 评论 0原文

我需要(非常)快速处理有限范围的字符串,计算它们的值。输入文件的形式为:

January    7
March     22
September 87
March     36

等等。因为线宽是相同的,所以我可以简单地用 fread 相当快地读取一行,并且我已经开发了一个完美的哈希函数,它可以工作,但我想看看是否有人可以提供任何建议如何让它更快。我将分析每个建议,看看效果如何。

哈希函数基于月份名称,以允许将值快速分配到存储桶。请耐心听我说。我首先计算出完美哈希的最小字符数:

January
February
March
April
May
June
July
August
September
October
November
December

请记住,由于我拥有整个输入行,因此月份全部九个字符。

不幸的是,没有单个列来标记月份的唯一性。第 1 列重复 J,第 2 列重复 a,第 3 列重复 r,第 4 列重复 u,第 5 列重复向前重复 (还有其他重复项,但一个就足以防止单列哈希键)。

但是,通过使用第一列和第四列,我得到了值 JuFrMcAiMJeJyAuStOoNeDe,它们是唯一的。该文件中不会有无效值,因此我不必担心输入数据的存储桶不正确。

通过查看字符的十六进制代码,我发现只需与策略值进行 AND 运算即可获得较低的唯一值:

FirstChar  Hex  Binary     &0x0f
---------  ---  ---------  -----
   A       x41  0100 0001      1
   D       x44  0100 0100      4
   F       x46  0100 0110      6
   J       x4a  0100 1010     10
   M       x4d  0100 1101     13
   N       x4e  0100 1110     14
   O       x4f  0100 1111     15
   S       x53  0101 0011      3

SecondChar  Hex  Binary     &0x1f
----------  ---  ---------  -----
 <space>    x20  0010 0000      0
    c       x63  0110 0011      3
    e       x65  0110 0101      5
    i       x69  0110 1001      9
    o       x6f  0110 1111     15
    r       x72  0111 0010     18
    t       x74  0111 0100     20
    u       x75  0111 0101     21
    y       x79  0111 1001     25

这使我能够设置一个静态数组来创建一个(希望)快得令人眼花缭乱的哈希函数:

#define __ -1
static unsigned int hash (const char *str) {
    static unsigned char bucket[] = {
        //   A       S   D       F               J           M   N   O
        __, __, __, __, __, __, __, __, __, __, __, __, __,  4, __, __, // space
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __,  2, __, __, // c
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, 11, __, __, __, __, __,  5, __, __, __, 10, __, // e
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __,  3, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // i
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __,  9, // o
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __,  1, __, __, __, __, __, __, __, __, __, // r
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __,  8, __, __, __, __, __, __, __, __, __, __, __, __, // t
        __,  7, __, __, __, __, __, __, __, __,  0, __, __, __, __, __, // u
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __,  6, __, __, __, __, __  // y
    };
    return bucket[((unsigned int)(str[3]&0x1f)<<4)|(str[0]&0xf)];
}

使用代码:

#include <stdio.h>
#include <string.h>

// Hash function here.

static char *months[] = {
    "January  ", "February ", "March    ", "April    ", "May      ", "June     ",
    "July     ", "August   ", "September", "October  ", "November ", "December "
};

int main (void) {
    int i;
    for (i = 0; i < sizeof(months)/sizeof(*months); i++)
        printf ("%-10s -> %2d\n", months[i], hash(months[i]));
    return 0;
}

表明它在功能上是正确的:

January    ->  0
February   ->  1
March      ->  2
April      ->  3
May        ->  4
June       ->  5
July       ->  6
August     ->  7
September  ->  8
October    ->  9
November   -> 10
December   -> 11

但我想知道它是否可以做得更快。

有什么建议吗?如果我的散列函数本身存在问题,我愿意接受任何简单的优化,甚至完全重写。


我认为这不是那么重要,但最终版本将使用 EBCDIC。该理论仍然成立,但由于字符具有不同的代码点,AND 运算可能会略有变化。我会很高兴只在 ASCII 方面提供任何帮助,因为我相信提供的任何建议都可以很好地转换为 EBCDIC。

I have a requirement to (very) quickly process strings of a limited range, tallying their values. The input file is of the form:

January    7
March     22
September 87
March     36

and so forth. Because the line widths are identical, I can simply read in a line with fread reasonably fast, and I've developed a perfect hashing function which works, but I wanted to see if anyone could offer any advice on how to make it even faster. I'll profile each suggestion to see how it goes.

The hashing function is based on the month name to allow fast allocation of the value to a bucket. Bear with me here. I first figured out the minimal number of characters for a perfect hash:

January
February
March
April
May
June
July
August
September
October
November
December

Keep in mind that the months are all nine characters due to the fact I have the entire input line.

Unfortunately, there is no single column to mark a month unique. Column 1 duplicates J, column 2 duplicates a, column 3 duplicates r, column 4 duplicates u and columns 5 onwards duplicate <space> (there are other duplicates but one is enough to prevent a single-column hash key).

However, by using the first and fourth column, I get the values Ju, Fr, Mc, Ai, M<space>, Je, Jy, Au, St, Oo, Ne and De, which are unique. There will be no invalid values in this file so I don't have to worry about incorrect buckets for the input data.

By viewing the hex codes for the characters, I found I could get low unique values by just ANDing with strategic values:

FirstChar  Hex  Binary     &0x0f
---------  ---  ---------  -----
   A       x41  0100 0001      1
   D       x44  0100 0100      4
   F       x46  0100 0110      6
   J       x4a  0100 1010     10
   M       x4d  0100 1101     13
   N       x4e  0100 1110     14
   O       x4f  0100 1111     15
   S       x53  0101 0011      3

SecondChar  Hex  Binary     &0x1f
----------  ---  ---------  -----
 <space>    x20  0010 0000      0
    c       x63  0110 0011      3
    e       x65  0110 0101      5
    i       x69  0110 1001      9
    o       x6f  0110 1111     15
    r       x72  0111 0010     18
    t       x74  0111 0100     20
    u       x75  0111 0101     21
    y       x79  0111 1001     25

and this allowed me to set up a static array to create a (hopefully) blindingly-fast hash function:

#define __ -1
static unsigned int hash (const char *str) {
    static unsigned char bucket[] = {
        //   A       S   D       F               J           M   N   O
        __, __, __, __, __, __, __, __, __, __, __, __, __,  4, __, __, // space
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __,  2, __, __, // c
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, 11, __, __, __, __, __,  5, __, __, __, 10, __, // e
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __,  3, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // i
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __,  9, // o
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __,  1, __, __, __, __, __, __, __, __, __, // r
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __,  8, __, __, __, __, __, __, __, __, __, __, __, __, // t
        __,  7, __, __, __, __, __, __, __, __,  0, __, __, __, __, __, // u
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __,  6, __, __, __, __, __  // y
    };
    return bucket[((unsigned int)(str[3]&0x1f)<<4)|(str[0]&0xf)];
}

Testing that with the code:

#include <stdio.h>
#include <string.h>

// Hash function here.

static char *months[] = {
    "January  ", "February ", "March    ", "April    ", "May      ", "June     ",
    "July     ", "August   ", "September", "October  ", "November ", "December "
};

int main (void) {
    int i;
    for (i = 0; i < sizeof(months)/sizeof(*months); i++)
        printf ("%-10s -> %2d\n", months[i], hash(months[i]));
    return 0;
}

shows that it's functionally correct:

January    ->  0
February   ->  1
March      ->  2
April      ->  3
May        ->  4
June       ->  5
July       ->  6
August     ->  7
September  ->  8
October    ->  9
November   -> 10
December   -> 11

but I want to know if it can be made faster.

Any suggestions out there? I'm open to any simple optimisations or even a total rewrite if there's something inherently bad with my hashing function.


I don't think this is that important but the final version will be using EBCDIC. The theory will still stand but the AND operation may change slightly since the characters have different code points. I'll be happy with any assistance only on the ASCII front since I'm confident whatever advice is offered will translate okay to EBCDIC.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

落日海湾 2024-09-20 00:12:17

我同意其他人的观点,没有太大的改进空间。我所能建议的是一个更小的查找表,它可以处理相同数量的操作,这可能会使其在 CPU 缓存中停留更长时间。此外,它不依赖于末尾的空格填充字符,并且适用于任何大写和小写字符的混合。我发现针对需求可能发生的变化添加一些合理的稳健性通常会在未来得到回报,特别是当实现优化到小变化不再那么容易的程度时。

#define __ -1
static unsigned int hash (const char *str)
{
    static unsigned char tab[] = {
        __, __,  1, 11, __, __, __, __,  7, __, __, __, __,  6,  0,  5,
         8, __,  2,  3,  9, __, 10, __, __,  4, __, __, __, __, __, __
    };
    return tab[ ( ( str[ 1 ] >> 4 ) & 1 ) + ( str[ 2 ] & 0x1f ) ];
}

这与您最初的想法类似,但空白较少:

Month  s[1]          s[2]          s[1].4  s[2].4-0  sum  lookup
-----  ------------  ------------  ------  --------  ---  ------
Jan    61:0110 0001  6e:0110 1110       0        14   14       0
Feb    65:0110 0101  62:0110 0010       0         2    2       1
Mar    61:0110 0001  72:0111 0010       0        18   18       2
Apr    70:0111 0000  72:0111 0010       1        18   19       3
May    61:0110 0001  79:0111 1001       0        25   25       4
Jun    75:0111 0101  6e:0110 1110       1        14   15       5
Jul    75:0111 0101  6c:0110 1100       1        12   13       6
Aug    75:0111 0101  67:0110 0111       1         7    8       7
Sep    65:0110 0101  70:0111 0000       0        16   16       8
Oct    63:0110 0011  74:0111 0100       0        20   20       9
Nov    6f:0110 1111  76:0111 0110       0        22   22      10
Dec    65:0110 0101  63:0110 0111       0         3    3      11
             ^             ^ ^^^^
bits:        4             4 3210

I agree with the others that there is not much room for improvement. All I can suggest is a smaller lookup table that works with the same number of operations which might make it stay longer in the CPU cache. Additionally it does not rely on the space filling chars at the end and it works with any mixture of uppercase and lowercase characters. I found that adding some reasonable robustness against likely changes in the requirements often pays off in the future especially when the implementation is optimized to a point where small changes are not so easy anymore.

#define __ -1
static unsigned int hash (const char *str)
{
    static unsigned char tab[] = {
        __, __,  1, 11, __, __, __, __,  7, __, __, __, __,  6,  0,  5,
         8, __,  2,  3,  9, __, 10, __, __,  4, __, __, __, __, __, __
    };
    return tab[ ( ( str[ 1 ] >> 4 ) & 1 ) + ( str[ 2 ] & 0x1f ) ];
}

This works similar to your original idea but with less white space:

Month  s[1]          s[2]          s[1].4  s[2].4-0  sum  lookup
-----  ------------  ------------  ------  --------  ---  ------
Jan    61:0110 0001  6e:0110 1110       0        14   14       0
Feb    65:0110 0101  62:0110 0010       0         2    2       1
Mar    61:0110 0001  72:0111 0010       0        18   18       2
Apr    70:0111 0000  72:0111 0010       1        18   19       3
May    61:0110 0001  79:0111 1001       0        25   25       4
Jun    75:0111 0101  6e:0110 1110       1        14   15       5
Jul    75:0111 0101  6c:0110 1100       1        12   13       6
Aug    75:0111 0101  67:0110 0111       1         7    8       7
Sep    65:0110 0101  70:0111 0000       0        16   16       8
Oct    63:0110 0011  74:0111 0100       0        20   20       9
Nov    6f:0110 1111  76:0111 0110       0        22   22      10
Dec    65:0110 0101  63:0110 0111       0         3    3      11
             ^             ^ ^^^^
bits:        4             4 3210
拧巴小姐 2024-09-20 00:12:17

这是我能找到的 EBCDIC-US 的最小序列:

它在存储桶中有 24 个元素,仅使用 2 个操作来计算索引:

static unsigned int hash (const char *str)
{
 static unsigned char tab[] = {
    11, 4,__, 7,__,__, 9, 1,
    __,__,__,__,__,__,__,__,
     3, 5, 2,10, 8,__, 0, 6
 };
 return tab[0x17 & (str[ 1 ] + str[ 2 ])];
}

第二好,带有异或的 25 个项目:(

static unsigned int hash(const char *str)
{
 static unsigned char tab[] = {
  9,__,__, 7,__,__,11, 1,
 __, 4,__,__,__,__, 3,__,
 __, 5, 8,10, 0,__,__, 6, 2
 };
 return tab[0x1f & (str[ 1 ] ^ str[ 2 ])];
}

实际上,tab[]这里的长度应该是 32 个条目,因为 0x1f 可能会因输入错误而产生溢出)。


Pax 的更新:就其价值而言,第一个选项非常适合 EBCDIC 代码页 500:

## Month     str[1] str[2] Lookup
-- --------- ------ ------ ------
 0 January   a (81) n (95)      0
 1 February  e (85) b (82)      1
 2 March     a (81) r (99)      2
 3 April     p (97) r (99)      3
 4 May       a (81) y (a8)      4
 5 June      u (a4) n (95)      5
 6 July      u (a4) l (93)      6
 7 August    u (a4) g (87)      7
 8 September e (85) p (97)      8
 9 October   c (83) t (a3)      9
10 November  o (96) v (a5)     10
11 December  e (85) c (83)     11

Here's the smallest sequence I could find for EBCDIC-US:

It has 24 elements in the bucket and uses only 2 operations to compute the index:

static unsigned int hash (const char *str)
{
 static unsigned char tab[] = {
    11, 4,__, 7,__,__, 9, 1,
    __,__,__,__,__,__,__,__,
     3, 5, 2,10, 8,__, 0, 6
 };
 return tab[0x17 & (str[ 1 ] + str[ 2 ])];
}

Second best, 25 items with xor:

static unsigned int hash(const char *str)
{
 static unsigned char tab[] = {
  9,__,__, 7,__,__,11, 1,
 __, 4,__,__,__,__, 3,__,
 __, 5, 8,10, 0,__,__, 6, 2
 };
 return tab[0x1f & (str[ 1 ] ^ str[ 2 ])];
}

(Actually, tab[] should be 32 entries long here, because 0x1f can generate an overflow for incorrect inputs).


Update from Pax: For what it's worth, the first option worked perfectly for EBCDIC code page 500:

## Month     str[1] str[2] Lookup
-- --------- ------ ------ ------
 0 January   a (81) n (95)      0
 1 February  e (85) b (82)      1
 2 March     a (81) r (99)      2
 3 April     p (97) r (99)      3
 4 May       a (81) y (a8)      4
 5 June      u (a4) n (95)      5
 6 July      u (a4) l (93)      6
 7 August    u (a4) g (87)      7
 8 September e (85) p (97)      8
 9 October   c (83) t (a3)      9
10 November  o (96) v (a5)     10
11 December  e (85) c (83)     11
北方的巷 2024-09-20 00:12:17

这是针对 EBDIC (CCSID 500) 进行测试的,表 32 字节(比你的小,与 x4u 的大小相同):

#define __ -1
static unsigned int hash(const char *str)
{
    static unsigned char bucket[] = {
        __, __, __, __, __, __,  1,  8,
        __,  7, __, __, __,  3, __, __,
        11,  6, __, __,  4, __,  2, __,
        __,  0, __,  5,  9, __, __, 10,
    }
    return bucket[(unsigned int)(str[0]|str[3]<<1)&0x1f];
}

This is tested for EBDIC (CCSID 500), the table 32 byte (smaller than yours, same size as x4u's):

#define __ -1
static unsigned int hash(const char *str)
{
    static unsigned char bucket[] = {
        __, __, __, __, __, __,  1,  8,
        __,  7, __, __, __,  3, __, __,
        11,  6, __, __,  4, __,  2, __,
        __,  0, __,  5,  9, __, __, 10,
    }
    return bucket[(unsigned int)(str[0]|str[3]<<1)&0x1f];
}
小糖芽 2024-09-20 00:12:17

我将从您的较大流程的详细概况开始,以确保您不会进行过早的优化。

从表面上看,这看起来相当快,但如果内存真的很便宜,最好使用更稀疏的数组,让缓存完成一些工作。例如(在这里即兴思考),如果您只需将前两个字节中找到的 short 添加到接下来两个字节的 short 会怎样。这包括第一个和第四个字符,因此猜测它应该产生 12 个不同的值,并且它不涉及位字段提取,这可能无法很好地优化。然后,使匹配的 bucket[] 数组具有 64K 条目,其中只有 12 个条目被命中。
如果结果正确,这 12 个条目最终会占用一些 D 缓存,并且您已经用少量算术运算换取了缓存的更大数组的索引。

但是,请在尝试使算术更快的任何尝试之前和之后进行分析,并且不要在实际上不会节省时间的地方进行优化。 (我知道 Pax 知道这一点,但这是任何优化讨论中附加的强制性警告。)

I would start with a detailed profile of your larger process to make sure you aren't engaging in premature optimization.

This looks pretty fast on the face of it, but if memory is really cheap it might be better to just use an even sparser array and let your cache do some of the work. For instance (and thinking off the cuff here), what if you simply add the short found in the first two bytes to the short at the next two. That includes both the first and fourth characters, so at a guess it should produce your 12 distinct values, and it doesn't involve bit field extractions which may not optimize well. Then, make the matching bucket[] array have 64K entries, only 12 of which are ever hit.
If it works out right, those 12 entries end up occupying some of your D cache and you've traded a handful of arithmetic operations for an index into a cached larger array.

But do profile both before and after any mucking about trying to make arithmetic faster, and don't bother optimizing where it won't actually save time. (I know Pax knows this, but its the obligatory warning attached to any optimization discussion.)

夏の忆 2024-09-20 00:12:17

好的,正如 SO 上的每个人一样,我全力以赴。;*) 正如我在上面的评论中所写,目标架构的低端具有 256 字节的缓存行大小,因此您最终可能会得到表查找中存在一些缓存垃圾(您的表超过 256 字节)。尝试使用一些廉价的小技巧来折叠桌子实际上可能会获得一些性能。

我一直在玩弄你的数据。您还可以选择第 2 列和第 3 列。不过,还没有找到一种方法来使其低于 8 位。

...和往常一样,进行配置文件,确保这是付出努力的最佳点,然后再次进行配置文件,确保速度更快。

...并且您一次阅读不止一行,对吧?固定记录大小很好,这样您就不必搜索分隔符(换行符),并且一次可以读取其中的一大块。

您可以使用以下方法减小数组大小:

#define __ -1
static unsigned int hash (const char *str) {
    static unsigned char alloc_to[] = {
        //   A       S   D       F               J           M   N   O
        __, __, __, __, __, __, __, __, __, __, __, __, __,  4, __, __, // space
        __, __, __, __, __, __, __, __, __, __, __, __, __,  2, __, __, // c
        __, __, __, __, 11, __, __, __, __, __,  5, __, __, __, 10, __, // e
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __,  3, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // i
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __,  9, // o
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __,  1, __, __, __, __, __, __, __, __, __, // r
        __,  7, __,  8, __, __, __, __, __, __,  0, __, __, __, __, __, // t/u
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __,  6, __, __, __, __, __  // y
    };
    return alloc_to[((unsigned int)(str[3]&0x1e)<<3)|(str[0]&0xf)];
}

将其从 16×26 更改为 16×13。

编辑

如果,正如其他帖子所建议的那样,您的字符串是对齐的,以便您可以将它们用作短整型,您可以添加第一个和第二个短整型,将两个字节异或在一起,您将得到一个唯一的 8 位密钥(实际上是 7 位)。也许也值得您花时间。不过这是 ASCII,因此可能不适用于 EBCDIC。在 ASCII 中,键是:

6e Jan
7f Feb
7b Mar
6a Apr
47 May
62 Jun
58 Jul
42 Aug
1a Sep
11 Oct
10 Nov
6d Dec

Ok, as everyone on SO, I'm all in it for the rep.. ;*) As I wrote in comments above, the lower end of your target architectures has a cache line size of 256 bytes, so you might end up with some cache trashing in your table lookups (your table is more than 256 bytes). An attempt to fold the table using some cheap bit-trick might actually gain some performance.

I've been playing around with your data. You also have the option of column 2 and 3. Haven't figured out a way to get that under 8 bits yet though.

... and as always, profile, make sure it's the best point to apply effort, and profile again afterward, make sure it's faster.

... and you are reading more than one line at a time, right? Fixed record sizes are good that way, that you don't have to search for separators (newlines), and you can read a big chunk of them at a time.

You can reduce the array size by using:

#define __ -1
static unsigned int hash (const char *str) {
    static unsigned char alloc_to[] = {
        //   A       S   D       F               J           M   N   O
        __, __, __, __, __, __, __, __, __, __, __, __, __,  4, __, __, // space
        __, __, __, __, __, __, __, __, __, __, __, __, __,  2, __, __, // c
        __, __, __, __, 11, __, __, __, __, __,  5, __, __, __, 10, __, // e
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __,  3, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // i
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __,  9, // o
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __,  1, __, __, __, __, __, __, __, __, __, // r
        __,  7, __,  8, __, __, __, __, __, __,  0, __, __, __, __, __, // t/u
        __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
        __, __, __, __, __, __, __, __, __, __,  6, __, __, __, __, __  // y
    };
    return alloc_to[((unsigned int)(str[3]&0x1e)<<3)|(str[0]&0xf)];
}

which changes it from 16-by-26 to 16-by-13.

EDIT

If, as suggested by other posts, your strings ARE aligned, so that you may use them as shorts, you may add the first and second short, xor the two bytes together and you'll have a unique 8-bit key (well, seven, actually). Might be worth your while too. This is ASCII though, so might not work in EBCDIC. In ASCII, the keys turn out to be:

6e Jan
7f Feb
7b Mar
6a Apr
47 May
62 Jun
58 Jul
42 Aug
1a Sep
11 Oct
10 Nov
6d Dec
述情 2024-09-20 00:12:17

对我来说看起来足够好。问题是哈希函数本身是否足以成为瓶颈,以证明正在努力消除一两个简单的二进制运算是合理的。鉴于似乎涉及文件访问,我对此表示怀疑,当然,我不知道有关整个处理的任何细节。

编辑:

也许您可以看看是否发现任何一对字符在添加时产生唯一的低位(4、5 或 6):

(str[1] + str[2]) & 0x1f

如果加法不起作用,可能是其他操作之一 & | ^。如果这没有帮助,也许使用三个字符。

Looks nice enough for me. The question is if the hash function itself is enough of a bottleneck to justify the ongoing efforts of eliminating one or two more simple binary operations from it. Given that file access seems to be involved, I doubt it, without knowing any details about the overall processing, of course.

EDIT:

Maybe you could see if you find any pair of characters that results in unique lower bits (4, 5 or 6) when added:

(str[1] + str[2]) & 0x1f

If addition won't do, maybe one of the other operations & | ^. If this won't help, maybe using three characters.

冬天的雪花 2024-09-20 00:12:17

在 ASCII 中,如果您采用 month[0] ^month[2] ^month[3] 那么您会得到一个最大值为 95(7 月)的唯一哈希值,这应该可以让您减少表大小相当大(最小值为 20(5 月),因此减法使其再次变小)。

在 EBCDIC 中可能情况并非如此,但可能会有类似的情况。

In ASCII, if you take month[0] ^ month[2] ^ month[3] then you get a unique hash with a maximum value of 95 (July), which should allow you to reduce your table size a fair bit (and a minimum value of 20 (May), so a subtraction makes it smaller again).

The same might not be true in EBCDIC, but something similar might be.

謌踐踏愛綪 2024-09-20 00:12:17

您真的需要哈希值和月份索引之间的映射来进行统计吗?如果您不返回月份而是返回散列并将其用于统计,则可以消除查找。在x4u的回答 哈希函数的最后一行可能看起来像这样

return ( ( str[ 1 ] >> 4 ) & 1 ) + ( str[ 2 ] & 0x1f )

,您仍然可以进行求和,仅在最后对结果进行排序,而不是在循环内排序。

Do you really need the mapping between the hash and the month index to do the tallying? You could eliminate a lookup if instead of returning the month you returned the hash and use that for tallying. In x4u's answer the last line of the hash function could look like

return ( ( str[ 1 ] >> 4 ) & 1 ) + ( str[ 2 ] & 0x1f )

and you'd still be able to do the sums, sorting the results only in the end, not inside the loop.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文