PHP 变音位实现错误

发布于 01-04 19:31 字数 1205 浏览 3 评论 0原文

我正在测试 C# 的 Metaphone 实现,并将其结果与 PHP 中的内置 Metaphone() 函数进行比较。但是,我遇到了一个错误(之前记录在 PHP 的问题跟踪器中 并在 a 上讨论邮件列表),但出于我个人的兴趣,我试图了解他们的错误背后的 C 代码。

基本上,根据变音位算法,大多数 -gh- 实例都应该保持静音。在“wright”的具体测试用例中,我期望(并使用我自己的算法生成)“RT”的变音位键,

"wr" => R
"i"  => ignored
"gh" => ignored
"t"  => T

Result: RT

但是,PHP 的变音位函数返回 RFT。显然,它将 -gh- 转换为 F,就好像它位于单词的末尾(例如“rough”),但对于单词“wright”,这是不正确的,因为 -gh- 确实不出现在单词的末尾。查看 PHP 源代码发行版中的 metaphone.c 文件,我看到了一些关键内容:

/* These prevent GH from becoming F */
#define NOGHTOF(c)  (ENCODE(c) & 16)    /* BDH */

...

/* Go N letters back. */
#define Look_Back_Letter(n) (w_idx >= n ? toupper(word[w_idx-n]) : '\0')

然后在第 342 行:

case 'G':
    if (Next_Letter == 'H') {
        if (!(NOGHTOF(Look_Back_Letter(3)) || Look_Back_Letter(4) == 'H')) {
            Phonize('F');
            skip_letter++;

有人可以帮助我理解 NOGHTOF 函数到底做什么以及为什么此代码错误地为 -gh 渲染 F - 在“赖特”中?我不是一个真正的 C 人,所以代码对我来说一点也不清楚。

I'm testing a metaphone implementation for C# and comparing its results against the built-in metaphone() function from PHP. However, I've come across a bug (which is previously documented in PHP's issue tracker and discussed on a mailing list), but I'm trying to understand the C code behind their bug for my own personal interest.

Basically, according to the metaphone algorithm, most instances of -gh- should be rendered silent. In the specific test case of "wright", I expect (and generate with my own algorithm) a metaphone key of "RT"

"wr" => R
"i"  => ignored
"gh" => ignored
"t"  => T

Result: RT

However, PHP's metaphone function returns RFT. Clearly, it's converting the -gh- to an F, as if it were at the end of a word (e.g. "rough"), but in the case of the word "wright", this is incorrect, because the -gh- does not come at the end of the word. Looking at the metaphone.c file in the PHP source distribution, I see a few key things:

/* These prevent GH from becoming F */
#define NOGHTOF(c)  (ENCODE(c) & 16)    /* BDH */

...

/* Go N letters back. */
#define Look_Back_Letter(n) (w_idx >= n ? toupper(word[w_idx-n]) : '\0')

And then on line 342:

case 'G':
    if (Next_Letter == 'H') {
        if (!(NOGHTOF(Look_Back_Letter(3)) || Look_Back_Letter(4) == 'H')) {
            Phonize('F');
            skip_letter++;

Can someone help me understand what exactly the NOGHTOF function does and why this code is incorrectly rendering an F for the -gh- in "wright"? I'm not really a C guy, so the code isn't at all clear to me.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

浮萍、无处依2025-01-11 19:31:20

NOGHTOF(c) 的含义实际上是由第 81 行开始的代码决定的:

char _codes[26] = {
        1, 16, 4, 16, 9, 2, 4, 16, 9, 2, 0, 2, 2, 2, 1, 4, 0, 2, 4, 4, 1, 0, 0, 0, 8, 0
    /*  a  b   c  d   e  f  g  h   i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z */
};

#define ENCODE(c) (isalpha(c) ? _codes[((toupper(c)) - 'A')] : 0)

本质上,按顺序为字母表中的每个字母分配一个值(A = 1、B = 16 等)。 ) 然后ENCODE宏检查传递的字符是否是字母;如果是,则返回该字母的相应代码,否则返回 null 字符。 (它实际上并没有返回任何内容,因为这是一个宏,并在编译时被编译器替换以替换实际的调用。)

我读取 'G' 代码的方式是这个(没有试图理解为什么):

If current letter is G then
    If next letter is H then
        Take "_code" value of a letter three letters back (why?) from the _codes table and check the fifth bit (from the back, naturally)
        If this bit is not set OR if a letter four letters back (why?) is 'H' then
            Add 'F' to the result
            skip one more character (letter 'H' following the 'G')

为什么会这样,但我很确定有人有充分的理由这样写,但对我来说这似乎是一个明显的错误。

The meaning of NOGHTOF(c) is actually determined by the code starting at line 81:

char _codes[26] = {
        1, 16, 4, 16, 9, 2, 4, 16, 9, 2, 0, 2, 2, 2, 1, 4, 0, 2, 4, 4, 1, 0, 0, 0, 8, 0
    /*  a  b   c  d   e  f  g  h   i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z */
};

#define ENCODE(c) (isalpha(c) ? _codes[((toupper(c)) - 'A')] : 0)

Essentially, a value is assigned for each letter of the alphabet in order (A = 1, B = 16, etc.) Then ENCODE macro checks whether the passed character is a letter; if yes, it returns the corresponding code for that letter, otherwise it returns the null character. (It doesn't really return anything, as this is a macro and is substituted by the compiler at compile time to replace the actual call.)

The way I'm reading the code for 'G' is this (without trying to understand why):

If current letter is G then
    If next letter is H then
        Take "_code" value of a letter three letters back (why?) from the _codes table and check the fifth bit (from the back, naturally)
        If this bit is not set OR if a letter four letters back (why?) is 'H' then
            Add 'F' to the result
            skip one more character (letter 'H' following the 'G')

Why it is like this is beyond me though, I'm quite sure somebody had a good reason to write it this way, but it seems an obvious bug to me.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文