PHP 变音位实现错误
我正在测试 C# 的 Metaphone 实现,并将其结果与 PHP 中的内置 Metaphone() 函数进行比较。但是,我遇到了一个错误(之前记录在 PHP 的问题跟踪器中 并在 a 上讨论邮件列表),但出于我个人的兴趣,我试图了解他们的错误背后的 C 代码。
基本上,根据变音位算法,大多数 -gh- 实例都应该保持静音。在“wright”的具体测试用例中,我期望(并使用我自己的算法生成)“RT”的变音位键,
"wr" => R
"i" => ignored
"gh" => ignored
"t" => T
Result: RT
但是,PHP 的变音位函数返回 RFT。显然,它将 -gh- 转换为 F,就好像它位于单词的末尾(例如“rough”),但对于单词“wright”,这是不正确的,因为 -gh- 确实不出现在单词的末尾。查看 PHP 源代码发行版中的 metaphone.c 文件,我看到了一些关键内容:
/* These prevent GH from becoming F */
#define NOGHTOF(c) (ENCODE(c) & 16) /* BDH */
...
/* Go N letters back. */
#define Look_Back_Letter(n) (w_idx >= n ? toupper(word[w_idx-n]) : '\0')
然后在第 342 行:
case 'G':
if (Next_Letter == 'H') {
if (!(NOGHTOF(Look_Back_Letter(3)) || Look_Back_Letter(4) == 'H')) {
Phonize('F');
skip_letter++;
有人可以帮助我理解 NOGHTOF 函数到底做什么以及为什么此代码错误地为 -gh 渲染 F - 在“赖特”中?我不是一个真正的 C 人,所以代码对我来说一点也不清楚。
I'm testing a metaphone implementation for C# and comparing its results against the built-in metaphone() function from PHP. However, I've come across a bug (which is previously documented in PHP's issue tracker and discussed on a mailing list), but I'm trying to understand the C code behind their bug for my own personal interest.
Basically, according to the metaphone algorithm, most instances of -gh- should be rendered silent. In the specific test case of "wright", I expect (and generate with my own algorithm) a metaphone key of "RT"
"wr" => R
"i" => ignored
"gh" => ignored
"t" => T
Result: RT
However, PHP's metaphone function returns RFT. Clearly, it's converting the -gh- to an F, as if it were at the end of a word (e.g. "rough"), but in the case of the word "wright", this is incorrect, because the -gh- does not come at the end of the word. Looking at the metaphone.c file in the PHP source distribution, I see a few key things:
/* These prevent GH from becoming F */
#define NOGHTOF(c) (ENCODE(c) & 16) /* BDH */
...
/* Go N letters back. */
#define Look_Back_Letter(n) (w_idx >= n ? toupper(word[w_idx-n]) : '\0')
And then on line 342:
case 'G':
if (Next_Letter == 'H') {
if (!(NOGHTOF(Look_Back_Letter(3)) || Look_Back_Letter(4) == 'H')) {
Phonize('F');
skip_letter++;
Can someone help me understand what exactly the NOGHTOF function does and why this code is incorrectly rendering an F for the -gh- in "wright"? I'm not really a C guy, so the code isn't at all clear to me.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

NOGHTOF(c)
的含义实际上是由第 81 行开始的代码决定的:本质上,按顺序为字母表中的每个字母分配一个值(A = 1、B = 16 等)。 ) 然后
ENCODE
宏检查传递的字符是否是字母;如果是,则返回该字母的相应代码,否则返回null
字符。 (它实际上并没有返回任何内容,因为这是一个宏,并在编译时被编译器替换以替换实际的调用。)我读取
'G'
代码的方式是这个(没有试图理解为什么):为什么会这样,但我很确定有人有充分的理由这样写,但对我来说这似乎是一个明显的错误。
The meaning of
NOGHTOF(c)
is actually determined by the code starting at line 81:Essentially, a value is assigned for each letter of the alphabet in order (A = 1, B = 16, etc.) Then
ENCODE
macro checks whether the passed character is a letter; if yes, it returns the corresponding code for that letter, otherwise it returns thenull
character. (It doesn't really return anything, as this is a macro and is substituted by the compiler at compile time to replace the actual call.)The way I'm reading the code for
'G'
is this (without trying to understand why):Why it is like this is beyond me though, I'm quite sure somebody had a good reason to write it this way, but it seems an obvious bug to me.