Python中的Soundex算法（作业帮助请求）

发布于 2024-08-08 19:10:03 字数 2092 浏览 4 评论 0原文

美国人口普查局使用一种名为“soundex”的特殊编码来查找有关个人的信息。 soundex 是基于姓氏发音而非拼写方式的姓氏编码。听起来相同但拼写不同的姓氏（例如 SMITH 和 SMYTH）具有相同的代码并一起归档。 soundex 编码系统的开发是为了让您可以找到姓氏，即使它可能以各种拼写形式记录。

在本实验中，您将设计、编码并记录一个程序，该程序在输入姓氏时生成 soundex 代码。系统将提示用户输入姓氏，程序应输出相应的代码。

基本 Soundex 编码规则

姓氏的每个 soundex 编码均由一个字母和三个数字组成。使用的字母始终是姓氏的第一个字母。根据如下所示的 soundex 指南，将数字分配给姓氏的其余字母。如有必要，会在末尾添加零以始终生成四字符代码。附加字母将被忽略。

Soundex 编码指南

Soundex 为各种辅音分配一个编号。发音相似的辅音分配相同的编号：

辅音编号

1 B, F, P, V 2 C, G, J, K, Q, S, X, Z 3 D, T 4 L 5 M, N 6 R

Soundex 忽略字母 A、E、I、O、U、H、W 和 Y。

还遵循 3 个附加 Soundex 编码规则。良好的程序设计会将这些功能实现为一个或多个单独的功能。

规则 1. 姓名中含有双字母

如果姓氏中含有双字母，则应将其视为一个字母。例如：

Gutierrez 编码为 G362（G，3 表示 T，6 表示第一个 R，忽略第二个 R，2 表示 Z）。规则 2. 具有相同 Soundex 代码编号的并排字母的姓名

如果姓氏并排具有不同字母且在 soundex 编码指南中具有相同编号，则应将它们视为一个字母。示例：

Pfister 编码为 P236（P、F 被忽略，因为它被视为与 P 相同，2 代表 S，3 代表 T，6 代表 R）。

Jackson 编码为 J250（J，2 代表 C，K 被忽略，与 C 相同，S 被忽略，与 C 相同，5 代表 N，添加 0）。

规则 3. 辅音分隔符

3.a.如果元音 (A、E、I、O、U) 分隔具有相同 soundex 代码的两个辅音，则对元音右侧的辅音进行编码。示例：

Tymczak 编码为 T-522（T、M 为 5、C 为 2、Z 被忽略（请参阅上面的“并排”规则）、K 为 2）。由于元音“A”将 Z 和 K 分开，因此对 K 进行编码。 3.b.如果“H”或“W”分隔具有相同 soundex 代码的两个辅音，则右侧的辅音不被编码。示例：

*Ashcraft 编码为 A261（A、2 代表 S，C 被忽略，因为与 S 相同，中间有 H，6 代表 R，1 代表 F）。它的编码不是 A226。

到目前为止，这是我的代码：

surname = raw_input("Please enter surname:")
outstring = ""

outstring = outstring + surname[0]
for i in range (1, len(surname)):
        nextletter = surname[i]
        if nextletter in ['B','F','P','V']:
            outstring = outstring + '1'

        elif nextletter in ['C','G','J','K','Q','S','X','Z']:
            outstring = outstring + '2'

        elif nextletter in ['D','T']:
            outstring = outstring + '3'

        elif nextletter in ['L']:
            outstring = outstring + '4'

        elif nextletter in ['M','N']:
            outstring = outstring + '5'

        elif nextletter in ['R']:
            outstring = outstring + '6'

print outstring

足以满足要求，我只是不确定如何编写这三个规则。这就是我需要帮助的地方。因此，我们非常感谢任何帮助。

原文

The US census bureau uses a special encoding called “soundex” to locate information about a person. The soundex is an encoding of surnames (last names) based on the way a surname sounds rather than the way it is spelled. Surnames that sound the same, but are spelled differently, like SMITH and SMYTH, have the same code and are filed together. The soundex coding system was developed so that you can find a surname even though it may have been recorded under various spellings.

In this lab you will design, code, and document a program that produces the soundex code when input with a surname. A user will be prompted for a surname, and the program should output the corresponding code.

Basic Soundex Coding Rules

Every soundex encoding of a surname consists of a letter and three numbers. The letter used is always the first letter of the surname. The numbers are assigned to the remaining letters of the surname according to the soundex guide shown below. Zeroes are added at the end if necessary to always produce a four-character code. Additional letters are disregarded.

Soundex Coding Guide

Soundex assigns a number for various consonants. Consonants that sound alike are assigned the same number:

Number Consonants

1 B, F, P, V 2 C, G, J, K, Q, S, X, Z 3 D, T 4 L 5 M, N 6 R

Soundex disregards the letters A, E, I, O, U, H, W, and Y.

There are 3 additional Soundex Coding Rules that are followed. A good program design would implement these each as one or more separate functions.

Rule 1. Names With Double Letters

If the surname has any double letters, they should be treated as one letter. For example:

Gutierrez is coded G362 (G, 3 for the T, 6 for the first R, second R ignored, 2 for the Z).
Rule 2. Names with Letters Side-by-Side that have the Same Soundex Code Number

If the surname has different letters side-by-side that have the same number in the soundex coding guide, they should be treated as one letter. Examples:

Pfister is coded as P236 (P, F ignored since it is considered same as P, 2 for the S, 3 for the T, 6 for the R).

Jackson is coded as J250 (J, 2 for the C, K ignored same as C, S ignored same as C, 5 for the N, 0 added).

Rule 3. Consonant Separators

3.a. If a vowel (A, E, I, O, U) separates two consonants that have the same soundex code, the consonant to the right of the vowel is coded. Example:

Tymczak is coded as T-522 (T, 5 for the M, 2 for the C, Z ignored (see "Side-by-Side" rule above), 2 for the K). Since the vowel "A" separates the Z and K, the K is coded.
3.b. If "H" or "W" separate two consonants that have the same soundex code, the consonant to the right is not coded. Example:

*Ashcraft is coded A261 (A, 2 for the S, C ignored since same as S with H in between, 6 for the R, 1 for the F). It is not coded A226.

So far this is my code:

surname = raw_input("Please enter surname:")
outstring = ""

outstring = outstring + surname[0]
for i in range (1, len(surname)):
        nextletter = surname[i]
        if nextletter in ['B','F','P','V']:
            outstring = outstring + '1'

        elif nextletter in ['C','G','J','K','Q','S','X','Z']:
            outstring = outstring + '2'

        elif nextletter in ['D','T']:
            outstring = outstring + '3'

        elif nextletter in ['L']:
            outstring = outstring + '4'

        elif nextletter in ['M','N']:
            outstring = outstring + '5'

        elif nextletter in ['R']:
            outstring = outstring + '6'

print outstring

sufficiently does what it is asked to, I am just not sure how to code the three rules. That is where I need help. So, any help is appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

最偏执的依靠 2024-08-15 19:10:03

我建议您尝试以下操作。

存储 CurrentCoded 和 LastCoded 变量，以便在附加到输出之前使用
将系统分解为有用的函数，例如
1. Boolean IsVowel(Char)
2. 整数编码（字符）
3. 布尔 IsRule1(Char, Char)

一旦你很好地分解它，它应该变得更容易管理。

回复收藏 0 原文

神爱温柔 2024-08-15 19:10:03

这并不完美（例如，如果输入不以字母开头，它会产生错误的结果），并且它没有将规则实现为可独立测试的函数，因此它并不能真正作为以下问题的答案家庭作业问题。但这就是我的实现方式：

>>> def soundex_prepare(s):
        """Prepare string for Soundex encoding.

        Remove non-alpha characters (and the not-of-interest W/H/Y), 
        convert to upper case, and remove all runs of repeated letters."""
        p = re.compile("[^a-gi-vxz]", re.IGNORECASE)
        s = re.sub(p, "", s).upper()
        for c in set(s):
            s = re.sub(c + "{2,}", c, s)
        return s

>>> def soundex_encode(s):
        """Encode a name string using the Soundex algorithm."""
        result = s[0].upper()
        s = soundex_prepare(s[1:])
        letters = 'ABCDEFGIJKLMNOPQRSTUVXZ'
        codes   = '.123.12.22455.12623.122'
        d = dict(zip(letters, codes))
        prev_code=""
        for c in s:
            code = d[c]
            if code != "." and code != prev_code:
                result += code
         if len(result) >= 4: break
            prev_code = code
        return (result + "0000")[:4]

This is hardly perfect (for instance, it produces the wrong result if the input doesn't start with a letter), and it doesn't implement the rules as independently-testable functions, so it's not really going to serve as an answer to the homework question. But this is how I'd implement it:

>>> def soundex_prepare(s):
        """Prepare string for Soundex encoding.

        Remove non-alpha characters (and the not-of-interest W/H/Y), 
        convert to upper case, and remove all runs of repeated letters."""
        p = re.compile("[^a-gi-vxz]", re.IGNORECASE)
        s = re.sub(p, "", s).upper()
        for c in set(s):
            s = re.sub(c + "{2,}", c, s)
        return s

>>> def soundex_encode(s):
        """Encode a name string using the Soundex algorithm."""
        result = s[0].upper()
        s = soundex_prepare(s[1:])
        letters = 'ABCDEFGIJKLMNOPQRSTUVXZ'
        codes   = '.123.12.22455.12623.122'
        d = dict(zip(letters, codes))
        prev_code=""
        for c in s:
            code = d[c]
            if code != "." and code != prev_code:
                result += code
         if len(result) >= 4: break
            prev_code = code
        return (result + "0000")[:4]

回复收藏 0 原文

静待花开 2024-08-15 19:10:03

surname = input("Enter surname of the author: ") #asks user to input the author's surname

while surname != "": #initiates a while loop thats loops on as long as the input is not equal to an empty line

    str_ini = surname[0] #denotes the initial letter of the surname string
    mod_str1 = surname[1:] #denotes modified string excluding the first letter of the surname

    import re #importing re module to access the sub function
    mod_str2 = re.sub(r'[aeiouyhwAEIOUYHW]', '', mod_str1) #eliminating any instances of the given letters


    mod_str21 = re.sub(r'[bfpvBFPV]', '1', mod_str2)
    mod_str22 = re.sub(r'[cgjkqsxzCGJKQSXZ]', '2', mod_str21)
    mod_str23 = re.sub(r'[dtDT]', '3', mod_str22)
    mod_str24 = re.sub(r'[lL]', '4', mod_str23)
    mod_str25 = re.sub(r'[mnMN]', '5', mod_str24)
    mod_str26 = re.sub(r'[rR]', '6', mod_str25)
                #substituting given letters with specific numbers as required by the soundex algorithm

    mod_str3 = str_ini.upper()+mod_str26 #appending the surname initial with the remaining modified trunk

    import itertools #importing itertools module to access the groupby function
    mod_str4 = ''.join(char for char, rep in itertools.groupby(mod_str3))
                #grouping each character of the string into individual characters
                #removing sequences of identical numbers with a single number
                #joining the individually grouped characters into a string

    mod_str5 = (mod_str4[:4]) #setting character limit of the modified string upto the fourth place

    if len (mod_str5) == 1:
        print (mod_str5 + "000\n")
    elif len (mod_str5) == 2:
        print (mod_str5 + "00\n")
    elif len (mod_str5) == 3:
        print (mod_str5 + "0\n")
    else:
        print (mod_str5 + "\n")
                #using if, elif and else arguments for padding with trailing zeros

    print ("Press enter to exit") #specification for the interactor, to press enter (i.e., equivalent to a new line for breaking the while loop) when he wants to exit the program
    surname = input("Enter surname of the author: ") #asking next input from the user if he wants to carry on

exit(0) #exiting the program at the break of the while loop

surname = input("Enter surname of the author: ") #asks user to input the author's surname

while surname != "": #initiates a while loop thats loops on as long as the input is not equal to an empty line

    str_ini = surname[0] #denotes the initial letter of the surname string
    mod_str1 = surname[1:] #denotes modified string excluding the first letter of the surname

    import re #importing re module to access the sub function
    mod_str2 = re.sub(r'[aeiouyhwAEIOUYHW]', '', mod_str1) #eliminating any instances of the given letters


    mod_str21 = re.sub(r'[bfpvBFPV]', '1', mod_str2)
    mod_str22 = re.sub(r'[cgjkqsxzCGJKQSXZ]', '2', mod_str21)
    mod_str23 = re.sub(r'[dtDT]', '3', mod_str22)
    mod_str24 = re.sub(r'[lL]', '4', mod_str23)
    mod_str25 = re.sub(r'[mnMN]', '5', mod_str24)
    mod_str26 = re.sub(r'[rR]', '6', mod_str25)
                #substituting given letters with specific numbers as required by the soundex algorithm

    mod_str3 = str_ini.upper()+mod_str26 #appending the surname initial with the remaining modified trunk

    import itertools #importing itertools module to access the groupby function
    mod_str4 = ''.join(char for char, rep in itertools.groupby(mod_str3))
                #grouping each character of the string into individual characters
                #removing sequences of identical numbers with a single number
                #joining the individually grouped characters into a string

    mod_str5 = (mod_str4[:4]) #setting character limit of the modified string upto the fourth place

    if len (mod_str5) == 1:
        print (mod_str5 + "000\n")
    elif len (mod_str5) == 2:
        print (mod_str5 + "00\n")
    elif len (mod_str5) == 3:
        print (mod_str5 + "0\n")
    else:
        print (mod_str5 + "\n")
                #using if, elif and else arguments for padding with trailing zeros

    print ("Press enter to exit") #specification for the interactor, to press enter (i.e., equivalent to a new line for breaking the while loop) when he wants to exit the program
    surname = input("Enter surname of the author: ") #asking next input from the user if he wants to carry on

exit(0) #exiting the program at the break of the while loop

回复收藏 0 原文

~没有更多了~