可以使用此 soundex 编码的一些帮助

发布于 2024-08-07 20:38:06 字数 2097 浏览 2 评论 0原文

美国人口普查局使用一种名为“soundex”的特殊编码来查找有关个人的信息。 soundex 是基于姓氏发音而非拼写方式的姓氏编码。听起来相同但拼写不同的姓氏(例如 SMITH 和 SMYTH)具有相同的代码并一起归档。 soundex 编码系统的开发是为了让您可以找到姓氏,即使它可能以各种拼写形式记录。

在本实验中,您将设计、编码并记录一个程序,该程序在输入姓氏时生成 soundex 代码。系统将提示用户输入姓氏,程序应输出相应的代码。

基本 Soundex 编码规则

姓氏的每个 soundex 编码均由一个字母和三个数字组成。使用的字母始终是姓氏的第一个字母。根据如下所示的 soundex 指南,将数字分配给姓氏的其余字母。如有必要,会在末尾添加零以始终生成四字符代码。附加字母将被忽略。

Soundex 编码指南

Soundex 为各种辅音分配一个编号。发音相似的辅音分配相同的编号:

辅音编号

1 B、F、P、V 2 C、G、J、K、Q、S、X、Z 3D、T 4升 5M、N 6 R

Soundex 忽略字母 A、E、I、O、U、H、W 和 Y。

还遵循 3 个附加 Soundex 编码规则。良好的程序设计会将这些功能实现为一个或多个单独的功能。

规则 1. 姓名中含有双字母

如果姓氏中含有双字母,则应将其视为一个字母。例如:

  • Gutierrez 编码为 G362(G,3 表示 T,6 表示第一个 R,忽略第二个 R,2 表示 Z)。

规则 2. 具有相同 Soundex 代码编号的并排字母的姓名

如果姓氏并排具有不同字母且在 soundex 编码指南中具有相同编号,则应将它们视为一个字母。示例:

  • Pfister 编码为 P236(P、F 被忽略,因为它被视为与 P 相同,2 代表 S,3 代表 T,6 代表 R)。

  • Jackson 编码为 J250(J、2 代表 C、K 被忽略,与 C 相同、S 被忽略、与 C 相同、5 代表 N、添加 0)。

规则 3. 辅音分隔符

3.a.如果元音 (A、E、I、O、U) 分隔具有相同 soundex 代码的两个辅音,则对元音右侧的辅音进行编码。示例:

  • Tymczak 编码为 T-522(T、M 为 5、C 为 2、Z 被忽略(请参阅上面的“并排”规则)、K 为 2)。由于元音“A”将 Z 和 K 分开,因此对 K 进行编码。

3.b.如果“H”或“W”分隔具有相同 soundex 代码的两个辅音,则右侧的辅音不被编码。示例:

*Ashcraft 编码为 A261(A、2 代表 S,C 被忽略,因为与 S 相同,中间有 H,6 代表 R,1 代表 F)。它的编码不是 A226。

到目前为止,这是我的代码:

surname = raw_input("Please enter surname:")
outstring = ""

outstring = outstring + surname[0]
for i in range (1, len(surname)):
    nextletter = surname[i]
    if nextletter in ['B','F','P','V']:
        outstring = outstring + '1'

    elif nextletter in ['C','G','J','K','Q','S','X','Z']:
        outstring = outstring + '2'

    elif nextletter in ['D','T']:
        outstring = outstring + '3'

    elif nextletter in ['L']:
        outstring = outstring + '4'

    elif nextletter in ['M','N']:
        outstring = outstring + '5'

    elif nextletter in ['R']:
        outstring = outstring + '6'

print outstring

该代码足以满足要求,我只是不确定如何编写这三个规则。这就是我需要帮助的地方。因此,我们非常感谢任何帮助。

The US census bureau uses a special encoding called “soundex” to locate information about a person. The soundex is an encoding of surnames (last names) based on the way a surname sounds rather than the way it is spelled. Surnames that sound the same, but are spelled differently, like SMITH and SMYTH, have the same code and are filed together. The soundex coding system was developed so that you can find a surname even though it may have been recorded under various spellings.

In this lab you will design, code, and document a program that produces the soundex code when input with a surname. A user will be prompted for a surname, and the program should output the corresponding code.

Basic Soundex Coding Rules

Every soundex encoding of a surname consists of a letter and three numbers. The letter used is always the first letter of the surname. The numbers are assigned to the remaining letters of the surname according to the soundex guide shown below. Zeroes are added at the end if necessary to always produce a four-character code. Additional letters are disregarded.

Soundex Coding Guide

Soundex assigns a number for various consonants. Consonants that sound alike are assigned the same number:

Number Consonants

1 B, F, P, V
2 C, G, J, K, Q, S, X, Z
3 D, T
4 L
5 M, N
6 R

Soundex disregards the letters A, E, I, O, U, H, W, and Y.

There are 3 additional Soundex Coding Rules that are followed. A good program design would implement these each as one or more separate functions.

Rule 1. Names With Double Letters

If the surname has any double letters, they should be treated as one letter. For example:

  • Gutierrez is coded G362 (G, 3 for the T, 6 for the first R, second R ignored, 2 for the Z).

Rule 2. Names with Letters Side-by-Side that have the Same Soundex Code Number

If the surname has different letters side-by-side that have the same number in the soundex coding guide, they should be treated as one letter. Examples:

  • Pfister is coded as P236 (P, F ignored since it is considered same as P, 2 for the S, 3 for the T, 6 for the R).

  • Jackson is coded as J250 (J, 2 for the C, K ignored same as C, S ignored same as C, 5 for the N, 0 added).

Rule 3. Consonant Separators

3.a. If a vowel (A, E, I, O, U) separates two consonants that have the same soundex code, the consonant to the right of the vowel is coded. Example:

  • Tymczak is coded as T-522 (T, 5 for the M, 2 for the C, Z ignored (see "Side-by-Side" rule above), 2 for the K). Since the vowel "A" separates the Z and K, the K is coded.

3.b. If "H" or "W" separate two consonants that have the same soundex code, the consonant to the right is not coded. Example:

*Ashcraft is coded A261 (A, 2 for the S, C ignored since same as S with H in between, 6 for the R, 1 for the F). It is not coded A226.

So far this is my code:

surname = raw_input("Please enter surname:")
outstring = ""

outstring = outstring + surname[0]
for i in range (1, len(surname)):
    nextletter = surname[i]
    if nextletter in ['B','F','P','V']:
        outstring = outstring + '1'

    elif nextletter in ['C','G','J','K','Q','S','X','Z']:
        outstring = outstring + '2'

    elif nextletter in ['D','T']:
        outstring = outstring + '3'

    elif nextletter in ['L']:
        outstring = outstring + '4'

    elif nextletter in ['M','N']:
        outstring = outstring + '5'

    elif nextletter in ['R']:
        outstring = outstring + '6'

print outstring

The code sufficiently does what it is asked to, I am just not sure how to code the three rules. That is where I need help. So, any help is appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

浊酒尽余欢 2024-08-14 20:38:06

这里有一些关于一般 Python 内容的小提示。

0) 可以使用for循环来循环任何序列,字符串算作一个序列。所以可以这样写:

for nextletter in surname[1:]:
    # do stuff

这比计算索引、索引姓氏更容易写、更容易理解。

1) 您可以使用+=运算符附加字符串。 而不是

x = x + 'a'

x += 'a'

您将需要跟踪上一封信, “As”来寻求对您的具体问题的帮助。如果您的作业有一条规则,规定“一行中的两个 'z' 字符应编码为 99”,您可以添加如下代码:

def rule_two_z(prevletter, curletter):
    if prevletter.lower() == 'z' and curletter.lower() == 'z':
        return 99
    else:
        return -1


prevletter = surname[0]
for curletter in surname[1:]:
    code = rule_two_z(prevletter, curletter)
    if code < 0:
        # do something else here
    outstring += str(code)
    prevletter = curletter

嗯,您正在编写代码以返回字符串整数,例如 '3',而我编写的代码返回一个实际的整数,然后在将其添加到字符串之前对其调用 str() 。无论哪种方式都可能没问题。

祝你好运!

Here are some small hints on general Python stuff.

0) You can use a for loop to loop over any sequence, and a string counts as a sequence. So you can write:

for nextletter in surname[1:]:
    # do stuff

This is easier to write and easier to understand than computing an index and indexing the surname.

1) You can use the += operator to append strings. Instead of

x = x + 'a'

you can write

x += 'a'

As for help with your specific problem, you will want to keep track of the previous letter. If your assignment had a rule that said "two 'z' characters in a row should be coded as 99" you could add code like this:

def rule_two_z(prevletter, curletter):
    if prevletter.lower() == 'z' and curletter.lower() == 'z':
        return 99
    else:
        return -1


prevletter = surname[0]
for curletter in surname[1:]:
    code = rule_two_z(prevletter, curletter)
    if code < 0:
        # do something else here
    outstring += str(code)
    prevletter = curletter

Hmmm, you were writing your code to return string integers like '3', while I wrote my code to return an actual integer and then call str() on it before adding it to the string. Either way is probably fine.

Good luck!

兔姬 2024-08-14 20:38:06

一些提示:

  • 通过使用一个数组,其中每个 Soundex 代码都由它对应的字母的 ASCII 值(或其派生的较短数字范围内的值)存储和索引,您将使代码变得高效并且更具可读性。这是一种非常常见的技术:理解、使用和重用;-)

  • 在解析输入字符串时,您需要跟踪(或比较)之前处理的字母以忽略重复的字母,并处理其他规则。 (如文章中所暗示的,在单独的函数中实现这些)。这个想法可能是引入一个函数,负责为正在处理的输入的当前字母添加 soundex 代码。该函数将依次调用每个“规则”函数,可能根据某些规则的返回值提前退出。换句话说,替换系统...

    outstring = outstring + c    # btw could be +=
...with
    outstring += AppendCodeIfNeeded(c)
  • 注意,这种多功能的结构对于这种琐碎的逻辑来说是大材小用,但作为实践来做一下也未尝不可。

A few hints:

  • By using an array where each Soundex code is stored and indexed by the ASCII value (or a value in a shorter numeric range derived thereof) of the letter it corresponds to, you will both make the code for efficient and more readable. This is a very common technique: understand, use and reuse ;-)

  • As you parse the input string, you need keep track of (or compare with) the letter previously handled to ignore repeating letters, and handle other the rules. (implement each of these these in a separate function as hinted in the write-up). The idea could be to introduce a function in charge of -maybe- adding the soundex code for the current letter of the input being processed. This function would in turn call each of the "rules" functions, possibly quiting early based on the return values of some rules. In other words, replace the systematic...

    outstring = outstring + c    # btw could be +=
...with
    outstring += AppendCodeIfNeeded(c)
  • beware that this multi-function structure is overkill for such trivial logic, but it is not a bad idea to do it for practice.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文