如何正确地在单词前面加上“a”和“一个”?
我有一个 .NET 应用程序,给定一个名词,我希望它正确地为该单词添加“a”或“an”前缀。我该怎么做呢?
在您认为答案只是简单地检查第一个字母是否是元音之前,请考虑以下短语:
- 一个诚实的错误
- 二手车
I have a .NET application where, given a noun, I want it to correctly prefix that word with "a" or "an". How would I do that?
Before you think the answer is to simply check if the first letter is a vowel, consider phrases like:
- an honest mistake
- a used car
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(25)
您可能找不到比这更好的了 - 而且它肯定会击败大多数基于规则的系统。
编辑:我已经在 JS/C# 中实现了这个 。您可以在浏览器中尝试,或下载小型、可重用的 JavaScript 实现它使用。 .NET 实现是 nuget 上的包
AvsAn
。这些实现很简单,因此如果需要的话应该很容易移植到任何其他语言。事实证明,“规则”比我想象的要复杂得多:
...这只是强调基于规则的系统构建起来会很棘手!
You probably can't get much better than this - and it'll certainly beat most rule-based systems.
Edit: I've implemented this in JS/C#. You can try it in your browser, or download the small, reusable javascript implementation it uses. The .NET implementation is package
AvsAn
on nuget. The implementations are trivial, so it should be easy to port to any other language if necessary.Turns out the "rules" are quite a bit more complex than I thought:
...which just goes to underline that a rule based system would be tricky to build!
您需要使用例外列表。我认为并不是所有的例外都有明确的定义,因为它有时取决于说这个词的人的口音。
一种愚蠢的方法是向 Google 询问两种可能性(使用搜索 API 之一)并使用最流行的:
或者:
因此“a europe”和“anHonest”是正确的版本。
You need to use a list of exceptions. I don't think all of the exceptions are well defined, because it sometimes depends on the accent of the person saying the word.
One stupid way is to ask Google for the two possibilities (using the one of the search APIs) and use the most popular:
Or:
Therefore "a europe" and "an honest" are the correct versions.
如果您可以找到单词拼写到单词发音的来源,例如:
您可以根据拼写发音字符串的第一个字符做出决定。
为了提高性能,也许您可以使用此类查找来预先生成异常集,并在执行期间使用这些较小的查找集。
编辑添加:
!!! - 我认为你可以用它来生成你的异常:
http://www.speech.cs.cmu.edu/cgi-bin/cmudict当然
,并非所有内容都会在字典中 - 这意味着并非所有可能的异常都会出现在您的异常集中 - 但在这种情况下,您可以默认使用 an 表示元音/ a 表示辅音或使用其他启发式方法有更好的胜算。
(翻阅卡耐基梅隆大学词典,我很高兴地看到它包含了国家和其他一些地方的专有名词 - 因此它会提供诸如“乌克兰人”、“今日美国报纸”、“乌拉尔灵感绘画”等示例。)
再次编辑添加:CMU 词典不包含常见的缩写词,您必须担心以 s、f、l、m、n、u 和 x 开头的缩写词。但是有很多缩略词列表,例如维基百科,您可以使用它们来添加例外情况。
If you could find a source of word spellings to word pronunciations, like:
You could base your decision on the first character of the spelled pronunciation string.
For performance, perhaps you could use such a lookup to pre-generate exception sets and use those smaller lookup sets during execution instead.
Edited to add:
!!! - I think you could use this to generate your exceptions:
http://www.speech.cs.cmu.edu/cgi-bin/cmudict
Not everything will be in the dictionary, of course - meaning not every possible exception would wind up in your exceptions sets - but in that case, you could just default to an for vowels/ a for consonants or use some other heuristic with better odds.
(Looking through the CMU dictionary, I was pleased to see it includes proper nouns for countries and some other places - so it will hande examples like "a Ukrainian", "a USA Today paper", "a Urals-inspired painting".)
Editing once more to add: The CMU dictionary does not contain common acronyms, and you have to worry about those starting with s,f,l,m,n,u,and x. But there are plenty of acronym lists out there, like in Wikipedia, which you could use to add to the exceptions.
您必须手动实施并添加您想要的例外,例如如果第一个字母是“H”,后跟“O”,如诚实、小时……以及相反的字母,如欧洲、大学、二手……
You have to implemented manually and add the exceptions you want like for example if the first letter is 'H' and followed by an 'O' like honest, hour ... and also the opposite ones like europe, university, used ...
由于“a”和“an”是由语音规则而不是拼写约定决定的,我可能会这样做:
Since "a" and "an" is determined by phonetic rules and not spelling conventions, I would probably do it like this:
你需要看看不定冠词的语法规则(英语语法中只有两种不定冠词——“a”和“an”)。你可能不同意这些听起来正确,但是 英语语法非常清晰:
请注意,这表示元音声音,而不是元音字母。例如,以不发音的“h”开头的单词,例如“honour”或“继承人”被视为元音,因此以“an”开头 - 例如,“很荣幸见到你”以辅音开头的单词以 a 为前缀 - 这就是为什么你说“二手车”。而不是“二手车”——因为“二手车”发出的是“yoose”的声音,而不是“呃”的声音。
所以,作为程序员,这些是要遵循的规则。单词以哪个字母开头,而不是我见过的例子,例如 PHP 中的这个 作者:Jaimie Sirovich :
创建规则然后创建例外列表并使用它可能是最简单的。我不认为会有那么多。
You need to look at the grammatical rules for indefinite articles (there are only two indefinite articles in English grammar - "a" and "an). You may not agree these sound correct, but the rules of English grammar are very clear:
Note this means a vowel sound, and not a vowel letter. For instance, words beginning with a silent "h", such as "honour" or "heir" are treated as vowels an so are proceeded with "an" - for example, "It is an honour to meet you". Words beginning with a consonant sound are prefixed with a - which is why you say "a used car" rather than "an used car" - because "used" has a "yoose" sound rather than a "uhh" sound.
So, as a programmer, these are the rules to follow. You just need to work out a way of determining what sound a word begins with, rather than what letter. I've seen examples of this, such as this one in PHP by Jaimie Sirovich :
It's probably easiest to create the rule and then create a list of exceptions and use that. I don't imagine there will be that many.
伙计,我意识到这可能是一个已解决的争论,但我认为它比使用维基百科的临时语法规则更容易解决,维基百科最多只能导出白话语法。
看来,最好的解决方案是使用 a 或 an 触发后续单词的基于音素的匹配,其中某些音素始终与“an”相关联,其余音素属于“a”。
卡内基梅隆大学有一个很棒的在线工具来进行此类检查 - http://www .speech.cs.cmu.edu/cgi-bin/cmudict - 125k 单词,匹配 39 个音素。插入一个单词即可提供整个音素集,其中只有第一个是重要的。
如果该单词没有出现在字典中,例如“NSA”并且全部大写,则系统可以假设该单词是缩写词,并基于相同的原始规则集使用第一个字母来确定使用哪个不定冠词。
Man, I realize that this is probably a settled argument, but I think it can be settled easier than using ad hoc grammar rules from Wikipedia, which would derive vernacular grammar, at best.
The best solution, it seems, is to have the use of a or an trigger a phoneme-based matching of the following word, with certain phonemes always associated with "an" and the remaining belonging to "a".
Carnegie Mellon University has a great online tool for these kind of checks - http://www.speech.cs.cmu.edu/cgi-bin/cmudict - and at 125k words with the matching 39 phonemes. Plugging a word in provides the entire phonemic set, of which only the first is important.
If the word does not appear in the dictionary, such as "NSA" and is all capitalized, then the system can assume the word is an Acronym and use the first letter to determine which indefinite article to use based on the same original rule set.
@内森·朗:
下载维基百科实际上并不是一个坏主意。不需要所有图像、视频和其他媒体。
我用 php 和 javascript 编写了一个(蹩脚的)程序(!)来阅读整个瑞典语维基百科(或者至少可以从有关数学的文章中找到所有文章,这是我的蜘蛛的开始。)
我收集了所有单词和数据库中的内部链接,并且还跟踪每个单词的频率。我现在使用它作为各种任务的单词数据库:
* 查找可以从给定字母集(包括通配符)创建的所有单词
* 创建了一个简单的瑞典语语法文件(所有不在数据库中的单词都被认为是不正确的)。
哦,下载整个 wiki 大约需要一周时间,我的笔记本电脑大部分时间都在运行,连接速度为 10Mbit。
当你这样做时,记录所有与英语不一致的情况,看看其中是否有错误。去解决它们并回馈社区。
@Nathan Long:
Downloading wikipedia is actually not a bad idea. All images, videos and other media is not needed.
I wrote a (crappy) program in php and javascript(!) to read the entire Swedish wikipedia (or at least all aricles that could be reached from the aricle about math, which was the start for my spider.)
I collected all words and internal links in a database, and also kept track of the frequency of every word. I now use that as a word database for various tasks:
* Finding all words that can be created from a given set of letters (including wildcard)
* Created a simple syntax file for Swedish (all words not in the database are considered incorrect).
Oh, and downloading the entire wiki took about one week, using my laptop running most of the time, with 10Mbit connection.
When you're at it, log all occurrences that are inconsistent with the english language and see if some of them are mistakes. Go fix 'em and give something back to the community.
请注意,美国和英国方言之间存在差异,正如 Grammar Girl 在她的剧集中指出的那样 A与 An 相对。
Note that there are differences between American and British dialects, as Grammar Girl pointed out in her episode A Versus An.
看一下 Perl 的 Lingua::EN::Inflect。请参阅源代码中的
sub _indef_article
。Take a look at Perl's Lingua::EN::Inflect. See
sub _indef_article
in the source code.我已经从 Python(最初来自 CPAN 包 Lingua-EN-Inflect),可以正确确定 C# 中的元音并将其发布为问题的答案 以编程方式确定是否用 a 或 an 描述对象?。您可以看到代码片段 在这里。
I've ported a function from Python (originally from CPAN package Lingua-EN-Inflect) that correctly determines vowel sounds in C# and posted it as an answer to the question Programmatically determine whether to describe an object with a or an?. You can see the code snippet here.
您能否获得一本英语词典,其中存储了用我们的常规字母表书写的单词和国际音标字母表?
然后用语音学来判断这个词的声母,“a”或“an”是否合适?
不确定这是否真的比统计维基百科方法更容易(或同样有趣)。
Could you get a English dictionary that stores the words written in our regular alphabet, and the International Phoenetic Alphabet?
Then use the phoenetics to figure out the beginning sound of the word, and thus whether “a” or “an” is appropriate?
Not sure if that would actually be easier than (or as much fun as) the statistical Wikipedia approach.
我会使用基于规则的算法来覆盖尽可能多的情况,然后使用例外列表。如果您想变得更奇特,您可以尝试从例外列表中确定一些新的“规则”。
I would use a rule-based algorithm to cover as many as I could, then use a list of exceptions. If you wanted to get fancy, you could try to determine some new "rules" from your exception list.
我看起来只是一套启发法。它需要更复杂一点,并回答一些我从未得到好的答案的问题,例如您如何对待缩写(“a RPM”或“an RPM”?我一直认为后者更有意义)。
快速搜索了一些关于如何处理英语单数前缀的语言库,但如果你挖掘得足够深入,你可能会找到一些东西。如果没有 - 您始终可以编写自己的变形库并获得世界声誉:-)。
I just looks like a set of heuristics. It needs be a bit more complicated and answer some things which I never got a good answer for, for example how do you treat abbreviations ("a RPM" or "an RPM"? I always thought the latter one makes more sense).
A quick search yielded on linguistic libraries that talk about how to handle the English singular prefix, but you can probably find something if you dig dip enough. And if not - you can always write your own inflection library and gain world fame :-) .
我不认为你可以只填写一些像“a/an”这样的样板内容作为一步涵盖所有内容。否则,您最终会遇到假设错误,例如所有带有 'h' 的单词都以 'o' 开头,得到 'an' 而不是 'a',如 'home' - (一个家?)。基本上,你最终会包含英语的逻辑,或者偶尔会发现一些罕见的案例,让你看起来很愚蠢。
I don't suppose you can just fill-in some boiler plate stuff like 'a/an' as a one step cover-all. Otherwise you will end up with assumption errors like all words with 'h' proceed by 'o' get 'an' instead of 'a' like 'home' - (an home?). Basically, you will end up including the logic of the english language or occassionally find rare cases that will make you look foolish.
检查单词是否以元音或辅音开头。 “u”通常是一个辅音和一个元音(“yu”),因此根据您的目的属于辅音组。
字母“h”在法语和英语中使用的法语单词中代表总塞音(辅音)。您可以列出这些内容(事实上,包括“honor”、“honour”和“hour”可能就足够了)并将它们算作以元音开头(因为英语不识别声门塞音)。
也把“eu”算作辅音等。
这并不太难。
Check for whether a word starts with a vowel or a consonent. A "u" is generally a consonant and a vowel ("yu"), hence belongs in the consonant group for your purposes.
The letter "h" stands for a gottal stop (a consonant) in French and in French words used in English. You can make a list of those (in fact, including "honor", "honour", and "hour" might be sufficient) and count them as starting with vowels (since English doesn't recognise a glottal stop).
Also count "eu" as a consonant etc.
It's not too difficult.
an 或 a 的选择取决于单词的发音方式。通过查看该单词,您不一定能说出其正确的发音,例如行话或缩写等。
其中一种方法可以是拥有支持音素的字典,并使用与单词相关联的音素信息来确定是否应该使用“a”或“an”。
choice of an or a depends on the way the word is pronounced. By looking at the word you can't necessarily tell its correct pronunciation e.g. a Jargon or abbreviation etc.
One of the ways can be to have a dictionary with support for phonemes and use the phoneme information associated with the word to determine whether an "a" or an "an" should be used.
我不能确定它是否有适当的信息来区分“a”和“an”,但普林斯顿大学的 WordNet 数据库的存在正是为了类似的任务,所以我认为数据很可能就在那里。它有大约数万个单词和所说单词之间的数十万个关系(IIRC;我在网站上找不到当前的统计数据)。看看吧。它可以免费下载。
I can't be certain that it has the appropriate information in it to differentiate "a" and "an", but Princeton's WordNet database exists precisely for the purpose of similar sorts of tasks, so I think it's likely that the data is in there. It has some tens of thousands of words and hundreds of thousands of relationships between said words (IIRC; I can't find the current statistics on the site). Give it a look. It's freely downloadable.
如何?什么时候呢?获取附有文章的名词。以特定的形式提出要求。
询问文章中的名词。许多 MUD 代码库将项目存储为由以下内容组成的信息:
关键字形式可能是“短剑生锈”。简称为“一把剑”。长形将是“一把生锈的短剑”。
您正在编写“a vs. an”Web 服务吗?退后一步,看看是否可以进一步向上游解决此泄漏问题。你可以建造一座大坝,但除非你阻止它流动,否则它最终会溢出。
确定这有多重要,正如其他人所建议的那样,追求“快速但粗糙”或“昂贵但坚固”。
How? How about when? Get the noun with article attached. Ask for it in a specific form.
Ask for the noun with the article. Many a MUD codebase store items as information consisting of:
The keyword form might be "short sword rusty". The short form will be "a sword". The long form will be "a rusty short sword".
Are you writing an "a vs. an" Web service? Take a step back and look at if you can attack this leak further upstream. You can build a dam, but unless you stop it from flowing, it will spill over eventually.
Determine how critical this is, and as others have suggested, go for "quick but crude", or "expensive but sturdy".
规则很简单。如果下一个单词以元音开头,则使用“an”,如果下一个单词以辅音开头,则使用“a”。困难的是我们学校对元音和辅音的分类不起作用。 “honour”中的“h”是元音,而“hospital”中的“h”是辅音。
更糟糕的是,像“诚实”这样的单词以元音或辅音开头,具体取决于说它们的人。更糟糕的是,对于某些说话者来说,某些单词会根据周围的单词而发生变化。
问题仅取决于您愿意投入多少时间和精力。您可以在几分钟内使用“aeiou”作为元音写出一些内容,也可以花费几个月的时间对目标受众进行语言分析。它们之间存在大量的启发式方法,这些启发式方法对于某些说话者来说是正确的,而对于另一些说话者来说则是错误的——但因为不同的说话者对同一个词有不同的判断,所以无论你怎么做,都不可能总是正确的。它。
The rule is very simple. If the next word starts with a vowel sound then use 'an', if it starts with a consonant then use 'a'. The hard thing is that our school classification of vowels and consonants doesn't work. The 'h' in 'honour' is a vowel, but the 'h' in 'hospital' is a consonant.
Even worse, some words like 'honest' start with a vowel or a consonant depending on who is saying them. Even worse, some words change depending on the words around them for some speakers.
The problem is bounded only by how much time and effort you want to put into it. You can write something in a couple using 'aeiou' as vowels in a couple of minutes, or you can spends months doing linguistic analysis of your target audience. Between them are a huge number of heuristics which will be right for some speakers and wrong for others -- but because different speakers have different determinations for the same word it simply isn't possible to be right all of the time no matter how you do it.
理想的方法是在网上找到可以为您提供答案的地方,动态查询它们并缓存答案。对于初学者来说,您可以用几百个单词来启动系统。
(我不知道有这样的在线资源,但如果有的话我不会感到惊讶。)
The ideal approach would be to find someplace online that can give you the answers, dynamically query them and cache the answers. You can prime the system with a few hundred words for starters.
(I don't know of such an online source, but I wouldn't be surprised if there is one.)
因此,无需下载所有互联网内容,就可以找到合理的解决方案。这就是我所做的:
我记得 Google 发布了 Google Books N-Gram 频率的原始数据 这里。所以我下载了“a_”和“an”的 2 克文件。如果我没记错的话,大概有26场演出。由此我生成了一个字符串列表,其中绝大多数前面都有您所期望的相反的文章(如果我们期望元音采用“an”)。我能够存储的最终单词列表不到 7 KB。
So, a reasonable solution is possible without downloading all of the internet. Here's what I did:
I remembered that Google published their raw data for Google Books N-Gram frequencies here. So I downloaded the 2-gram files for "a_" and "an". It's about 26 gigs if I recall correctly. From that I produced a list of strings where they were overwhelmingly preceded by the opposite article you'd expect (if we were to expect vowels take an "an"). That final list of words I was able to store in under 7 kilobytes.
我倾向于重写包含不定冠词的语句,而不是编写可能与文化相关且有大量例外的代码。例如,您可以说“这位客户想要‘单户住宅’的住房类型,而不是说“这位客户想要住在单户住宅中。”这样,不定冠词不依赖于变量 - 例如,“该客户想要‘公寓’的住房类型。”
Rather than writing code that could be culture-dependent and have numerous exceptions I tend to rework the statement that includes the indefinite article. For example, rather than saying "This customer wants to live in a Single-Family Home.", you could say "This customer wants a housing type of 'Single-Family Home'." That way, the indefinite article is not dependent on the variable - e.g., "This customer wants a housing type of 'Apartment'."
我想综合一些给出的答案,并贡献我自己的解决方案。
让我们从一些基本的启发式开始:
从单词的第一个字母开始。
确定该单词是否是首字母缩略词。
[AZ][AZ]+
)。希望这有帮助。我怀疑它比任何单个选项占用的资源更少,因为其中大部分可以通过简单的“等于”语句(例如
word[0] == 'a'
)来解决,或通过正则表达式(例如[aioAIO]
),以及一些简单的语言学知识和英文字母名称的发音。如果该词不属于简单情况,则使用其他回答者提供的更复杂的解决方案之一。I'd like to synthesize a few of the given answers, and contribute my own solutions as well.
Let's start with some basic heuristics:
Start with the first letter of the word.
Determine whether the word is an acronym.
[A-Z][A-Z]+
).Hopefully this helps. I suspect that it will be less resource intensive than any single option, given that much of it can be solved by either a simple "equals" statement (e.g.
word[0] == 'a'
), or by a regex expression (e.g.[aioAIO]
), and by some simple knowledge of linguistics and the pronunciations of the English letter names. If the word doesn't fall into a simple case, then use one of the more complex solutions that the other answerers have provided.每当下一个单词不是元音时,您就使用“a”吗?每当有元音时你就用“an”吗?
话虽如此,你不能只做一个像“a\s[a,e,i,o,u].*”这样的正则表达式吗?然后将其替换为“an?”
You use "a" whenever the next word isn't a vowel? And you use "an" whenever there is a vowel?
With that said, couldn't you just do a regular expression like "a\s[a,e,i,o,u].*"? And then replace it with an "an?"