确定英语单词的难度

发布于 2024-10-19 19:35:06 字数 295 浏览 3 评论 0原文

我正在做一个基于文字的游戏。我的单词数据库包含大约 10,000 个英语单词(按字母顺序排序)。我计划游戏有 5 个难度级别。相对而言,1 级显示最简单的单词,5 级显示最难的单词。

我需要将 10,000 个长单词列表分为 5 个级别,从最简单的单词到困难的单词。我正在寻找一个程序来为我做到这一点。

有人可以告诉我是否有一种算法或方法可以定量测量英语单词的难度?

我有一些关于使用“单词长度”和“ 词频”作为因素,并提出一个公式或其他东西来实现这一点。

I am working a word based game. My word database contains around 10,000 english words (sorted alphabetically). I am planning to have 5 difficulty levels in the game. Level 1 shows the easiest words and Level 5 shows the most difficult words, relatively speaking.

I need to divide the 10,000 long words list into 5 levels, starting from the easiest words to difficult ones. I am looking for a program to do this for me.

Can someone tell me if there is an algorithm or a method to quantitatively measure the difficulty of an english word?

I have some thoughts revolving around using the "word length" and "word frequency" as factors, and come up with a formula or something that accomplishes this.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(13

我早已燃尽 2024-10-26 19:35:06

获取大量文本(例如来自古腾堡档案),进行直接频率分析,并观察结果。如果它们看起来不令人满意,请使用其Flesch-Kincaid 评分并再次运行分析 - 经常出现但在“困难”文本中的单词将获得分数提升,这正是您想要的。

不过,如果您只有 10000 个单词,那么首先进行频率排序,然后手动调整结果可能会更快。

Get a large corpus of texts (e.g. from the Gutenberg archives), do a straight frequency analysis, and eyeball the results. If they don't look satisfying, weight each text with its Flesch-Kincaid score and run the analysis again - words that show up frequently, but in "difficult" texts will get a score boost, which is what you want.

If all you have is 10000 words, though, it will probably be quicker to just do the frequency sorting as a first pass and then tweak the results by hand.

白日梦 2024-10-26 19:35:06

我不明白频率是如何使用的......如果你要浏览报纸,我相信你会看到“彻底”这个词比“bop”或“moo”这个词被提及的频率更高,但这并没有这并不意味着这是一个更容易的词;相反,“彻底”是最令人作呕、最荒谬的拼写异常之一,它给小学生带来噩梦……

尝试向一个正常的学习英语作为第二语言的人解释屠杀和笑声之间的细微差别。

I'm not understanding how frequency is being used... if you were to scan a newspaper, I'm sure you would see the word "thoroughly" mentioned much more frequently than the word "bop" or "moo" but that doesn't mean it's an easier word; on the contrary 'thoroughly' is one of the most disgustingly absurd spelling anomalies that gives grade school children nightmares...

Try explaining to a sane human being learning english as a second language the subtle difference between slaughter and laughter.

遗弃M 2024-10-26 19:35:06

我同意使用频率是最可能的衡量标准;有研究支持词频和难度(测试的正确答案等)之间存在高度相关性。查看英语词典项目 http://elexicon.wustl.edu/ 大约 70k(?) 频率- 评级的单词。

I agree that frequency of use is the most likely metric; there are studies supporting a high correlation between word frequency and difficulty (correct responses on tests, etc.). Check out the English Lexicon Project at http://elexicon.wustl.edu/ for some 70k(?) frequency-rated words.

巷子口的你 2024-10-26 19:35:06

众包答案。

  • 创建一个随机列出 10 个单词的在线“游戏”。
  • 让玩家将它们拖放到最简单 - 最难的位置,并勾选以指示玩家是否听说过该词。
  • 对每个实验的结果应用排名算法(例如 ELO)。
  • 重复。

玩起来甚至可能很有趣,最后你可能会得到一个语言能力分数。

Crowd-source the answer.

  • Create an online 'game' that lists 10 words at random.
  • Get the player to drag and drop them into easiest - hardest, and tick to indicate if the player has ever heard of the word.
  • Apply an ranking algorithm (e.g. ELO) on the result of each experiment.
  • Repeat.

It might even be fun to play, you could get a language proficiency score at the end.

凑诗 2024-10-26 19:35:06

难度是一个相当无定形的概念。如果您不清楚自己想要什么,也许您可​​以看看Porter Stemming Algorithm(例如,参见原始论文)。通过将单词定义为 [C](VC){m}[V] 形式,它包含了更高级的“长度”概念; C 表示辅音块,V 表示元音块,此定义表示单词是可选的 C,后跟 m VC 块,最后是可选的 V。m 值这是高级的“长度”。

Difficulty is a pretty amorphus concept. If you've no clear idea of what you want, perhaps you could take a look at the Porter Stemming Algorithm (see for example the original paper). That contains a more advanced idea of 'length' by defining words as being of the form [C](VC){m}[V]; C means a block of consonants and V a block of vowels and this definition says a word is an optional C followed by m VC blocks and finally an optional V. The m value is this advanced 'length'.

抱猫软卧 2024-10-26 19:35:06

根据游戏的类型,“困难”的定义会发生变化。如果您的游戏涉及快速打字(ztype 风格...),“困难”将会有不同的含义比在游戏中需要定义单词的含义更重要。

也就是说,拼字游戏有一种方法来衡量一个单词的“难度”,这在算法上也很容易。

您也可以考虑根据您的游戏来定义“困难”。您可以对您的游戏进行 Beta 测试,并根据“困难”玩家在您自己的游戏中找到单词的程度对单词进行分类。

depending on the type of game the definition of "difficult" will change. If your game involves typing quickly (ztype-style...), "difficult" will have a different meaning than in a game where you need to define a word's meaning.

That said, Scrabble has a way to measure how "difficult" a word is which is also quite easy algoritmically.

Also you may look into defining "difficult" in terms of your game. You could beta test your game and classify words according to how "difficult" players find them in the context of your own game.

流殇 2024-10-26 19:35:06

有几个因素与单词难度有关,包括习得年龄、形象性、具体性、抽象性、音节、频率(口语和书面语)。还有一些心理语言学数据库将至少通过其中一些因素来搜索单词。 (只需搜索“心理语言数据库”即可。

There are several factors that relate to word difficulty, including age at acquisition, imageability, concreteness, abstractness, syllables, frequency (spoken and written). There are also psycholinguistic databases that will search for word by at least some of these factors. (just do a search for "psycholinguistic database".

浮萍、无处依 2024-10-26 19:35:06

词频是一个显而易见的选择(当然并不完美)。您可以此处下载 Google n-grams V2,该版本已获得许可知识共享署名 3.0 未移植许可证。

格式:ngram TAB 年 TAB match_count TAB page_count TAB Volume_count NEWLINE

示例:

在此处输入图像描述

使用的语料库(来自 Lin、Yuri 等人。“Google 图书 ngram 语料库的语法注释。" ACL 2012 系统演示论文集.计算语言学协会,2012。):

在此处输入图像描述

Word frequency is an obvious choice (of course not perfect). You can download Google n-grams V2 here, which is license under the Creative Commons Attribution 3.0 Unported License.

Format: ngram TAB year TAB match_count TAB page_count TAB volume_count NEWLINE

Example:

enter image description here

Corpus used (from Lin, Yuri, et al. "Syntactic annotations for the google books ngram corpus." Proceedings of the ACL 2012 system demonstrations. Association for Computational Linguistics, 2012.):

enter image description here

梅倚清风 2024-10-26 19:35:06

词长是一个很好的指标,对于词频,你需要数据,因为算法本身显然无法确定它。
您还可以像拼字游戏一样使用某种评分:每个字母都有一个值,最终值将是这些值的总和。
在我看来,找到您的语言中每个字母的频率数据会更容易。

Word length is a good indicator , for word frequency , you would need data as an algorithm can obviously not determine it by itself.
You could also use some sort of scoring like the scrabble game does : each letter has a value and the final value would be the sum of the values.
It would be imo easier to find frequency data about each letter in your language .

今天小雨转甜 2024-10-26 19:35:06

在他关于拼写纠正的文章中 Peter Norvig 使用字典来计算每个单词出现的次数 (从而确定它们的频率)。

你可以用它作为垫脚石:)

此外,频率对难度的影响可能比长度更大……你必须为此对游戏进行 Beta 测试。

In his article on spell correction Peter Norvig uses a dictionary to count the number of occurrences of each word (and thus determine their frequency).

You could use this as a stepping stone :)

Also, frequency should probably influence the difficulty more than length... you would have to beta-test the game for that.

月寒剑心 2024-10-26 19:35:06

除了 Flesch-Kincaid 等指标之外,您还可以尝试一种方法基于 Dale-Chall 可读性公式,使用读者熟悉的单词列表特定的能力水平。

许多可读性公式的实现都包含用于估计单词中音节数的代码,这也可能很有用。

In addition to metrics such as Flesch-Kincaid, you could try an approach based on the Dale-Chall readability formula, using lists of words that are familiar to readers of a particular level of ability.

Implementations of many of the readability formulae contain code for estimating the number of syllables in a word, which may also be useful.

多孤肩上扛 2024-10-26 19:35:06

我猜这个词进入普通学生词汇的年级是衡量难度的标准。接下来是有多少次违反标准规则的情况。意思是你的单词的拼写或发音似乎违反了正常的出发规则。最后.. 意义.. 可能是一个很难理解的概念。 ......例如......尝试向从未听过这个词的人解释抽象。

I would guess that the grade at wich the word is introduced into normal students vocabulary is a measure of difficulty. Next would be how many standard rule violations it has. Meaning your words that have spellings or pronunciations that seem to violate the normal set off rules. Finally.. the meaning.. can be a tough concept. .. for example ... try explaining abstract to someone who's never heard the word.

萌化 2024-10-26 19:35:06

无需声称了解任何有关其算法的信息,有一个 API 可返回 1-10 等级的单词难度:TwinWord API

不过,我自己从未使用过它。

Without claiming to know anything about their algorithm, there is an API that returns a 1-10 scale word difficulty: TwinWord API

I have never used it, myself, though.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文