理论：“词汇编码”

发布于 2024-07-06 21:02:23 字数 925 浏览 15 评论 0原文

我使用“词法编码”这个术语是因为我没有更好的术语。

单词可以说是与字母相对的交流的基本单位。 Unicode 尝试为所有已知字母表中的每个字母分配一个数值。对于一种语言来说是字母，对于另一种语言来说则是字形。目前，Unicode 5.1 为这些字形分配了超过 100,000 个值。现代英语使用的大约 180,000 个单词中，据说只要拥有大约 2,000 个单词的词汇，您就应该能够进行一般术语的交谈。 “词汇编码”将对每个单词而不是每个字母进行编码，并将它们封装在一个句子中。

// An simplified example of a "Lexical Encoding"
String sentence = "How are you today?";
int[] sentence = { 93, 22, 14, 330, QUERY };

在此示例中，字符串中的每个令牌都被编码为整数。这里的编码方案只是根据单词使用情况的广义统计排名分配一个 int 值，并为问号分配一个常量。

最终，一个单词既有拼写又有拼写。意思虽然。任何“词汇编码”都会保留整个句子的含义和意图，而不是特定于语言的。英语句子将被编码为 “...语言中性的意义原子元素...” 然后可以将其重构为任何具有结构化句法形式和语法结构的语言。

“词法编码”技术的其他示例还有哪些？

如果您对单词使用统计数据的来源感兴趣：
http://www.wordcount.org

原文

I am using the term "Lexical Encoding" for my lack of a better one.

A Word is arguably the fundamental unit of communication as opposed to a Letter. Unicode tries to assign a numeric value to each Letter of all known Alphabets. What is a Letter to one language, is a Glyph to another. Unicode 5.1 assigns more than 100,000 values to these Glyphs currently. Out of the approximately 180,000 Words being used in Modern English, it is said that with a vocabulary of about 2,000 Words, you should be able to converse in general terms. A "Lexical Encoding" would encode each Word not each Letter, and encapsulate them within a Sentence.

// An simplified example of a "Lexical Encoding"
String sentence = "How are you today?";
int[] sentence = { 93, 22, 14, 330, QUERY };

In this example each Token in the String was encoded as an Integer. The Encoding Scheme here simply assigned an int value based on generalised statistical ranking of word usage, and assigned a constant to the question mark.

Ultimately, a Word has both a Spelling & Meaning though. Any "Lexical Encoding" would preserve the meaning and intent of the Sentence as a whole, and not be language specific. An English sentence would be encoded into "...language-neutral atomic elements of meaning ..." which could then be reconstituted into any language with a structured Syntactic Form and Grammatical Structure.

What are other examples of "Lexical Encoding" techniques?

If you were interested in where the word-usage statistics come from :
http://www.wordcount.org

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

腹黑女流氓 2024-07-13 21:03:07

实际上你只需要大约600个单词就可以得到一半的词汇量。

回复收藏 0 原文

在巴黎塔顶看东京樱花 2024-07-13 21:03:00

这是一个有趣的小练习，但我强烈建议您将其视为对自然语言中类型和标记之间差异概念的介绍。

类型是代表所有实例的单词的单个实例。标记是单词的每个实例的单个计数。让我用下面的例子来解释这一点：

“约翰去了面包店。他买了面包。”

以下是该示例的一些频率计数，其中计数表示标记的数量：

John: 1
went: 1
to: 1
the: 2
store: 1
he: 1
bought: 1
bread: 2

请注意，“the”被计数了两次——“the”有两个标记。但是，请注意，虽然有十个单词，但这些单词频率对中只有八个。单词被分解为类型并与其标记计数配对。

类型和标记在统计 NLP 中很有用。另一方面，我会留意“词汇编码”。这是对更老式的 NLP 方法的延续，其中预编程和理性主义比比皆是。我什至不知道有任何统计机器翻译实际上为单词分配了特定的“地址”。一方面，单词之间有太多的关系，无法构建任何一种经过深思熟虑的数字本体，如果我们只是将数字扔给单词来对它们进行分类，我们应该考虑内存管理和速度分配等问题。

我建议查看 NLTK（用 Python 编写的自然语言工具包），以更广泛地介绍 NLP 及其实际用途。

This is an interesting little exercise, but I would urge you to consider it nothing more than an introduction to the concept of the difference in natural language between types and tokens.

A type is a single instance of a word which represents all instances. A token is a single count for each instance of the word. Let me explain this with the following example:

"John went to the bread store. He bought the bread."

Here are some frequency counts for this example, with the counts meaning the number of tokens:

John: 1
went: 1
to: 1
the: 2
store: 1
he: 1
bought: 1
bread: 2

Note that "the" is counted twice--there are two tokens of "the". However, note that while there are ten words, there are only eight of these word-to-frequency pairs. Words being broken down to types and paired with their token count.

Types and tokens are useful in statistical NLP. "Lexical encoding" on the other hand, I would watch out for. This is a segue into much more old-fashioned approaches to NLP, with preprogramming and rationalism abound. I don't even know about any statistical MT that actually assigns a specific "address" to a word. There are too many relationships between words, for one thing, to build any kind of well thought out numerical ontology, and if we're just throwing numbers at words to categorize them, we should be thinking about things like memory management and allocation for speed.

I would suggest checking out NLTK, the Natural Language Toolkit, written in Python, for a more extensive introduction to NLP and its practical uses.

回复收藏 0 原文

我也只是我 2024-07-13 21:02:54

这是一个有趣的问题，但我怀疑你问这个问题的原因是错误的。您是否认为这种“词汇”Unicode 可以让您将句子分解为与语言无关的原子意义元素，然后能够用其他具体语言重新构建它们？也许作为实现通用翻译器的一种手段？

即使你可以使用“词汇 unicode”编码和存储一个英语句子，你也不能指望阅读它并神奇地将其呈现为保持含义完整的中文。

然而，你对 Unicode 的类比非常有用。

请记住，Unicode 虽然是“通用”代码，但并不体现相关字符的发音、含义或用法。每个代码点指的是特定语言中的特定字形（或者更确切地说是一组语言使用的脚本）。它是字形视觉表示级别的基本要素（在样式、格式和字体的范围内）。拉丁字母“A”的 Unicode 代码点就是这样。它是拉丁字母“A”。它不能自动呈现为阿拉伯字母 Alif (ﺍ) 或印度语 (Devnagari) 字母“A”(अ)。

继续与 Unicode 类比，词汇 Unicode 将为每种语言中的每个单词（单词形式）提供代码点。 Unicode 具有特定脚本的代码点范围。您的词汇 Unicode 必须为每种语言提供一系列代码。不同语言中的不同单词，即使它们具有相同的含义（同义词），也必须具有不同的代码点。具有不同含义或不同发音（同音异义词）的同一单词必须具有不同的代码点。

在 Unicode 中，对于某些语言（但不是全部），相同的字符根据其在单词中的位置而具有不同的形状 - 例如，在希伯来语和阿拉伯语中，字形的形状在单词末尾发生变化 - 那么它有一个不同的代码点。同样，在词汇 Unicode 中，如果一个单词根据其在句子中的位置而具有不同的形式，则它可能需要有自己的代码点。

也许提出英语代码点的最简单方法是将您的系统基于牛津英语词典的特定版本，并按顺序为每个单词分配一个唯一的代码。对于同一个单词的不同含义，您将必须使用不同的代码，并且对于不同的形式，您将必须使用不同的代码 - 例如，如果同一个单词可以用作名词和动词，那么您将需要两个代码

然后您必须对要包含的每种其他语言执行相同的操作 - 使用该语言最权威的词典。

很可能这种练习所付出的努力超出了其价值。如果您决定包含世界上所有现存的语言，加上一些历史上已消亡的语言和一些虚构的语言（就像 Unicode 一样），您最终会得到一个非常大的代码空间，以至于您的代码必须非常宽才能容纳它。在压缩方面您不会获得任何好处 - 在原始语言中表示为字符串的句子可能比表示为代码的同一句子占用更少的空间。

PS 对于那些说这是一项不可能完成的任务的人来说，因为单词含义发生了变化，我不认为这是一个问题。用 Unicode 来类比，字母的用法已经发生了变化（诚然，没有单词含义变化得那么快），但 Unicode 并不关心中世纪时“th”的发音与“y”类似。 Unicode 有“t”、“h”和“y”的代码点，它们各有其用途。

PPS 实际上，Unicode 关心的是“oe”也是“œ”或者“ss”在德语中可以写成“ß”

This is an interesting question, but I suspect you are asking it for the wrong reasons. Are you thinking of this 'lexical' Unicode' as something that would allow you to break down sentences into language-neutral atomic elements of meaning and then be able to reconstitute them in some other concrete language? As a means to achieve a universal translator, perhaps?

Even if you can encode and store, say, an English sentence using a 'lexical unicode', you can not expect to read it and magically render it in, say, Chinese keeping the meaning intact.

Your analogy to Unicode, however, is very useful.

Bear in mind that Unicode, whilst a 'universal' code, does not embody the pronunciation, meaning or usage of the character in question. Each code point refers to a specific glyph in a specific language (or rather the script used by a group of languages). It is elemental at the visual representation level of a glyph (within the bounds of style, formatting and fonts). The Unicode code point for the Latin letter 'A' is just that. It is the Latin letter 'A'. It cannot automagically be rendered as, say, the Arabic letter Alif (ﺍ) or the Indic (Devnagari) letter 'A' (अ).

Keeping to the Unicode analogy, your Lexical Unicode would have code points for each word (word form) in each language. Unicode has ranges of code points for a specific script. Your lexical Unicode would have to a range of codes for each language. Different words in different languages, even if they have the same meaning (synonyms), would have to have different code points. The same word having different meanings, or different pronunciations (homonyms), would have to have different code points.

In Unicode, for some languages (but not all) where the same character has a different shape depending on it's position in the word - e.g. in Hebrew and Arabic, the shape of a glyph changes at the end of the word - then it has a different code point. Likewise in your Lexical Unicode, if a word has a different form depending on its position in the sentence, it may warrant its own code point.

Perhaps the easiest way to come up with code points for the English Language would be to base your system on, say, a particular edition of the Oxford English Dictionary and assign a unique code to each word sequentially. You will have to use a different code for each different meaning of the same word, and you will have to use a different code for different forms - e.g. if the same word can be used as a noun and as a verb, then you will need two codes

Then you will have to do the same for each other language you want to include - using the most authoritative dictionary for that language.

Chances are that this excercise is all more effort than it is worth. If you decide to include all the world's living languages, plus some historic dead ones and some fictional ones - as Unicode does - you will end up with a code space that is so large that your code would have to be extremely wide to accommodate it. You will not gain anything in terms of compression - it is likely that a sentence represented as a String in the original language would take up less space than the same sentence represented as code.

P.S. for those who are saying this is an impossible task because word meanings change, I do not see that as a problem. To use the Unicode analogy, the usage of letters has changed (admittedly not as rapidly as the meaning of words), but it is not of any concern to Unicode that 'th' used to be pronounced like 'y' in the Middle ages. Unicode has a code point for 't', 'h' and 'y' and they each serve their purpose.

P.P.S. Actually, it is of some concern to Unicode that 'oe' is also 'œ' or that 'ss' can be written 'ß' in German

回复收藏 0 原文

ㄖ落Θ余辉 2024-07-13 21:02:49

作为一个翻译方案，如果没有更多的工作，这可能不会起作用。您可能会认为可以为每个单词分配一个编号，然后自动将其翻译成另一种语言。事实上，语言存在多个单词拼写相同的问题，“风把她的头发向后吹”和“上发条你的手表”。

对于传输文本，每种语言可能都有一个字母表，它会工作得很好，尽管我想知道与使用可变长度字典（如 ZIP 使用）相比，你会从中获得什么。

回复收藏 0 原文

小清晰的声音 2024-07-13 21:02:42

系统将如何处理名词复数或动词变化？这些都有自己的“Unicode”值吗？

回复收藏 0 原文

诗笺 2024-07-13 21:02:39

为自己发明一个很容易。将每个单词转换为规范字节流（例如，小写分解的 UCS32），然后将其哈希为整数。 32 位可能就足够了，但如果还不够，那么 64 位肯定就足够了。

在你想要给你一个尖锐的答案之前，请考虑一下 Unicode 的目的只是为每个字形分配一个唯一的标识符。不是对它们进行排名、排序或分组，而只是将它们映射到每个人都同意的唯一标识符上。

回复收藏 0 原文

烟燃烟灭 2024-07-13 21:02:36

这个想法有几个主要问题。在大多数语言中，单词的含义以及与含义相关的单词变化得非常快。

一旦为某个单词分配了编号，该单词的含义就会发生变化。例如，“同性恋”一词过去仅表示“快乐”或“快乐”，但现在主要用于表示同性恋。另一个例子是语素“谢谢”，它最初来自德语“danke”，它只是一个单词。另一个例子是“Good bye”，它是“God bless you”的缩写。

另一个问题是，即使在任何时间点对某个单词进行快照，该单词的含义和用法也会存在争议，即使在同一省内也是如此。在编写词典时，负责学者为一个单词争论的情况并不罕见。

简而言之，您无法使用现有语言来做到这一点。为此，您必须考虑发明一种自己的语言，或者使用已经发明的相当静态的语言，例如国际语或世界语。然而，即使这些对于在标准词典中定义静态语素的目的来说也不是完美的。

即使在中文中，汉字与意义有粗略的映射，它仍然行不通。许多字符的含义会根据上下文以及它们前面或后面的字符而改变。

当您尝试在不同语言之间进行翻译时，问题最为严重。英语中可能有一个单词，可以在各种情况下使用，但不能直接在另一种语言中使用。一个例子是“免费”。在西班牙语中，可以使用“libre”（如演讲中的“免费”意思）或“gratis”（如啤酒中的“免费”意思）（并且使用错误的单词代替“免费”会看起来很有趣）。

还有一些词更难以赋予其含义，例如韩语中的“美丽”一词；称一个女孩漂亮时，会有好几个替代者；但是当称食物美丽时，除非你的意思是食物好看，否则还有其他几种完全不同的候选者。

归根结底，虽然我们只使用大约 20 万个英语单词，但我们的词汇量在某些方面实际上更大，因为我们为同一个单词赋予了许多不同的含义。同样的问题也适用于世界语和国际语，以及其他所有对对话有意义的语言。人类语言并不是一台定义明确、运转良好的机器。因此，尽管您可以创建这样一个词典，其中每个“单词”都有其独特的含义，但这对于使用当前技术的机器来说将非常困难，而且几乎不可能从任何人类语言翻译成您的特殊标准化词典。

这就是为什么机器翻译仍然很糟糕，并且在未来很长一段时间内都会如此。如果你能做得更好（我希望你能），那么你可能应该考虑通过某种奖学金和/或大学/政府资助来攻读博士学位；或者只是赚一大笔钱，无论什么都能让你的船继续航行。

Their are several major problems with this idea. In most languages, the meaning of a word, and the word associated with a meaning change very swiftly.

No sooner would you have a number assigned to a word, before the meaning of the word would change. For instance, the word "gay" used to only mean "happy" or "merry", but it is now used mostly to mean homosexual. Another example is the morpheme "thank you" which originally came from German "danke" which is just one word. Yet another example is "Good bye" which is a shortening of "God bless you".

Another problem is that even if one takes a snapshot of a word at any point of time, the meaning and usage of the word would be under contention, even within the same province. When dictionaries are being written, it is not uncommon for the academics responsible to argue over a single word.

In short, you wouldn't be able to do it with an existing language. You would have to consider inventing a language of your own, for the purpose, or using a fairly static language that has already been invented, such as Interlingua or Esperanto. However, even these would not be perfect for the purpose of defining static morphemes in an ever-standard lexicon.

Even in Chinese, where there is rough mapping of character to meaning, it still would not work. Many characters change their meanings depending on both context, and which characters either precede or postfix them.

The problem is at its worst when you try and translate between languages. There may be one word in English, that can be used in various cases, but cannot be directly used in another language. An example of this is "free". In Spanish, either "libre" meaning "free" as in speech, or "gratis" meaning "free" as in beer can be used (and using the wrong word in place of "free" would look very funny).

There are other words which are even more difficult to place a meaning on, such as the word beautiful in Korean; when calling a girl beautiful, there would be several candidates for substitution; but when calling food beautiful, unless you mean the food is good looking, there are several other candidates which are completely different.

What it comes down to, is although we only use about 200k words in English, our vocabularies are actually larger in some aspects because we assign many different meanings to the same word. The same problems apply to Esperanto and Interlingua, and every other language meaningful for conversation. Human speech is not a well-defined, well oiled-machine. So, although you could create such a lexicon where each "word" had it's own unique meaning, it would be very difficult, and nigh on impossible for machines using current techniques to translate from any human language into your special standardised lexicon.

This is why machine translation still sucks, and will for a long time to come. If you can do better (and I hope you can) then you should probably consider doing it with some sort of scholarship and/or university/government funding, working towards a PHD; or simply make a heap of money, whatever keeps your ship steaming.

回复收藏 0 原文