根据定义将单词映射到数字
作为较大项目的一部分,我需要阅读文本并将每个单词表示为数字。例如,如果程序读入“每个好男孩都应该得到水果”,那么我会得到一个将“每个”转换为“ 1742'、'good' 到 '977513' 等。
现在,显然我可以使用哈希获得这些数字的算法。然而,如果具有相似含义的单词的数值彼此接近,那么“good”就会更有用,这样“good”就会变成“6827”” 'great' 变为 '6835' 等。
作为另一种选择,代替代表每个数字的简单整数,效果会更好拥有由多个数字组成的向量,例如 (lexical_category、时态、分类、特定单词),其中 < i>lexical_category 是名词/动词/形容词等,时态 是未来/过去/现在,分类 定义了一系列广泛的一般主题, Specific_word 与上一段中描述的非常相似。
是否存在这样的算法?如果没有,您能给我一些关于如何开始自己开发的建议吗?我用 C++ 编码。
As part of a larger project, I need to read in text and represent each word as a number. For example, if the program reads in "Every good boy deserves fruit", then I would get a table that converts 'every' to '1742', 'good' to '977513', etc.
Now, obviously I can just use a hashing algorithm to get these numbers. However, it would be more useful if words with similar meanings had numerical values close to each other, so that 'good' becomes '6827' and 'great' becomes '6835', etc.
As another option, instead of a simple integer representing each number, it would be even better to have a vector made up of multiple numbers, eg (lexical_category, tense, classification, specific_word) where lexical_category is noun/verb/adjective/etc, tense is future/past/present, classification defines a wide set of general topics and specific_word is much the same as described in the previous paragraph.
Does any such an algorithm exist? If not, can you give me any tips on how to get started on developing one myself? I code in C++.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果你的想法有点天真,那么它很有趣(但不用担心,天真的问题在 NLP 领域很有用)。
撇开其他实际问题不谈(例如,解析、词性标记、词干提取,当然还有识别/映射给定单词的问题……此后我会非常简短地讨论它们),有一些困难< em>你的建议的根本原则[在语义上相近的单词被邻近编码的数字尺度]:
简而言之
a) 意义(或问题中所称的“定义”,或语言学家所称的“语义”)是一件棘手的事情,不适合被认为是映射到一条线,甚至一棵树上。可以使用其他图(例如网络),但即便如此,当应用于相对受限的领域之外时,事情也会变得有点棘手。
和
b)由于一词多义、表达等原因,将单词与含义关联起来也很棘手。
尽管如此,如果您想尝试问题中建议的那种映射,也许在特定领域的上下文中(例如体育评论)或机械维修)和/或了解某些单词必须任意映射,在“投入”之前,您可能需要熟悉以下 NLP(自然语言处理)学科和资源 :
包括带注释的统计和基于语料库的 NLP 软件的资源列表
关于您的兴趣在使用用 C++ 编写的工具时,您可能会发现其中几个用于不同目的(并且质量不同!)的工具。您可能还会发现,尽管出于性能原因,它们有时会绑定到用 C/C++ 编写的原语,但许多现代 NLP 框架和工具倾向于使用 Java 甚至 Python 等脚本语言。我没有使用基于 C++ 的 NLP 软件的直接经验。如果您没有找到您需要的东西(在 C++ 中),我强烈建议您尝试自己实现一些东西,至少在您之前广泛回顾了以前的技术并充分理解潜在的困难之前。
Your idea is interesting if only a bit naive (but no worries, naive questions are useful in the area of NLP).
Leaving other practical questions aside (e.g. Parsing, POS-tagging, stemming, and of course the very issue of identifying/mapping a given word... I discuss them, very briefly, thereafter), there are several difficulties with the very principle of your suggestion [of a numeric scale where semantically close words are coded in proximity]:
In a nutshell
a) meaning (or "definition", as called in the question, or "semantics" as called by linguists) is a tricky thing which doesn't lend itself to being mapped onto a line, or even a tree. Other graphs such as networks can be used, but even then things can get a bit tricky when applied beyond relatively restricted domains.
and
b) associating words with meanings is also tricky because of polysemy, expressions etc.
Never the less, if you'd like to attempt the kind of mapping suggested in the question, maybe in the context of a specific domain (say that of sport commentary or mechanics repairs) and/or understanding that some words will just have to be arbitrarily mapped, before "diving in", you may want to get familiar with the following NLP (Natural Language Processing) disciplines and resources:
including their annotated resource list for statistical and Corpus-based NLP software
With regards to your interest in using tools written in C++, you'll probably find several of these, for various purposes (and of various quality !). You may also find that although they sometimes bind to primitives written in C/C++ for performances reasons, many of the modern frameworks and tools of NLP tend to use Java or even script languages like Python. I do not have direct experience with C++ based NLP software. If you do not find what you need (in C++), I discourage you, vehemently, to try and implement something yourself, at least before you have previously reviewed extensively previous art and have a good understanding for the underlying difficulties.
要将单词映射到数字,您可能应该使用 索引。使用哈希码只是自找麻烦,因为完全不相关的单词最终可能会使用相同的值。
有多种方法可以对单词在语义上的相关程度进行数值测量,例如潜在语义分析( LSA) 或在词汇资源中使用某种相关性度量,例如 WordNet(例如 林,Resnik,或 Jiang-Conrath)。
要获取您所说的词汇类别,您需要使用词性(POS) 标注器。词性标签还会为您提供时态信息(例如,VBP 表示该词是过去时态动词)。
要将单词分配给主题,您可以使用 WordNet 中的上位名信息。这会给你一些东西,比如“红色”是一种“颜色”。或者,如果您想使用潜在狄利克雷分配(LDA),将单词更柔和地分配给主题,以便每个单词可以不同程度地分配给多个主题。
To map a word to a number, you should probably just use an index. Using hashcodes is just asking for trouble, since completely unrelated words could end up using the same value.
There are a number of ways to get a numerical measure of how semantically related words are, such as latent semantic analysis (LSA) or using some measure of relatedness within a lexical resource like WordNet (e.g. Lin, Resnik, or Jiang-Conrath).
To get what you're calling lexical categories, you'll need to use a part-of-speech (POS) tagger. The POS tags will also give you tense information (e.g., VBP means the word is a past tense verb).
To assign words to topics, you could make use of hypernym information from WordNet. This will give you stuff like 'red' is a 'color'. Or, you could make use of Latent Dirichlet allocation (LDA), if you would like to have a softer assignment of words to topics such that each word can be assigned to numerous topics to varying degrees.
自然语言处理是一个广泛而复杂的领域。有一些工具(请参阅链接文章的软件工具部分),其中最主要的一个可能是 NLTK。
我不知道一个简单的答案,但这是一个起点。
Natural Language Processing is a broad and complex field. There are some tools out there (see Software Tools section of linked article), with the predominant one probably being NLTK.
I don't know of an easy answer, but that's a place to start.
这是一个更普遍的问题的一部分,称为“意义表示”。我对这个问题很感兴趣,但事实是单词往往太模糊而无法用数字表示。我认为句子可能是更好的候选者,因为至少存在一些上下文。即便如此,将文本表示为数字更多的是一个研究问题,而不是一个编码问题。
对于单词,正如 dmcer 指出的那样,如果你真的想将单词映射到数字,LSA/PLSA/LDA 将是你最好的选择。但在这种情况下,您将得到实数,而不是整数。关于主题模型以及如何将语义相关的单词在单个主题下分组在一起(主题模型只不过是单词的概率聚类),有大量的工作。值得注意的是,LSA 表示过去曾被用来对语义记忆进行建模(请参考 google-scholar“Lemaire 和 Denhiere”)。然而,正如 mjv 所指出的,该领域必须受到限制/专门化,以便您可以确保问题的规模不会失控。
最后,我个人认为单词可能有潜在的结构,可以用来将它们表示为数字。句子的显式表示(例如谓词)有其自身的与词性、从句等排序相关的问题。但是单词不一定要处理这些问题,因此可能仍然有一些希望。您可能对以下指示感兴趣:
1>表征论
2>通用网络语言(语言作为单词的超图,其中句子是超边)
3>柯尔莫哥洛夫复杂性和表征失真
4>群论和图论(可以使用许多有趣的表示)
5>对数论的回顾(看看特定类别的数字是否可以与特定类别的单词相关联)
Risi Kondor 的论文也很有趣。
This is part of a more general problem called "Meaning Representation". I am interested in this problem, but the fact is that words are often too ambiguous to be represented as numbers. I think sentences might be a better candidate, because at least some context is present. Even then, representing text as numbers is more a research issue than a coding issue.
For words, as dmcer pointed out, LSA/PLSA/LDA will be your best bet if you really want to map words to numbers. In this case though, you will get real numbers, not integers. There is a large body of work on topic models and how semantically related words can be grouped together under a single topic (topic models are nothing but probabilistic clustering of words). Notably, LSA representation has been used in the past to model semantic memory (please google-scholar "Lemaire and Denhiere" for reference). However, as mjv indicated, the domain has to be restricted/specialized so that you can make sure the problem size does not get out of hand.
Finally, I personally think that there might be underlying structure of words that you can use for representing them as numbers. Explicit representations of sentences e.g. predicates have their own problems related to ordering of POS, clauses, etc. But words do not necessarily have to deal with these issues, so there might still be some hope. You might be interested in the following pointers:
1> Representation Theory
2> Universal Networking Language (language as a hypergraph of words where sentences are hyperedges)
3> Kolmogorov Complexity and Representational Distortion
4> Group Theory and Graph Theory (there are many interesting representations that might be used)
5> A review of Number Theory (to see if particular categories of numbers can be associated to particular categories of words)
Risi Kondor's thesis is also interesting.