单个句子由什么组成?怎么命名呢?
我正在设计文本解析器的架构。例句:这里是内容,这里是内容。
整个句子是一个...句子,这是显而易见的。 The
、quick
等都是单词; 、
和 .
是标点符号。但一般来说,单词和标点符号是什么?它们只是符号吗?我根本不知道如何以最合理的抽象方式命名单个句子的组成部分(因为人们可能会写它由字母/元音等组成)。
感谢您的帮助:)
I'm designing architecture of a text parser. Example sentence: Content here, content here.
Whole sentence is a... sentence, that's obvious. The
, quick
etc are words; ,
and .
are punctuation marks. But what are words and punctuation marks all together in general? Are they just symbols? I simply don't know how to name what a single sentence consists of in the most reasonable abstract way (because one may write it consists of letters/vowels etc).
Thanks for any help :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
从技术上讲,您所做的是词法分析(“lexing”),它采用一系列输入符号并生成一系列标记或词位。所以单词、标点符号和空格都是标记。
在 (E)BNF 术语中,词位或标记与“终结符”同义。如果您将解析规则集视为一棵树,那么终端符号就是树的叶子。
那么你的输入的原子是什么?是一个词还是一个句子?如果它是单词(和空格),那么句子更类似于解析规则。事实上,“句子”一词本身就可能具有误导性。将整个输入序列称为句子的情况并不罕见。
非空白字符序列的半通用术语是“textrun”。
What you're doing is technically lexical analysis ("lexing"), which takes a sequence of input symbols and generates a series of tokens or lexemes. So word, punctuation and white-space are all tokens.
In (E)BNF terms, lexemes or tokens are synonymous with "terminal symbols". If you think of the set of parsing rules as a tree the terminal symbols are the leaves of the tree.
So what's the atom of your input? Is it a word or a sentence? If it's words (and white-space) then a sentence is more akin to a parsing rule. In fact the term "sentence" can itself be misleading. It's not uncommon to refer to the entire input sequence as a sentence.
A semi-common term for a sequence of non-white-space characters is a "textrun".
包含两个子类别“单词”和“标点符号”的常见术语是“标记”,在讨论解析时经常使用。
A common term comprising the two sub-categories "words" and "punctuation", often used when talking about parsing, is "tokens".
根据您正在查看的输入文本的词法分析的阶段,这些可以是“词位”或“标记”。
Depending on what stage of your lexical analysis of input text you are looking at, these would be either "lexemes" or "tokens."