你应该词干化和词形还原吗?
我目前正在使用 python NLTK 来预处理 Kaggle 短信垃圾邮件分类的文本数据数据集。我在预处理过程中完成了以下步骤:
- 删除任何多余的空格
- 删除标点符号和特殊字符
- 将文本转换为小写 用
- 缩写(例如 lol、brb 等)替换其含义或完整形式。
- 删除停用词
- 对数据进行标记
现在,我计划对标记化数据分别执行词形还原和词干提取,然后对词形还原数据和词干数据分别执行 TF-IDF。
问题如下:
- 是否有一个实际用例可以对标记化数据执行词形还原,然后对词形还原数据进行词干化,反之亦然。
- 对词形还原数据进行词干化或反之亦然的想法在理论上是否有意义?或者它完全不正确。
背景:我对 NLP 比较陌生,因此我试图尽可能多地理解这些概念。这个问题背后的主要思想是了解词形还原或词干提取在理论上/实践上是否有意义,或者是否应该单独进行。
参考问题:
- 我应该同时执行词形还原和词干提取吗? :这个问题的答案是不确定的,没有被接受,它从来没有讨论过为什么你应该或不应该这样做。
- 词形还原与词干提取之间有什么区别?:提供了词干提取和词形还原背后的想法,但我无法根据此得出问题的答案
- Stemmers 与 Lemmatizers:解释优点和缺点,以及上下文词干提取和词形还原可能会有所帮助
- 使用正则表达式标记化进行 NLP 词干提取和词形还原:该问题讨论了不同的预处理步骤并分别进行词干提取和词形还原
I am currently working with python NLTK to preprocess text data for Kaggle SMS Spam Classification Dataset. I have completed the following steps during preprocessing:
- Removed any extra spaces
- Removed punctuation and special characters
- Converted the text to lower case
- Replaced abbreviations such as lol,brb etc with their meaning or full form.
- Removed stop words
- Tokenized the data
Now I plan to perform lemmatization and stemming separately on the tokenized data followed by TF-IDF done separately on lemmatized data and stemmed data.
Questions are as follows:
- Is there a practical use case to perform lemmatization on the tokenized data and then stem that lemmatized data or vice versa
- Does the idea of stemming the lemmatized data or vice versa make any sense theoretically, or is it completely incorrect.
Context: I am relatively new to NLP and hence I am trying to understand as much as I can about these concepts. The main idea behind this question is to understand whether lemmatization or stemming together make any sense theoretically/practically or whether these should be done separately.
Questions Referenced:
- Should I perform both lemmatization and stemming?: The answer to this question was inconclusive and not accepted, it never discussed why you should or should not do it in the first place.
- What is the difference between lemmatization vs stemming?: Provides the ideas behind stemming and lemmatization but I was unable to conclude the answers to my questions based on this
- Stemmers vs Lemmatizers: Explains the pros and cons, as well as the context in which stemming and lemmatization, might help
- NLP Stemming and Lemmatization using Regular expression tokenization: The question discusses the different preprocessing steps and does stemming and lemmatization separately
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
是否有一个实际的用例来对标记化数据执行词形还原,然后对词形还原数据进行词干化,反之亦然
对词形还原数据进行词干化或反之亦然的想法在理论上是否有意义,或者是完全错误的。
关于(1):词形还原和词干本质上做的是相同的事情:它们将变形的单词形式转换为规范形式,假设通过形态(例如词尾)表达的特征对于用例来说并不重要。如果您对时态、数字、语音等不感兴趣,那么词形还原/词干提取将减少您必须处理的不同单词形式的数量(因为不同的变体被折叠成一种规范形式)。因此,如果不知道您到底想做什么,也不知道形态信息是否与该任务相关,就很难回答。
词形还原是一种语言驱动的过程。其输出是目标语言中的有效单词,但删除了结尾等。并非没有信息丢失,但问题案例并不多。 does 是第三人称单数助动词,还是雌鹿的复数? building是指结构的名词,还是动词to build的连续形式? 住房怎么样?物体(例如发动机)的外壳或为某人寻找庇护所的过程?
词干提取是一种资源消耗较少的过程,但作为一种权衡,它仅适用于近似值。您将得到不太精确的结果,这在信息检索等应用程序中可能并不重要,但如果您对含义完全感兴趣,那么它可能是一个太粗糙的工具。它的输出也不是一个单词,而是一个“词干”,基本上是一个与提取相似单词时得到的字符串大致相关的字符串。
回复(2):不,这没有任何意义。两个过程都以不同的方式尝试相同的任务(标准化词形变化的单词),一旦进行了词形还原,词干提取就毫无意义了。如果你先词干,你通常不会得到有效的单词,所以词形还原无论如何也行不通。
Is there a practical use case to perform lemmatization on the tokenized data and then stem that lemmatized data or vice versa
Does the idea of stemming the lemmatized data or vice versa make any sense theoretically, or is it completely incorrect.
Regarding (1): Lemmatisation and stemming do essentially the same thing: they convert an inflected word form to a canonical form, on the assumption that features expressed through morphology (such as word endings) are not important for the use case. If you are not interested in tense, number, voice, etc, then lemmatising/stemming will reduce the number of distinct word forms you have to deal with (as different variations get folded into one canonical form). So without knowing what you want to do exactly, and whether morphological information is relevant to that task, it's hard to answer.
Lemmatisation is a linguistically motivated procedure. Its output is a valid word in the target language, but with endings etc removed. It is not without information loss, but there are not that many problematic cases. Is does a third person singular auxiliary verb, or the plural of a female deer? Is building a noun, referring to a structure, or a continuous form of the verb to build? What about housing? A casing for an object (such as an engine) or the process of finding shelter for someone?
Stemming is a less resource intense procedure, but as a trade-off it works with approximations only. You will have less precise results, which might not matter too much in an application such as information retrieval, but if you are at all interested in meaning, then it is probably too coarse a tool. Its output also will not be a word, but a 'stem', basically a character string roughly related to those you get when stemming similar words.
Re (2): no, it doesn't make any sense. Both procedures attempt the same task (normalising inflected words) in different ways, and once you have lemmatised, stemming is pointless. And if you stem first, you generally do not end up with valid words, so lemmatisation would not work anyway.