当前位置：文江博客话题详情

从文本内容生成标签

发布于 2024-08-28 23:22:14 字数 114 浏览 10 评论 0原文

我很好奇是否存在一种算法/方法可以通过使用一些权重计算、出现率或其他工具从给定文本生成关键字/标签。

此外，如果您为此指出任何基于 Python 的解决方案/库，我将不胜感激。

谢谢

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

风和你 2024-09-04 23:22:14

实现此目的的一种方法是提取文档中出现频率高于您预期的单词。例如，假设在更大的文档集中，术语“马尔可夫”几乎从未出现过。然而，在同一集合的特定文档中，马尔可夫出现得非常频繁。这表明马尔可夫可能是与文档关联的良好关键字或标签。

要识别这样的关键字，您可以使用关键字和文档的逐点交互信息。由以下公式给出：PMI(term, doc) = log [ P(term, doc) / (P(term)*P(doc)) ]。这将粗略地告诉您，与在更大的集合中遇到该术语相比，您在特定文档中遇到该术语的惊讶程度有多少（或更多）。

要确定与文档关联的 5 个最佳关键字，您只需按术语在文档中的 PMI 分数对术语进行排序，然后选择分数最高的 5 个。

如果您想提取多字标签，请参阅 StackOverflow 问题如何从一系列文本条目中提取常见/重要短语。

借用我对这个问题的回答，NLTK 搭配方法涵盖了如何做
使用 n-gram PMI 在大约 7 行代码中提取有趣的多字表达式，例如：

import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()

# change this to read in your data
finder = BigramCollocationFinder.from_words(
   nltk.corpus.genesis.words('english-web.txt'))

# only bigrams that appear 3+ times
finder.apply_freq_filter(3) 

# return the 5 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 5)

One way to do this would be to extract words that occur more frequently in a document than you would expect them to by chance. For example, say in a larger collection of documents the term 'Markov' is almost never seen. However, in a particular document from the same collection Markov shows up very frequently. This would suggest that Markov might be a good keyword or tag to associate with the document.

To identify keywords like this, you could use the point-wise mutual information of the keyword and the document. This is given by PMI(term, doc) = log [ P(term, doc) / (P(term)*P(doc)) ]. This will roughly tell you how much less (or more) surprised you are to come across the term in the specific document as appose to coming across it in the larger collection.

To identify the 5 best keywords to associate with a document, you would just sort the terms by their PMI score with the document and pick the 5 with the highest score.

If you want to extract multiword tags, see the StackOverflow question How to extract common / significant phrases from a series of text entries.

Borrowing from my answer to that question, the NLTK collocations how-to covers how to do
extract interesting multiword expressions using n-gram PMI in a about 7 lines of code, e.g.:

import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()

# change this to read in your data
finder = BigramCollocationFinder.from_words(
   nltk.corpus.genesis.words('english-web.txt'))

# only bigrams that appear 3+ times
finder.apply_freq_filter(3) 

# return the 5 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 5)

回复收藏 0 原文

爱冒险 2024-09-04 23:22:14

首先，计算语言学的关键Python库是NLTK（“自然语言工具包” ”）。这是一个稳定、成熟的库，由专业计算语言学家创建和维护。它还包含大量集合教程、常见问题解答等。我强烈推荐它。

下面是一个简单的模板，用Python代码编写，用于解决您的问题中提出的问题；尽管它是一个模板，但它运行 - 提供任何文本作为字符串（就像我所做的那样），它将返回一个词频列表以及这些单词按“重要性”（或作为关键字的适用性）顺序排列的列表）根据一个非常简单的启发式。

给定文档的关键字（显然）是从文档中的重要单词中选择的，即那些可能将其与另一个文档区分开来的单词。如果您对文本主题没有先验知识，常见的技术是根据给定单词/术语的频率推断其重要性或权重，或者重要性 = 1/频率。

text = """ The intensity of the feeling makes up for the disproportion of the objects.  Things are equal to the imagination, which have the power of affecting the mind with an equal degree of terror, admiration, delight, or love.  When Lear calls upon the heavens to avenge his cause, "for they are old like him," there is nothing extravagant or impious in this sublime identification of his age with theirs; for there is no other image which could do justice to the agonising sense of his wrongs and his despair! """

BAD_CHARS = ".!?,\'\""

# transform text into a list words--removing punctuation and filtering small words
words = [ word.strip(BAD_CHARS) for word in text.strip().split() if len(word) > 4 ]

word_freq = {}

# generate a 'word histogram' for the text--ie, a list of the frequencies of each word
for word in words :
  word_freq[word] = word_freq.get(word, 0) + 1

# sort the word list by frequency 
# (just a DSU sort, there's a python built-in for this, but i can't remember it)
tx = [ (v, k) for (k, v) in word_freq.items()]
tx.sort(reverse=True)
word_freq_sorted = [ (k, v) for (v, k) in tx ]

# eg, what are the most common words in that text?
print(word_freq_sorted)
# returns: [('which', 4), ('other', 4), ('like', 4), ('what', 3), ('upon', 3)]
# obviously using a text larger than 50 or so words will give you more meaningful results

term_importance = lambda word : 1.0/word_freq[word]

# select document keywords from the words at/near the top of this list:
map(term_importance, word_freq.keys())

First, the key python library for computational linguistics is NLTK ("Natural Language Toolkit"). This is a stable, mature library created and maintained by professional computational linguists. It also has an extensive collection of tutorials, FAQs, etc. I recommend it highly.

Below is a simple template, in python code, for the problem raised in your Question; although it's a template it runs--supply any text as a string (as i've done) and it will return a list of word frequencies as well as a ranked list of those words in order of 'importance' (or suitability as keywords) according to a very simple heuristic.

Keywords for a given document are (obviously) chosen from among important words in a document--ie, those words that are likely to distinguish it from another document. If you had no a priori knowledge of the text's subject matter, a common technique is to infer the importance or weight of a given word/term from its frequency, or importance = 1/frequency.

text = """ The intensity of the feeling makes up for the disproportion of the objects.  Things are equal to the imagination, which have the power of affecting the mind with an equal degree of terror, admiration, delight, or love.  When Lear calls upon the heavens to avenge his cause, "for they are old like him," there is nothing extravagant or impious in this sublime identification of his age with theirs; for there is no other image which could do justice to the agonising sense of his wrongs and his despair! """

BAD_CHARS = ".!?,\'\""

# transform text into a list words--removing punctuation and filtering small words
words = [ word.strip(BAD_CHARS) for word in text.strip().split() if len(word) > 4 ]

word_freq = {}

# generate a 'word histogram' for the text--ie, a list of the frequencies of each word
for word in words :
  word_freq[word] = word_freq.get(word, 0) + 1

# sort the word list by frequency 
# (just a DSU sort, there's a python built-in for this, but i can't remember it)
tx = [ (v, k) for (k, v) in word_freq.items()]
tx.sort(reverse=True)
word_freq_sorted = [ (k, v) for (v, k) in tx ]

# eg, what are the most common words in that text?
print(word_freq_sorted)
# returns: [('which', 4), ('other', 4), ('like', 4), ('what', 3), ('upon', 3)]
# obviously using a text larger than 50 or so words will give you more meaningful results

term_importance = lambda word : 1.0/word_freq[word]

# select document keywords from the words at/near the top of this list:
map(term_importance, word_freq.keys())

回复收藏 0 原文

平定天下 2024-09-04 23:22:14

http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation 尝试表示训练中的每个文档语料库是主题的混合体，而主题又是将单词映射到概率的分布。

我曾经用它来将产品评论语料库剖析成所有文档中所讨论的潜在想法，例如“客户服务”、“产品可用性”等。基本模型并不提倡转换的方法主题模型被转化成一个描述主题内容的单词。但是，一旦他们的模型被训练，人们就会想出各种启发式方法来做到这一点。

我建议您尝试使用 http://mallet.cs.umass.edu/ 看看是否可以 LDA 是一种完全无监督的算法，

这意味着它不需要您手动注释任何很棒的内容，但另一方面，可能无法为您提供您期望它提供的主题。

回复收藏 0 原文

漫漫岁月 2024-09-04 23:22:14

该问题的一个非常简单的解决方案是：

计算文本中每个单词的出现次数，
考虑最常见的术语，因为关键短语
有一个“停用词”黑名单，以删除常见单词，例如 the、and、it、is等等，

我确信有更聪明的、基于统计的解决方案。

如果您需要一个在更大的项目中使用的解决方案而不是出于利益考虑，Yahoo BOSS 有一个关键术语提取方法。

回复收藏 0 原文

十秒萌定你 2024-09-04 23:22:14

潜在狄利克雷分配或分层狄利克雷过程可用于通过从派生主题中提取最重要的单词来为更大的语料库（文本正文）中的各个文本生成标签。

一个基本的例子是，如果我们在一个语料库上运行 LDA 并将其定义为有两个主题，我们进一步发现语料库中的文本 70% 是一个主题，30% 是另一个主题。然后，定义第一个主题的前 70% 的单词和定义第二个主题的前 30%（不重复）可以被视为给定文本的标签。此方法提供了强大的结果，其中标签通常代表给定文本的更广泛主题。

这些代码所需的预处理的一般参考可以在此处，我们可以使用 gensim 通过以下过程找到标签。

此答案中提供了一种为 LDA 导出最佳主题数量的启发式方法。尽管 HDP 不需要主题数量作为输入，但这种情况下的标准仍然是使用带有派生主题编号的 LDA，因为 HDP 可能会出现问题。假设这里发现语料库有 10 个主题，并且我们希望每个文本有 5 个标签：

from gensim.models import LdaModel, HdpModel
from gensim import corpora

num_topics = 10
num_tags = 5

进一步假设我们有一个变量 corpus，它是一个预处理的列表列表，子列表条目是单词代币。初始化 Dirichlet 字典并创建一个单词包，其中文本将转换为其组件标记（单词）的索引：

dirichlet_dict = corpora.Dictionary(corpus)
bow_corpus = [dirichlet_dict.doc2bow(text) for text in corpus]

创建 LDA 或 HDP 模型：

dirichlet_model = LdaModel(corpus=bow_corpus,
                           id2word=dirichlet_dict,
                           num_topics=num_topics,
                           update_every=1,
                           chunksize=len(bow_corpus),
                           passes=20,
                           alpha='auto')

# dirichlet_model = HdpModel(corpus=bow_corpus, 
#                            id2word=dirichlet_dict,
#                            chunksize=len(bow_corpus))

以下代码为每个主题生成最重要单词的有序列表（请注意，这里是其中 num_tags 定义每个文本所需的标签）：

shown_topics = dirichlet_model.show_topics(num_topics=num_topics, 
                                           num_words=num_tags,
                                           formatted=False)
model_topics = [[word[0] for word in topic[1]] for topic in shown_topics]

然后找到文本中主题的连贯性：

topic_corpus = dirichlet_model.__getitem__(bow=bow_corpus, eps=0) # cutoff probability to 0 
topics_per_text = [text for text in topic_corpus]

从这里我们可以得到每个文本与给定主题连贯的百分比，以及与每个主题关联的单词，因此我们可以将它们组合为标签，如下所示：

corpus_tags = []

for i in range(len(bow_corpus)):
    # The complexity here is to make sure that it works with HDP
    significant_topics = list(set([t[0] for t in topics_per_text[i]]))
    topic_indexes_by_coherence = [tup[0] for tup in sorted(enumerate(topics_per_text[i]), key=lambda x:x[1])]
    significant_topics_by_coherence = [significant_topics[i] for i in topic_indexes_by_coherence]

    ordered_topics = [model_topics[i] for i in significant_topics_by_coherence][:num_topics] # subset for HDP
    ordered_topic_coherences = [topics_per_text[i] for i in topic_indexes_by_coherence][:num_topics] # subset for HDP

    text_tags = []
    for i in range(num_topics):
        # Find the number of indexes to select, which can later be extended if the word has already been selected
        selection_indexes = list(range(int(round(num_tags * ordered_topic_coherences[i]))))
        if selection_indexes == [] and len(text_tags) < num_tags: 
            # Fix potential rounding error by giving this topic one selection
            selection_indexes = [0]
              
        for s_i in selection_indexes:
            # ignore_words is a list of words should not be included
            if ordered_topics[i][s_i] not in text_tags and ordered_topics[i][s_i] not in ignore_words:
                text_tags.append(ordered_topics[i][s_i])
            else:
                selection_indexes.append(selection_indexes[-1] + 1)

    # Fix for if too many were selected
    text_tags = text_tags[:num_tags]

    corpus_tags.append(text_tags)

corpus_tags 将是每个文本的标签列表，具体取决于文本与派生主题的连贯程度。

请参阅此答案，了解为整个文本语料库生成标签的类似版本。

Latent Dirichlet allocation or Hierarchical Dirichlet Process can be used to generate tags for individual texts within a greater corpus (body of texts) by extracting the most important words from the derived topics.

A basic example would be if we were to run LDA over a corpus and define it to have two topics, and that we find further that a text in the corpus is 70% one topic, and 30% another. The top 70% of the words that define the first topic and 30% that define the second (without duplication) could then be considered as tags for the given text. This method provides strong results where tags generally represent the broader themes of the given texts.

With a general reference for preprocessing needed for these codes being found here, we can find tags through the following process using gensim.

A heuristic way of deriving the optimal number of topics for LDA is found in this answer. Although HDP does not require the number of topics as an input, the standard in such cases is still to use LDA with a derived topic number, as HDP can be problematic. Assume here that the corpus is found to have 10 topics, and we want 5 tags per text:

from gensim.models import LdaModel, HdpModel
from gensim import corpora

num_topics = 10
num_tags = 5

Assume further that we have a variable corpus, which is a preprocessed list of lists, with the subslist entries being word tokens. Initialize a Dirichlet dictionary and create a bag of words where texts are converted to their indexes for their component tokens (words):

dirichlet_dict = corpora.Dictionary(corpus)
bow_corpus = [dirichlet_dict.doc2bow(text) for text in corpus]

Create an LDA or HDP model:

dirichlet_model = LdaModel(corpus=bow_corpus,
                           id2word=dirichlet_dict,
                           num_topics=num_topics,
                           update_every=1,
                           chunksize=len(bow_corpus),
                           passes=20,
                           alpha='auto')

# dirichlet_model = HdpModel(corpus=bow_corpus, 
#                            id2word=dirichlet_dict,
#                            chunksize=len(bow_corpus))

The following code produces ordered lists for the most important words per topic (note that here is where num_tags defines the desired tags per text):

shown_topics = dirichlet_model.show_topics(num_topics=num_topics, 
                                           num_words=num_tags,
                                           formatted=False)
model_topics = [[word[0] for word in topic[1]] for topic in shown_topics]

Then find the coherence of the topics across the texts:

topic_corpus = dirichlet_model.__getitem__(bow=bow_corpus, eps=0) # cutoff probability to 0 
topics_per_text = [text for text in topic_corpus]

From here we have the percentage that each text coheres to a given topic, and the words associated with each topic, so we can combine them for tags with the following:

corpus_tags = []

for i in range(len(bow_corpus)):
    # The complexity here is to make sure that it works with HDP
    significant_topics = list(set([t[0] for t in topics_per_text[i]]))
    topic_indexes_by_coherence = [tup[0] for tup in sorted(enumerate(topics_per_text[i]), key=lambda x:x[1])]
    significant_topics_by_coherence = [significant_topics[i] for i in topic_indexes_by_coherence]

    ordered_topics = [model_topics[i] for i in significant_topics_by_coherence][:num_topics] # subset for HDP
    ordered_topic_coherences = [topics_per_text[i] for i in topic_indexes_by_coherence][:num_topics] # subset for HDP

    text_tags = []
    for i in range(num_topics):
        # Find the number of indexes to select, which can later be extended if the word has already been selected
        selection_indexes = list(range(int(round(num_tags * ordered_topic_coherences[i]))))
        if selection_indexes == [] and len(text_tags) < num_tags: 
            # Fix potential rounding error by giving this topic one selection
            selection_indexes = [0]
              
        for s_i in selection_indexes:
            # ignore_words is a list of words should not be included
            if ordered_topics[i][s_i] not in text_tags and ordered_topics[i][s_i] not in ignore_words:
                text_tags.append(ordered_topics[i][s_i])
            else:
                selection_indexes.append(selection_indexes[-1] + 1)

    # Fix for if too many were selected
    text_tags = text_tags[:num_tags]

    corpus_tags.append(text_tags)

corpus_tags will be a list of tags for each text based on how coherent the text is to the derived topics.

See this answer for a similar version of this that generates tags for a whole text corpus.

回复收藏 0 原文

~没有更多了~