如何使用 NLP 将非结构化文本内容分成不同的段落？

发布于 2024-09-09 23:55:26 字数 582 浏览 11 评论 0原文

以下非结构化文本具有三个不同的主题：史泰龙、费城和美国革命。但是您会使用哪种算法或技术将此内容分成不同的段落？

分类器在这种情况下不起作用。我还尝试使用 Jaccard 相似度分析器来查找连续句子之间的距离，并尝试在连续句子之间的距离小于给定值时将它们分组为一个段落。有更好的方法吗？

这是我的文本示例：

西尔维斯特·加登齐奥·史泰龙，绰号“狡猾的史泰龙”，是一位美国演员、电影制片人和编剧。史泰龙以其男子气概和好莱坞动作角色而闻名。史泰龙的电影《洛基》被纳入国家电影登记处，其电影道具也被放置在史密森尼博物馆。史泰龙在《洛奇》系列中使用了费城艺术博物馆的正门，使该地区被昵称为“洛基台阶”。费城是商业、教育和文化中心，曾是大英帝国第二大城市（仅次于伦敦）），以及原13个美洲殖民地的社会和地理中心。它是美国早期历史的核心，孕育了许多催生美国革命和独立的思想和行动。美国革命是 18 世纪下半叶的政治剧变，北美 13 个殖民地加入其中共同摆脱大英帝国，合并成为美利坚合众国。他们首先拒绝了英国议会在没有代表的情况下从海外统治他们的权力，然后驱逐了所有王室官员。到 1774 年，每个殖民地都建立了省议会或同等的政府机构，以形成各自的自治国家。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

鼻尖触碰 2024-09-16 23:55:26

我在 NLP 领域工作了很长时间，这是一个你想要解决的非常棘手的问题。您永远无法实现 100% 准确度的解决方案，因此您应该预先决定是做出假阴性决策（未能找到段落分割点）还是假阳性决策（插入虚假分段）更好点）。完成此操作后，收集文档语料库并注释您期望找到的真实分割点。

完成此操作后，您将需要一种查找 EOS（句末）点的机制。然后，在每对句子之间，您需要做出二元决定：是否应该在此处插入段落边界？

您可以根据不同的分割点来衡量每个段落中概念的凝聚力。例如，在包含五个句子 (ABCDE) 的文档中，有 16 种不同的分段方法：

ABCDE   ABCD|E   ABC|DE   ABC|D|E   AB|CDE   AB|CD|E   AB|C|DE   AB|C|D|E
A|BCDE  A|BCD|E  A|BC|DE  A|BC|D|E  A|B|CDE  A|B|CD|E  A|B|C|DE  A|B|C|D|E

要测量衔接性，您可以使用句子到句子的相似性度量（基于为每个句子提取的一些特征集合）。为简单起见，如果两个相邻句子的相似度度量为 0.95，则将它们组合到同一段落中的“成本”为 0.05。文档分割计划的总成本是所有句子连接成本的总和。为了达到最终的细分，您选择总成本最低的计划。

当然，对于包含多个句子的文档，存在太多不同的可能的分段排列，无法强力评估其所有成本。因此，您需要一些启发式方法来指导该过程。动态编程在这里可能会有所帮助。

至于实际的句子特征提取……嗯，这就是事情变得非常复杂的地方。

您可能想忽略高度句法的单词（介词、连词、助动词和从句标记等连接词），并将相似性建立在语义更相关的单词（名词和动词，以及较小程度上的形容词和副词）上。

简单的实现可能只是计算每个单词的实例数量，并将一个句子中的单词计数与相邻句子中的单词计数进行比较。如果一个重要的单词（如“费城”）出现在两个相邻的句子中，那么它们可能会获得很高的相似度得分。

但问题是两个相邻的句子可能具有非常相似的主题，即使这些句子具有完全不重叠的单词集。

因此，您需要评估每个单词的“含义”（在给定周围上下文的情况下，其具体含义）并将该含义概括为涵盖更广泛的领域。

例如，想象一个含有“greenish”一词的句子。在我的特征提取过程中，我当然会包含确切的词汇值（“绿色”），但我还会应用形态变换，将单词标准化为其根形式（“绿色”）。然后我会在分类中查找该单词，发现它是一种颜色，可以进一步概括为视觉描述符。因此，基于这个词，我可能会在我的句子特征集合中添加四种不同的特征（“绿色”、“绿色”、“[颜色]”、“[视觉]”）。如果文档中的下一个句子再次提到颜色“绿色”，那么这两个句子将非常相似。如果下一句话使用“红色”这个词，那么它们仍然有一定程度的相似性，但程度较小。

所以，有一些基本的想法。您可以无限地详细说明这些并调整算法以在您的特定数据集上表现良好。有一百万种不同的方法可以解决这个问题，但我希望其中一些建议对您入门有所帮助。

So I've worked in NLP for a long time, and this is a really tough problem you're trying to tackle. You'll never be able to implement a solution with 100% accuracy, so you should decide up front whether it's better to make false-negative decisions (failing to find a paragraph-segmentation-point) or false-positive decisions (inserting spurious segmentation points). Once you do that, assemble a corpus of documents and annotate the true segmentation points you expect to find.

Once you've done that, you'll need a mechanism for finding EOS (end-of-sentence) points. Then, between every pair of sentences, you'll need to make a binary decision: should a paragraph boundary be inserted here?

You could measure the cohesion of concepts within each paragraph based on different segmentation points. For example, in a document with five sentences (ABCDE), there are sixteen different ways to segment it:

ABCDE   ABCD|E   ABC|DE   ABC|D|E   AB|CDE   AB|CD|E   AB|C|DE   AB|C|D|E
A|BCDE  A|BCD|E  A|BC|DE  A|BC|D|E  A|B|CDE  A|B|CD|E  A|B|C|DE  A|B|C|D|E

To measure cohesion, you could use a sentence-to-sentence similarity metric (based on some collection of features extracted for each sentence). For the sake of simplicity, if two adjacent sentences have a similarity metric of 0.95, then there's a 0.05 "cost" for combining them into the same paragraph. The total cost of a document segmentation plan is the aggregate of all the sentence-joining costs. To arrive at the final segmentation, you choose the plan with the least expensive aggregate cost.

Of course, for a document with more than a few sentences, there are too many different possible segmentation permutations to brute-force evaluate all of their costs. So you'll need some heuristic to guide the process. Dynamic programming could be helpful here.

As for the actual sentence feature extraction... well, that's where it gets really complicated.

You probably want to ignore highly syntactic words (connective words like prepositions, conjunctions, helping verbs, and clause markers) and base your similarity around more semantically relevant words (nouns and verbs, and to a lesser extent, adjectives and adverbs).

A naive implementation might just count up the number of instances of each word and compare the word counts in one sentence with the word counts in an adjacent sentence. If an important word (like "Philadelphia") appears in two adjacent sentences, then they might get a high similarity score.

But the problem with that is that two adjacent sentences might have very similar topics, even if those sentences have completely non-overlapping sets of words.

So you need to evaluate the "sense" of each word (its specific meaning, given the surrounding context) and generalize that meaning to encompass a broader domain.

For example, imaging a sentence with the word "greenish" in it. During my feature extraction process, I'd certainly include the exact lexical value ("greenish") but I'd also apply a morphological transform, normalizing the word to its root form ("green"). Then I'd lookup that word in a taxonomy and discover that it's a color, which can be further generalized as a visual descriptor. So, based on that one word, I might add four different features to my collection of sentence features ("greenish", "green", "[color]", "[visual]"). If the next sentence in the document referred to the color "green" again, then the two sentences would be very similar. If the next sentence used the word "red", then they'd still have a degree of similarity, but to a lesser extent.

So, there are a few basic ideas. You could elaborate on these ad infinitum and tweak the algorithm to perform well on your specific dataset. There are a million different ways to attack this problem, but I hope some of these suggestions are helpful in getting you started.

回复收藏 0 原文