当前位置：文江博客话题详情

检测文本中的预定义主题

发布于 2025-01-25 02:18:27 字数 314 浏览 2 评论 0原文

我想在文本语料库中找到有关一些预定义主题的典故（假设我对2个主题感兴趣：“报酬”和“工作条件”）。对于在我的语料库中（特定段落）中的发现，它指出了有关“报酬”的问题。

为此，我首先想到了一种确定性的方法：构建大词典，并且由于Regex可能在文本语料库中标记了这些单词。这是一个非常基本的想法，但我不知道如何有效地构建我的字典（我在报酬的词汇领域需要很多单词）。您知道一些法语网站可以帮助我构建这个词典吗？

也许您可以根据某种机器学习算法考虑一种更聪明的方法，该算法可以意识到这项任务（我知道主题建模，但这里的区别是我专注于预先确定的主题/主题，例如“报酬”）。我需要一种简单的方法:)

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

尤怨 2025-02-01 02:18:27

词典方法是一种非常基本的方法，但它可以起作用。您可以用迭代构建字典：

假设您想要与“工作条件”相关的术语字典。

从种子开始，可能与工作条件相关的少数术语。
使用此词典通过语料库运行并找到相关文档。
现在浏览相关文档并找到具有较高TFIDF值的术语（在上述文档中具有很高表示的术语，但在其余的语料库中表示较低）。可以假定这些术语也指“工作条件”的主题。
将您发现的新术语添加到字典中。
现在您可以再次通过语料库运行并找到其他相关文档。

您可以重复上述过程以进行预配置的次数，或者直到找不到更多新术语为止。

回复收藏 0 原文

a√萤火虫的光℡ 2025-02-01 02:18:27

对此类“ 主题分析“问题远远超出了堆栈溢出Q＆amp的范围。 A-关于此类的多本书和论文。

对于一个小型的入门项目：收集许多侧重于讨论您的主题和其他主题的文章。根据是否涵盖您选择的每个主题对每个文档进行评分。计算 term-term-frequency-frequency-frequency-inverse-inverse-document-document-document-frequerquency 样本文章。将它们转换为每个文档每个单词外观频率的向量。（您可能需要消除非常普遍或模棱两可的” do“ stegming 您也可以扫描两个或更多的序列。然后，这为每个定义的主题定义了一组“正”和“负”示例。

如果只有一个感兴趣的话题，则可以使用 cesine Simallity函数哪个示例文章最喜欢您的新 /输入文本样本。对于多个主题，您可能需要做 principtal组件分析示例以确定哪些单词和单词组合最能代表每个主题。

分类的质量将在很大程度上取决于您必须训练模型的示例文本数量以及它们的不同之处。

回复收藏 0 原文

魔 2025-02-01 02:18:27

如果您要谈论解决此问题的编码方式，为什么不编写（用您的语言）找到包含 Word 或 the sestusion Word的段落。

例如，我会在JavaScript中这样做

// longText is a long text that includes 4 paragraphs in total
const longText = `
In 1893, the first running, gasoline-powered American car was built and road-tested by the Duryea brothers of Springfield, Massachusetts. The first public run of the Duryea Motor Wagon took place on 21 September 1893, on Taylor Street in Metro Center Springfield.[32][33] The Studebaker Automobile Company, subsidiary of a long-established wagon and coach manufacturer, started to build cars in 1897[34]: p.66  and commenced sales of electric vehicles in 1902 and gasoline vehicles in 1904.[35]

In Britain, there had been several attempts to build steam cars with varying degrees of success, with Thomas Rickett even attempting a production run in 1860.[36] Santler from Malvern is recognized by the Veteran Car Club of Great Britain as having made the first gasoline-powered car in the country in 1894,[37] followed by Frederick William Lanchester in 1895, but these were both one-offs.[37] The first production vehicles in Great Britain came from the Daimler Company, a company founded by Harry J. Lawson in 1896, after purchasing the right to use the name of the engines. Lawson's company made its first car in 1897, and they bore the name Daimler.[37]

In 1892, German engineer Rudolf Diesel was granted a patent for a "New Rational Combustion Engine". In 1897, he built the first diesel engine.[1] Steam-, electric-, and gasoline-powered vehicles competed for decades, with gasoline internal combustion engines achieving dominance in the 1910s. Although various pistonless rotary engine designs have attempted to compete with the conventional piston and crankshaft design, only Mazda's version of the Wankel engine has had more than very limited success.

All in all, it is estimated that over 100,000 patents created the modern automobile and motorcycle.
`

document.querySelector('.searchbox').addEventListener('submit', (e)=> { e.preventDefault(); search() })
function search(){
  const allusion = document.querySelector('#searchbox').value.toLowerCase()
  const output = document.querySelector('#results-body ol')
  output.innerHTML = "" // reset the output
  const paragraphs = longText.split('\n').filter(item => item != "")
  const included = paragraphs.filter((paragraph) => paragraph.toLowerCase().includes(allusion))
  let foundIn = included.map(paragraph => `<div class="result-row"> <li>${paragraph.toLowerCase()}</li>
  </div>`)
  foundIn = foundIn.map(el => el.replaceAll(allusion, `<span class="highlight">${allusion}</span>`))
  
  output.insertAdjacentHTML('afterbegin', foundIn.join('\n'))
}

.container{
  padding : 5px;
  border: .2px solid black;
}
.searchbox{
  padding-bottom: 5px
}
.searchbox input {
  width: 90%
}
.result-row{
  padding-bottom: 5px;
}
.highlight{
  background: yellow;
}
h3 span {
font-size: 14px;
font-style: italic;
}

<div class="container">
  <form class="searchbox">
    <h3>Give me an hint: <span>ex: car, gasoline, company</span> </h3>
    <input id="searchbox" type="text" placeholder="allusion word, ex: car, gasoline, company">
    <button type"submit"> find </button>
  </form>
  <div id="results-body">
    <ol></ol>
  </div>
</div>

If you're talking about the coding way of solving this, why don't you write a code (in your language) that finds the paragraph containing the word or the allusion word.

For example, I would do it like this in JavaScript

// longText is a long text that includes 4 paragraphs in total
const longText = `
In 1893, the first running, gasoline-powered American car was built and road-tested by the Duryea brothers of Springfield, Massachusetts. The first public run of the Duryea Motor Wagon took place on 21 September 1893, on Taylor Street in Metro Center Springfield.[32][33] The Studebaker Automobile Company, subsidiary of a long-established wagon and coach manufacturer, started to build cars in 1897[34]: p.66  and commenced sales of electric vehicles in 1902 and gasoline vehicles in 1904.[35]

In Britain, there had been several attempts to build steam cars with varying degrees of success, with Thomas Rickett even attempting a production run in 1860.[36] Santler from Malvern is recognized by the Veteran Car Club of Great Britain as having made the first gasoline-powered car in the country in 1894,[37] followed by Frederick William Lanchester in 1895, but these were both one-offs.[37] The first production vehicles in Great Britain came from the Daimler Company, a company founded by Harry J. Lawson in 1896, after purchasing the right to use the name of the engines. Lawson's company made its first car in 1897, and they bore the name Daimler.[37]

In 1892, German engineer Rudolf Diesel was granted a patent for a "New Rational Combustion Engine". In 1897, he built the first diesel engine.[1] Steam-, electric-, and gasoline-powered vehicles competed for decades, with gasoline internal combustion engines achieving dominance in the 1910s. Although various pistonless rotary engine designs have attempted to compete with the conventional piston and crankshaft design, only Mazda's version of the Wankel engine has had more than very limited success.

All in all, it is estimated that over 100,000 patents created the modern automobile and motorcycle.
`

document.querySelector('.searchbox').addEventListener('submit', (e)=> { e.preventDefault(); search() })
function search(){
  const allusion = document.querySelector('#searchbox').value.toLowerCase()
  const output = document.querySelector('#results-body ol')
  output.innerHTML = "" // reset the output
  const paragraphs = longText.split('\n').filter(item => item != "")
  const included = paragraphs.filter((paragraph) => paragraph.toLowerCase().includes(allusion))
  let foundIn = included.map(paragraph => `<div class="result-row"> <li>${paragraph.toLowerCase()}</li>
  </div>`)
  foundIn = foundIn.map(el => el.replaceAll(allusion, `<span class="highlight">${allusion}</span>`))
  
  output.insertAdjacentHTML('afterbegin', foundIn.join('\n'))
}

.container{
  padding : 5px;
  border: .2px solid black;
}
.searchbox{
  padding-bottom: 5px
}
.searchbox input {
  width: 90%
}
.result-row{
  padding-bottom: 5px;
}
.highlight{
  background: yellow;
}
h3 span {
font-size: 14px;
font-style: italic;
}

<div class="container">
  <form class="searchbox">
    <h3>Give me an hint: <span>ex: car, gasoline, company</span> </h3>
    <input id="searchbox" type="text" placeholder="allusion word, ex: car, gasoline, company">
    <button type"submit"> find </button>
  </form>
  <div id="results-body">
    <ol></ol>
  </div>
</div>

回复收藏 0 原文

~没有更多了~