检测文本中的预定义主题

发布于 2025-01-25 02:18:27 字数 314 浏览 2 评论 0原文

我想在文本语料库中找到有关一些预定义主题的典故(假设我对2个主题感兴趣:“报酬”和“工作条件”)。 对于在我的语料库中(特定段落)中的发现,它指出了有关“报酬”的问题。

为此,我首先想到了一种确定性的方法:构建大词典,并且由于Regex可能在文本语料库中标记了这些单词。这是一个非常基本的想法,但我不知道如何有效地构建我的字典(我在报酬的词汇领域需要很多单词)。您知道一些法语网站可以帮助我构建这个词典吗?

也许您可以根据某种机器学习算法考虑一种更聪明的方法,该算法可以意识到这项任务(我知道主题建模,但这里的区别是我专注于预先确定的主题/主题,例如“报酬”)。我需要一种简单的方法:)

I would like to find in a text corpus (with long boring text), allusions about some pre-defined topic (let's say i am interested in the 2 topic: "Remuneration" and "Work condition").
For exemple finding in my corpus where (the specific paragraph) it is pointing problems about "remuneration".

To accomplish that i first thought about a deterministic approach: building a big dictionary, and thanks to regex maybe flagging those words in the text corpus. It is a very basic idea but i do not know how i could build efficiently my dictionary (i need a lot of words in the lexical field of remuneration). Do you know some website in french which could help me to build this dictionary ?

Perhaps can you think about a more clever approach based on some Machine Learning algorithm which could realize this task (i know about topic modelling but the difference here is that i am focusing on pre-determines subject/topic like "Remuneration"). I need a simple approach :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

尤怨 2025-02-01 02:18:27

词典方法是一种非常基本的方法,但它可以起作用。您可以用迭代构建字典:

假设您想要与“工作条件”相关的术语字典。

  1. 从种子开始,可能与工作条件相关的少数术语。

  2. 使用此词典通过语料库运行并找到相关文档。

  3. 现在浏览相关文档并找到具有较高TFIDF值的术语(在上述文档中具有很高表示的术语,但在其余的语料库中表示较低)。可以假定这些术语也指“工作条件”的主题。

  4. 将您发现的新术语添加到字典中。

  5. 现在您可以再次通过语料库运行并找到其他相关文档。

您可以重复上述过程以进行预配置的次数,或者直到找不到更多新术语为止。

The dictionary approach is a very basic one, but it could work. You can build the dictionary iteratively:

Suppose you want a dictionary of terms related to "work conditions".

  1. Start with a seed, a small number of terms that may be related, with high probability, to work conditions.

  2. Use this dictionary to run through the corpus and find relevant documents.

  3. Now go through the relevant documents and find terms with high TFIDF value (terms which have high representation in the above documents but low representation in the rest of the corpus). These terms can be assumed to refer to the subject of "work conditions" as well.

  4. Add the new terms you found to the dictionary.

  5. Now you can run again through the corpus and find additional relevant documents.

You can repeat the above process for a pre-configured number of times, or until no more new terms are found.

a√萤火虫的光℡ 2025-02-01 02:18:27

对此类“ 主题分析“问题远远超出了堆栈溢出Q&amp的范围。 A-关于此类的多本书和论文。

对于一个小型的入门项目:收集许多侧重于讨论您的主题和其他主题的文章。根据是否涵盖您选择的每个主题对每个文档进行评分。计算 term-term-frequency-frequency-frequency-inverse-inverse-document-document-document-frequerquency 样本文章。将它们转换为每个文档每个单词外观频率的向量。 (您可能需要消除非常普遍或模棱两可的” do“ stegming 您也可以扫描两个或更多的序列。然后,这为每个定义的主题定义了一组“正”和“负”示例。

如果只有一个感兴趣的话题,则可以使用 cesine Simallity函数哪个示例文章最喜欢您的新 /输入文本样本。对于多个主题,您可能需要做 principtal组件分析示例以确定哪些单词和单词组合最能代表每个主题。

分类的质量将在很大程度上取决于您必须训练模型的示例文本数量以及它们的不同之处。

A full treatment of such "topic analysis" problems is well beyond the scope of a Stack Overflow Q&A - There are multiple books and papers on such.

For a small, starter project: collect a number of articles which focus on discussing your topic(s), and on other topics. Rate each document according to whether or not it covers each of your chosen topics. Calculate the term-frequency-inverse-document-frequency for each of the sample articles. Convert these into a vector of the frequency of appearance of each word for each document. (You'll probably want to eliminate extremely common or ambiguous "stop words" from the analysis and do "stemming" as well. You can also scan for common sequences of two or more words.) This then defines a set of "positive" and "negative" examples for each defined topic.

If there's only a single topic of interest, you can then use a cosine-similarity function to determine which sample article is most like your new / input text sample. For multiple topics, you'll probably want to do something like Principal Component Analysis from the original text samples to identify which words and word combinations are most representative of each topic.

The quality of the classification will depend largely on the number of example texts you have to train the model and how much they differ.

2025-02-01 02:18:27

如果您要谈论解决此问题的编码方式,为什么不编写(用您的语言)找到包含 Word the sestusion Word的段落

例如,我会在JavaScript中这样做

// longText is a long text that includes 4 paragraphs in total
const longText = `
In 1893, the first running, gasoline-powered American car was built and road-tested by the Duryea brothers of Springfield, Massachusetts. The first public run of the Duryea Motor Wagon took place on 21 September 1893, on Taylor Street in Metro Center Springfield.[32][33] The Studebaker Automobile Company, subsidiary of a long-established wagon and coach manufacturer, started to build cars in 1897[34]: p.66  and commenced sales of electric vehicles in 1902 and gasoline vehicles in 1904.[35]

In Britain, there had been several attempts to build steam cars with varying degrees of success, with Thomas Rickett even attempting a production run in 1860.[36] Santler from Malvern is recognized by the Veteran Car Club of Great Britain as having made the first gasoline-powered car in the country in 1894,[37] followed by Frederick William Lanchester in 1895, but these were both one-offs.[37] The first production vehicles in Great Britain came from the Daimler Company, a company founded by Harry J. Lawson in 1896, after purchasing the right to use the name of the engines. Lawson's company made its first car in 1897, and they bore the name Daimler.[37]

In 1892, German engineer Rudolf Diesel was granted a patent for a "New Rational Combustion Engine". In 1897, he built the first diesel engine.[1] Steam-, electric-, and gasoline-powered vehicles competed for decades, with gasoline internal combustion engines achieving dominance in the 1910s. Although various pistonless rotary engine designs have attempted to compete with the conventional piston and crankshaft design, only Mazda's version of the Wankel engine has had more than very limited success.

All in all, it is estimated that over 100,000 patents created the modern automobile and motorcycle.
`

document.querySelector('.searchbox').addEventListener('submit', (e)=> { e.preventDefault(); search() })
function search(){
  const allusion = document.querySelector('#searchbox').value.toLowerCase()
  const output = document.querySelector('#results-body ol')
  output.innerHTML = "" // reset the output
  const paragraphs = longText.split('\n').filter(item => item != "")
  const included = paragraphs.filter((paragraph) => paragraph.toLowerCase().includes(allusion))
  let foundIn = included.map(paragraph => `<div class="result-row"> <li>${paragraph.toLowerCase()}</li>
  </div>`)
  foundIn = foundIn.map(el => el.replaceAll(allusion, `<span class="highlight">${allusion}</span>`))
  
  output.insertAdjacentHTML('afterbegin', foundIn.join('\n'))
}
.container{
  padding : 5px;
  border: .2px solid black;
}
.searchbox{
  padding-bottom: 5px
}
.searchbox input {
  width: 90%
}
.result-row{
  padding-bottom: 5px;
}
.highlight{
  background: yellow;
}
h3 span {
font-size: 14px;
font-style: italic;
}
<div class="container">
  <form class="searchbox">
    <h3>Give me an hint: <span>ex: car, gasoline, company</span> </h3>
    <input id="searchbox" type="text" placeholder="allusion word, ex: car, gasoline, company">
    <button type"submit"> find </button>
  </form>
  <div id="results-body">
    <ol></ol>
  </div>
</div>

If you're talking about the coding way of solving this, why don't you write a code (in your language) that finds the paragraph containing the word or the allusion word.

For example, I would do it like this in JavaScript

// longText is a long text that includes 4 paragraphs in total
const longText = `
In 1893, the first running, gasoline-powered American car was built and road-tested by the Duryea brothers of Springfield, Massachusetts. The first public run of the Duryea Motor Wagon took place on 21 September 1893, on Taylor Street in Metro Center Springfield.[32][33] The Studebaker Automobile Company, subsidiary of a long-established wagon and coach manufacturer, started to build cars in 1897[34]: p.66  and commenced sales of electric vehicles in 1902 and gasoline vehicles in 1904.[35]

In Britain, there had been several attempts to build steam cars with varying degrees of success, with Thomas Rickett even attempting a production run in 1860.[36] Santler from Malvern is recognized by the Veteran Car Club of Great Britain as having made the first gasoline-powered car in the country in 1894,[37] followed by Frederick William Lanchester in 1895, but these were both one-offs.[37] The first production vehicles in Great Britain came from the Daimler Company, a company founded by Harry J. Lawson in 1896, after purchasing the right to use the name of the engines. Lawson's company made its first car in 1897, and they bore the name Daimler.[37]

In 1892, German engineer Rudolf Diesel was granted a patent for a "New Rational Combustion Engine". In 1897, he built the first diesel engine.[1] Steam-, electric-, and gasoline-powered vehicles competed for decades, with gasoline internal combustion engines achieving dominance in the 1910s. Although various pistonless rotary engine designs have attempted to compete with the conventional piston and crankshaft design, only Mazda's version of the Wankel engine has had more than very limited success.

All in all, it is estimated that over 100,000 patents created the modern automobile and motorcycle.
`

document.querySelector('.searchbox').addEventListener('submit', (e)=> { e.preventDefault(); search() })
function search(){
  const allusion = document.querySelector('#searchbox').value.toLowerCase()
  const output = document.querySelector('#results-body ol')
  output.innerHTML = "" // reset the output
  const paragraphs = longText.split('\n').filter(item => item != "")
  const included = paragraphs.filter((paragraph) => paragraph.toLowerCase().includes(allusion))
  let foundIn = included.map(paragraph => `<div class="result-row"> <li>${paragraph.toLowerCase()}</li>
  </div>`)
  foundIn = foundIn.map(el => el.replaceAll(allusion, `<span class="highlight">${allusion}</span>`))
  
  output.insertAdjacentHTML('afterbegin', foundIn.join('\n'))
}
.container{
  padding : 5px;
  border: .2px solid black;
}
.searchbox{
  padding-bottom: 5px
}
.searchbox input {
  width: 90%
}
.result-row{
  padding-bottom: 5px;
}
.highlight{
  background: yellow;
}
h3 span {
font-size: 14px;
font-style: italic;
}
<div class="container">
  <form class="searchbox">
    <h3>Give me an hint: <span>ex: car, gasoline, company</span> </h3>
    <input id="searchbox" type="text" placeholder="allusion word, ex: car, gasoline, company">
    <button type"submit"> find </button>
  </form>
  <div id="results-body">
    <ol></ol>
  </div>
</div>

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文