从文本块中提取相关标签/关键字

发布于 2024-10-15 04:07:28 字数 393 浏览 6 评论 0原文

我想要一个特定的实现，以便用户提供如下文本块：

“要求 - 使用 Linux、Apache 2 的 LAMP 环境的工作知识、 MySQL 5 和 PHP 5， - Web 2.0 标准知识 - 熟悉 JSON - 使用框架、Zend、OOP 的实践经验 - 跨浏览器Javascript、JQuery 等。 - 版本控制软件（例如子版本）的知识最好。”

我想做的是自动选择相关关键字并创建标签/关键字，因此对于上面的文本，相关标签应该是：mysql、php、json、jquery、版本控制、oop、web2 0、javascript

我怎样才能用 PHP/Javascript 等来做这件事？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

厌倦 2024-10-22 04:07:28

一个非常天真的方法是从文本中删除常见的停用词，留下更有意义的单词，例如“标准”、“JSON”等。但是，您仍然会收到很多噪音，因此您可以考虑像 OpenCalais 这样的服务它可以对您的文本进行相当复杂的分析。

更新：

好的，我之前的答案中的链接指向实现，但您要求一个，所以这里有一个简单的：

function stopWords($text, $stopwords) {

  // Remove line breaks and spaces from stopwords
    $stopwords = array_map(function($x){return trim(strtolower($x));}, $stopwords);

  // Replace all non-word chars with comma
  $pattern = '/[0-9\W]/';
  $text = preg_replace($pattern, ',', $text);

  // Create an array from $text
  $text_array = explode(",",$text);

  // remove whitespace and lowercase words in $text
  $text_array = array_map(function($x){return trim(strtolower($x));}, $text_array);

  foreach ($text_array as $term) {
    if (!in_array($term, $stopwords)) {
      $keywords[] = $term;
    }
  };

  return array_filter($keywords);
}

$stopwords = file('stop_words.txt');
$text = "Requirements - Working knowledge, on LAMP Environment using Linux, Apache 2, MySQL 5 and PHP 5, - Knowledge of Web 2.0 Standards - Comfortable with JSON - Hands on Experience on working with Frameworks, Zend, OOPs - Cross Browser Javascripting, JQuery etc. - Knowledge of Version Control Software such as sub-version will be preferable.";

print_r(stopWords($text, $stopwords));

您可以看到这个，以及 stop_word.txt在此要点中。

在示例文本上运行上面的代码会生成以下数组：

Array
(
    [0] => requirements
    [4] => linux
    [6] => apache
    [10] => mysql
    [13] => php
    [25] => json
    [28] => frameworks
    [30] => zend
    [34] => browser
    [35] => javascripting
    [37] => jquery
    [38] => etc
    [42] => software
    [43] => preferable
)

所以，就像我说的，这有点天真，可以使用更多优化（而且速度很慢），但它确实从文本中提取出更相关的关键字。您还需要对停用词进行一些微调。捕获像 Web 2.0 这样的术语将非常困难，所以我再次认为您最好使用像 OpenCalais 这样的严肃服务，它可以理解文本并返回实体和引用的列表。 DocumentCloud 正是依靠这项服务从文档中收集信息。

另外，对于客户端实现，您可以使用 JavaScript 执行几乎相同的操作，而且可能更简洁（尽管对于客户端来说可能会很慢。）

A very naive method is to remove common stopwords from the text, leaving you with more meaningful words like 'Standards', 'JSON', etc. You will still get a lot of noise however, so you may consider a service like OpenCalais which can do a rather sophisticated analysis of your text.

Update:

Okay, the link in my previous answer pointed to implementations, but you asked for one so a simple one is here:

function stopWords($text, $stopwords) {

  // Remove line breaks and spaces from stopwords
    $stopwords = array_map(function($x){return trim(strtolower($x));}, $stopwords);

  // Replace all non-word chars with comma
  $pattern = '/[0-9\W]/';
  $text = preg_replace($pattern, ',', $text);

  // Create an array from $text
  $text_array = explode(",",$text);

  // remove whitespace and lowercase words in $text
  $text_array = array_map(function($x){return trim(strtolower($x));}, $text_array);

  foreach ($text_array as $term) {
    if (!in_array($term, $stopwords)) {
      $keywords[] = $term;
    }
  };

  return array_filter($keywords);
}

$stopwords = file('stop_words.txt');
$text = "Requirements - Working knowledge, on LAMP Environment using Linux, Apache 2, MySQL 5 and PHP 5, - Knowledge of Web 2.0 Standards - Comfortable with JSON - Hands on Experience on working with Frameworks, Zend, OOPs - Cross Browser Javascripting, JQuery etc. - Knowledge of Version Control Software such as sub-version will be preferable.";

print_r(stopWords($text, $stopwords));

You can see this, and the contents of stop_word.txt in this Gist.

Running the above on your example text produces the following array:

Array
(
    [0] => requirements
    [4] => linux
    [6] => apache
    [10] => mysql
    [13] => php
    [25] => json
    [28] => frameworks
    [30] => zend
    [34] => browser
    [35] => javascripting
    [37] => jquery
    [38] => etc
    [42] => software
    [43] => preferable
)

So, like I said, this is somewhat naive and could use more optimization (plus it's slow) but it does pull out the more relevant keywords from your text. You would need to do some fine tuning on the stop words as well. Capturing terms like Web 2.0 will be very difficult, so again I think you would be better off using a serious service like OpenCalais which can understand a text and return a list of entities and references. DocumentCloud relies on this very service to gather information from documents.

Also, for client side implementation you could do pretty much the same thing with JavaScript, and probably much cleaner (although it could be slow for the client.)

回复收藏 0 原文