从文本块中提取相关标签/关键字

发布于 2024-10-15 04:07:28 字数 393 浏览 6 评论 0原文

我想要一个特定的实现,以便用户提供如下文本块:

“要求 - 使用 Linux、Apache 2 的 LAMP 环境的工作知识、 MySQL 5 和 PHP 5, - Web 2.0 标准知识 - 熟悉 JSON - 使用框架、Zend、OOP 的实践经验 - 跨浏览器Javascript、JQuery 等。 - 版本控制软件(例如子版本)的知识 最好。”

我想做的是自动选择相关关键字并创建标签/关键字,因此对于上面的文本,相关标签应该是:mysql、php、json、jquery、版本控制、oop、web2 0、javascript

我怎样才能用 PHP/Javascript 等来做这件事?

I wanted a particular implementation, such that the user provide a block of text like:

"Requirements
- Working knowledge, on LAMP Environment using Linux, Apache 2,
MySQL 5 and PHP 5,
- Knowledge of Web 2.0 Standards
- Comfortable with JSON
- Hands on Experience on working with Frameworks, Zend, OOPs
- Cross Browser Javascripting, JQuery etc.
- Knowledge of Version Control Software such as sub-version will be
preferable."

What I want to do is automatically select relevant keywords and create tags/keywords, hence for the above piece of text, relevant tags should be: mysql, php, json, jquery, version control, oop, web2.0, javascript

How can I go about doing it in PHP/Javascript etc? A headstart would be really helpful.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

厌倦 2024-10-22 04:07:28

一个非常天真的方法是从文本中删除常见的 停用词,留下更有意义的单词,例如“标准”、“JSON”等。但是,您仍然会收到很多噪音,因此您可以考虑像 OpenCalais 这样的服务 它可以对您的文本进行相当复杂的分析。

更新:

好的,我之前的答案中的链接指向实现,但您要求一个,所以这里有一个简单的:

function stopWords($text, $stopwords) {

  // Remove line breaks and spaces from stopwords
    $stopwords = array_map(function($x){return trim(strtolower($x));}, $stopwords);

  // Replace all non-word chars with comma
  $pattern = '/[0-9\W]/';
  $text = preg_replace($pattern, ',', $text);

  // Create an array from $text
  $text_array = explode(",",$text);

  // remove whitespace and lowercase words in $text
  $text_array = array_map(function($x){return trim(strtolower($x));}, $text_array);

  foreach ($text_array as $term) {
    if (!in_array($term, $stopwords)) {
      $keywords[] = $term;
    }
  };

  return array_filter($keywords);
}

$stopwords = file('stop_words.txt');
$text = "Requirements - Working knowledge, on LAMP Environment using Linux, Apache 2, MySQL 5 and PHP 5, - Knowledge of Web 2.0 Standards - Comfortable with JSON - Hands on Experience on working with Frameworks, Zend, OOPs - Cross Browser Javascripting, JQuery etc. - Knowledge of Version Control Software such as sub-version will be preferable.";

print_r(stopWords($text, $stopwords));

您可以看到这个,以及 stop_word.txt在此要点中。

在示例文本上运行上面的代码会生成以下数组:

Array
(
    [0] => requirements
    [4] => linux
    [6] => apache
    [10] => mysql
    [13] => php
    [25] => json
    [28] => frameworks
    [30] => zend
    [34] => browser
    [35] => javascripting
    [37] => jquery
    [38] => etc
    [42] => software
    [43] => preferable
)

所以,就像我说的,这有点天真,可以使用更多优化(而且速度很慢),但它确实从文本中提取出更相关的关键字。您还需要对停用词进行一些微调。捕获像 Web 2.0 这样的术语将非常困难,所以我再次认为您最好使用像 OpenCalais 这样的严肃服务,它可以理解文本并返回实体和引用的列表。 DocumentCloud 正是依靠这项服务从文档中收集信息。

另外,对于客户端实现,您可以使用 JavaScript 执行几乎相同的操作,而且可能更简洁(尽管对于客户端来说可能会很慢。)

A very naive method is to remove common stopwords from the text, leaving you with more meaningful words like 'Standards', 'JSON', etc. You will still get a lot of noise however, so you may consider a service like OpenCalais which can do a rather sophisticated analysis of your text.

Update:

Okay, the link in my previous answer pointed to implementations, but you asked for one so a simple one is here:

function stopWords($text, $stopwords) {

  // Remove line breaks and spaces from stopwords
    $stopwords = array_map(function($x){return trim(strtolower($x));}, $stopwords);

  // Replace all non-word chars with comma
  $pattern = '/[0-9\W]/';
  $text = preg_replace($pattern, ',', $text);

  // Create an array from $text
  $text_array = explode(",",$text);

  // remove whitespace and lowercase words in $text
  $text_array = array_map(function($x){return trim(strtolower($x));}, $text_array);

  foreach ($text_array as $term) {
    if (!in_array($term, $stopwords)) {
      $keywords[] = $term;
    }
  };

  return array_filter($keywords);
}

$stopwords = file('stop_words.txt');
$text = "Requirements - Working knowledge, on LAMP Environment using Linux, Apache 2, MySQL 5 and PHP 5, - Knowledge of Web 2.0 Standards - Comfortable with JSON - Hands on Experience on working with Frameworks, Zend, OOPs - Cross Browser Javascripting, JQuery etc. - Knowledge of Version Control Software such as sub-version will be preferable.";

print_r(stopWords($text, $stopwords));

You can see this, and the contents of stop_word.txt in this Gist.

Running the above on your example text produces the following array:

Array
(
    [0] => requirements
    [4] => linux
    [6] => apache
    [10] => mysql
    [13] => php
    [25] => json
    [28] => frameworks
    [30] => zend
    [34] => browser
    [35] => javascripting
    [37] => jquery
    [38] => etc
    [42] => software
    [43] => preferable
)

So, like I said, this is somewhat naive and could use more optimization (plus it's slow) but it does pull out the more relevant keywords from your text. You would need to do some fine tuning on the stop words as well. Capturing terms like Web 2.0 will be very difficult, so again I think you would be better off using a serious service like OpenCalais which can understand a text and return a list of entities and references. DocumentCloud relies on this very service to gather information from documents.

Also, for client side implementation you could do pretty much the same thing with JavaScript, and probably much cleaner (although it could be slow for the client.)

五里雾 2024-10-22 04:07:28

今天早上我快速回顾了这些内容,令我惊讶的是,与我的测试短语表现最好的一个是用 PHP

看起来最专业的一个表现得很糟糕:viewer.opencalais.com

其他还不错的(不确定他们是用什么语言编写的)

  • www.nactem.ac.uk/software/ termine/#form
  • www.alchemyapi.com/api/keyword/

I did a quick review of these this morning and to my surprise one which performs best with my test phrase was written in PHP

What looked like the most professional one performed abysmally: viewer.opencalais.com

Others that were OK were (not sure what language they're written in)

  • www.nactem.ac.uk/software/termine/#form
  • www.alchemyapi.com/api/keyword/
昇り龍 2024-10-22 04:07:28

这并不容易做到,因为它需要某种类型的模糊逻辑。您应该使用雅虎术语提取器 YQL

查看:链接

This is not easy to do because it requires some type of fuzzy logic. You should use the Yahoo Term extractor YQL

Check it out: link

几味少女 2024-10-22 04:07:28

根据您是否要向客户端显示关键字/标签,或者是否要从文本块中提取关键字/标签,然后对它们进行进一步的计算。

如果您只需要显示它们,那么客户端处理就可以了。如果您需要它们进行进一步计算,请使用服务器端处理。

如果您能提供更多详细信息,我可以推荐一个 javascript 客户端实现。如果您想一般地“了解”关键字,那么某种聪明的解决方案是必要的

如果您有关键字列表,那么您可以使用 正则表达式来提取数据

Depending on whether you want to show the client keywords/tags or whether you want to extract the keywords / tags from the block of text then do further computation with them.

If you only need to show them then clientside handling is fine. If you need them for further computation then use serverside handling for it.

I can recommend a javascript clientside implementation if you can supply some more details. If you want to generically "know" the keywords then some kind of clever solution is neccesary

If you have a list of keywords then you can use regular expressions to extract the data

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文