从文本中提取关键短语(1-4 个单词的 ngram)

发布于 2024-11-29 19:28:07 字数 320 浏览 4 评论 0原文

从文本块中提取关键短语的最佳方法是什么?我正在编写一个工具来进行关键字提取:类似这样的。我找到了一些用于 Python 和 Perl 的库来提取 n 元语法,但我是在 Node 中编写的,所以我需要一个 JavaScript 解决方案。如果没有任何现有的 JavaScript 库,有人可以解释如何执行此操作,以便我可以自己编写吗?

What's the best way to extract keyphrases from a block of text? I'm writing a tool to do keyword extraction: something like this. I've found a few libraries for Python and Perl to extract n-grams, but I'm writing this in Node so I need a JavaScript solution. If there aren't any existing JavaScript libraries, could someone explain how to do this so I can just write it myself?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

嗼ふ静 2024-12-06 19:28:07

我喜欢这个想法,所以我已经实现了它:请参阅下文(包括描述性注释)。
预览地址:https://jsfiddle.net/WsKMx

/*@author Rob W, created on 16-17 September 2011, on request for Stackoverflow (http://stackoverflow.com/q/7085454/938089)
 * Modified on 17 juli 2012, fixed IE bug by replacing [,] with [null]
 * This script will calculate words. For the simplicity and efficiency,
 * there's only one loop through a block of text.
 * A 100% accuracy requires much more computing power, which is usually unnecessary
 **/


var text = "A quick brown fox jumps over the lazy old bartender who said 'Hi!' as a response to the visitor who presumably assaulted the maid's brother, because he didn't pay his debts in time. In time in time does really mean in time. Too late is too early? Nonsense! 'Too late is too early' does not make any sense.";

var atLeast = 2;       // Show results with at least .. occurrences
var numWords = 5;      // Show statistics for one to .. words
var ignoreCase = true; // Case-sensitivity
var REallowedChars = /[^a-zA-Z'\-]+/g;
 // RE pattern to select valid characters. Invalid characters are replaced with a whitespace

var i, j, k, textlen, len, s;
// Prepare key hash
var keys = [null]; //"keys[0] = null", a word boundary with length zero is empty
var results = [];
numWords++; //for human logic, we start counting at 1 instead of 0
for (i=1; i<=numWords; i++) {
    keys.push({});
}

// Remove all irrelevant characters
text = text.replace(REallowedChars, " ").replace(/^\s+/,"").replace(/\s+$/,"");

// Create a hash
if (ignoreCase) text = text.toLowerCase();
text = text.split(/\s+/);
for (i=0, textlen=text.length; i<textlen; i++) {
    s = text[i];
    keys[1][s] = (keys[1][s] || 0) + 1;
    for (j=2; j<=numWords; j++) {
        if(i+j <= textlen) {
            s += " " + text[i+j-1];
            keys[j][s] = (keys[j][s] || 0) + 1;
        } else break;
    }
}

// Prepares results for advanced analysis
for (var k=1; k<=numWords; k++) {
    results[k] = [];
    var key = keys[k];
    for (var i in key) {
        if(key[i] >= atLeast) results[k].push({"word":i, "count":key[i]});
    }
}

// Result parsing
var outputHTML = []; // Buffer data. This data is used to create a table using `.innerHTML`

var f_sortAscending = function(x,y) {return y.count - x.count;};
for (k=1; k<numWords; k++) {
    results[k].sort(f_sortAscending);//sorts results
    
    // Customize your output. For example:
    var words = results[k];
    if (words.length) outputHTML.push('<td colSpan="3" class="num-words-header">'+k+' word'+(k==1?"":"s")+'</td>');
    for (i=0,len=words.length; i<len; i++) {
        
        //Characters have been validated. No fear for XSS
        outputHTML.push("<td>" + words[i].word + "</td><td>" +
           words[i].count + "</td><td>" +
           Math.round(words[i].count/textlen*10000)/100 + "%</td>");
           // textlen defined at the top
           // The relative occurence has a precision of 2 digits.
    }
}
outputHTML = '<table id="wordAnalysis"><thead><tr>' +
              '<td>Phrase</td><td>Count</td><td>Relativity</td></tr>' +
              '</thead><tbody><tr>' +outputHTML.join("</tr><tr>")+
               "</tr></tbody></table>";
document.getElementById("RobW-sample").innerHTML = outputHTML;
/*
CSS:
#wordAnalysis td{padding:1px 3px 1px 5px}
.num-words-header{font-weight:bold;border-top:1px solid #000}

HTML:
<div id="#RobW-sample"></div>
*/

I like the idea, so I've implemented it: See below (descriptive comments are included).
Preview at: https://jsfiddle.net/WsKMx

/*@author Rob W, created on 16-17 September 2011, on request for Stackoverflow (http://stackoverflow.com/q/7085454/938089)
 * Modified on 17 juli 2012, fixed IE bug by replacing [,] with [null]
 * This script will calculate words. For the simplicity and efficiency,
 * there's only one loop through a block of text.
 * A 100% accuracy requires much more computing power, which is usually unnecessary
 **/


var text = "A quick brown fox jumps over the lazy old bartender who said 'Hi!' as a response to the visitor who presumably assaulted the maid's brother, because he didn't pay his debts in time. In time in time does really mean in time. Too late is too early? Nonsense! 'Too late is too early' does not make any sense.";

var atLeast = 2;       // Show results with at least .. occurrences
var numWords = 5;      // Show statistics for one to .. words
var ignoreCase = true; // Case-sensitivity
var REallowedChars = /[^a-zA-Z'\-]+/g;
 // RE pattern to select valid characters. Invalid characters are replaced with a whitespace

var i, j, k, textlen, len, s;
// Prepare key hash
var keys = [null]; //"keys[0] = null", a word boundary with length zero is empty
var results = [];
numWords++; //for human logic, we start counting at 1 instead of 0
for (i=1; i<=numWords; i++) {
    keys.push({});
}

// Remove all irrelevant characters
text = text.replace(REallowedChars, " ").replace(/^\s+/,"").replace(/\s+$/,"");

// Create a hash
if (ignoreCase) text = text.toLowerCase();
text = text.split(/\s+/);
for (i=0, textlen=text.length; i<textlen; i++) {
    s = text[i];
    keys[1][s] = (keys[1][s] || 0) + 1;
    for (j=2; j<=numWords; j++) {
        if(i+j <= textlen) {
            s += " " + text[i+j-1];
            keys[j][s] = (keys[j][s] || 0) + 1;
        } else break;
    }
}

// Prepares results for advanced analysis
for (var k=1; k<=numWords; k++) {
    results[k] = [];
    var key = keys[k];
    for (var i in key) {
        if(key[i] >= atLeast) results[k].push({"word":i, "count":key[i]});
    }
}

// Result parsing
var outputHTML = []; // Buffer data. This data is used to create a table using `.innerHTML`

var f_sortAscending = function(x,y) {return y.count - x.count;};
for (k=1; k<numWords; k++) {
    results[k].sort(f_sortAscending);//sorts results
    
    // Customize your output. For example:
    var words = results[k];
    if (words.length) outputHTML.push('<td colSpan="3" class="num-words-header">'+k+' word'+(k==1?"":"s")+'</td>');
    for (i=0,len=words.length; i<len; i++) {
        
        //Characters have been validated. No fear for XSS
        outputHTML.push("<td>" + words[i].word + "</td><td>" +
           words[i].count + "</td><td>" +
           Math.round(words[i].count/textlen*10000)/100 + "%</td>");
           // textlen defined at the top
           // The relative occurence has a precision of 2 digits.
    }
}
outputHTML = '<table id="wordAnalysis"><thead><tr>' +
              '<td>Phrase</td><td>Count</td><td>Relativity</td></tr>' +
              '</thead><tbody><tr>' +outputHTML.join("</tr><tr>")+
               "</tr></tbody></table>";
document.getElementById("RobW-sample").innerHTML = outputHTML;
/*
CSS:
#wordAnalysis td{padding:1px 3px 1px 5px}
.num-words-header{font-weight:bold;border-top:1px solid #000}

HTML:
<div id="#RobW-sample"></div>
*/
成熟的代价 2024-12-06 19:28:07

我不知道 JavaScript 中有这样的库,但逻辑是

  1. 将文本拆分为数组
  2. ,然后排序和计数,

或者

  1. 拆分为数组
  2. 创建一个辅助数组,
  3. 遍历第一个数组的每个项目,
  4. 检查当前项目是否存在于辅助数组中(
  5. 如果不存在)
    项目的键推送
  6. 将其作为其他
    增加寻找物品的钥匙=的价值。
    HTH

伊沃·斯托伊科夫

I do not know such a library in JavaScript but the logic is

  1. split text into array
  2. then sort and count

alternatively

  1. split into array
  2. create a secondary array
  3. traversing each item of the 1st array
  4. check whether current item exists in secondary array
  5. if not exists
    push it as a item's key
  6. else
    increase value having a key = to item sought.
    HTH

Ivo Stoykov

粉红×色少女 2024-12-06 19:28:07
function ngrams(seq, n) {
  to_return = []
  for (let i=0; i<seq.length-(n-1); i++) {
      let cur = []
      for (let j=i; j<seq.length && j<=i+(n-1); j++) {
          cur.push(seq[j])
      }
      to_return.push(cur.join(''))
  }
  return to_return
}
> ngrams(['a', 'b', 'c'], 2)
['ab', 'bc']
function ngrams(seq, n) {
  to_return = []
  for (let i=0; i<seq.length-(n-1); i++) {
      let cur = []
      for (let j=i; j<seq.length && j<=i+(n-1); j++) {
          cur.push(seq[j])
      }
      to_return.push(cur.join(''))
  }
  return to_return
}
> ngrams(['a', 'b', 'c'], 2)
['ab', 'bc']
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文