当前位置：文江博客话题详情

WebWorkers 可以用来计算很长页面上的词频吗？

发布于 2024-10-31 14:17:35 字数 819 浏览 7 评论 0原文

我正在编写一个基于浏览器（Javascript 和 jQuery）的语言分析工具，它从 HTML 中提取文本，然后提取语言单元，例如句子、单词等。

为了导入文本，PHP 后端会抓取给定的 URL 并清理生成的 HTML。然后将该 HTML 插入到界面中的 div#container 中，如下所示：

我遇到了一些问题当源 HTML 页面很长时会遇到困难。读取这样的页面并将其插入到界面的 DOM 中似乎不会引起问题（尽管需要一段时间）。

但是，如果页面很长，对蜘蛛抓取的内容运行词频算法会非常慢。如果页面接近 100K 字，比如说，它几乎会让事情陷入停滞。

因此，我看到了几个选项：

更改 PHP 蜘蛛，以便它将截断源文档或将其细分为多个文档
更改词频算法，使其不太精确，并对单词分布进行采样，而不是完全记录它
试试这个新奇的 Web Worker 东西，看看我是否可以在多个后台进程之间分配计算。

在我看来，(3) 正是 Web Workers 的设计目的。我想象将爬取的内容分成多个块，然后为每个块分配一个 Web Worker。每个块的词频概况可以从Web Worker返回，然后汇总并呈现到图表中。

在我尝试这样做之前，我希望我能从这里的其他可能曾经使用过 Web Workers 的人那里得到一次健全性检查。一方面，我想知道有效地分割 div#container 的内容是否会成为一个问题——我想这会涉及到对 div# 下的 DOM 树进行某种遍历。容器。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

枕梦 2024-11-07 14:17:35

Web Workers 当然是一个可行的选择，但代价是您无法保证跨浏览器兼容性。将内容分成块并使用 setTimeout 可能值得，看看这是否会产生影响。它将防止浏览器锁定，并防止发生任何长时间运行的脚本警告。 Nicholas Zakas 不久前写了一篇关于此类事情的博客文章： http://www.nczonline.net/blog/2009/01/13/speed-up-your-javascript-part-1/

他建议的方法是：

function chunk(array, process, context){
  var items = array.concat();   //clone the array
  setTimeout(function(){
    var item = items.shift();
    process.call(context, item);

    if (items.length > 0){
        setTimeout(arguments.callee, 100);
    }
  }, 100);
}

就个人而言，我认为100ms的延迟没有必要；我在其他地方看到它说你可以设置 0 毫秒的延迟，因为这足以中断长时间运行的脚本，并防止浏览器锁定。

如果这不能改善情况，那么是的，Web Workers 将是最佳选择。

Web workers would certainly be a viable option, but the trade-off is that you can't guarantee cross-browser compatibility. It might be worth breaking the content up into chunks and making use of setTimeout, to see if that makes a difference. It will prevent the browser from locking up, and would prevent any long-running script warnings from occurring. Nicholas Zakas wrote a blog entry about this sort of thing a while ago: http://www.nczonline.net/blog/2009/01/13/speed-up-your-javascript-part-1/

The method he suggests is:

function chunk(array, process, context){
  var items = array.concat();   //clone the array
  setTimeout(function(){
    var item = items.shift();
    process.call(context, item);

    if (items.length > 0){
        setTimeout(arguments.callee, 100);
    }
  }, 100);
}

Personally, I don't think the delay of 100ms is necessary; I've seen it stated elsewhere that you can put a delay of 0ms, as this is enough to interrupt a long running script, and prevent the browser from locking up.

If this doesn't improve things, then yes, Web Workers would be the way to go.

回复收藏 0 原文

~没有更多了~