WebWorkers 可以用来计算很长页面上的词频吗?
我正在编写一个基于浏览器(Javascript 和 jQuery)的语言分析工具,它从 HTML 中提取文本,然后提取语言单元,例如句子、单词等。
为了导入文本,PHP 后端会抓取给定的 URL 并清理生成的 HTML。然后将该 HTML 插入到界面中的 div#container
中,如下所示:
我遇到了一些问题当源 HTML 页面很长时会遇到困难。读取这样的页面并将其插入到界面的 DOM 中似乎不会引起问题(尽管需要一段时间)。
但是,如果页面很长,对蜘蛛抓取的内容运行词频算法会非常慢。如果页面接近 100K 字,比如说,它几乎会让事情陷入停滞。
因此,我看到了几个选项:
- 更改 PHP 蜘蛛,以便它将截断源文档或将其细分为多个文档
- 更改词频算法,使其不太精确,并对单词分布进行采样,而不是完全记录它
- 试试这个新奇的 Web Worker 东西,看看我是否可以在多个后台进程之间分配计算。
在我看来,(3) 正是 Web Workers 的设计目的。我想象将爬取的内容分成多个块,然后为每个块分配一个 Web Worker。每个块的词频概况可以从Web Worker返回,然后汇总并呈现到图表中。
在我尝试这样做之前,我希望我能从这里的其他可能曾经使用过 Web Workers 的人那里得到一次健全性检查。一方面,我想知道有效地分割 div#container
的内容是否会成为一个问题——我想这会涉及到对 div# 下的 DOM 树进行某种遍历。容器。
I'm writing a browser-based (Javascript and jQuery) linguistic analysis tool that extracts text from HTML and then extracts linguistic units such as sentences, words, and so on.
To import text, a PHP backend spiders a given URL and sanitizes the resulting HTML. Then that HTML is inserted into a div#container
in the interface, something like this:
I have run into some difficulties when the source HTML page is very long. Reading and inserting such a page into the interface's DOM
doesn't seem to cause problems (though it takes a while).
But running a word frequency algorithm over the spidered content is very slow if the page is long. If the page is approaches 100K words, say, it will pretty much bring things to a grinding halt.
So, I see a few options:
- Change the PHP spider so that it will truncate the source document or subdivide it into multiple documents
- Change the word frequency algorithm so that it's less exact, and samples the word distribution rather than recording it completely
- Try out this new-fangled Web Worker thing to see if I can distribute the calculation across multiple background processes.
It would appear to me that (3) is just the word of thing that Web Workers is designed to do. I'm imagining splitting the spidered content into chunks, and then assigning one Web Worker to each chunk. The word frequency profile of each chunk can be returned from the Web Worker, and then summed up and rendered to the chart.
Before I attempt this, I was hoping I could get a sanity check from other folks here who may have worked with Web Workers before. For one thing, I'm wondering if splitting up the contents of div#container
efficiently will be an issue -- I suppose it would involve some sort of traversal through the DOM tree under div#container
.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Web Workers 当然是一个可行的选择,但代价是您无法保证跨浏览器兼容性。将内容分成块并使用 setTimeout 可能值得,看看这是否会产生影响。它将防止浏览器锁定,并防止发生任何长时间运行的脚本警告。 Nicholas Zakas 不久前写了一篇关于此类事情的博客文章: http://www.nczonline.net/blog/2009/01/13/speed-up-your-javascript-part-1/
他建议的方法是:
就个人而言,我认为100ms的延迟没有必要;我在其他地方看到它说你可以设置 0 毫秒的延迟,因为这足以中断长时间运行的脚本,并防止浏览器锁定。
如果这不能改善情况,那么是的,Web Workers 将是最佳选择。
Web workers would certainly be a viable option, but the trade-off is that you can't guarantee cross-browser compatibility. It might be worth breaking the content up into chunks and making use of setTimeout, to see if that makes a difference. It will prevent the browser from locking up, and would prevent any long-running script warnings from occurring. Nicholas Zakas wrote a blog entry about this sort of thing a while ago: http://www.nczonline.net/blog/2009/01/13/speed-up-your-javascript-part-1/
The method he suggests is:
Personally, I don't think the delay of 100ms is necessary; I've seen it stated elsewhere that you can put a delay of 0ms, as this is enough to interrupt a long running script, and prevent the browser from locking up.
If this doesn't improve things, then yes, Web Workers would be the way to go.