在网页中搜索数千个关键字的最快方法

发布于 2025-01-07 20:59:27 字数 628 浏览 1 评论 0原文

我想扫描网页以查找词典中是否存在关键字。 已经有相关问题被问到,以强调关键词。然而我的字典会很大,例如 50.000 个单词。最好的方法是什么? 我还想在网站上搜索我的图书馆的变体。例如,我的库包含 p53 等基因名称。我想在网站上搜索“p53”、“p53 蛋白”、“p53 的诱导”、“抑制 p53”、“磷酸化 p53”。我该怎么做?最快的方法应该是什么

或者认为我有 2 个列表,

   List1                List2
   ------              -------    
   inhibits              p21
   induces               p53 
   phosphorylates        Akt
   decreases             Braf
                         cMyc

我希望它能够搜索 List1 和 2 的组合。

Such as 
"inhibits cMyc" 
"phoshorylates p21" 

这意味着对于此示例,它需要搜索 4 X 5=20 个关键字。 但最初它会是 200 X 50000 = 1.000.000 个搜索词。

I want to scan a web page for existence of keywords from my dictionary.
There are already asked questions about this to emphasize the keywords. However my dictionary will be huge eg 50.000 words. What is the best way to do it?
Also I want to search the website for variations of my library. For example my library contains gene names such as p53. I want to search the site for "p53", "p53 protein" , "induction of p53", "inhibits p53" "phosphorylates p53". How can I do this? What should be the fastest way

Or think that I have 2 lists

   List1                List2
   ------              -------    
   inhibits              p21
   induces               p53 
   phosphorylates        Akt
   decreases             Braf
                         cMyc

I want it tobe able to search combinations of List1 and 2.

Such as 
"inhibits cMyc" 
"phoshorylates p21" 

This means for this example it needs to search for 4 X 5=20 keywords.
But orginally it will be something like 200 X 50000 = 1.000.000 search term.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

九厘米的零° 2025-01-14 20:59:27

尝试一下也许会对您有帮助

http ://www.gotoquiz.com/web-coding/programming/javascript/highlight-words-in-text-with-jquery/

你必须像这样准备你的模式(示例以获得想法)

string keywords = "Cat, rabbit, dog,hound, fox";
 Regex r = new Regex(@", ?");
keywords = "(" + r.Replace(keywords, @"|") + ")";

Try may be it will help you

http://www.gotoquiz.com/web-coding/programming/javascript/highlight-words-in-text-with-jquery/

You have to prepare your patterns like this ( example to get idea)

string keywords = "Cat, rabbit, dog,hound, fox";
 Regex r = new Regex(@", ?");
keywords = "(" + r.Replace(keywords, @"|") + ")";
小清晰的声音 2025-01-14 20:59:27

首先,字典必须建立索引。然后,还应该对页面内容进行索引并为字典找到匹配项。然后应该处理页面中的实例(例如突出显示、链接到定义等)。

上述操作应该在服务器上完成,除非您想在浏览器中的随机网页上运行它,例如作为 Greasemonkey 脚本。我认为索引良好的 50,000 项词典不会让普通 PC 上的相当现代的浏览器感到困惑,即使对于具有数千个单词的页面也是如此。

编辑

如果您有两个列表,则对页面上的单词进行索引(例如,一个非常简单的方法是创建一个排序的唯一列表,其中包含第一个 A、第一个 B 等的键)。使用短列表搜索匹配的单词。

使用第一组单词匹配并在页面中找到它们,获取前面的单词以查看它是否与第二个列表中的单词匹配。对下面的单词做同样的事情。在 50,000 个单词的列表中进行简单的二进制查找永远不需要超过 16 次查找。使用字母索引进行第一次查找,然后使用二进制索引,应该将每个原始匹配的查找次数减少到大约 5 或 6 次。

您还可以使用对象而不是索引列表并使用 if (wordList 中的单词),这也会非常快(请记住包含 hasOwnProperty 测试)。

如果您只关心使用现代浏览器,请查看 webworkers 和 网络存储

Firstly, the dictionary must be indexed. Then, the page content should also be indexed and matches found for the dictionary. Then the instances in the page should be dealt with (e.g. highlighted, linked to definitions, etc.).

The above should be done on the server unless you want to run it in a browser on random web pages, say as a Greasemonkey script. I don't think a well indexed 50,000 item dictionary will faze a reasonably modern browser on an average PC, even for pages with several thousand words.

Edit

If you have two lists, then index the words on the page (e.g. a very simple one is to create a sorted, unique list with a key to the first A, first B, etc.). Use the short list to search for matching words.

Use the first set of word matches and find them in the page, get the preceding word to see if it matches a word in the second list. Do the same with the following word. A simple binary lookup in a 50,000 word list never needs more than 16 lookups. Using an alphabetic index for the first lookup, then binary, should cut it to probably 5 or 6 lookups per original match.

You can also use an object instead of an indexed list and use if (word in wordList), which will also be very fast (remember to include a hasOwnProperty test).

If you are only concerned with using a modern browser, check out webworkers and web storage

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文