协助建立倒排索引

发布于 2024-08-27 09:03:47 字数 474 浏览 13 评论 0原文

这是我为学校做的信息检索工作的一部分。该计划是使用单词的前两个字母作为键,并将具有这两个字母的任何单词保存为字符串值来创建单词的哈希图。因此,

hashmap["ba"] = "bad barley base"

一旦我完成了对一行的标记,我就会获取该哈希图,将其序列化,并将其附加到以键命名的文本文件中。

这个想法是,如果我获取数据并将其分布在数百个文件中,我将通过减少每个文件的密度来减少完成搜索所需的时间。我遇到的问题是,当我在每次运行中创建 100 多个文件时,无论出于何种原因,它都会在创建一些文件时卡住,因此这些条目是空的。 有什么办法可以提高效率吗?是否值得继续这样做,或者我应该放弃它?

我想说我正在使用 PHP。我比较熟悉的两种语言是 PHP 和 Java。我选择 PHP 是因为前端非常简单,而且我能够毫无问题地添加自动完成/建议搜索等功能。我也认为使用 Java 没有任何好处。如有任何帮助,我们将不胜感激,谢谢。

It's part of an information retrieval thing I'm doing for school. The plan is to create a hashmap of words using the the first two letters of the word as a key and any words with the two letters saved as a string value. So,

hashmap["ba"] = "bad barley base"

Once I'm done tokenizing a line I take that hashmap, serialize it, and append it to the text file named after the key.

The idea is that if I take my data and spread it over hundreds of files I'll lessen the time it takes to fulfill a search by lessening the density of each file. The problem I am running into is when I'm making 100+ files in each run it happens to choke on creating a few files for whatever reason and so those entries are empty. Is there any way to make this more efficient? Is it worth continuing this, or should I abandon it?

I'd like to mention I'm using PHP. The two languages I know relatively intimately are PHP and Java. I chose PHP because the front end will be very simple to do and I will be able to add features like autocompletion/suggested search without a problem. I also see no benefit in using Java. Any help is appreciated, thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

醉南桥 2024-09-03 09:03:47

我将使用单个文件来获取和放置序列化字符串。我还会使用 json 作为序列化。

放入数据

$string = "bad barley base";
$data = explode(" ",$string);
$hashmap["ba"] = $data;

$jsonContent = json_encode($hashmap);
file_put_contents("a-z.txt",$jsonContent);

获取数据

$jsonContent = file_get_contents("a-z.txt");
$hashmap = json_decode($jsonContent);

foreach($hashmap as $firstTwoCharacters => $value) {
    if ($firstTwoCharacters == 'ba') {
        $wordCount = count($value);
    }
}

I would use a single file to get and put the serialized string. I would also use json as the serialization.

Put the data

$string = "bad barley base";
$data = explode(" ",$string);
$hashmap["ba"] = $data;

$jsonContent = json_encode($hashmap);
file_put_contents("a-z.txt",$jsonContent);

Get the data

$jsonContent = file_get_contents("a-z.txt");
$hashmap = json_decode($jsonContent);

foreach($hashmap as $firstTwoCharacters => $value) {
    if ($firstTwoCharacters == 'ba') {
        $wordCount = count($value);
    }
}
中性美 2024-09-03 09:03:47

您没有解释您要解决的问题。我猜您正在尝试创建一个全文搜索引擎,但您的哈希映射中没有文档 ID,因此我不确定您如何使用哈希映射来查找匹配的文档。

假设您想要一个全文搜索引擎,我会考虑使用 trie 作为数据结构。您应该能够将所有东西放入其中,而不会变得太大。与要索引的单词匹配的节点将包含包含该单词的文档的 ID。

You didn't explain the problem you are trying to solve. I'm guessing you are trying to make a full text search engine, but you don't have document ids in your hashmap so I'm not sure how you are using the hashmap to find matching documents.

Assuming you want a full text search engine, I would look into using a trie for the data structure. You should be able to fit everything in it without it growing too large. Nodes that match a word you want to index would contain the ids of the documents containing that word.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文