在 PHP 中索引文本文件

发布于 2024-08-03 22:15:08 字数 2316 浏览 12 评论 0原文

我面临的挑战是创建一个索引器,该索引器将所有单词包含 4 个字符或更多字符,并将它们以及该单词的使用次数存储在数据库中。

我必须在 4,000 个 txt 文件上运行此索引器。目前,大约需要 12-15 分钟 - 我想知道是否有人有加快速度的建议?

目前,我将单词放入数组中,如下所示:

// ==============================================================
// === Create an index of all the words in the document
// ==============================================================
function index(){
    $this->index = Array();
    $this->index_frequency = Array();

    $this->original_file = str_replace("\r", " ", $this->original_file);
    $this->index = explode(" ", $this->original_file);

    // Build new frequency array
    foreach($this->index as $key=>$value){
        // remove everything except letters
        $value = clean_string($value);

        if($value == '' || strlen($value) < MIN_CHARS){
            continue;
        }

        if(array_key_exists($value, $this->index_frequency)){
            $this->index_frequency[$value] = $this->index_frequency[$value] + 1;
        } else{
            $this->index_frequency[$value] = 1;
        }
    }
    return $this->index_frequency;
}

我认为目前最大的瓶颈是将单词存储在数据库中的脚本。它需要将文档添加到essay表中,然后如果表中存在该单词,只需将essayid(单词的频率)附加到字段中,如果该单词不存在,则添加它...

// ==============================================================
// === Store the word frequencies in the db
// ==============================================================
private function store(){
    $index = $this->index();

    mysql_query("INSERT INTO essays (checksum, title, total_words) VALUES ('{$this->checksum}', '{$this->original_filename}', '{$this->get_total_words()}')") or die(mysql_error());

    $essay_id = mysql_insert_id();

    foreach($this->index_frequency as $key=>$value){

        $check_word = mysql_result(mysql_query("SELECT COUNT(word) FROM `index` WHERE word = '$key' LIMIT 1"), 0);

        $eid_frequency = $essay_id . "(" . $value . ")";

        if($check_word == 0){
            $save = mysql_query("INSERT INTO `index` (word, essays) VALUES ('$key', '$eid_frequency')");
        } else {
            $eid_frequency = "," . $eid_frequency;
            $save = mysql_query("UPDATE `index` SET essays = CONCAT(essays, '$eid_frequency') WHERE word = '$key' LIMIT 1");
        }
    }
}

I have been set a challenge to create an indexer that takes all words 4 characters or more, and stores them in a database along with how many times the word was used.

I have to run this indexer on 4,000 txt files. Currently, it takes about 12-15 minutes - and I'm wondering if anyone has a suggestion for speeding things up?

Currently I'm placing the words in an array as follows:

// ==============================================================
// === Create an index of all the words in the document
// ==============================================================
function index(){
    $this->index = Array();
    $this->index_frequency = Array();

    $this->original_file = str_replace("\r", " ", $this->original_file);
    $this->index = explode(" ", $this->original_file);

    // Build new frequency array
    foreach($this->index as $key=>$value){
        // remove everything except letters
        $value = clean_string($value);

        if($value == '' || strlen($value) < MIN_CHARS){
            continue;
        }

        if(array_key_exists($value, $this->index_frequency)){
            $this->index_frequency[$value] = $this->index_frequency[$value] + 1;
        } else{
            $this->index_frequency[$value] = 1;
        }
    }
    return $this->index_frequency;
}

I think the biggest bottleneck at the moment is the script to store the words in the database. It needs to add the document to the essays table and then if the word exists in the table just append essayid(frequency of the word) to the field, if the word doesnt exist, then add it...

// ==============================================================
// === Store the word frequencies in the db
// ==============================================================
private function store(){
    $index = $this->index();

    mysql_query("INSERT INTO essays (checksum, title, total_words) VALUES ('{$this->checksum}', '{$this->original_filename}', '{$this->get_total_words()}')") or die(mysql_error());

    $essay_id = mysql_insert_id();

    foreach($this->index_frequency as $key=>$value){

        $check_word = mysql_result(mysql_query("SELECT COUNT(word) FROM `index` WHERE word = '$key' LIMIT 1"), 0);

        $eid_frequency = $essay_id . "(" . $value . ")";

        if($check_word == 0){
            $save = mysql_query("INSERT INTO `index` (word, essays) VALUES ('$key', '$eid_frequency')");
        } else {
            $eid_frequency = "," . $eid_frequency;
            $save = mysql_query("UPDATE `index` SET essays = CONCAT(essays, '$eid_frequency') WHERE word = '$key' LIMIT 1");
        }
    }
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

熟人话多 2024-08-10 22:15:08

您可能会考虑对您的应用程序进行分析,以准确了解您的瓶颈在哪里。这可能会让您更好地了解可以改进的地方。

关于数据库优化:检查word列上是否有索引,然后尝试减少访问数据库的次数。 插入...重复密钥更新...,也许?

You might consider profiling your app to know exactly where are your bottlenecks. This might give you a better understanding of what can be improved.

Regarding DB optimisation: check if you have an index on word column, then try lowering the number of times you access DB. INSERT ... ON DUPLICATE KEY UPDATE ..., maybe?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文