如何返回结果文档中的字数来计算 TF

发布于 2024-09-30 18:52:24 字数 5527 浏览 10 评论 0原文

我面临的挑战是在非常有限的时间内用 PHP 创建一个基本的文本文件搜索引擎，几乎没有任何编程知识，这是一项艰巨的任务！

这是我们到目前为止所拥有的，它确实设法返回单词出现次数最多的文档（如果多个相同数量的文档）。

问题是我们这样做的方式不允许（至少不容易）计算 TF-IDF 分数。 IDF 已完成，但我们需要通过返回文档中的总字数来计算 TF，这就是我们遇到的问题。另一个问题是它只返回最高的文档，并且我们无法让它返回每个文档及其分数的列表......即一个文档包含 3 次“airline”一词，另外两个文档包含一次它们被忘记了，只返回了第一个...

（剥离符号也存在一些问题，但我们解决了这个问题，尽管这是一个漫长的方法...）

这是我们所拥有的：

<?php
$starttime = microtime();
$startarray = explode(" ", $starttime);
$starttime = $startarray[1] + $startarray[0];

if(isset($_GET['search']))
{
    $searchWord = $_GET['search'];
}
else
{
    $searchWord = null;
}

?>
<html>
<link href="style.css" rel="stylesheet" type="text/css">
<body>
<div id="wrapper">
    <div id="searchbar">
        <h1>PHP Search</h1>
        <form name='searchform' id='searchform' action='<?php echo $_SERVER['PHP_SELF']; ?>' method='get'>
          <input type='text' name='search' id='search' value='<?php echo $_GET['search']; ?>' />
            <input type='submit' value='Search' />
        </form>
        <br />
        <br />
    </div><!-- close searchbar -->
    <?php


//path to directory to scan
$directory = "./files/";

//get all image files with a .txt extension.
$files = glob("" . $directory . "*.txt");
$fileList = array();
//print each file name
foreach($files as $file)
{
$fileList[] =  $file;
}
//$fileList;


        function indexFile($file){
            $filename = $file;
            $fp = fopen($filename, 'r');
            $file_contents = fread($fp, filesize($filename));
            fclose($fp);

            $pat[0] = "/^\s+/";
            $pat[1] = "/\s{2,}/";
            $pat[2] = "/\s+\$/";
            $rep[0] = "";
            $rep[1] = " ";
            $rep[2] = "";

            $new_contents = preg_replace("/[^A-Za-z0-9\s\s+]/", "", $file_contents);
            $new_contents = preg_replace($pat, $rep, $new_contents);

            //COMMON WORDS WERE HERE
            include "commonwords.php";

            $lines = explode("\n", $new_contents);
            $lines2 = implode(" ", $lines); //string
            $lines2 = strtolower($lines2);

            //echo $lines2 . "<br><br>";

            $words = explode(" ", $lines2); //array
            //$words = $lines;
            $useful_words = array_diff($words, $commonWords);
            $useful_words = array_values($useful_words);
            print_r(count($useful_words));

            //echo '<pre>';
            $index = array_count_values($useful_words);
            arsort($index, SORT_NUMERIC);
            //print_r($index);
            //echo '</pre>';

            return $index;
        }
       // $file1 = indexFile ('airlines.txt'); //array
       // $file2 = indexFile ('africa.txt');  //array

        function merge_common_keys(){
            $arr = func_get_args();
            $num = func_num_args();

            $keys = array();
            $i = 0;
            for($i=0;$i<$num;++$i){
                $keys = array_merge($keys, array_keys($arr[$i]));
            }
            $keys = array_unique($keys);

            $merged = array();

            foreach($keys as $key){
                $merged[$key] = array();
                for($i=0;$i<$num;++$i){
                    $merged[$key][] = isset($arr[$i][$key])?$arr[$i][$key]:null;
                }
            }
            return $merged;
        }


    for ($i = 0; $i < count($fileList); $i++) {
        $fileArray[$i] = indexFile($fileList[$i]);
    }

        $merged = call_user_func_array('merge_common_keys',$fileArray);

        $searchQ = $merged[$searchWord];
        echo '<pre>';
        print_r($searchQ);
        echo '</pre>';


        //echo "hello2";
    $maxValue = 0;
    $num_docs = 0;
    $docID = array();
    $n = count($searchQ);
    for ($i=0 ; $i < $n ; $i++) {
        if ($searchQ[$i] > $maxValue) {
            $maxValue = $searchQ[$i];
            unset($docID);
            $docID[] = $i;
            //print_r(count($fileArray[$i]));
        }
        else if($searchQ[$i] == $maxValue){
            $docID[] = $i;
        }
        if (!empty($searchQ[$i])) {
            $num_docs++;
        }
    }
    print_r($n);
    print_r($num_docs);
      print_r($docID);
      if(is_array($docID)){
         for ($i = 0; $i < count($docID); $i++) {
            if ($maxValue == 1){$plural = '';}else{$plural = 's';}
            print_r ('<p><b>'.$searchWord . '</b> found in document <a href="'.$fileList[$docID[$i]].'">'.$fileList[$docID[$i]].'</a> '.$maxValue.' time'.$plural.'.</p>');
            $TF = $maxValue;
            //$TF2 = 1 + log($TF);
            echo "<br>$TF2<br>"; 
            $DF = $num_docs;
            $Non = $n / $num_docs;
            //echo "$Non";
            $IDF = (float) log10($Non);
            $TFxIDF = $TF2 * $IDF;
            //echo "$TFxIDF";
         }
      }


//1,2

//file_put_contents("demo2.txt", implode(" ", $useful_words));
if(isset($_GET['search']))
{
    $endtime = microtime();
    $endarray = explode(" ", $endtime);
    $endtime = $endarray[1] + $endarray[0];
    $totaltime = $endtime - $starttime; 
    $totaltime = round($totaltime,5);
    echo "<div id='timetaken'><p>This page loaded in $totaltime seconds.</p></div>";
}
?>
    </div><!-- close wrapper -->
</body>
</html>

原文

I've been set the challenge of creating a basic text file search engine in PHP in a very limited time, having little to no previous programming knowledge its quite a task!

Here is what we have so far, it does manage to return the document((s) - if more than one with same amount) with the highest number of occurrences of a word.

Problem is the way we have done it does not (atleast not easily) allow us to calculate the TF-IDF score. The IDF is done, but we need to calculate the TF by getting the total number of words in the returned document, and that is what we are having problems with. The other problem is that it only returns the highest document, and we cannot get it to return a list of documents each with their score.... i.e. one document has the word "airline" in 3 times, two other documents have it once and they are forgotten and only the first is returned...

(there was also some problems with stripping symbols, but we worked around that, albeit a drawn out method...)

Here is what we have:

<?php
$starttime = microtime();
$startarray = explode(" ", $starttime);
$starttime = $startarray[1] + $startarray[0];

if(isset($_GET['search']))
{
    $searchWord = $_GET['search'];
}
else
{
    $searchWord = null;
}

?>
<html>
<link href="style.css" rel="stylesheet" type="text/css">
<body>
<div id="wrapper">
    <div id="searchbar">
        <h1>PHP Search</h1>
        <form name='searchform' id='searchform' action='<?php echo $_SERVER['PHP_SELF']; ?>' method='get'>
          <input type='text' name='search' id='search' value='<?php echo $_GET['search']; ?>' />
            <input type='submit' value='Search' />
        </form>
        <br />
        <br />
    </div><!-- close searchbar -->
    <?php


//path to directory to scan
$directory = "./files/";

//get all image files with a .txt extension.
$files = glob("" . $directory . "*.txt");
$fileList = array();
//print each file name
foreach($files as $file)
{
$fileList[] =  $file;
}
//$fileList;


        function indexFile($file){
            $filename = $file;
            $fp = fopen($filename, 'r');
            $file_contents = fread($fp, filesize($filename));
            fclose($fp);

            $pat[0] = "/^\s+/";
            $pat[1] = "/\s{2,}/";
            $pat[2] = "/\s+\$/";
            $rep[0] = "";
            $rep[1] = " ";
            $rep[2] = "";

            $new_contents = preg_replace("/[^A-Za-z0-9\s\s+]/", "", $file_contents);
            $new_contents = preg_replace($pat, $rep, $new_contents);

            //COMMON WORDS WERE HERE
            include "commonwords.php";

            $lines = explode("\n", $new_contents);
            $lines2 = implode(" ", $lines); //string
            $lines2 = strtolower($lines2);

            //echo $lines2 . "<br><br>";

            $words = explode(" ", $lines2); //array
            //$words = $lines;
            $useful_words = array_diff($words, $commonWords);
            $useful_words = array_values($useful_words);
            print_r(count($useful_words));

            //echo '<pre>';
            $index = array_count_values($useful_words);
            arsort($index, SORT_NUMERIC);
            //print_r($index);
            //echo '</pre>';

            return $index;
        }
       // $file1 = indexFile ('airlines.txt'); //array
       // $file2 = indexFile ('africa.txt');  //array

        function merge_common_keys(){
            $arr = func_get_args();
            $num = func_num_args();

            $keys = array();
            $i = 0;
            for($i=0;$i<$num;++$i){
                $keys = array_merge($keys, array_keys($arr[$i]));
            }
            $keys = array_unique($keys);

            $merged = array();

            foreach($keys as $key){
                $merged[$key] = array();
                for($i=0;$i<$num;++$i){
                    $merged[$key][] = isset($arr[$i][$key])?$arr[$i][$key]:null;
                }
            }
            return $merged;
        }


    for ($i = 0; $i < count($fileList); $i++) {
        $fileArray[$i] = indexFile($fileList[$i]);
    }

        $merged = call_user_func_array('merge_common_keys',$fileArray);

        $searchQ = $merged[$searchWord];
        echo '<pre>';
        print_r($searchQ);
        echo '</pre>';


        //echo "hello2";
    $maxValue = 0;
    $num_docs = 0;
    $docID = array();
    $n = count($searchQ);
    for ($i=0 ; $i < $n ; $i++) {
        if ($searchQ[$i] > $maxValue) {
            $maxValue = $searchQ[$i];
            unset($docID);
            $docID[] = $i;
            //print_r(count($fileArray[$i]));
        }
        else if($searchQ[$i] == $maxValue){
            $docID[] = $i;
        }
        if (!empty($searchQ[$i])) {
            $num_docs++;
        }
    }
    print_r($n);
    print_r($num_docs);
      print_r($docID);
      if(is_array($docID)){
         for ($i = 0; $i < count($docID); $i++) {
            if ($maxValue == 1){$plural = '';}else{$plural = 's';}
            print_r ('<p><b>'.$searchWord . '</b> found in document <a href="'.$fileList[$docID[$i]].'">'.$fileList[$docID[$i]].'</a> '.$maxValue.' time'.$plural.'.</p>');
            $TF = $maxValue;
            //$TF2 = 1 + log($TF);
            echo "<br>$TF2<br>"; 
            $DF = $num_docs;
            $Non = $n / $num_docs;
            //echo "$Non";
            $IDF = (float) log10($Non);
            $TFxIDF = $TF2 * $IDF;
            //echo "$TFxIDF";
         }
      }


//1,2

//file_put_contents("demo2.txt", implode(" ", $useful_words));
if(isset($_GET['search']))
{
    $endtime = microtime();
    $endarray = explode(" ", $endtime);
    $endtime = $endarray[1] + $endarray[0];
    $totaltime = $endtime - $starttime; 
    $totaltime = round($totaltime,5);
    echo "<div id='timetaken'><p>This page loaded in $totaltime seconds.</p></div>";
}
?>
    </div><!-- close wrapper -->
</body>
</html>

分享到QQ

分享到微博