从文本中提取关键字和多字关键字 - PHP

发布于 2024-10-18 07:40:09 字数 194 浏览 7 评论 0原文

我想知道是否有人知道从 PHP 文本块中提取最常出现的关键字/短语的最佳方法。

我想为我正在开发的应用程序构建自己的标签云。主要棘手的部分是提取“多单词”关键字,例如“白宫”,并且不将它们识别为两个单独的单词,而是一个短语。

肯定有很多用于此目的的脚本,但似乎找不到任何脚本!

感谢您的帮助!

I was wondering If anyone knows the best way to pull out the top reoccurring keywords/phrases from a block of text in PHP.

I want to build my own tag cloud for an application I'm working on. The main tricky part would be pulling out 'muli-word' keywords such as "White House" and not recognising them as two separate words but one phrase.

There must be a bunch of scripts out there for this purpose, just can't seem to find any!

Appreciate your help!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

不忘初心 2024-10-25 07:40:09

这是我使用的一小块 - 它解析逗号分隔的字符串,并相应地打印大小:

PHP

function cs_get_tag_cloud_data($data)
{
    $data = str_replace(' ', '', $data);
    $tagwords_arr = explode(",", $data);
    $tags_arr = null;

    for( $x=0; $x<sizeof($tagwords_arr); $x++)
    {
        $word_count = get_tag_count($tagwords_arr, $tagwords_arr[$x]);

        if(in_tag_array($tags_arr, $tagwords_arr[$x]) == false)
        {
            $tags_arr[] = array("tag" => $tagwords_arr[$x], "count" => $word_count);
        }
    }

    return $tags_arr;       
}

# Get tag count
function get_tag_count($arr, $word)
{
    $wordCount = 0;
    for ( $i = 0; $i < sizeof($arr); $i++ ) 
    {
        if ( strtoupper($arr[$i]) == strtoupper($word) ) $wordCount++;
    }
    return $wordCount;
}

# check if word already exists
function in_tag_array($arr, $search)
{
    $tag_exists = false;
    if(sizeof($arr)>0)
    {
        for($b = 0; $b < sizeof($arr); $b++) 
        {
            if (strtoupper($arr[$b]['tag']) == strtoupper($search)) 
            {
                $tag_exists = true;
                break;
            }
        }
    }
    else
    {
        $tag_exists = false;
    }
    return $tag_exists;
}

HTML

<p id="tag-words">
    <?  $tag_data = cs_get_tag_cloud_data($cloud_data);
        asort($tag_data);

        for($x=0; $x<sizeof($tag_data); $x++)
        {   
            $word = "";
            $value = "";
            $count = 0;
            $font_size = 0; 
            $new_font_size = 0;

            foreach($tag_data[$x] as $key => $value)
            {
                if($key == "tag") $word = $value;
                if($key == "count") $count = $value;
                if($count > 10) $count = 10;

                if($count > 0)
                {
                    $new_font_size = 0;
                    $font_size = 8;
                    $new_font_size = $font_size + ($count*3);

                    $word = preg_replace("/&#?[a-z0-9]+;/i","", $word);

                    echo '<a class="tag-link" style="font-size: ' . $new_font_size . 'px;" href="#">' . $word . '</a> ';
                }
            }
        } ?>
</p>

这只是我用过的东西,但我想我会分享——也许对你有帮助。

编辑:对于两个单词的标签,您可以执行“White-House”之类的操作,然后在回显时删除破折号。只是另一个想法。

Here's a little chunk I used - it parses a comma-delimited string, and prints the size accordingly:

PHP

function cs_get_tag_cloud_data($data)
{
    $data = str_replace(' ', '', $data);
    $tagwords_arr = explode(",", $data);
    $tags_arr = null;

    for( $x=0; $x<sizeof($tagwords_arr); $x++)
    {
        $word_count = get_tag_count($tagwords_arr, $tagwords_arr[$x]);

        if(in_tag_array($tags_arr, $tagwords_arr[$x]) == false)
        {
            $tags_arr[] = array("tag" => $tagwords_arr[$x], "count" => $word_count);
        }
    }

    return $tags_arr;       
}

# Get tag count
function get_tag_count($arr, $word)
{
    $wordCount = 0;
    for ( $i = 0; $i < sizeof($arr); $i++ ) 
    {
        if ( strtoupper($arr[$i]) == strtoupper($word) ) $wordCount++;
    }
    return $wordCount;
}

# check if word already exists
function in_tag_array($arr, $search)
{
    $tag_exists = false;
    if(sizeof($arr)>0)
    {
        for($b = 0; $b < sizeof($arr); $b++) 
        {
            if (strtoupper($arr[$b]['tag']) == strtoupper($search)) 
            {
                $tag_exists = true;
                break;
            }
        }
    }
    else
    {
        $tag_exists = false;
    }
    return $tag_exists;
}

HTML

<p id="tag-words">
    <?  $tag_data = cs_get_tag_cloud_data($cloud_data);
        asort($tag_data);

        for($x=0; $x<sizeof($tag_data); $x++)
        {   
            $word = "";
            $value = "";
            $count = 0;
            $font_size = 0; 
            $new_font_size = 0;

            foreach($tag_data[$x] as $key => $value)
            {
                if($key == "tag") $word = $value;
                if($key == "count") $count = $value;
                if($count > 10) $count = 10;

                if($count > 0)
                {
                    $new_font_size = 0;
                    $font_size = 8;
                    $new_font_size = $font_size + ($count*3);

                    $word = preg_replace("/&#?[a-z0-9]+;/i","", $word);

                    echo '<a class="tag-link" style="font-size: ' . $new_font_size . 'px;" href="#">' . $word . '</a> ';
                }
            }
        } ?>
</p>

It's just something I've used, but thought I'd share- maybe it helps you.

Edit: For two-word tags, you could just do something like "White-House" and then remove the dash when you're echoing. Just another thought.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文