使用Sql计算TF-IDF

发布于 2024-09-12 18:43:21 字数 166 浏览 4 评论 0原文

我的数据库中有一个表,其中包含自由文本字段列。

我想知道每个单词在所有行中出现的频率,或者甚至计算所有单词的 TF-IDF,其中我的文档是该字段每行的值。

是否可以使用 Sql 查询来计算此值?如果没有或者有更简单的方法,您能指导我吗?

非常感谢,

乔恩

I have a table in my DB containning a free text field column.

I would like to know the frequency each word appears over all the rows, or maybe even calc a TF-IDF for all words, where my documents are that field's values per row.

Is it possible to calculate this using an Sql Query? if not or there's a simpler way could you please direct me to it?

Many Thanks,

Jon

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

东京女 2024-09-19 18:43:21

在 SQL Server 2008 中,根据您的需要,您可以对列应用全文索引,然后查询 sys.dm_fts_index_keywordssys.dm_fts_index_keywords_by_document 表值函数来获取出现次数。

编辑:实际上,即使不创建持久全文索引,您仍然可以利用解析器

WITH testTable AS
(
SELECT 1 AS Id, N'how now brown cow' AS txt UNION ALL
SELECT 2, N'she sells sea shells upon the sea shore' UNION ALL
SELECT 3, N'red lorry yellow lorry' UNION ALL
SELECT 4, N'the quick brown fox jumped over the lazy dog'
)

SELECT display_term, COUNT(*) As Cnt
FROM testTable
CROSS APPLY sys.dm_fts_parser('"' + REPLACE(txt,'"','""') + '"', 1033, 0,0)
WHERE TXT IS NOT NULL
GROUP BY display_term
HAVING COUNT(*) > 1
ORDER BY Cnt DESC

返回

display_term                   Cnt
------------------------------ -----------
the                            3
brown                          2
lorry                          2
sea                            2

In SQL Server 2008 depending on your needs you could apply full text indexing to the column then query the sys.dm_fts_index_keywords and sys.dm_fts_index_keywords_by_document table valued functions to get the occurrence count.

Edit: Actually even without creating a persistent full text index you can still leverage the parser

WITH testTable AS
(
SELECT 1 AS Id, N'how now brown cow' AS txt UNION ALL
SELECT 2, N'she sells sea shells upon the sea shore' UNION ALL
SELECT 3, N'red lorry yellow lorry' UNION ALL
SELECT 4, N'the quick brown fox jumped over the lazy dog'
)

SELECT display_term, COUNT(*) As Cnt
FROM testTable
CROSS APPLY sys.dm_fts_parser('"' + REPLACE(txt,'"','""') + '"', 1033, 0,0)
WHERE TXT IS NOT NULL
GROUP BY display_term
HAVING COUNT(*) > 1
ORDER BY Cnt DESC

Returns

display_term                   Cnt
------------------------------ -----------
the                            3
brown                          2
lorry                          2
sea                            2
慢慢从新开始 2024-09-19 18:43:21

SQL Server 2008 的解决方案:

这里是表:

CREATE TABLE MyTable (id INT, txt VARCHAR(MAX));

这里是 SQL 查询:

SELECT sum(case when TSplitted.txt_word = 'searched' then 1 else 0 end) as cnt_searched
     , count(*) as cnt_all
FROM MyTable MYT 
INNER JOIN Fn_Split(MYT.id,' ',MYT.txt) TSplitted on MYT.id=TSplitted.id

这里是表值函数 Fn_Split(@id int, @separator VARCHAR(32), @string VARCHAR(MAX)) (取自 此处):

CREATE FUNCTION Fn_Split (@id int, @separator VARCHAR(32), @string VARCHAR(MAX))

RETURNS @t TABLE
    (
        ret_id INT
       ,txt_word VARCHAR(MAX)
    )   
AS
    BEGIN
        DECLARE @xml XML
        SET @XML = N'<root><r>' + REPLACE(@s, @separator, '</r><r>') + '</r></root>'

        INSERT INTO @t(ret_id, val)
        SELECT @id, r.value('.','VARCHAR(5)') as Item
        FROM @xml.nodes('//root/r') AS RECORDS(r)

        RETURN
    END

Solution for SQL Server 2008:

here is the table:

CREATE TABLE MyTable (id INT, txt VARCHAR(MAX));

here is SQL query:

SELECT sum(case when TSplitted.txt_word = 'searched' then 1 else 0 end) as cnt_searched
     , count(*) as cnt_all
FROM MyTable MYT 
INNER JOIN Fn_Split(MYT.id,' ',MYT.txt) TSplitted on MYT.id=TSplitted.id

here is table valued function Fn_Split(@id int, @separator VARCHAR(32), @string VARCHAR(MAX)) (taken from here):

CREATE FUNCTION Fn_Split (@id int, @separator VARCHAR(32), @string VARCHAR(MAX))

RETURNS @t TABLE
    (
        ret_id INT
       ,txt_word VARCHAR(MAX)
    )   
AS
    BEGIN
        DECLARE @xml XML
        SET @XML = N'<root><r>' + REPLACE(@s, @separator, '</r><r>') + '</r></root>'

        INSERT INTO @t(ret_id, val)
        SELECT @id, r.value('.','VARCHAR(5)') as Item
        FROM @xml.nodes('//root/r') AS RECORDS(r)

        RETURN
    END
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文