Quanteda：删除空文档以计算TFIDF，但将其保留在最终的DFM中

发布于 2025-02-03 09:10:13 字数 4752 浏览 4 评论 0原文

我正在尝试在数据集中计算带有许多空文档的数据集。我想在没有空文档的情况下计算TFIDF，但仍具有原始文档数量的DFM对象。

这是一个示例：

texts = c("", "Bonjour!", "Hello, how are you", "", "Good", "", "", "")
a = texts %>%
    tokens(tolower=T, remove_punct=T) %>%
    dfm() %>%
    dfm_wordstem() %>%
    dfm_remove(stopwords("en")) %>%
    dfm_tfidf()
print(a, max_ndoc=10)
Document-feature matrix of: 8 documents, 3 features (87.50% sparse) and 0 docvars.
       features
docs    bonjour   hello    good
  text1 0       0       0      
  text2 0.90309 0       0      
  text3 0       0.90309 0      
  text4 0       0       0      
  text5 0       0       0.90309
  text6 0       0       0      
  text7 0       0       0      
  text8 0       0       0

但是IDF受我不想要的空文档数量的影响。因此，我在类似的非空文档的子集上计算TFIDF：

a2 = texts %>%
    tokens(tolower=T, remove_punct=T) %>%
    dfm() %>%
    dfm_subset(ntoken(.) > 0) %>%
    dfm_wordstem() %>%
    dfm_remove(stopwords("en")) %>%
    dfm_tfidf()
print(a2, max_ndoc=10)
Document-feature matrix of: 3 documents, 3 features (66.67% sparse) and 0 docvars.
       features
docs      bonjour     hello      good
  text2 0.4771213 0         0        
  text3 0         0.4771213 0        
  text5 0         0         0.4771213

我现在想具有与第一个矩阵相同格式的稀疏矩阵，但具有先前的文本值。我在stackoverflow上找到了此代码： https://stackoverflow.com/a/65635722

add_rows_2 <- function(M,v) {
    oldind <- unique(M@i)
    ## new row indices
    newind <- oldind + as.integer(rowSums(outer(oldind,v,">=")))
    ## modify dimensions
    M@Dim <- M@Dim + c(length(v),0L)
    M@i <- newind[match(M@i,oldind)]
    M
}
empty_texts_idx = which(texts=="")
position_after_insertion = empty_texts_idx - 1:(length(empty_texts_idx))

a3 = add_rows_2(a2, position_after_insertion)
print(a3, max_ndoc=10)
Document-feature matrix of: 8 documents, 3 features (87.50% sparse) and 0 docvars.
         features
docs        bonjour     hello      good
  text2.1 0         0         0        
  text3.1 0.4771213 0         0        
  text5.1 0         0.4771213 0        
  NA.NA   0         0         0        
  NA.NA   0         0         0.4771213
  NA.NA   0         0         0        
  NA.NA   0         0         0        
  NA.NA   0         0         0

这是我想要的已在矩阵中的适当行中添加。

问题1：我想知道是否有一种更有效的方法可以直接使用Quanteda软件包...

问题2：...或至少不会改变一种方式DFM对象，因为a3和a没有相同的docvars属性。

print(a3@docvars)
  docname_ docid_ segid_
1    text2  text2      1
2    text3  text3      1
3    text5  text5      1

print(docnames(a3))
[1] "text2" "text3" "text5"

print(a@docvars)
  docname_ docid_ segid_
1    text1  text1      1
2    text2  text2      1
3    text3  text3      1
4    text4  text4      1
5    text5  text5      1
6    text6  text6      1
7    text7  text7      1
8    text8  text8      1

我能够通过运行以下代码来

# necessary to print proper names in 'docs' column
new_docvars = data.frame(docname_=paste0("text",1:length(textes3)) %>% as.factor(), docid_=paste0("text",1:length(textes3))%>% as.factor(), segid_=rep(1,length(textes3)))
a3@docvars = new_docvars

# The following line is necessary for cv.glmnet to run using a3 as covariates
docnames(a3) <- paste0("text",1:length(textes3)) 
# seems equivalent to a3@Dimnames$docs <- paste0("text",1:length(textes3))

print(a3, max_ndoc=10)
Document-feature matrix of: 8 documents, 3 features (87.50% sparse) and 0 docvars.
       features
docs      bonjour     hello      good
  text1 0         0         0        
  text2 0.4771213 0         0        
  text3 0         0.4771213 0        
  text4 0         0         0        
  text5 0         0         0.4771213
  text6 0         0         0        
  text7 0         0         0        
  text8 0         0         0

print(a3@docvars) # this is now as expected
  docname_ docid_ segid_
1    text1  text1      1
2    text2  text2      1
3    text3  text3      1
4    text4  text4      1
5    text5  text5      1
6    text6  text6      1
7    text7  text7      1
8    text8  text8      1
print(docnames(a3)) # this is now as expected
[1] "text1" "text2" "text3" "text4" "text5" "text6" "text7" "text8"

更改DOCNAME（A3）的代码，可以为A3具有“正确的”格式，因为我想将A3用作我想用cv.glmet训练的型号的协变量< /code>，但是如果我不更改A3的文档名称，我会遇到错误。同样，这是进行定量的正确方法吗？我觉得手动更换docvars并不是做到这一点的正确方法，而且我在网上找不到任何东西。对此的任何见解将不胜感激。

谢谢！

原文

I am trying to compute tfidf on a dataset with a lot of empty documents. I wanted to compute tfidf without the empty documents, but still have as an output a dfm object with the original number of documents.

Here's an example :

texts = c("", "Bonjour!", "Hello, how are you", "", "Good", "", "", "")
a = texts %>%
    tokens(tolower=T, remove_punct=T) %>%
    dfm() %>%
    dfm_wordstem() %>%
    dfm_remove(stopwords("en")) %>%
    dfm_tfidf()
print(a, max_ndoc=10)
Document-feature matrix of: 8 documents, 3 features (87.50% sparse) and 0 docvars.
       features
docs    bonjour   hello    good
  text1 0       0       0      
  text2 0.90309 0       0      
  text3 0       0.90309 0      
  text4 0       0       0      
  text5 0       0       0.90309
  text6 0       0       0      
  text7 0       0       0      
  text8 0       0       0

But IDF is affected by the number of empty documents, which I do not want. Therefore, I compute tfidf on the subset of non-empty documents like so :

a2 = texts %>%
    tokens(tolower=T, remove_punct=T) %>%
    dfm() %>%
    dfm_subset(ntoken(.) > 0) %>%
    dfm_wordstem() %>%
    dfm_remove(stopwords("en")) %>%
    dfm_tfidf()
print(a2, max_ndoc=10)
Document-feature matrix of: 3 documents, 3 features (66.67% sparse) and 0 docvars.
       features
docs      bonjour     hello      good
  text2 0.4771213 0         0        
  text3 0         0.4771213 0        
  text5 0         0         0.4771213

I now want to have a sparse matrix with the same format as the first matrix, but with the previous values for the texts. I found this code on stackoverflow: https://stackoverflow.com/a/65635722

add_rows_2 <- function(M,v) {
    oldind <- unique(M@i)
    ## new row indices
    newind <- oldind + as.integer(rowSums(outer(oldind,v,">=")))
    ## modify dimensions
    M@Dim <- M@Dim + c(length(v),0L)
    M@i <- newind[match(M@i,oldind)]
    M
}
empty_texts_idx = which(texts=="")
position_after_insertion = empty_texts_idx - 1:(length(empty_texts_idx))

a3 = add_rows_2(a2, position_after_insertion)
print(a3, max_ndoc=10)
Document-feature matrix of: 8 documents, 3 features (87.50% sparse) and 0 docvars.
         features
docs        bonjour     hello      good
  text2.1 0         0         0        
  text3.1 0.4771213 0         0        
  text5.1 0         0.4771213 0        
  NA.NA   0         0         0        
  NA.NA   0         0         0.4771213
  NA.NA   0         0         0        
  NA.NA   0         0         0        
  NA.NA   0         0         0

Which is what I want, and the empty texts have been added at the appropriate row in the matrix.

Question 1: I was wondering if there is a more efficient way to do this directly with the quanteda package...

Question 2: ...or at least a way that would not change the structure of the dfm object, since a3 and a do not have the same docvars attribute.

print(a3@docvars)
  docname_ docid_ segid_
1    text2  text2      1
2    text3  text3      1
3    text5  text5      1

print(docnames(a3))
[1] "text2" "text3" "text5"

print(a@docvars)
  docname_ docid_ segid_
1    text1  text1      1
2    text2  text2      1
3    text3  text3      1
4    text4  text4      1
5    text5  text5      1
6    text6  text6      1
7    text7  text7      1
8    text8  text8      1

I was able to have a "correct" format for a3 by running the following lines of code

# necessary to print proper names in 'docs' column
new_docvars = data.frame(docname_=paste0("text",1:length(textes3)) %>% as.factor(), docid_=paste0("text",1:length(textes3))%>% as.factor(), segid_=rep(1,length(textes3)))
a3@docvars = new_docvars

# The following line is necessary for cv.glmnet to run using a3 as covariates
docnames(a3) <- paste0("text",1:length(textes3)) 
# seems equivalent to a3@Dimnames$docs <- paste0("text",1:length(textes3))

print(a3, max_ndoc=10)
Document-feature matrix of: 8 documents, 3 features (87.50% sparse) and 0 docvars.
       features
docs      bonjour     hello      good
  text1 0         0         0        
  text2 0.4771213 0         0        
  text3 0         0.4771213 0        
  text4 0         0         0        
  text5 0         0         0.4771213
  text6 0         0         0        
  text7 0         0         0        
  text8 0         0         0

print(a3@docvars) # this is now as expected
  docname_ docid_ segid_
1    text1  text1      1
2    text2  text2      1
3    text3  text3      1
4    text4  text4      1
5    text5  text5      1
6    text6  text6      1
7    text7  text7      1
8    text8  text8      1
print(docnames(a3)) # this is now as expected
[1] "text1" "text2" "text3" "text4" "text5" "text6" "text7" "text8"

I need to change docnames(a3) because I want to use a3 as covariates for a model I want to train with cv.glmet, but I get an error if I don't change the document names for a3. Again, is this the correct way to proceed with quanteda? I felt like manually changing docvars was not the proper way to do it, and I could not find anything online about that. Any insights on that would be appreciated.

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

染墨丶若流云 2025-02-10 09:10:13

我不知道在计算tf-idf之前删除空文档是个好主意，但是可以很容易地使用drop_docid = false和fill readore删除的文档， fill = true true 因为 Quanteda 跟踪它们。

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 66.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- c("", "Bonjour!", "Hello, how are you", "", "Good", "", "", "")
corp <- corpus(txt)
dfmt <- dfm(tokens(corp))
dfmt
#> Document-feature matrix of: 8 documents, 8 features (87.50% sparse) and 0 docvars.
#>        features
#> docs    bonjour ! hello , how are you good
#>   text1       0 0     0 0   0   0   0    0
#>   text2       1 1     0 0   0   0   0    0
#>   text3       0 0     1 1   1   1   1    0
#>   text4       0 0     0 0   0   0   0    0
#>   text5       0 0     0 0   0   0   0    1
#>   text6       0 0     0 0   0   0   0    0
#> [ reached max_ndoc ... 2 more documents ]


dfmt2 <- dfm_subset(dfmt, ntoken(dfmt) > 0, drop_docid = FALSE) %>% 
  dfm_tfidf()
dfmt2
#> Document-feature matrix of: 3 documents, 8 features (66.67% sparse) and 0 docvars.
#>        features
#> docs      bonjour         !     hello         ,       how       are       you
#>   text2 0.4771213 0.4771213 0         0         0         0         0        
#>   text3 0         0         0.4771213 0.4771213 0.4771213 0.4771213 0.4771213
#>   text5 0         0         0         0         0         0         0        
#>        features
#> docs         good
#>   text2 0        
#>   text3 0        
#>   text5 0.4771213

dfmt3 <- dfm_group(dfmt2, fill = TRUE, force = TRUE)
dfmt3
#> Document-feature matrix of: 8 documents, 8 features (87.50% sparse) and 0 docvars.
#>        features
#> docs      bonjour         !     hello         ,       how       are       you
#>   text1 0         0         0         0         0         0         0        
#>   text2 0.4771213 0.4771213 0         0         0         0         0        
#>   text3 0         0         0.4771213 0.4771213 0.4771213 0.4771213 0.4771213
#>   text4 0         0         0         0         0         0         0        
#>   text5 0         0         0         0         0         0         0        
#>   text6 0         0         0         0         0         0         0        
#>        features
#> docs         good
#>   text1 0        
#>   text2 0        
#>   text3 0        
#>   text4 0        
#>   text5 0.4771213
#>   text6 0        
#> [ reached max_ndoc ... 2 more documents ]

^由

I do not know if it is a good idea to remove empty documents before computing tf-idf, but it easy to do restore removed documents with drop_docid = FALSE and fill = TRUE because quanteda keeps track of them.

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 66.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- c("", "Bonjour!", "Hello, how are you", "", "Good", "", "", "")
corp <- corpus(txt)
dfmt <- dfm(tokens(corp))
dfmt
#> Document-feature matrix of: 8 documents, 8 features (87.50% sparse) and 0 docvars.
#>        features
#> docs    bonjour ! hello , how are you good
#>   text1       0 0     0 0   0   0   0    0
#>   text2       1 1     0 0   0   0   0    0
#>   text3       0 0     1 1   1   1   1    0
#>   text4       0 0     0 0   0   0   0    0
#>   text5       0 0     0 0   0   0   0    1
#>   text6       0 0     0 0   0   0   0    0
#> [ reached max_ndoc ... 2 more documents ]


dfmt2 <- dfm_subset(dfmt, ntoken(dfmt) > 0, drop_docid = FALSE) %>% 
  dfm_tfidf()
dfmt2
#> Document-feature matrix of: 3 documents, 8 features (66.67% sparse) and 0 docvars.
#>        features
#> docs      bonjour         !     hello         ,       how       are       you
#>   text2 0.4771213 0.4771213 0         0         0         0         0        
#>   text3 0         0         0.4771213 0.4771213 0.4771213 0.4771213 0.4771213
#>   text5 0         0         0         0         0         0         0        
#>        features
#> docs         good
#>   text2 0        
#>   text3 0        
#>   text5 0.4771213

dfmt3 <- dfm_group(dfmt2, fill = TRUE, force = TRUE)
dfmt3
#> Document-feature matrix of: 8 documents, 8 features (87.50% sparse) and 0 docvars.
#>        features
#> docs      bonjour         !     hello         ,       how       are       you
#>   text1 0         0         0         0         0         0         0        
#>   text2 0.4771213 0.4771213 0         0         0         0         0        
#>   text3 0         0         0.4771213 0.4771213 0.4771213 0.4771213 0.4771213
#>   text4 0         0         0         0         0         0         0        
#>   text5 0         0         0         0         0         0         0        
#>   text6 0         0         0         0         0         0         0        
#>        features
#> docs         good
#>   text1 0        
#>   text2 0        
#>   text3 0        
#>   text4 0        
#>   text5 0.4771213
#>   text6 0        
#> [ reached max_ndoc ... 2 more documents ]

^{Created on 2022-06-16 by the reprex package (v2.0.1)}

回复收藏 0 原文

~没有更多了~