为什么在 R 中创建子矩阵后 str() 显示的因子水平信息不正确?

发布于 2024-12-09 06:50:16 字数 2684 浏览 0 评论 0原文

我在 R 中有以下数据框,有 274569 行和 15 列:

> str(x2)
'data.frame':   274569 obs. of  15 variables:
 $ ykod : int  99 99 99 99 99 99 99 99 99 99 ...
 $ yad  : Factor w/ 43 levels "BAKUGAN","BARBIE",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ per  : Factor w/ 3 levels "2 AYLIK","3 AYLIK",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ donem: int  201106 201106 201106 201106 201106 201106 201106 201106 201106 201106 ...
 $ sayi : int  201106 201106 201106 201106 201106 201106 201106 201106 201106 201106 ...
 $ mkod : int  359 361 362 363 366 847 849 850 1505 1506 ...
 $ mad  : Factor w/ 11045 levels "    Hilal Gida           ",..: 5163 3833 10840 8284 10839 2633 10758 10293 6986 6984 ...
 $ mtip : Factor w/ 30 levels "Abone Bürosu                                      ",..: 20 20 20 20 20 2 2 2 11 11 ...
 $ kanal: Factor w/ 2 levels "OB","SS": 2 2 2 2 2 2 2 2 1 1 ...
 $ bkod : int  110006 110006 110006 110006 110006 110006 110006 110006 110006 110006 ...
 $ bad  : Factor w/ 213 levels "4. Levent","500 Evler",..: 25 25 25 25 25 25 25 25 25 25 ...
 $ bolge: Factor w/ 12 levels "Adana Şehiriçi",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ sevk : int  5 2 2 2 10 0 4 3 13 32 ...
 $ iade : int  0 2 1 2 4 0 3 2 0 8 ...
 $ satis: int  5 0 1 0 6 0 1 1 13 24 ...

我创建一个子矩阵并显示其结构:

> msub <- x2[x2$ykod == 99,]
> str(msub)
'data.frame':   14367 obs. of  15 variables:
 $ ykod : int  99 99 99 99 99 99 99 99 99 99 ...
 $ yad  : Factor w/ 43 levels "BAKUGAN","BARBIE",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ per  : Factor w/ 3 levels "2 AYLIK","3 AYLIK",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ donem: int  201106 201106 201106 201106 201106 201106 201106 201106 201106 201106 ...
 $ sayi : int  201106 201106 201106 201106 201106 201106 201106 201106 201106 201106 ...
 $ mkod : int  359 361 362 363 366 847 849 850 1505 1506 ...
 $ mad  : Factor w/ 11045 levels "    Hilal Gida           ",..: 5163 3833 10840 8284 10839 2633 10758 10293 6986 6984 ...
 $ mtip : Factor w/ 30 levels "Abone Bürosu                                      ",..: 20 20 20 20 20 2 2 2 11 11 ...
 $ kanal: Factor w/ 2 levels "OB","SS": 2 2 2 2 2 2 2 2 1 1 ...
 $ bkod : int  110006 110006 110006 110006 110006 110006 110006 110006 110006 110006 ...
 $ bad  : Factor w/ 213 levels "4. Levent","500 Evler",..: 25 25 25 25 25 25 25 25 25 25 ...
 $ bolge: Factor w/ 12 levels "Adana Şehiriçi",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ sevk : int  5 2 2 2 10 0 4 3 13 32 ...
 $ iade : int  0 2 1 2 4 0 3 2 0 8 ...
 $ satis: int  5 0 1 0 6 0 1 1 13 24 ...

现在我有一个有 14367 行和 15 列的子矩阵,但因子水平仍然存在。他们应该减少。例如,对于yad,应该只有一个因素。

如何轻松使 str() 显示因子水平的正确信息,以便当我输入 str(msub) 时它会给出正确的值?

I have the following data frame in R with 274569 rows and 15 columns:

> str(x2)
'data.frame':   274569 obs. of  15 variables:
 $ ykod : int  99 99 99 99 99 99 99 99 99 99 ...
 $ yad  : Factor w/ 43 levels "BAKUGAN","BARBIE",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ per  : Factor w/ 3 levels "2 AYLIK","3 AYLIK",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ donem: int  201106 201106 201106 201106 201106 201106 201106 201106 201106 201106 ...
 $ sayi : int  201106 201106 201106 201106 201106 201106 201106 201106 201106 201106 ...
 $ mkod : int  359 361 362 363 366 847 849 850 1505 1506 ...
 $ mad  : Factor w/ 11045 levels "    Hilal Gida           ",..: 5163 3833 10840 8284 10839 2633 10758 10293 6986 6984 ...
 $ mtip : Factor w/ 30 levels "Abone Bürosu                                      ",..: 20 20 20 20 20 2 2 2 11 11 ...
 $ kanal: Factor w/ 2 levels "OB","SS": 2 2 2 2 2 2 2 2 1 1 ...
 $ bkod : int  110006 110006 110006 110006 110006 110006 110006 110006 110006 110006 ...
 $ bad  : Factor w/ 213 levels "4. Levent","500 Evler",..: 25 25 25 25 25 25 25 25 25 25 ...
 $ bolge: Factor w/ 12 levels "Adana Şehiriçi",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ sevk : int  5 2 2 2 10 0 4 3 13 32 ...
 $ iade : int  0 2 1 2 4 0 3 2 0 8 ...
 $ satis: int  5 0 1 0 6 0 1 1 13 24 ...

I create a sub-matrix and display its structure:

> msub <- x2[x2$ykod == 99,]
> str(msub)
'data.frame':   14367 obs. of  15 variables:
 $ ykod : int  99 99 99 99 99 99 99 99 99 99 ...
 $ yad  : Factor w/ 43 levels "BAKUGAN","BARBIE",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ per  : Factor w/ 3 levels "2 AYLIK","3 AYLIK",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ donem: int  201106 201106 201106 201106 201106 201106 201106 201106 201106 201106 ...
 $ sayi : int  201106 201106 201106 201106 201106 201106 201106 201106 201106 201106 ...
 $ mkod : int  359 361 362 363 366 847 849 850 1505 1506 ...
 $ mad  : Factor w/ 11045 levels "    Hilal Gida           ",..: 5163 3833 10840 8284 10839 2633 10758 10293 6986 6984 ...
 $ mtip : Factor w/ 30 levels "Abone Bürosu                                      ",..: 20 20 20 20 20 2 2 2 11 11 ...
 $ kanal: Factor w/ 2 levels "OB","SS": 2 2 2 2 2 2 2 2 1 1 ...
 $ bkod : int  110006 110006 110006 110006 110006 110006 110006 110006 110006 110006 ...
 $ bad  : Factor w/ 213 levels "4. Levent","500 Evler",..: 25 25 25 25 25 25 25 25 25 25 ...
 $ bolge: Factor w/ 12 levels "Adana Şehiriçi",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ sevk : int  5 2 2 2 10 0 4 3 13 32 ...
 $ iade : int  0 2 1 2 4 0 3 2 0 8 ...
 $ satis: int  5 0 1 0 6 0 1 1 13 24 ...

Now I have a sub-matrix with 14367 rows and 15 columns, but the levels of factors are still there. They should have been decreased. For example, for yad, there should be only one factor.

How can I easily make str() to show correct info for factor levels so that when I type str(msub) it gives me correct values?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

鱼忆七猫命九 2024-12-16 06:50:16

这是预期的行为。在您的子集中没有代表性的因子水平不会“消失”,除非您告诉它们。截至最近,您可以使用 droplevels()

This is expected behavior. Factor levels that have no representation in your subset do not "disappear" until you tell them to. As of recently, you can use droplevels().

忆梦 2024-12-16 06:50:16

事实上,str 正在向您显示正确的结构信息:该因子有能力显示级别。想象一下连接两个子矩阵,其中一个包含一些级别,另一个包含另一组:合并这个会有点麻烦!这就是 R 中因子的工作原理。

如果您想知道数据中“存在”哪些因子,其中一个选项是使用 table 来计算出现次数。

如果您希望减少因子,使其仅包含实际存在的级别,则可以对其重新应用因子:

myfact<-factor(rep(1:2,5), levels=1:3, labels=letters[1:3])
myfact
# [1] a b a b a b a b a b
#Levels: a b c
factor(myfact)
# [1] a b a b a b a b a b
#Levels: a b

您可以简单地将其应用于 data.frame 的所有因子列以获得您想要的内容。

In fact str is showing you the correct structural information: the factor has the ability to have the levels shown. Imagine concatenating two of your submatrices where one contained some of the levels and the other another set: it would be somewhat of a hassle to merge this! This is simply how factors work in R.

If you want to know which factors are 'present' in your data, one of the options is using table to count the occurrences.

If you want your factor reduced, so it only contains the levels that are actually present, you can reapply factor to it:

myfact<-factor(rep(1:2,5), levels=1:3, labels=letters[1:3])
myfact
# [1] a b a b a b a b a b
#Levels: a b c
factor(myfact)
# [1] a b a b a b a b a b
#Levels: a b

You can simply apply this to all the factor columns of your data.frame to get what you say you want.

吃颗糖壮壮胆 2024-12-16 06:50:16

因子的水平是该列的一部分,并不依赖于实际存在的水平:

> x <- factor(LETTERS[1:10])
> x
 [1] A B C D E F G H I J
Levels: A B C D E F G H I J
> y <- x[1]
> y
[1] A
Levels: A B C D E F G H I J
> factor(y)
[1] A
Levels: A
> 

我确信,还有另一种方法,但这应该可行。

The levels of the factor are part of the column, and not dependant on the levels actually present:

> x <- factor(LETTERS[1:10])
> x
 [1] A B C D E F G H I J
Levels: A B C D E F G H I J
> y <- x[1]
> y
[1] A
Levels: A B C D E F G H I J
> factor(y)
[1] A
Levels: A
> 

I am sure, there is another way, but this should work.

尸血腥色 2024-12-16 06:50:16
x <- factor(LETTERS[1:10])
y <- x[1, drop=TRUE]
y
x <- factor(LETTERS[1:10])
y <- x[1, drop=TRUE]
y
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文