根据月年时间格式对数据框进行排序

发布于 2024-10-10 09:28:06 字数 2340 浏览 0 评论 0原文

我正在努力解决一些非常基本的问题:根据时间格式(月-年,或者在本例中为“%B-%y”)对数据框进行排序。我的目标是计算各种每月统计数据,从总和开始。

数据框的相关部分看起来像这样*(这很顺利并且符合我的目标。我将其包含在此处以显示问题可能源自何处)*:

> tmp09
   Instrument AccountValue   monthYear   ExitTime
1         JPM         6997    april-07 2007-04-10
2         JPM         7261      mei-07 2007-05-29
3         JPM         7545     juli-07 2007-07-18
4         JPM         7614     juli-07 2007-07-19
5         JPM         7897 augustus-07 2007-08-22
10        JPM         7423 november-07 2007-11-02
11        KFT         6992      mei-07 2007-05-14
12        KFT         6944      mei-07 2007-05-21
13        KFT         7069     juli-07 2007-07-09
14        KFT         6919     juli-07 2007-07-16
# Order on the exit time, which corresponds with 'monthYear'
> tmp09.sorted <- tmp09[order(tmp09$ExitTime),]
> tmp09.sorted
   Instrument AccountValue   monthYear   ExitTime
1         JPM         6997    april-07 2007-04-10
11        KFT         6992      mei-07 2007-05-14
12        KFT         6944      mei-07 2007-05-21
2         JPM         7261      mei-07 2007-05-29
13        KFT         7069     juli-07 2007-07-09
14        KFT         6919     juli-07 2007-07-16
3         JPM         7545     juli-07 2007-07-18
4         JPM         7614     juli-07 2007-07-19
5         JPM         7897 augustus-07 2007-08-22
10        JPM         7423 november-07 2007-11-02

到目前为止,非常好,并且基于 ExitTime 进行排序是有效的。 当我尝试计算每月总计,然后尝试对此输出进行排序时,问题就开始了

# Calculate the total results per month
> Tmp09Totals <- tapply(tmp09.sorted$AccountValue, tmp09.sorted$monthYear, sum)
> Tmp09Totals <- data.frame(Tmp09Totals)
> Tmp09Totals
            Tmp09Totals
april-07           6997
augustus-07        7897
juli-07           29147
mei-07            21197
november-07        7423

如何按时间顺序对此输出进行排序?

我已经已经尝试过(除了将monthYear转换为另一种日期格式的各种尝试之外):order、sort、sort.list、sort_df、reshape以及根据tapply、lapply、sapply、aggregate计算总和。甚至重写行名(通过给它们一个从 1 到长度的数字 (tmp09.sorted2$AccountValue) 也不起作用。我还尝试根据我的情况给每个月-年一个不同的 ID在另一个问题中已经了解到,但 R 在区分不同的月年值时也遇到了困难,

该输出的正确顺序是 april-07,mei-07,juli-07,augustus07,november-07。 :

apr-07  6997
mei-07  21197
jul-07  29147
aug-07  7897
nov-07  7423

I'm struggling with something very basic: sorting a data frame based on a time format (month-year, or, “%B-%y” in this case). My goal is to calculate various monthly statistics, starting with sum.

The part of relevant part of the data frame looks like this * (This goes well and in accordance of my goal. I'm including it here to show where the problem could originate from)*:

> tmp09
   Instrument AccountValue   monthYear   ExitTime
1         JPM         6997    april-07 2007-04-10
2         JPM         7261      mei-07 2007-05-29
3         JPM         7545     juli-07 2007-07-18
4         JPM         7614     juli-07 2007-07-19
5         JPM         7897 augustus-07 2007-08-22
10        JPM         7423 november-07 2007-11-02
11        KFT         6992      mei-07 2007-05-14
12        KFT         6944      mei-07 2007-05-21
13        KFT         7069     juli-07 2007-07-09
14        KFT         6919     juli-07 2007-07-16
# Order on the exit time, which corresponds with 'monthYear'
> tmp09.sorted <- tmp09[order(tmp09$ExitTime),]
> tmp09.sorted
   Instrument AccountValue   monthYear   ExitTime
1         JPM         6997    april-07 2007-04-10
11        KFT         6992      mei-07 2007-05-14
12        KFT         6944      mei-07 2007-05-21
2         JPM         7261      mei-07 2007-05-29
13        KFT         7069     juli-07 2007-07-09
14        KFT         6919     juli-07 2007-07-16
3         JPM         7545     juli-07 2007-07-18
4         JPM         7614     juli-07 2007-07-19
5         JPM         7897 augustus-07 2007-08-22
10        JPM         7423 november-07 2007-11-02

So far, so good, and sorting based on ExitTime works. The trouble starts when I try to calculate the totals per month, followed by an attempt to sort this output:

# Calculate the total results per month
> Tmp09Totals <- tapply(tmp09.sorted$AccountValue, tmp09.sorted$monthYear, sum)
> Tmp09Totals <- data.frame(Tmp09Totals)
> Tmp09Totals
            Tmp09Totals
april-07           6997
augustus-07        7897
juli-07           29147
mei-07            21197
november-07        7423

How can I sort this output in a chronological Way?

I've already tried (besides various attempts to convert the monthYear to another date format): order, sort, sort.list, sort_df, reshape, and calculating the sum based on tapply, lapply, sapply, aggregate. And even rewriting the rownames (by giving them a number from 1 to length (tmp09.sorted2$AccountValue) didn't work. I also tried to give each month-year a different ID based on what I've learned in another question, but R also experienced difficulties in discriminating between the various month-year values.

The correct order of this output would be april-07,mei-07,juli-07,augustus07, november-07:

apr-07  6997
mei-07  21197
jul-07  29147
aug-07  7897
nov-07  7423

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

驱逐舰岛风号 2024-10-17 09:28:06

以正确的顺序单独使用 MonthYear 因子,并在两个变量的并集上使用 tapply 会更容易,例如:

## The Month factor
tmp09 <- within(tmp09,
                Month <- droplevels(factor(strftime(ExitTime, format = "%B"),
                                                    levels = month.name)))
## for @Jura25's locale, we can't use the in built English constant
## instead, we can use this solution, from ?month.name:
## format(ISOdate(2000, 1:12, 1), "%B"))
tmp09 <- within(tmp09,
                Month <- droplevels(factor(strftime(ExitTime, format = "%B"),
                                                    levels = format(ISOdate(2000, 1:12, 1), "%B"))))
##
## And the Year factor
tmp09 <- within(tmp09, Year <- factor(strftime(ExitTime, format = "%Y")))

这给了我们(在我的语言环境中):

> head(tmp09)
   Instrument AccountValue   monthYear   ExitTime    Month Year
1         JPM         6997    april-07 2007-04-10    April 2007
2         JPM         7261      mei-07 2007-05-29      May 2007
3         JPM         7545     juli-07 2007-07-18     July 2007
4         JPM         7614     juli-07 2007-07-19     July 2007
5         JPM         7897 augustus-07 2007-08-22   August 2007
10        JPM         7423 november-07 2007-11-02 November 2007

然后将 tapply 与两个因素一起使用:

> with(tmp09, tapply(AccountValue, list(Month, Year), sum))
          2007
April     6997
May      21197
July     29147
August    7897
November  7423

或通过 aggregate

> with(tmp09, aggregate(AccountValue, list(Month = Month, Year = Year), sum))
     Month Year     x
1    April 2007  6997
2      May 2007 21197
3     July 2007 29147
4   August 2007  7897
5 November 2007  7423

It would be easier to have separate Month and Year factors, in the correct order, and use tapply on the union of both variables, e.g.:

## The Month factor
tmp09 <- within(tmp09,
                Month <- droplevels(factor(strftime(ExitTime, format = "%B"),
                                                    levels = month.name)))
## for @Jura25's locale, we can't use the in built English constant
## instead, we can use this solution, from ?month.name:
## format(ISOdate(2000, 1:12, 1), "%B"))
tmp09 <- within(tmp09,
                Month <- droplevels(factor(strftime(ExitTime, format = "%B"),
                                                    levels = format(ISOdate(2000, 1:12, 1), "%B"))))
##
## And the Year factor
tmp09 <- within(tmp09, Year <- factor(strftime(ExitTime, format = "%Y")))

Which gives us (in my locale):

> head(tmp09)
   Instrument AccountValue   monthYear   ExitTime    Month Year
1         JPM         6997    april-07 2007-04-10    April 2007
2         JPM         7261      mei-07 2007-05-29      May 2007
3         JPM         7545     juli-07 2007-07-18     July 2007
4         JPM         7614     juli-07 2007-07-19     July 2007
5         JPM         7897 augustus-07 2007-08-22   August 2007
10        JPM         7423 november-07 2007-11-02 November 2007

Then use tapply with both factors:

> with(tmp09, tapply(AccountValue, list(Month, Year), sum))
          2007
April     6997
May      21197
July     29147
August    7897
November  7423

or via aggregate:

> with(tmp09, aggregate(AccountValue, list(Month = Month, Year = Year), sum))
     Month Year     x
1    April 2007  6997
2      May 2007 21197
3     July 2007 29147
4   August 2007  7897
5 November 2007  7423
維他命╮ 2024-10-17 09:28:06

尝试使用 Zoo 中的 "yearmon" 类,因为它可以适当排序。下面我们创建示例 DF 数据框,然后添加 "yearmon" 类的 YearMonth 列。最后我们执行聚合。实际处理只是最后两行(另一部分只是创建示例数据框)。

Lines <-   "Instrument AccountValue   monthYear   ExitTime
JPM         6997    april-07 2007-04-10
JPM         7261      mei-07 2007-05-29
JPM         7545     juli-07 2007-07-18
JPM         7614     juli-07 2007-07-19
JPM         7897 augustus-07 2007-08-22
JPM         7423 november-07 2007-11-02
KFT         6992      mei-07 2007-05-14
KFT         6944      mei-07 2007-05-21
KFT         7069     juli-07 2007-07-09
KFT         6919     juli-07 2007-07-16"
library(zoo)
DF <- read.table(textConnection(Lines), header = TRUE)

DF$YearMonth <- as.yearmon(DF$ExitTime)
aggregate(AccountValue ~ YearMonth + Instrument, DF, sum)

这给出了以下内容:

> aggregate(AccountValue ~ YearMonth + Instrument, DF, sum)
  YearMonth Instrument AccountValue
1  Apr 2007        JPM         6997
2  May 2007        JPM         7261
3  Jul 2007        JPM        15159
4  Aug 2007        JPM         7897
5  Nov 2007        JPM         7423
6  May 2007        KFT        13936
7  Jul 2007        KFT        13988

略有不同的方法和输出直接使用 read.zoo 。它每台仪器生成一列,每年/每月生成一行。我们在列中读取,为 monthYear 列使用 "NULL" 为它们分配适当的类,因为我们不会使用该列。我们还指定时间索引是剩余列的第三列,并且我们希望输入按第一列分成几列。 FUN=as.yearmon 表示我们希望将时间索引从 "Date" 类转换为 "yearmon" 类,并且我们聚合所有内容使用求和

z <- read.zoo(textConnection(Lines),  header = TRUE, index = 3, 
     split = 1, colClasses = c("character", "numeric", "NULL", "Date"),
     FUN = as.yearmon, aggregate = sum)

生成的动物园对象如下所示:

> z
           JPM   KFT
Apr 2007  6997    NA
May 2007  7261 13936
Jul 2007 15159 13988
Aug 2007  7897    NA
Nov 2007  7423    NA

我们可能更愿意将其保留为动物园对象,以利用动物园中的其他功能,或者我们可以将其转换为如下数据框:data.frame(Time = time(z ), coredata(z)) 使时间成为单独的列,或 as.data.frame(z) 使用时间的行名称。 fortify.zoo()z) 也有效。

Try using the "yearmon" class in zoo as it sorts appropriately. Below we create the sample DF data frame and then we add a YearMonth column of class "yearmon" . Finally we perform our aggregation. The actual processing is just the last two lines (the other part is just to create the sample data frame).

Lines <-   "Instrument AccountValue   monthYear   ExitTime
JPM         6997    april-07 2007-04-10
JPM         7261      mei-07 2007-05-29
JPM         7545     juli-07 2007-07-18
JPM         7614     juli-07 2007-07-19
JPM         7897 augustus-07 2007-08-22
JPM         7423 november-07 2007-11-02
KFT         6992      mei-07 2007-05-14
KFT         6944      mei-07 2007-05-21
KFT         7069     juli-07 2007-07-09
KFT         6919     juli-07 2007-07-16"
library(zoo)
DF <- read.table(textConnection(Lines), header = TRUE)

DF$YearMonth <- as.yearmon(DF$ExitTime)
aggregate(AccountValue ~ YearMonth + Instrument, DF, sum)

This gives the following:

> aggregate(AccountValue ~ YearMonth + Instrument, DF, sum)
  YearMonth Instrument AccountValue
1  Apr 2007        JPM         6997
2  May 2007        JPM         7261
3  Jul 2007        JPM        15159
4  Aug 2007        JPM         7897
5  Nov 2007        JPM         7423
6  May 2007        KFT        13936
7  Jul 2007        KFT        13988

A slightly different approach and output uses read.zoo directly. It produces one column per instrument and one row per year/month. We read in the columns assigning them appropriate classes using "NULL" for the monthYear column since we won't use that one. We also specify that the time index is the 3rd column of the remaining columns and that we want the input split into columns by the 1st column. FUN=as.yearmon indicates that we want the time index to be converted from "Date" class to "yearmon" class and we aggregate everything using sum.

z <- read.zoo(textConnection(Lines),  header = TRUE, index = 3, 
     split = 1, colClasses = c("character", "numeric", "NULL", "Date"),
     FUN = as.yearmon, aggregate = sum)

The resulting zoo object looks like this:

> z
           JPM   KFT
Apr 2007  6997    NA
May 2007  7261 13936
Jul 2007 15159 13988
Aug 2007  7897    NA
Nov 2007  7423    NA

We may prefer to keep it as a zoo object to take advantage of other functionality in zoo or we can convert it to a data frame like this: data.frame(Time = time(z), coredata(z)) which makes the time a separate column or as.data.frame(z) which uses row names for the time. fortify.zoo()z) also works.

海夕 2024-10-17 09:28:06

您可以通过reorder函数重新排序因子级别。

tmp09$monthYear <- reorder(tmp09$monthYear, as.numeric(as.Date(tmp09$ExitTime)))

技巧是使用日期的数字表示形式作为自 1970-01-01 以来的天数(参见 ?Date)并使用其平均值作为参考。

You could reorder factor levels by reorder function.

tmp09$monthYear <- reorder(tmp09$monthYear, as.numeric(as.Date(tmp09$ExitTime)))

Trick is to use numeric representation of date as number of days since 1970-01-01 (see ?Date) and use mean value of it as reference.

清风疏影 2024-10-17 09:28:06

编辑:我一开始误解了这个问题。先复制问题中给出的数据,然后

> tmp09 <- read.table(file="clipboard", header=TRUE)
> Sys.setlocale(category="LC_TIME", locale="Dutch_Belgium.1252")
[1] "Dutch_Belgium.1252"

# create POSIXlt variable from monthYear
> tmp09$d <- strptime(paste("2007", tmp09$monthYear, sep="-"), "%Y-%B-%d")

# create ordered factor
> tmp09$dFac <- droplevels(cut(tmp09$d, breaks="month", ordered=TRUE))
> tmp09[order(tmp09$d), ]
   Instrument AccountValue   monthYear   ExitTime          d       dFac
1         JPM         6997    april-07 2007-04-10 2007-04-07 2007-04-01
2         JPM         7261      mei-07 2007-05-29 2007-05-07 2007-05-01
11        KFT         6992      mei-07 2007-05-14 2007-05-07 2007-05-01
12        KFT         6944      mei-07 2007-05-21 2007-05-07 2007-05-01
3         JPM         7545     juli-07 2007-07-18 2007-07-07 2007-07-01
4         JPM         7614     juli-07 2007-07-19 2007-07-07 2007-07-01
13        KFT         7069     juli-07 2007-07-09 2007-07-07 2007-07-01
14        KFT         6919     juli-07 2007-07-16 2007-07-07 2007-07-01
5         JPM         7897 augustus-07 2007-08-22 2007-08-07 2007-08-01
10        JPM         7423 november-07 2007-11-02 2007-11-07 2007-11-01

> Tmp09Totals <- tapply(tmp09$AccountValue, tmp09$dFac, sum)
> Tmp09Totals
2007-04-01 2007-05-01 2007-07-01 2007-08-01 2007-11-01 
      6997      21197      29147       7897       7423

Edit: I misunderstood the question at first. Copy the data given in the question first, then

> tmp09 <- read.table(file="clipboard", header=TRUE)
> Sys.setlocale(category="LC_TIME", locale="Dutch_Belgium.1252")
[1] "Dutch_Belgium.1252"

# create POSIXlt variable from monthYear
> tmp09$d <- strptime(paste("2007", tmp09$monthYear, sep="-"), "%Y-%B-%d")

# create ordered factor
> tmp09$dFac <- droplevels(cut(tmp09$d, breaks="month", ordered=TRUE))
> tmp09[order(tmp09$d), ]
   Instrument AccountValue   monthYear   ExitTime          d       dFac
1         JPM         6997    april-07 2007-04-10 2007-04-07 2007-04-01
2         JPM         7261      mei-07 2007-05-29 2007-05-07 2007-05-01
11        KFT         6992      mei-07 2007-05-14 2007-05-07 2007-05-01
12        KFT         6944      mei-07 2007-05-21 2007-05-07 2007-05-01
3         JPM         7545     juli-07 2007-07-18 2007-07-07 2007-07-01
4         JPM         7614     juli-07 2007-07-19 2007-07-07 2007-07-01
13        KFT         7069     juli-07 2007-07-09 2007-07-07 2007-07-01
14        KFT         6919     juli-07 2007-07-16 2007-07-07 2007-07-01
5         JPM         7897 augustus-07 2007-08-22 2007-08-07 2007-08-01
10        JPM         7423 november-07 2007-11-02 2007-11-07 2007-11-01

> Tmp09Totals <- tapply(tmp09$AccountValue, tmp09$dFac, sum)
> Tmp09Totals
2007-04-01 2007-05-01 2007-07-01 2007-08-01 2007-11-01 
      6997      21197      29147       7897       7423
¢蛋碎的人ぎ生 2024-10-17 09:28:06

看起来主要问题是如何按时间顺序对月-年字符串序列进行排序。最简单的方法是在每个月-年字符串的开头添加“01”,并将它们按常规日期排序。因此,获取最终数据帧 Tmp09Totals,然后执行以下操作:

monYear <- rownames(Tmp09Totals)
sortedMonYear <- format(sort( as.Date( paste('01-', monYear, sep = ''),
                                       '%d-%B-%y')), 
                       '%B-%y')
Tmp09Totals[ sortedMonYear, , drop = FALSE]

It looks like the main problem is how to sort a sequence of Month-Year strings chronologically. The easiest way is to pre-pend a "01" at the beginning of each Month-Year string and sort them as regular dates. So take your final data-frame Tmp09Totals, and do this:

monYear <- rownames(Tmp09Totals)
sortedMonYear <- format(sort( as.Date( paste('01-', monYear, sep = ''),
                                       '%d-%B-%y')), 
                       '%B-%y')
Tmp09Totals[ sortedMonYear, , drop = FALSE]
风筝有风,海豚有海 2024-10-17 09:28:06

一篇旧文章,但值得使用 data.table 方法:

读入数据并按照 @caracal 的描述设置本地

> Sys.setlocale(category="LC_TIME", locale="Dutch_Belgium.1252")
[1] "Dutch_Belgium.1252"
> tmp09 <- read.table(file="clipboard", header=TRUE)
> tmp09$ExitTime <- as.Date(tmp09$ExitTime)

按要求汇总数据

require(data.table)
> data.table(tmp09)[, 
+                   .(Tmp09Total = sum(AccountValue)),
+                   by = .(Date = format(ExitTime, "%B-%y"))]
          Date Tmp09Total
1:    april-07       6997
2:      mei-07      21197
3:     juli-07      29147
4: augustus-07       7897
5: november-07       7423

An old post but worthy of a data.table approach:

Read in data and set local as described by @caracal

> Sys.setlocale(category="LC_TIME", locale="Dutch_Belgium.1252")
[1] "Dutch_Belgium.1252"
> tmp09 <- read.table(file="clipboard", header=TRUE)
> tmp09$ExitTime <- as.Date(tmp09$ExitTime)

Summarise data as requested

require(data.table)
> data.table(tmp09)[, 
+                   .(Tmp09Total = sum(AccountValue)),
+                   by = .(Date = format(ExitTime, "%B-%y"))]
          Date Tmp09Total
1:    april-07       6997
2:      mei-07      21197
3:     juli-07      29147
4: augustus-07       7897
5: november-07       7423
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文