根据共同值将两个文件合并在一起,但具有不同数量的变量

发布于 2025-02-13 13:01:12 字数 4531 浏览 0 评论 0原文

我正在尝试将两个文件合并在一起,但是我一直会遇到以下错误:

Error: memory exhausted (limit reached?)
Error during wrapup: memory exhausted (limit reached?)
Error: no more error handlers available (recursive errors?); invoking 'abort' restart

我正在使用以下代码:

FinalTweets <- merge(tweets2U, tweets2, by="author_id")

值数量不同,

我alo意识到我的文件对Tweets2U

'data.frame':   325256 obs. of  2 variables:
 $ created_at: chr  "2015-02-18T02:56:55.000Z" "2016-05-23T02:14:36.000Z" "2013-04-22T02:52:16.000Z" "2015-03-06T02:40:55.000Z" ...
 $ author_id : chr  "3024607164" "734568179457007617" "1371107096" "3063885536" ...

和Tweets2的

'data.frame':   338037 obs. of  4 variables:
 $ author_id                   : chr  "3024607164" "734568179457007617" "1371107096" "3063885536" ...
 $ created_at                  : chr  "2021-01-01T02:24:18.000Z" "2021-01-01T02:22:48.000Z" "2021-01-01T02:22:14.000Z" "2021-01-01T02:21:01.000Z" ...
 $ text                        : chr  "Super Game Talk Video Alpha!  The #1 Indie Video Game Review Show hosted by puppets! No"| __truncated__ "I was testing my game, and caught a fish that was a gold star (top 3 percentile in size) and I though \"Oh I be"| __truncated__ "Hey hey everyone just got home from work. Time to finish artwork before 12 am to show.super excited. I can fina"| __truncated__ "Congratulation to Pretumos who won our Dec 2020 $100 Zeegift! nMore members giveaways "| __truncated__ ...
 $ public_metrics.retweet_count: int  3 3 6 6 16 1 10 5 2 3 ...

并且有关如何解决此问题的任何建议? 也许其他功能可以起作用? 我还了解 left_join 函数可能是有用的

编辑:我已经更新了代码,但是我仍然在同一问题中运行,

jointdataset <- merge(tweets2U, tweets2, by = 'author_id', all.x= TRUE)
View(jointdataset)

FinalTweets <- merge(tweets2U, tweets2, by=c("author_id","created_at"))
View(FinalTweets)

Error: no more error handlers available (recursive errors?); invoking 'abort' restart

我将在几分钟内再次尝试,而计算机上没有其他程序运行 我有16千兆的RAM,这让我感到困惑,因为为什么这里不够的

数据是最小可重现示例的数据,

> dput(head(tweets2U))
structure(list(created_at = c("2015-02-18T02:56:55.000Z", "2016-05-23T02:14:36.000Z", 
"2013-04-22T02:52:16.000Z", "2015-03-06T02:40:55.000Z", "2016-03-31T10:53:21.000Z", 
"2016-10-04T03:38:25.000Z"), author_id = c("3024607164", "734568179457007617", 
"1371107096", "3063885536", "715492170392846336", "783149246178553856"
)), row.names = c(NA, 6L), class = "data.frame")

> dput(head(tweets2))
structure(list(author_id = c("3024607164", "734568179457007617", 
"1371107096", "3063885536", "1274386153035173891", "750763458719653888"
), created_at = c("2021-01-01T02:24:18.000Z", "2021-01-01T02:22:48.000Z", 
"2021-01-01T02:22:14.000Z", "2021-01-01T02:21:01.000Z", "2021-01-01T02:20:03.000Z", 
"2021-01-01T02:19:46.000Z"), text = c("Super Game Talk Video Alpha! ! The #1 Indie Video Game Review Show hosted by puppets! Now on Roku: Smiley Crew TV. #indiegame #gamedev #indiedev #indievideogames #marketing #indiegames #videogames #apple #ios #steam #roku ", 
"I was testing my game, and caught a fish that was a gold star (top 3 percentile in size) and I though \"Oh I better save the game\" Then I realized it's not done, and saving means nothing for me - but I really hope to evoke this feeling in others <U+0001F605>\n\n#gamedev #indiegame #pixelart ", 
"Hey hey everyone just got home from work. Time to finish artwork before 12 am to show.super excited. I can finally make more music with this midi keyboard for my game. #indiegame #rpg #artsoon #solodev #indiedev #madewithunity #indiegamedev ", 
"Congratulation to Pretumos who won our Dec 2020 $100 Zeegift!  \nMore members giveaways here: <U+0001F381>\nGuest Giveaways here:  \n@BlazedRTs @GamerGalsRT @SGH_RTs  #gaming #gamingcommunity #indiegame @GamingRTweeters ", 
"The lighting on my new level is really starting to come together<U+0001F60D>\n\nWhat do you think?\n\n#scifi #indiegame #art #gaming #game #indiegamedev #unity3d #games #twitchtv #shader #madewithunity #3d #3dart #twitch #stream #cyberpunk #artistsontwitter #indie #3dart ", 
"Super Game Talk Video Alpha!  The #1 Indie Video Game Review Show hosted by puppets! #indiegame #gamedev #indiedev #indievideogames #marketing #indiegames #videogames #apple #ios #steam "
), public_metrics.retweet_count = c(3L, 3L, 6L, 6L, 16L, 1L)), row.names = c(NA, 
6L), c

预期的输出将导致我拥有一个文件 create_at,rutor_id and the Text和public_metrics.retweet_count匹配此fure_id

I am trying to merge together two files, but i keep getting the following error:

Error: memory exhausted (limit reached?)
Error during wrapup: memory exhausted (limit reached?)
Error: no more error handlers available (recursive errors?); invoking 'abort' restart

I am using the following code:

FinalTweets <- merge(tweets2U, tweets2, by="author_id")

I alo realised that my files have a different number of values

For tweets2U

'data.frame':   325256 obs. of  2 variables:
 $ created_at: chr  "2015-02-18T02:56:55.000Z" "2016-05-23T02:14:36.000Z" "2013-04-22T02:52:16.000Z" "2015-03-06T02:40:55.000Z" ...
 $ author_id : chr  "3024607164" "734568179457007617" "1371107096" "3063885536" ...

and for tweets2

'data.frame':   338037 obs. of  4 variables:
 $ author_id                   : chr  "3024607164" "734568179457007617" "1371107096" "3063885536" ...
 $ created_at                  : chr  "2021-01-01T02:24:18.000Z" "2021-01-01T02:22:48.000Z" "2021-01-01T02:22:14.000Z" "2021-01-01T02:21:01.000Z" ...
 $ text                        : chr  "Super Game Talk Video Alpha!  The #1 Indie Video Game Review Show hosted by puppets! No"| __truncated__ "I was testing my game, and caught a fish that was a gold star (top 3 percentile in size) and I though \"Oh I be"| __truncated__ "Hey hey everyone just got home from work. Time to finish artwork before 12 am to show.super excited. I can fina"| __truncated__ "Congratulation to Pretumos who won our Dec 2020 $100 Zeegift! nMore members giveaways "| __truncated__ ...
 $ public_metrics.retweet_count: int  3 3 6 6 16 1 10 5 2 3 ...

Any recommendation on how to fix this ?
Maybe a different function could work ?
I also understand that the left_join function could be useful

Edit: I have updated my code, but i still run in the same issue

jointdataset <- merge(tweets2U, tweets2, by = 'author_id', all.x= TRUE)
View(jointdataset)

FinalTweets <- merge(tweets2U, tweets2, by=c("author_id","created_at"))
View(FinalTweets)

Error: no more error handlers available (recursive errors?); invoking 'abort' restart

I will try again in a few minutes, with no other program running on the computer
I have 16 Giga of Ram, which makes me confused as why there is not enough

Here is the data for the min reproducible example

> dput(head(tweets2U))
structure(list(created_at = c("2015-02-18T02:56:55.000Z", "2016-05-23T02:14:36.000Z", 
"2013-04-22T02:52:16.000Z", "2015-03-06T02:40:55.000Z", "2016-03-31T10:53:21.000Z", 
"2016-10-04T03:38:25.000Z"), author_id = c("3024607164", "734568179457007617", 
"1371107096", "3063885536", "715492170392846336", "783149246178553856"
)), row.names = c(NA, 6L), class = "data.frame")

> dput(head(tweets2))
structure(list(author_id = c("3024607164", "734568179457007617", 
"1371107096", "3063885536", "1274386153035173891", "750763458719653888"
), created_at = c("2021-01-01T02:24:18.000Z", "2021-01-01T02:22:48.000Z", 
"2021-01-01T02:22:14.000Z", "2021-01-01T02:21:01.000Z", "2021-01-01T02:20:03.000Z", 
"2021-01-01T02:19:46.000Z"), text = c("Super Game Talk Video Alpha! ! The #1 Indie Video Game Review Show hosted by puppets! Now on Roku: Smiley Crew TV. #indiegame #gamedev #indiedev #indievideogames #marketing #indiegames #videogames #apple #ios #steam #roku ", 
"I was testing my game, and caught a fish that was a gold star (top 3 percentile in size) and I though \"Oh I better save the game\" Then I realized it's not done, and saving means nothing for me - but I really hope to evoke this feeling in others <U+0001F605>\n\n#gamedev #indiegame #pixelart ", 
"Hey hey everyone just got home from work. Time to finish artwork before 12 am to show.super excited. I can finally make more music with this midi keyboard for my game. #indiegame #rpg #artsoon #solodev #indiedev #madewithunity #indiegamedev ", 
"Congratulation to Pretumos who won our Dec 2020 $100 Zeegift!  \nMore members giveaways here: <U+0001F381>\nGuest Giveaways here:  \n@BlazedRTs @GamerGalsRT @SGH_RTs  #gaming #gamingcommunity #indiegame @GamingRTweeters ", 
"The lighting on my new level is really starting to come together<U+0001F60D>\n\nWhat do you think?\n\n#scifi #indiegame #art #gaming #game #indiegamedev #unity3d #games #twitchtv #shader #madewithunity #3d #3dart #twitch #stream #cyberpunk #artistsontwitter #indie #3dart ", 
"Super Game Talk Video Alpha!  The #1 Indie Video Game Review Show hosted by puppets! #indiegame #gamedev #indiedev #indievideogames #marketing #indiegames #videogames #apple #ios #steam "
), public_metrics.retweet_count = c(3L, 3L, 6L, 6L, 16L, 1L)), row.names = c(NA, 
6L), c

The expected output would lead to me having a single file with
Created_at, author_id and the text and the public_metrics.retweet_count that match this author_id

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

睡美人的小仙女 2025-02-20 13:01:13

目前尚不清楚您想要哪种输出。您是否只需要匹配两个data.frame的行(base :: merge,dplyr :: innit_join)?还是将所有行保留在x(dplyr :: left_join)中?或在y中(dplyr :: right_join)。此外,您的错误是不可再现的。正如@limey所建议的那样,您应该发布一个完全可重复的示例,或直接通过dput(Tweet2u)dput(Tweets2)或至少一部分数据(<代码) > dput(your_data)

dplyr :: left_join

set.seed(4)
#simulate some data 
tweets2U <- data.frame(created_at=sample(seq(as.Date('2010/01/01'), as.Date('2020/01/01'), by="day"), 325256,,replace = T),
                       author_id=as.character(sample(1:325256,replace = T)))


tweets2 <- data.frame(created_at=sample(seq(as.Date('2010/01/01'), as.Date('2020/01/01'), by="day"), 338037 ,,replace = T),
                      author_id=as.character(sample(1:338037 ,replace = T)),
                       text=sample(letters,338037,replace = T),
                       public_metrics.retweet_count=sample(1:10,338037,replace = T))


FinalTweets <- dplyr::left_join(tweets2U, tweets2)

summary(FinalTweets)

摘要的输出

    created_at          author_id             text           public_metrics.retweet_count
 Min.   :2010-01-01   Length:325256      Length:325256      Min.   : 1.0                
 1st Qu.:2012-07-02   Class :character   Class :character   1st Qu.: 2.8                
 Median :2015-01-04   Mode  :character   Mode  :character   Median : 5.0                
 Mean   :2015-01-01                                         Mean   : 5.3                
 3rd Qu.:2017-07-02                                         3rd Qu.: 7.2                
 Max.   :2020-01-01                                         Max.   :10.0                
                                                            NA's   :325160     

It is not clear what kind of output you want. Do you want only rows that match both data.frame (base::merge, dplyr::inner_join)? or preserve all rows in x (dplyr::left_join)? or in y (dplyr::right_join). Furthermore, your error is not reproducible. As suggested by @Limey you should post a fully reproducible example or directly your data via dput(tweet2U) and dput(tweets2) or at least part of your data (dput(head(your_data)). However it is very strange that it could be a memory problem given the size of the data.

In my PC this code works well. An example with dplyr::left_join:

set.seed(4)
#simulate some data 
tweets2U <- data.frame(created_at=sample(seq(as.Date('2010/01/01'), as.Date('2020/01/01'), by="day"), 325256,,replace = T),
                       author_id=as.character(sample(1:325256,replace = T)))


tweets2 <- data.frame(created_at=sample(seq(as.Date('2010/01/01'), as.Date('2020/01/01'), by="day"), 338037 ,,replace = T),
                      author_id=as.character(sample(1:338037 ,replace = T)),
                       text=sample(letters,338037,replace = T),
                       public_metrics.retweet_count=sample(1:10,338037,replace = T))


FinalTweets <- dplyr::left_join(tweets2U, tweets2)

summary(FinalTweets)

the output of summary:

    created_at          author_id             text           public_metrics.retweet_count
 Min.   :2010-01-01   Length:325256      Length:325256      Min.   : 1.0                
 1st Qu.:2012-07-02   Class :character   Class :character   1st Qu.: 2.8                
 Median :2015-01-04   Mode  :character   Mode  :character   Median : 5.0                
 Mean   :2015-01-01                                         Mean   : 5.3                
 3rd Qu.:2017-07-02                                         3rd Qu.: 7.2                
 Max.   :2020-01-01                                         Max.   :10.0                
                                                            NA's   :325160     
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文