按日期绘制 Twitter 搜索结果的词云？（使用R）

发布于 2024-09-03 18:57:41 字数 1338 浏览 9 评论 0 原文

我希望在 twitter 上搜索一个单词（假设是#google），然后能够生成 twitts 中使用的单词的标签云，但是根据日期（例如，有一个小时的移动窗口，移动时间为每次 10 分钟，并向我展示不同的单词如何在一天中被更频繁地使用）。

我将不胜感激任何有关如何执行此操作的帮助：信息资源、编程代码（R 是我唯一喜欢使用的语言）和可视化想法。问题：

如何获取信息？

在R中，我发现twitteR包有searchTwitter命令。但我不知道我能从中得到多大的“n”。此外，它不会返回该推文的起源日期。

我看到这里我最多可以得到 1500 个 twitts，但这需要我手动进行解析（这导致我进入步骤 2）。另外，为了我的目的，我需要数以万计的推特。甚至有可能让他们回顾起来吗？（例如，每次通过 API URL 询问较旧的帖子？）如果没有，则存在一个更普遍的问题：如何在家用计算机上创建 twitts 的个人存储？（这个问题最好留给另一个 SO 线程 - 尽管这里的人的任何见解对我来说读起来都很有趣）
如何解析信息（在R中）？我知道 R 的函数可以从 rcurl 和 twitteR 包中得到帮助。但我不知道是哪个，也不知道如何使用它们。任何建议都会有所帮助。
如何分析？如何删除所有“无趣”的词？我发现 R 中的“tm”包有此示例：

路透社<- tm_map（路透社，removeWords，stopwords（“english”））

这能解决问题吗？我应该做点别的/更多的事情吗？

另外，我想我想在根据时间切割数据集后执行此操作（这将需要一些类似 posix 的函数（我不确定这里需要哪些函数，或者如何使用它）。< /p>
最后，如何创建我找到的单词的标签云？ .r-bloggers.com/creating-tag-cloud-using-r-and-flash-javascript-swfobject/" rel="nofollow noreferrer">这里有一个解决方案，还有其他建议/推荐吗？< /p>

我相信我在这里问了一个很大的问题，但我试图将其分解为尽可能多的简单问题。欢迎任何帮助！

最好的，

塔尔

原文

I wish to search twitter for a word (let's say #google), and then be able to generate a tag cloud of the words used in twitts, but according to dates (for example, having a moving window of an hour, that moves by 10 minutes each time, and shows me how different words gotten more often used throughout the day).

I would appreciate any help on how to go about doing this regarding: resources for the information, code for the programming (R is the only language I am apt in using) and ideas on visualization. Questions:

How do I get the information?

In R, I found that the twitteR package has the searchTwitter command. But I don't know how big an "n" I can get from it. Also, It doesn't return the dates in which the twitt originated from.

I see here that I could get until 1500 twitts, but this requires me to do the parsing manually (which leads me to step 2). Also, for my purposes, I would need tens of thousands of twitts. Is it even possible to get them in retrospect?? (for example, asking older posts each time through the API URL ?) If not, there is the more general question of how to create a personal storage of twitts on your home computer? (a question which might be better left to another SO thread - although any insights from people here would be very interesting for me to read)
How to parse the information (in R)? I know that R has functions that could help from the rcurl and twitteR packages. But I don't know which, or how to use them. Any suggestions would be of help.
How to analyse? how to remove all the "not interesting" words? I found that the "tm" package in R has this example:

reuters <- tm_map(reuters, removeWords, stopwords("english"))

Would this do the trick? I should I do something else/more ?

Also, I imagine I would like to do that after cutting my dataset according to time (which will require some posix-like functions (which I am not exactly sure which would be needed here, or how to use it).
And lastly, there is the question of visualization. How do I create a tag cloud of the words? I found a solution for this here, any other suggestion/recommendations?

I believe I am asking a huge question here but I tried to break it to as many straightforward questions as possible. Any help will be welcomed!

Best,

Tal

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

尹雨沫 2024-09-10 18:57:44

R 中使用“snippets”包的单词/标签云
www.wordle.net
使用 openNLP 包，您可以对推文进行 pos 标记（pos=词性），然后仅提取名词、动词或形容词以在词云中可视化。
也许您可以查询 twitter 并使用当前系统时间作为时间戳，写入本地数据库并以 x 秒/分钟的增量再次查询，等等。
可以在 http://www.readwriteweb.com/archives/twitter_data_dump_infochimp_puts_1b_connections_up.php 和 http://www.wired.com/epicenter/2010/04/loc-google-twitter/

回复收藏 0 原文

夜空下最亮的亮点 2024-09-10 18:57:44

至于绘图部分：我在这里做了一个词云： http://trends.techcrunch.com/2009/09/25/describe-yourself-in-3-or-4-words/ 使用片段包，我的代码就在那里。我手动拉出了某些单词。检查一下，如果您有更具体的问题，请告诉我。

回复收藏 0 原文

毁我热情 2024-09-10 18:57:44

我注意到这是一个老问题，并且可以通过网络搜索找到多种解决方案，但这里有一个答案（来自 http://blog.ouseful.info/2012/02/15/generate-twitter-wordclouds-in-r -由-open-learning-blogpost/提示）：

require(twitteR)
searchTerm='#dev8d'
#Grab the tweets
rdmTweets <- searchTwitter(searchTerm, n=500)
#Use a handy helper function to put the tweets into a dataframe
tw.df=twListToDF(rdmTweets)

##Note: there are some handy, basic Twitter related functions here:
##https://github.com/matteoredaelli/twitter-r-utils
#For example:
RemoveAtPeople <- function(tweet) {
  gsub("@\\w+", "", tweet)
}
#Then for example, remove @d names
tweets <- as.vector(sapply(tw.df$text, RemoveAtPeople))

##Wordcloud - scripts available from various sources; I used:
#http://rdatamining.wordpress.com/2011/11/09/using-text-mining-to-find-out-what-rdatamining-tweets-are-about/
#Call with eg: tw.c=generateCorpus(tw.df$text)
generateCorpus= function(df,my.stopwords=c()){
  #Install the textmining library
  require(tm)
  #The following is cribbed and seems to do what it says on the can
  tw.corpus= Corpus(VectorSource(df))
  # remove punctuation
  tw.corpus = tm_map(tw.corpus, removePunctuation)
  #normalise case
  tw.corpus = tm_map(tw.corpus, tolower)
  # remove stopwords
  tw.corpus = tm_map(tw.corpus, removeWords, stopwords('english'))
  tw.corpus = tm_map(tw.corpus, removeWords, my.stopwords)

  tw.corpus
}

wordcloud.generate=function(corpus,min.freq=3){
  require(wordcloud)
  doc.m = TermDocumentMatrix(corpus, control = list(minWordLength = 1))
  dm = as.matrix(doc.m)
  # calculate the frequency of words
  v = sort(rowSums(dm), decreasing=TRUE)
  d = data.frame(word=names(v), freq=v)
  #Generate the wordcloud
  wc=wordcloud(d$word, d$freq, min.freq=min.freq)
  wc
}

print(wordcloud.generate(generateCorpus(tweets,'dev8d'),7))

##Generate an image file of the wordcloud
png('test.png', width=600,height=600)
wordcloud.generate(generateCorpus(tweets,'dev8d'),7)
dev.off()

#We could make it even easier if we hide away the tweet grabbing code. eg:
tweets.grabber=function(searchTerm,num=500){
  require(twitteR)
  rdmTweets = searchTwitter(searchTerm, n=num)
  tw.df=twListToDF(rdmTweets)
  as.vector(sapply(tw.df$text, RemoveAtPeople))
}
#Then we could do something like:
tweets=tweets.grabber('ukgc12')
wordcloud.generate(generateCorpus(tweets),3)

I note that this is an old question, and there are several solutions available via web search, but here's one answer (via http://blog.ouseful.info/2012/02/15/generating-twitter-wordclouds-in-r-prompted-by-an-open-learning-blogpost/):

require(twitteR)
searchTerm='#dev8d'
#Grab the tweets
rdmTweets <- searchTwitter(searchTerm, n=500)
#Use a handy helper function to put the tweets into a dataframe
tw.df=twListToDF(rdmTweets)

##Note: there are some handy, basic Twitter related functions here:
##https://github.com/matteoredaelli/twitter-r-utils
#For example:
RemoveAtPeople <- function(tweet) {
  gsub("@\\w+", "", tweet)
}
#Then for example, remove @d names
tweets <- as.vector(sapply(tw.df$text, RemoveAtPeople))

##Wordcloud - scripts available from various sources; I used:
#http://rdatamining.wordpress.com/2011/11/09/using-text-mining-to-find-out-what-rdatamining-tweets-are-about/
#Call with eg: tw.c=generateCorpus(tw.df$text)
generateCorpus= function(df,my.stopwords=c()){
  #Install the textmining library
  require(tm)
  #The following is cribbed and seems to do what it says on the can
  tw.corpus= Corpus(VectorSource(df))
  # remove punctuation
  tw.corpus = tm_map(tw.corpus, removePunctuation)
  #normalise case
  tw.corpus = tm_map(tw.corpus, tolower)
  # remove stopwords
  tw.corpus = tm_map(tw.corpus, removeWords, stopwords('english'))
  tw.corpus = tm_map(tw.corpus, removeWords, my.stopwords)

  tw.corpus
}

wordcloud.generate=function(corpus,min.freq=3){
  require(wordcloud)
  doc.m = TermDocumentMatrix(corpus, control = list(minWordLength = 1))
  dm = as.matrix(doc.m)
  # calculate the frequency of words
  v = sort(rowSums(dm), decreasing=TRUE)
  d = data.frame(word=names(v), freq=v)
  #Generate the wordcloud
  wc=wordcloud(d$word, d$freq, min.freq=min.freq)
  wc
}

print(wordcloud.generate(generateCorpus(tweets,'dev8d'),7))

##Generate an image file of the wordcloud
png('test.png', width=600,height=600)
wordcloud.generate(generateCorpus(tweets,'dev8d'),7)
dev.off()

#We could make it even easier if we hide away the tweet grabbing code. eg:
tweets.grabber=function(searchTerm,num=500){
  require(twitteR)
  rdmTweets = searchTwitter(searchTerm, n=num)
  tw.df=twListToDF(rdmTweets)
  as.vector(sapply(tw.df$text, RemoveAtPeople))
}
#Then we could do something like:
tweets=tweets.grabber('ukgc12')
wordcloud.generate(generateCorpus(tweets),3)

回复收藏 0 原文

‘画卷フ 2024-09-10 18:57:44

我想回答你关于制作大词云的问题。
我所做的是

使用 s0.tweet <- searchTwitter(KEYWORD,n=1500) 7 天或更长时间，例如这个。

通过此命令组合它们：

rdmTweets = c(s0.tweet,s1.tweet,s2.tweet,s3.tweet,s4.tweet,s5.tweet,s6.tweet,s7.tweet)

结果：

Lynas Square Cloud