最未充分利用的数据可视化

发布于 2024-08-17 16:33:20 字数 207 浏览 11 评论 0原文

直方图和散点图是可视化数据和变量之间关系的好方法,但最近我一直想知道我缺少哪些可视化技术。您认为最未被充分利用的情节类型是什么?

答案应该:

  1. 不太常用 实践。
  2. 无需太多内容即可理解 的背景讨论。
  3. 适用于许多常见情况。
  4. 包含可重现的代码来创建 一个例子(最好是R语言)。链接图像将是 好的。

Histograms and scatterplots are great methods of visualizing data and the relationship between variables, but recently I have been wondering about what visualization techniques I am missing. What do you think is the most underused type of plot?

Answers should:

  1. Not be very commonly used in
    practice.
  2. Be understandable without a great deal
    of background discussion.
  3. Be applicable in many common situations.
  4. Include reproducible code to create
    an example (preferably in R). A linked image would be
    nice.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(15

菩提树下叶撕阳。 2024-08-24 16:33:20

我非常同意其他海报:Tufte 的书非常棒并且非常值得一读。

首先,我会向您指出一个非常好的教程ggplot2 和 ggobi 来自今年早些时候的“Looking at Data”。除此之外,我只想重点介绍 R 中的一种可视化效果和两个图形包(它们不像基础图形、点阵或 ggplot 那样广泛使用):

热图

我真的很喜欢可以处理多元数据的可视化效果,尤其是时间序列数据。 热图对此很有用。 大卫·史密斯在革命博客上。这是 Hadley 提供的 ggplot 代码:

stock <- "MSFT"
start.date <- "2006-01-12"
end.date <- Sys.Date()
quote <- paste("http://ichart.finance.yahoo.com/table.csv?s=",
                stock, "&a=", substr(start.date,6,7),
                "&b=", substr(start.date, 9, 10),
                "&c=", substr(start.date, 1,4), 
                "&d=", substr(end.date,6,7),
                "&e=", substr(end.date, 9, 10),
                "&f=", substr(end.date, 1,4),
                "&g=d&ignore=.csv", sep="")    
stock.data <- read.csv(quote, as.is=TRUE)
stock.data <- transform(stock.data,
  week = as.POSIXlt(Date)$yday %/% 7 + 1,
  wday = as.POSIXlt(Date)$wday,
  year = as.POSIXlt(Date)$year + 1900)

library(ggplot2)
ggplot(stock.data, aes(week, wday, fill = Adj.Close)) + 
  geom_tile(colour = "white") + 
  scale_fill_gradientn(colours = c("#D61818","#FFAE63","#FFFFBD","#B5E384")) + 
  facet_wrap(~ year, ncol = 1)

最终看起来有点像这样:

alt text

RGL:交互式 3D 图形

另一个非常值得努力学习的包是 RGL,它可以轻松提供创建交互式 3D 图形的能力。网上有很多这方面的例子(包括在 rgl 文档中)。

R-Wiki 有一个很好的示例如何使用 rgl 绘制 3D 散点图。

GGobi

另一个值得了解的软件包是 rggobi。有一本关于该主题的 Springer 书籍,以及许多很棒的在线文档/示例,包括在 < href="http://lookingatdata.com/jsm-2009/" rel="noreferrer">“查看数据” 课程。

I really agree with the other posters: Tufte's books are fantastic and well worth reading.

First, I would point you to a very nice tutorial on ggplot2 and ggobi from "Looking at Data" earlier this year. Beyond that I would just highlight one visualization from R, and two graphics packages (which are not as widely used as base graphics, lattice, or ggplot):

Heat Maps

I really like visualizations that can handle multivariate data, especially time series data. Heat maps can be useful for this. One really neat one was featured by David Smith on the Revolutions blog. Here is the ggplot code courtesy of Hadley:

stock <- "MSFT"
start.date <- "2006-01-12"
end.date <- Sys.Date()
quote <- paste("http://ichart.finance.yahoo.com/table.csv?s=",
                stock, "&a=", substr(start.date,6,7),
                "&b=", substr(start.date, 9, 10),
                "&c=", substr(start.date, 1,4), 
                "&d=", substr(end.date,6,7),
                "&e=", substr(end.date, 9, 10),
                "&f=", substr(end.date, 1,4),
                "&g=d&ignore=.csv", sep="")    
stock.data <- read.csv(quote, as.is=TRUE)
stock.data <- transform(stock.data,
  week = as.POSIXlt(Date)$yday %/% 7 + 1,
  wday = as.POSIXlt(Date)$wday,
  year = as.POSIXlt(Date)$year + 1900)

library(ggplot2)
ggplot(stock.data, aes(week, wday, fill = Adj.Close)) + 
  geom_tile(colour = "white") + 
  scale_fill_gradientn(colours = c("#D61818","#FFAE63","#FFFFBD","#B5E384")) + 
  facet_wrap(~ year, ncol = 1)

Which ends up looking somewhat like this:

alt text

RGL: Interactive 3D Graphics

Another package that is well worth the effort to learn is RGL, which easily provides the ability to create interactive 3D graphics. There are many examples online for this (including in the rgl documentation).

The R-Wiki has a nice example of how to plot 3D scatter plots using rgl.

GGobi

Another package that is worth knowing is rggobi. There is a Springer book on the subject, and lots of great documentation/examples online, including at the "Looking at Data" course.

你在我安 2024-08-24 16:33:20

我真的很喜欢 dotplots 并且发现当我向其他人推荐它们来解决适当的数据问题时,它们总是惊讶又高兴。它们似乎没有多大用处,我不明白为什么。

以下是 Quick-R 中的示例:
dotplot on car data

我相信克利夫兰对这些的开发和颁布负有最大的责任,以及他书中的例子(在使用点图可以轻松检测到错误数据)是其使用的有力论据。请注意,上面的示例每行仅放置一个点,而它们的真正威力在于每行上有多个点,并有一个图例解释哪个是哪个。例如,您可以对三个不同的时间点使用不同的符号或颜色,从而轻松了解不同类别的时间模式。

在下面的示例中(所有事情都是在 Excel 中完成的!),您可以清楚地看到哪个类别可能受到标签交换的影响。

2 组点图

I really like dotplots and find when I recommend them to others for appropriate data problems they are invariably surprised and delighted. They don't seem to get much use, and I can't figure out why.

Here's an example from Quick-R:
dotplot on car data

I believe Cleveland is most responsible for the development and promulgation of these, and the example in his book (in which faulty data was easily detected with a dotplot) is a powerful argument for their use. Note that the example above only puts one dot per line, whereas their real power comes with you have multiple dots on each line, with a legend explaining which is which. For instance, you could use different symbols or colors for three different time points, and thence easily get a sense of time patterns in different categories.

In the following example (done in Excel of all things!), you can clearly see which category might have suffered from a label swap.

Dotplot with 2 groups

感性 2024-08-24 16:33:20

使用极坐标的绘图肯定没有得到充分利用——有些人会说这是有充分理由的。我认为证明它们的使用合理的情况并不常见;我还认为,当出现这些情况时,极坐标图可以揭示线性图无法揭示的数据模式。

我认为这是因为有时您的数据本质上是极性的而不是线性的 - 例如,它是循环的(x 坐标表示多天 24 小时内的时间),或者数据之前已映射到极地特征空间。

这是一个例子。该图显示了网站每小时的平均流量。请注意晚上 10 点和凌晨 1 点的两个峰值。对于站点的网络工程师来说,这些意义重大;同样重要的是,它们发生的时间彼此很近(仅相隔两个小时)。但是,如果您在传统坐标系上绘制相同的数据,则该模式将被完全隐藏 - 线性绘制,这两个峰值将相距 20 小时,事实确实如此,尽管它们也只是两个小时连续几天相隔数小时。上面的极坐标图以简洁直观的方式显示了这一点(不需要图例)。

显示网站流量的极坐标图,峰值出现在 1 小时和 22 小时

有两种方法(据我所知)可以使用 R 创建这样的图(我使用 R 创建了上面的图)。一种是在基本或网格图形系统中编写您自己的函数。另一种更简单的方法是使用圆形封装。您将使用的函数是“rose.diag”:

data = c(35, 78, 34, 25, 21, 17, 22, 19, 25, 18, 25, 21, 16, 20, 26, 
                 19, 24, 18, 23, 25, 24, 25, 71, 27)
three_palettes = c(brewer.pal(12, "Set3"), brewer.pal(8, "Accent"), 
                   brewer.pal(9, "Set1"))
rose.diag(data, bins=24, main="Daily Site Traffic by Hour", col=three_palettes)

Plots using polar coordinates are certainly underused--some would say with good reason. I think the situations which justify their use are not common; I also think that when those situations arise, polar plots can reveal patterns in data that linear plots cannot.

I think that's because sometimes your data is inherently polar rather than linear--eg, it is cyclical (x-coordinates representing times during 24-hour day over multiple days), or the data were previously mapped onto a polar feature space.

Here's an example. This plot shows a Website's mean traffic volume by hour. Notice the two spikes at 10 pm and at 1 am. For the Site's network engineers, those are significant; it's also significant that they occur near each other other (just two hours apart). But if you plot the same data on a traditional coordinate system, this pattern would be completely concealed--plotted linearly, these two spikes would be 20 hours apart, which they are, though they are also just two hours apart on consecutive days. The polar chart above shows this in a parsimonious and intuitive way (a legend isn't necessary).

Polar chart showing site traffic, with peaks at hours 1 and 22

There are two ways (that I'm aware of) to create plots like this using R (I created the plot above w/ R). One is to code your own function in either the base or grid graphic systems. They other way, which is easier, is to use the circular package. The function you would use is 'rose.diag':

data = c(35, 78, 34, 25, 21, 17, 22, 19, 25, 18, 25, 21, 16, 20, 26, 
                 19, 24, 18, 23, 25, 24, 25, 71, 27)
three_palettes = c(brewer.pal(12, "Set3"), brewer.pal(8, "Accent"), 
                   brewer.pal(9, "Set1"))
rose.diag(data, bins=24, main="Daily Site Traffic by Hour", col=three_palettes)
被你宠の有点坏 2024-08-24 16:33:20

如果您的散点图有太多点以致变得一团糟,请尝试平滑散点图。下面是一个示例:

library(mlbench) ## this package has a smiley function
n <- 1e5 ## number of points
p <- mlbench.smiley(n,sd1 = 0.4, sd2 = 0.4) ## make a smiley :-)
x <- p$x[,1]; y <- p$x[,2]
par(mfrow = c(1,2)) ## plot side by side
plot(x,y) ## left plot, regular scatter plot
smoothScatter(x,y) ## right plot, smoothed scatter plot

hexbin 包(由 @Dirk Eddelbuettel 建议)用于相同目的,但 smoothScatter() 的优点是它属于 >graphics 包,因此是标准 R 安装的一部分。

笑脸作为常规或平滑散点图

If your scatter plot has so many points that it becomes a complete mess, try a smoothed scatter plot. Here is an example:

library(mlbench) ## this package has a smiley function
n <- 1e5 ## number of points
p <- mlbench.smiley(n,sd1 = 0.4, sd2 = 0.4) ## make a smiley :-)
x <- p$x[,1]; y <- p$x[,2]
par(mfrow = c(1,2)) ## plot side by side
plot(x,y) ## left plot, regular scatter plot
smoothScatter(x,y) ## right plot, smoothed scatter plot

The hexbin package (suggested by @Dirk Eddelbuettel) is used for the same purpose, but smoothScatter() has the advantage that it belongs to the graphics package, and is thus part of the standard R installation.

Smiley as a regular or smoothed scatter plot

爱情眠于流年 2024-08-24 16:33:20

关于迷你图和其他 Tufte 想法,YaleToolkit 包位于 CRAN 提供函数 sparklinesparklines

另一个对较大数据集有用的包是 hexbin 因为它巧妙地将数据“存储”到存储桶中,以处理对于朴素散点图来说可能太大的数据集。

Regarding sparkline and other Tufte idea, the YaleToolkit package on CRAN provides functions sparkline and sparklines.

Another package that is useful for larger datasets is hexbin as it cleverly 'bins' data into buckets to deal with datasets that may be too large for naive scatterplots.

凉栀 2024-08-24 16:33:20

小提琴图(将箱线图与核密度结合起来)相对奇特且非常酷。 R 中的 vioplot 包让您可以轻松制作它们。

这是一个示例(维基百科链接也显示了一个示例):

在此处输入图像描述

Violin plots (which combine box plots with kernel density) are relatively exotic and pretty cool. The vioplot package in R allows you to make them pretty easily.

Here's an example (The wikipedia link also shows an example):

enter image description here

阿楠 2024-08-24 16:33:20

我刚刚回顾的另一个不错的时间序列可视化是“凹凸图”(如“学习 R”博客上的这篇文章)。这对于可视化位置随时间的变化非常有用。

您可以在 http://learnr.wordpress.com/ 上阅读有关如何创建它的信息,但这是它最终看起来像什么:

alt text

Another nice time series visualization that I was just reviewing is the "bump chart" (as featured in this post on the "Learning R" blog). This is very useful for visualizing changes in position over time.

You can read about how to create it on http://learnr.wordpress.com/, but this is what it ends up looking like:

alt text

潦草背影 2024-08-24 16:33:20

我还喜欢 Tufte 对箱线图的修改,它可以让您更轻松地进行小倍数比较,因为它们水平非常“薄”,并且不会用多余的墨水弄乱绘图。然而,它最适用于相当多的类别;如果你在一个图上只有一些,那么常规(Tukey)箱线图看起来更好,因为它们有更多的分量。

library(lattice)
library(taRifx)
compareplot(~weight | Diet * Time * Chick, 
  data.frame=cw , 
  main = "Chick Weights",
  box.show.mean=FALSE,
  box.show.whiskers=FALSE,
  box.show.box=FALSE
  )

compareplot

制作这些图的其他方法(包括其他类型的 Tufte 箱线图)是 在此问题中讨论

I also like Tufte's modifications of boxplots which let you do small multiples comparison much more easily because they are very "thin" horizontally and don't clutter up the plot with redundant ink. However, it works best with a fairly large number of categories; if you've only got a few on a plot the regular (Tukey) boxplots look better since they have a bit more heft to them.

library(lattice)
library(taRifx)
compareplot(~weight | Diet * Time * Chick, 
  data.frame=cw , 
  main = "Chick Weights",
  box.show.mean=FALSE,
  box.show.whiskers=FALSE,
  box.show.box=FALSE
  )

compareplot

Other ways of making these (including the other kind of Tufte boxplot) are discussed in this question.

¢蛋碎的人ぎ生 2024-08-24 16:33:20

我们不应该忘记可爱且(历史上)重要的茎叶情节(塔夫特也喜欢!)。您可以直接获得数据密度和形状的数字概览(当然,如果您的数据集不大于大约 200 个点)。在 R 中,函数 stem 生成茎叶显示(在工作区中)。我更喜欢使用包 fmsb< 中的 gstem 函数/a> 直接在图形设备中绘制它。以下是逐叶显示的海狸体温变化(数据应位于默认数据集中):

  require(fmsb)
  gstem(beaver1$temp)

We shouldn't forget about cute and (historically) important stem-and-leaf plot (that Tufte loves too!). You get a directly numerical overview of you data density and shape (of course if your data set is not larger then about 200 points). In R, the function stem produces your stem-and-leaf dislay (in workspace). I prefer to use gstem function from package fmsb to draw it directly in a graphic device. Below is a beaver body temperature variance (data should be in your default dataset) in a stem-by-leaf display:

  require(fmsb)
  gstem(beaver1$temp)

enter image description here

温柔一刀 2024-08-24 16:33:20

除了 Tufte 的出色工作之外,我还推荐 William S. Cleveland 的书籍:可视化数据图形数据元素。它们不仅非常优秀,而且都是用 R 完成的,而且我相信代码是公开的。

In addition to Tufte's excellent work, I recommend the books by William S. Cleveland: Visualizing Data and The Elements of Graphing Data. Not only are they excellent, but they were all done in R, and I believe the code is publicly available.

难忘№最初的完美 2024-08-24 16:33:20

地平线图 (pdf),用于同时可视化多个时间序列。

平行坐标图 (pdf),用于多变量分析。

协会马赛克 图,用于可视化列联表(请参阅 vcd 包)

Horizon graphs (pdf), for visualising many time series at once.

Parallel coordinates plots (pdf), for multivariate analysis.

Association and mosaic plots, for visualising contingency tables (see the vcd package)

浮光之海 2024-08-24 16:33:20

箱线图! R 帮助中的示例:

boxplot(count ~ spray, data = InsectSprays, col = "lightgray")

在我看来,这是快速查看数据或比较分布的最方便的方法。
对于更复杂的分布,有一个名为 vioplot 的扩展。

Boxplots! Example from the R help:

boxplot(count ~ spray, data = InsectSprays, col = "lightgray")

In my opinion it is the most handy way to take a quick look at the data or to compare distributions.
For more complex distributions there is an extension called vioplot.

十雾 2024-08-24 16:33:20

在我看来,马赛克图满足提到的所有四个标准。 r中有一些例子,在mosaicplot下。

Mosaic plots seem to me to meet all four criteria mentioned. There are examples in r, under mosaicplot.

千年*琉璃梦 2024-08-24 16:33:20

查看 Edward Tufte 的作品,尤其是这本书

您还可以尝试并抓住他的旅行演讲。这本书非常好,包含他的四本书。 (我发誓我不拥有他出版商的股票!)

顺便说一句,我喜欢他的迷你数据可视化技术。惊喜! Google 已经编写了该代码并将其发布在 Google 代码

Check out Edward Tufte's work and especially this book

You can also try and catch his travelling presentation. It's quite good and includes a bundle of four of his books. (i swear i don't own his publisher's stock!)

By the way, i like his sparkline data visualization technique. Surprise! Google's already written it and put it out on Google Code

童话 2024-08-24 16:33:20

概要图?如本页所述:

可视化摘要统计数据和不确定性

Summary plots? Like mentioned in this page:

Visualizing Summary Statistics and Uncertainty

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文