数据可视化:气泡图、维恩图和标签云(天哪!)
假设我有一个很大的对象列表(数千或数万),每个对象都带有一些标签。 有数十或数百个可能的标签,它们的使用遵循典型的幂律: 有些标签使用非常频繁,但大多数很少使用。 事实上,除了最常见的几十个标签之外的所有标签通常都可以被忽略。
现在的问题是如何可视化这些标签之间的关系。 标签云是它们频率的很好的可视化,但它忽略了哪些标签与哪些其他标签一起出现。 假设标记 :bar 仅出现在也标记为 :foo 的对象上。 这应该是视觉上显而易见的。 对于往往一起出现的三个标签也是如此。
您可以使每个标签成为一个气泡,并让它们部分重叠。 从技术上讲,这是一个维恩图,但以这种方式处理它可能很笨拙。 例如,Google 图表可以创建维恩图,但仅限 3 个或更少的集合(标签): http://code.google.com/apis/chart/docs/图库/venn_charts.html
他们将其限制为 3 组的原因是,如果超过 3 组,看起来就会很可怕。 请参阅维基百科页面上的“扩展至更高数量的集合”:http://en.wikipedia.org/ wiki/Venn_diagrams
但这前提是每个可能的交集都非空。 如果同时出现的标签不超过 3 个(可能是在丢弃稀有标签之后),则可以使用维恩图集合(气泡的大小代表标签频率)。
或者可能是一个图形(如顶点和边),具有视觉上较厚或较薄的边缘来表示共现的频率。
您有什么想法或者工具或库的建议吗? 理想情况下,我会使用 javascript 来完成此操作,但我对 R 和 Mathematica 或其他任何东西都持开放态度。 如果有人好奇的话,我很乐意分享一些实际数据(如果我告诉你它代表什么,你会笑的)。
附录:我最初想到的应用程序是TagTime,但我突然想到这也是很好地映射到可视化一个人的美味书签的问题。
Suppose I have a large list of objects (thousands or tens of thousands), each of which is tagged with a handful of tags.
There are dozens or hundreds of possible tags and their usage follows a typical power law:
some tags are used extremely often but most are rare.
All but the most frequent couple dozen tags could typically be ignored, in fact.
Now the problem is how to visualize the relationship between these tags.
A tag cloud is a nice visualization of just their frequencies but it ignores which tags occur with which other tags.
Suppose tag :bar only occurs on objects also tagged :foo.
That should be visually apparent.
Similarly for three tags that tend to occur together.
You could make each tag a bubble and let them partially overlap with each other.
Technically that's a Venn diagram but treating it that way might be unwieldy.
For example, Google charts can create Venn diagrams, but only for 3 or fewer sets (tags):
http://code.google.com/apis/chart/docs/gallery/venn_charts.html
The reason they limit it to 3 sets is that any more and it looks horrendous.
See "extentions to higher numbers of sets" on the Wikipedia page: http://en.wikipedia.org/wiki/Venn_diagrams
But that's only if every possible intersection is non-empty.
If no more than 3 tags ever co-occur (maybe after throwing out the rare tags) then a collection of Venn diagrams could work (with the sizes of the bubbles representing tag frequency).
Or perhaps a graph (as in vertices and edges) with visually thicker or thinner edges to represent frequency of co-occurrence.
Do you have any ideas, or pointers to tools or libraries?
Ideally I'd do this with javascript but I'm open to things like R and Mathematica or really anything else.
I'm happy to share some actual data (you'll laugh if I tell you what it represents) if anyone is curious.
Addendum: The application I originally had in mind was TagTime but it occurs to me that this also maps well to the problem of visualizing one's delicious bookmarks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果我正确理解你的问题,图像矩阵在这里应该可以很好地工作。我想到的实现是一个 nxm 矩阵,其中标记的项目是行,每个标记类型是一个单独的列。矩阵中的每个单元格都完全由“1”和“0”组成,即特定项目要么具有给定标签,要么没有。
在下面的矩阵中(我将其旋转了 90 度,以便它更适合此窗口 - 因此列实际上代表标记的项目,每行显示所有项目中给定标签的存在或不存在),我模拟了以下场景:有 8 个标签 和 200 个带标签的项目。 ,“0”为蓝色,“1”为浅黄色。
该矩阵中的所有值都是随机选择的(每个标记项目是从一个盒子中抽取的八个标记,该盒子由两个令牌组成,一个蓝色,一个黄色(分别没有标签和标签)。因此,这里没有任何模式的视觉证据,这并不奇怪,但是如果您的数据中有这样一个技术,实现起来非常简单,可以帮助您找到它。
我使用 R 来生成和绘制模拟数据,仅使用基本图形(没有外部包)。或库):
If i understand your question correctly, an image matrix should work nicely here. The implementation i have in mind would be an n x m matrix in which the tagged items are rows, and each tags type is a separate column. Every cell in the matrix would consist entirely of "1's" and "0's", i.e., a particular item either has a given tag or it doesn't.
In the matrix below (which i rotated 90 degrees so it would fit better in this window--so columns actually represent tagged items, and each row shows the presence or absence of a given tag across all items), i simulated the scenario in which there are 8 tags and 200 tagged items. , a "0" is blue and a "1" is light yellow.
All values in this matrix were randomly selected (each tagged item is eight draws from a box consisting of two tokens, one blue and one yellow (no tag and tag, respectively). So not surprisingly there's no visual evidence of a pattern here, but if there is one in your data, this technique, which is dead simple to implement, can help you find it.
I used R to generate and plot the simulated data, using only base graphics (no external packages or libraries):
如果您的目标是网络,我会创建一些这样的。连接节点的边缘可以更厚或更暗,或者连接它们的力可能更强,因此它们的距离很近。我还会在圆圈内添加标签名称。
一些对此非常有用的库包括:
其他一些值得研究的有趣的 javascript 库是:
I would create something like this if you are targeting the web. Edges connecting the nodes could be thicker or darker in color, or perhaps a stronger force connecting them so they are close in distance. I would also add the tag name inside the circle.
Some libraries that would be very good for this include:
Some other fun javascript libraries worth looking into are:
虽然这是一个老话题,但我今天才看到它。
您可能还需要考虑使用自组织地图。
这是世界贫困自组织地图的示例。它使用了 39 个你所谓的“标签”来排列你所谓的“对象”。
http://www.cis.hut.fi/research/som-研究/povertymap.gif
Although this is an old thread, I just came across it today.
You may also want to consider using a Self-Organizing Map.
Here is an example of a self-organizing map for world poverty. It used 39 of what you call your "tags" to arrange what you call your "objects".
http://www.cis.hut.fi/research/som-research/povertymap.gif
请注意,它肯定会起作用,因为我没有测试这一点,但我是这样开始的:
您可以按照 doug 在他的答案中建议的那样创建一个矩阵,但不是将文档作为行,将标签作为列,而是采用一个方阵,其中标签是行和列。单元格 T1;T2 的值将是同时标记有 T1 和 T2 的文档数量(请注意,通过这样做,您将获得一个对称矩阵,因为 [T1;T2] 将具有与 [T2;T1] 相同的值) .
完成此操作后,每行(或列)都是一个向量,将标签定位在 T 维空间中。在这个空间中彼此靠近的标签经常一起出现。为了可视化共现,您可以使用降低空间维度的方法或任何聚类方法。例如,您可以使用 kohonen 自组织映射将 T 维空间投影到 2D 空间,然后您将得到一个 2D 矩阵,其中每个单元代表标签空间中的一个抽象向量(这意味着该向量不一定存在)在您的数据集中)。该向量反映了源空间的拓扑约束,并且可以被视为反映某些标签的显着共现的“模型”向量。此外,此地图上彼此靠近的单元将表示源空间中彼此靠近的向量,从而允许您将标签空间映射到 2D 矩阵上。
矩阵的最终可视化可以通过多种方式完成,但如果没有首先看到先前处理的结果,我无法给您建议。
Note sure it would work as I did not test that, but here is how I would start:
You can create a matrix as doug suggests in his answer, but instead of having documents as rows and tags as columns, you take a square matrix where tags are rows and columns. Value of the cell T1;T2 will be the number of documents tagged with both T1 and T2 (note that by doing that you'll get a symetric matrix because [T1;T2] will have the same value as [T2;T1]).
Once you have done that, each row (or column) is a vector locating the tag in a space with T dimensions. Tags near each others in this space often occur together. To visualize co-occurrence you can then use a method to reduce your space dimensionality or any clustering method. For example you can use a kohonen self organizing map to project your T-dimensions space to a 2D space, you'll then get a 2D matrix where each cell represents an abstract vector in the tag space (meaning the vector won't necessary exists in your data set). This vector reflect a topological constraint of your source space, and can be seen as a "model" vector reflecting a significant co-occurence of some tags. Moreover, cells near each others on this map will represent vectors close to each other in the source space, thus allowing you to map the tag space on a 2D matrix.
Final visualization of the matrix can be done in many ways but I cannot give you advice on that without first seeing the results of the previous processing.