命名图中的簇

发布于 2024-12-27 08:21:39 字数 571 浏览 1 评论 0原文

我们有社交图谱，后来被分解为高内聚力的集群。乔纳森·科恩（Jonathan Cohen）称为 Truss 的东西 [1]。

现在我有了这些簇，我想为它们起一个名字。集群名称应允许在不更改名称的情况下对集群大小进行微小的更改。

例如：假设我们有集群 M:

M : {A, B, C, D, E, F}

并假设“命名算法”为其生成了名称“ m ”。

一段时间后，顶点 A 离开了簇，而顶点 J 加入了：

M : {B, C, D, E, F, J}

新生成的名称是“ m' ”。

所需功能：

m' == m     for insignificant cluster changes

[1] http://www.cslu.ogi。 edu/~zak/cs506-pslc/trusses.pdf

原文

We have social graph that is later broken to clusters of high cohesion. Something called Truss by Jonathan Cohen [1].

Now that I have those clusters, I would like to come up with names for them.
Cluster name should allow insignificant changes to the cluster size without changing the name.

For example:
Let's assume we have cluster M:

M : {A, B, C, D, E, F}

and let's assume that "naming algorithm" generated name " m " for it.

After some time, vertex A has left the cluster, while vertex J has joined:

M : {B, C, D, E, F, J}

Newly generated name is " m' ".

Desired feature:

m' == m     for insignificant cluster changes

[1] http://www.cslu.ogi.edu/~zak/cs506-pslc/trusses.pdf

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

浴红衣 2025-01-03 08:21:39

根据您的示例，我假设您的意思是“集群组成的微小变化”，而不是“集群大小”。

如果您的命名函数 f() 无法使用有关给定集群的现有名称的信息，则您必须允许有时它确实会重命名，尽管更改很小。事实上，假设f()从不在集群发生一点变化时重命名它。从集群 A 开始，您可以通过一次仅添加或删除一个元素来访问任何其他集群 B。通过构造，该函数将为 A 和 B 返回相同的名称。由于 A、B 是任意的，f() 将为所有可能的簇返回相同的名称 - 显然没有用。

因此，您有两种选择：

(1) 命名函数依赖于集群的现有名称，或者

(2) 命名函数有时（很少）在非常微小的更改后重命名集群。

如果您选择替代方案（1），那就很简单了。您可以简单地随机分配名称，然后在集群更新时保持它们不变，只要它不是太不同（但是您定义不同）。考虑到它的简单性，我想这不是您想要的。

如果您选择替代方案 (2)，则需要使用有关集群中底层对象的一些信息。如果您拥有的只是指向没有内部结构的各种对象的链接，则无法完成此操作，因为除了簇大小之外，该函数没有任何可处理的内容。

假设您有一些有关对象的信息。例如，您可能有他们的名字。将每个对象名称的前 k 个字母称为对象的前缀。计算集群中所有不同前缀，并找到 n 个最常见的前缀。按字母顺序对这些 n 前缀进行排序，并将它们按该顺序相互附加。对于 k、n 的合理选择（这应该取决于簇的数量和典型的对象名称长度），您将得到您想要的结果 - 只要每个簇中有足够的对象。

例如，如果对象有人名，请尝试 k = 2；如果您有数百个集群，也许可以尝试 n = 2。

当然，可以通过重新映射名称以实现更均匀的分布、处理两个前缀具有相似频率的情况等来大大改善这一点。

Based on your example, I assume you mean "insignificant changes to the cluster composition", not to the "cluster size".

If your naming function f() cannot use the information about the existing name for the given cluster, you would have to allow that sometimes it does rename despite the change being small. Indeed, suppose that f() never renames a cluster when it changes just a little. Starting with cluster A, you can get to any other cluster B by adding or removing only one element at a time. By construction, the function will return the same name for A and B. Since A, B were arbitrary, f() will return the same name for all possible clusters - clearly useless.

So, you have two alternatives:

(1) the naming function relies on the existing name of a cluster, or

(2) the naming function sometimes (rarely) renames a cluster after a very tiny change.

If you go with alternative (1), it's trivial. You can simply assign names randomly, and then keep them unchanged whenever the cluster is updated as long as it's not too different (however you define different). Given how simple it is, I suppose that's not what you want.

If you go with alternative (2), you'll need to use some information about the underlying objects in the cluster. If all you have are links to various objects with no internal structure, it can't be done, since the function wouldn't have anything to work with apart from cluster size.

So let's say you have some information about the objects. For example, you may have their names. Call the first k letters of each object's name the object's prefix. Count all the different prefixes in your cluster, and find the n most common ones. Order these n prefixes alphabetically, and append them to each other in that order. For a reasonable choice of k, n (which should depend on the number of your clusters and typical object name lengths), you would get the result you seek - as long as you have enough objects in each cluster.

For instance, if objects have human names, try k = 2; and if you have hundreds of clusters, perhaps try n = 2.

This of course, can be greatly improved by remapping names to achieve a more uniform distribution, handling the cases where two prefixes have similar frequencies, etc.

回复收藏 0 原文

~没有更多了~