我应该如何更改我的图形结构(插入非常慢)?

发布于 2024-08-28 13:47:17 字数 2226 浏览 3 评论 0原文

我正在做的这个程序是关于社交网络的,这意味着有用户和他们的个人资料。配置文件结构是UserProfile

现在,有多种可能的图形实现,但我认为我没有使用最好的一种。我有一个 Graph 结构,内部有一个指向 Vertex 类型的链接列表的指针。每个Vertex元素都有一个值、一个指向下一个Vertex的指针和一个指向Edge类型的链表的指针。每个 Edge 元素都有一个值(因此我可以定义权重和所需的任何内容)、一个指向下一个 Edge 的指针和一个指向 Vertex 的指针> 所有者。

我有 2 个示例文件,其中包含要处理的数据(以 CSV 样式)并插入到图表中。第一个是用户数据(每行一个用户);第二个是用户关系(对于图表)。第一个文件很快插入到图表中,因为我总是在头部插入,并且大约有 18000 个用户。第二个文件花了很长时间,但我仍然将边缘插入头部。该文件包含大约 520000 行用户关系,需要 13-15 分钟才能插入到图表中。我做了一个快速测试,读取数据非常快,真的是即时的。问题出在插入上。

存在这个问题是因为我有一个用顶点链表实现的图。每次我需要插入一个关系时,我都需要查找 2 个顶点,这样我就可以将它们链接在一起。这就是问题所在......为大约 520000 个关系执行此操作需要一段时间。

我应该如何解决这个问题?

解决方案1) 有人建议我将图(顶点部分)实现为数组而不是链表。这样我就可以直接访问每个顶点,并且插入可能会大大减少。但是,我不喜欢分配具有 [18000] 元素的数组的想法。这有多实用?我的样本数据大约有 18000 个,但是如果我需要更少或更多怎么办?链表方法具有这种灵活性,只要有内存,我就可以拥有任何我想要的大小。但数组没有,我该如何处理这种情况?您有什么建议?

使用链表有利于空间复杂度,但对时间复杂度不利。使用数组有利于时间复杂度,但对空间复杂度不利。

对这个解决方案有什么想法吗?

解决方案 2) 该项目还要求我有某种数据结构,允许基于名称索引和 ID 索引进行快速查找。为此,我决定使用哈希表。我的表是通过单独的链接来实现的,作为冲突解决方案,当负载因子达到 0.70 时,我通常会重新创建表。我根据这个链接确定下一个表的大小。

目前,两个哈希表都保存指向UserProfile 的指针,而不是复制用户配置文件本身。这太愚蠢了,更改数据需要进行 3 次更改,这样做真的很愚蠢。所以我只是将指针保存到UserProfile。相同的用户配置文件指针也保存为每个图形顶点中的值。

因此,我有 3 个数据结构,一个图和两个哈希表,它们中的每一个都指向同一个 UserProfile。图结构将用于查找最短路径和类似的东西,而哈希表则用作按名称和 ID 的快速索引。

我想解决我的图形问题,而不是将哈希表值指向 UserProfile,而是将其指向相应的 Vertex。它仍然是一个指针,没有使用更多也没有更少的空间,我只是改变了我指向的内容。

像这样,我可以轻松快速地查找我需要的每个顶点并将它们链接在一起。这将很快插入约 520000 个关系。

我想到这个解决方案是因为我已经有了哈希表并且我需要它们,那么为什么不利用它们来索引图顶点而不是用户配置文件呢?基本上是一样的,我仍然可以很快地访问 UserProfile,只需转到 Vertex,然后转到 UserProfile

但是,您认为第二个解决方案相对于第一个解决方案有什么缺点吗?或者只有优点压倒第一个解决方案的优点和缺点?

其他解决方案)如果您有任何其他解决方案,我洗耳恭听。但是请解释一下该解决方案相对于前两个解决方案的优缺点。我现在真的没有太多时间可以浪费在这个问题上,我需要继续这个项目,所以,如果我这样做的话改变时,我需要确切地了解要改变什么以及这是否真的是正确的选择。

希望没有人读到这篇文章时睡着了并关闭浏览器,对这个大遗嘱感到抱歉。但我真的需要决定该怎么做,我真的需要做出改变。

PS:在回答我提出的解决方案时,请像我一样列举它们,以便我确切地知道你在说什么,并且不要让我自己更加困惑。

This program I'm doing is about a social network, which means there are users and their profiles. The profiles structure is UserProfile.

Now, there are various possible Graph implementations and I don't think I'm using the best one. I have a Graph structure and inside, there's a pointer to a linked list of type Vertex. Each Vertex element has a value, a pointer to the next Vertex and a pointer to a linked list of type Edge. Each Edge element has a value (so I can define weights and whatever it's needed), a pointer to the next Edge and a pointer to the Vertex owner.

I have a 2 sample files with data to process (in CSV style) and insert into the Graph. The first one is the user data (one user per line); the second one is the user relations (for the graph). The first file is quickly inserted into the graph cause I always insert at the head and there's like ~18000 users. The second file takes ages but I still insert the edges at the head. The file has about ~520000 lines of user relations and takes between 13-15mins to insert into the Graph. I made a quick test and reading the data is pretty quickly, instantaneously really. The problem is in the insertion.

This problem exists because I have a Graph implemented with linked lists for the vertices. Every time I need to insert a relation, I need to lookup for 2 vertices, so I can link them together. This is the problem... Doing this for ~520000 relations, takes a while.

How should I solve this?

Solution 1) Some people recommended me to implement the Graph (the vertices part) as an array instead of a linked list. This way I have direct access to every vertex and the insertion is probably going to drop considerably. But, I don't like the idea of allocating an array with [18000] elements. How practically is this? My sample data has ~18000, but what if I need much less or much more? The linked list approach has that flexibility, I can have whatever size I want as long as there's memory for it. But the array doesn't, how am I going to handle such situation? What are your suggestions?

Using linked lists is good for space complexity but bad for time complexity. And using an array is good for time complexity but bad for space complexity.

Any thoughts about this solution?

Solution 2) This project also demands that I have some sort of data structures that allows quick lookup based on a name index and an ID index. For this I decided to use Hash Tables. My tables are implemented with separate chaining as collision resolution and when a load factor of 0.70 is reach, I normally recreate the table. I base the next table size on this Link.

Currently, both Hash Tables hold a pointer to the UserProfile instead of duplication the user profile itself. That would be stupid, changing data would require 3 changes and it's really dumb to do it that way. So I just save the pointer to the UserProfile. The same user profile pointer is also saved as value in each Graph Vertex.

So, I have 3 data structures, one Graph and two Hash Tables and every single one of them point to the same exact UserProfile. The Graph structure will serve the purpose of finding the shortest path and stuff like that while the Hash Tables serve as quick index by name and ID.

What I'm thinking to solve my Graph problem is to, instead of having the Hash Tables value point to the UserProfile, I point it to the corresponding Vertex. It's still a pointer, no more and no less space is used, I just change what I point to.

Like this, I can easily and quickly lookup for each Vertex I need and link them together. This will insert the ~520000 relations pretty quickly.

I thought of this solution because I already have the Hash Tables and I need to have them, then, why not take advantage of them for indexing the Graph vertices instead of the user profile? It's basically the same thing, I can still access the UserProfile pretty quickly, just go to the Vertex and then to the UserProfile.

But, do you see any cons on this second solution against the first one? Or only pros that overpower the pros and cons on the first solution?

Other Solution) If you have any other solution, I'm all ears. But please explain the pros and cons of that solution over the previous 2. I really don't have much time to be wasting with this right now, I need to move on with this project, so, if I'm doing to do such a change, I need to understand exactly what to change and if that's really the way to go.

Hopefully no one fell asleep reading this and closed the browser, sorry for the big testament. But I really need to decide what to do about this and I really need to make a change.

P.S: When answering my proposed solutions, please enumerate them as I did so I know exactly what are you talking about and don't confuse my self more than I already am.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

回忆追雨的时光 2024-09-04 13:47:17

第一种方法是由于这里的主要问题是速度,我更喜欢数组方法。

当然,您应该维护用于名称索引查找的哈希表。

如果我理解正确的话,你只处理一次数据。所以不存在动态数据插入。

为了处理空间分配问题,我建议:

1 - 读取一次文件,获取顶点数。

2 - 分配该空间

如果数据是动态的,您可以实现一些简单的方法来以 50% 为步长增加数组大小。

3 - 在“边”中,用链表替换数组。该数组应以 50% 的步长动态递增。

即使分配了“额外”空间,当您以 50% 的步长增加大小时,数组使用的总大小也应该只略大于链表的大小。

我希望我能帮忙。

The first approach is the Since the main issue here is speed, I would prefer the array approach.

You should, of course, maintain the hash table for the name-index lookup.

If I understood correctly, you only process the data one time. So there is no dynamic data insertion.

To deal with the space allocation problem, I would recommend:

1 - Read once the file, to get the number of vertex.

2 - allocate that space

If you data is dynamic, you could implement some simple method to increment the array size in steps of 50%.

3 - In the Edges, substitute you linked list for an array. This array should be dynamically incremented with steps of 50%.

Even with the "extra" space allocated, when you increment the size with steps of 50%, the total size used by the array should only be marginally larger than with the size of the linked list.

I hope I could help.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文