扩展数据、减少加载时间、让我的网络主机满意的最佳方式
对于 Facebook 应用程序,我必须在 MySQL 数据库中存储用户的好友列表。该列表是从我的数据库请求的,与其他数据等进行比较。
目前,我将这个朋友列表存储在我的用户表中,朋友的 uid 放在一个“文本”字段中,并带有“|”作为分隔符。例如:
ID - UID - 姓名 - 好友 => 1 - 123456789 - 无名氏 - 987654321|123456|765432
我的 PHP 文件请求此行并通过分解该字段(“|”)来提取朋友列表。这一切都运行良好,每 1000 个用户大约有 5MB 磁盘空间。
现在的问题:
对于额外的功能,我还需要保存用户朋友的姓名。我可以通过不同的方式做到这一点:
1)将此数据保存在额外的表中。例如:
ID - UID - 名称 => 1 - 1234321 - 无名氏
如果我需要 ID 为 1234321 的朋友的姓名,我可以从此表中请求该姓名。然而,问题是该表将继续增长,直到 Facebook 上的所有用户都被索引(> 5 亿行)。我的虚拟主机不会喜欢这个!这样的表将占用大约 25GB 的磁盘空间。
2)另一种解决方案是扩展用户表中保存的数据,通过将名称添加到朋友字段中的UID(使用额外的分隔符,让我们使用“,”)。例如:
ID - UID - 姓名 - 好友 => 1 - 123456789 - John Doe - 987654321,迈克·琼斯|123456,汤姆·布莱特|765432,里克·史密斯
对于这个解决方案,我必须更改脚本,添加另一个额外的爆炸(',')等。我不知道如何这将占用许多额外的磁盘空间...但是以这种方式处理数据并不容易!
3)第三种解决方案可以很好地概述所有数据,但会导致数据库庞大。在此解决方案中,我们创建一个朋友表,每条友谊对应一行。例如:
ID - UID - FRIENDUID => 1 - 123456789 - 54321
ID - UID - FRIENDUID => 3 - 123456789 - 65432
ID - UID - FRIENDUID => 2 - 987654321 - 54321
ID - UID - FRIENDUID => 4 - 987654321 - 65432
正如您在此示例中所看到的,它很好地概述了所有友谊。然而,对于大约 5 亿用户,假设每个用户平均有 300 个好友,这将创建一个包含 1500 亿行的表。我的主机肯定不会喜欢这样...而且我认为这种表会占用大量磁盘空间...
那么...如何解决这个问题?您认为在 Facebook 上存储用户的 UID + 好友姓名的最佳方式是什么?如何扩展此类数据?或者您还有比上述三种可能性更好的另一种解决方案吗?
希望你能帮助我!
For a Facebook Application, I have to store a list of friends of a user in my MySQL database. This list is requested from my db, compared with other data, etc.
Currently, I store this list of friends within my user table, the uids of the friends are put together in one 'text' field, with a '|' as separator. For example:
ID - UID - NAME - FRIENDS => 1 - 123456789 - John Doe - 987654321|123456|765432
My PHP file requests this row and extracts the list of friends by exploding that field ('|'). This all works fine, every 1000 users are about 5MB diskspace.
Now the problem:
For an extra feature, I also need to save the names of the friends of the user. I can do this in different ways:
1) Save this data in an extra table. For example:
ID - UID - NAME => 1 - 1234321 - Jane Doe
If I need the name of the friend with ID 1234321, I can request the name from this table. However, the problem is that this table will keep growing, until all users on Facebook are indexed (>500million rows). My webhost is not going to like this! Such a table will take about 25GB of diskspace.
2) Another solution is to extend the data saved in the user table, by adding the name to the UID in the friends field (with an extra separator, let's use ','). For example:
ID - UID - NAME - FRIENDS => 1 - 123456789 - John Doe - 987654321,Mike Jones|123456,Tom Bright|765432,Rick Smith
For this solution I have to alter the script, to add another extra explode (','), etc. I'm not sure how many extra diskspace this is going to take... But the data doesn't get easy to handle this way!
3) A third solution gives a good overview of all the data, but will cause the database to be huge. In this solution we create a table of friends, with a row for every friendship. For example:
ID - UID - FRIENDUID => 1 - 123456789 - 54321
ID - UID - FRIENDUID => 3 - 123456789 - 65432
ID - UID - FRIENDUID => 2 - 987654321 - 54321
ID - UID - FRIENDUID => 4 - 987654321 - 65432
As you can see in this example, it gives a very good overview of all the friendships. However, with about 500million users, and let's say an average of 300 friendships per user, this will create a table with 150billion rows. My host is definitely not going to like that... AND I think this kind of table will take a lot of diskspace...
So... How to solve this problem? What do you think, what is the best way to store the UIDs + names of friends of a user on Facebook? How to scale this kind of data? Or do you have another (better) solution than the three possibilities mentioned above?
Hope you can help me!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果存储您需要的用户名确实需要 25GB,那么就需要 25GB。您无法移动数据并期望它变得更小 - 而且表的开销也没有那么多。相反,您只需专注于存储实际需要的数据。 Facebook 上的每个人不太可能都使用您的应用程序(如果是这种情况,您就不应该使用需要担心 25GB 空间的主机)。
因此,不要对整个 Facebook 建立索引(无论如何这都会很困难),而只需存储与实际使用您的应用程序的人及其直接朋友相关的数据,这是一个小得多的数据集。
您提出的第一个解决方案是正确的方法;它消除了名称存储中任何潜在的冗余。
If storing the names of the users you need really takes 25GB, then it takes 25GB. You can't move data around and expect it to get smaller - and the overhead of a table is not that much. Instead, you need to focus on only storing the data you actually need. It is unlikely that everyone on Facebook uses your application (if it were the case, you shouldn't be using a host where 25GB of space is a worry).
So instead of indexing the entirety of Facebook (which would be difficult regardless), just store the data relevant for the people who actually use your application and their immediate friends, which is a much smaller dataset.
Your first proposed solution is the proper way to do it; it eliminates any potential redundancy in name storage.
我同意 Amber 的观点,解决方案 1 将是存储这些数据的最有效的方法。如果您想坚持使用当前的方法(类似于解决方案 2),您可能需要考虑将友谊数据存储为 JSON 字符串。它不会产生尽可能短的字符串,但它会很容易解析。
保存数据:
取回数据:
I agree with Amber, solution 1 is going to be the most efficient way to store this data. If you want to stick with your current approach (similar to solution 2), you may want to consider storing the friendship data as a JSON string. It won't produce the shortest possible string, but it will be very easy to parse.
To save the data:
To get the data back:
我真的认为你应该选择第三种选择。为了可扩展性,您需要这样做。
使用第一种方法,您有很多冗余数据,因为如果 1 是 2 的朋友,那么 2 也是 1 的朋友。但是您正在存储这两种关系。
这也使得 1500 亿行计数变得不可能。更有可能的是,这最多是一半,因为关系表可以双向工作!!
因此,第一个用户将在表中生成 300 行,但第二个用户(如果他是 1 的朋友)将仅生成 299 行。继续这样做,最后一个用户甚至不会生成关系行,因为它们都是已经存在!
此外,当您想要开始搜索某些关系时,第三个选项会更快,因为您将拥有一个
int
索引而不是fulltext
索引,这可能会再节省 50存储和处理速度的百分比。如果您的应用程序将达到 5 亿用户,您将只需要获得更好的托管服务。
I really think you should go with the third option. For scalability you would want to do this.
With the first method you have a LOT of redundant data because if 1 is friends with 2, 2 is also friends with 1. But you are storing both relations.
This also makes the 150 billion row count impossible. It is more likely that this will be at most half, because the relations table can work both ways!!
So the first user will generate 300 rows in the table, but the second user (if he is friends with 1) will generate just 299. Continue to do so and the last user won't even generate a relation row, because they are all already present!
Also when you want to start searching for certain relations the third option will be much faster since you'll have a
int
index in stead of afulltext
index which probably saves another 50% in both storage and processing speed.If your application will reach 500 million users you will just have to get a better hosting service.