针对大数据问题的可扩展架构解决方案的建议
我正在构建/架构一个商业社交网络 Web 应用程序,该应用程序有一个我认为会导致重大可扩展性问题的组件,我想获得一些关于最佳前进方式的反馈/想法。
该应用程序有一个 User 对象。这个想法是,每次新用户加入系统时,他都会根据一组因素对其他人对他的“有用性”进行排名。同样,系统上的每个其他用户都会对他/她进行排名。
但是,我担心这种方法的可扩展性影响。例如,如果有 10,000 个用户加入系统,我们谈论的是要存储到数据库中的 10,000^2 次计算。这是 1 亿条记录,因此无论是在计算这些排名所需的时间方面,还是在将其存储在数据库中方面,这显然都成为问题。
因此,我正在寻找帮助/灵感:)
我的背景是java,我一直在寻找hadoop/map-reduce作为以并行方式实现计算的可能方法,但是我真的不确定这是否可行问题适用于 MapReduce 或一般来说最好的方法是什么。
所以,我想我的查询有两个特定部分。1
)为了进行实际计算,我应该以并行方式进行这些计算,即……MapReduce是解决这个问题的好方法
2)要存储排名,我应该使用什么...标准关系数据库是一个坏主意,即...这不太适合 MySQL...我应该考虑 Cassandra、HBase 或其他 NoSQL 解决方案吗?
非常感谢任何帮助/想法。
干杯, 布莱恩
I am in the process of building/architecting a business social network web application that has a component that I think will lead to major scalability issues and I'd like to get some feedback/thoughts on the best way forward.
The application has a User object. The idea is, that every time a new user joins the system he ranks everyone else's "usefulness" to him based on a set of factors. Similarly, every other user on the system ranks him/her.
However, I'm worried about the scalability implications of this approach. For example, if 10,000 users join the system we are talking about 10,000^2 calculations to be stored to the database. That is 100 million records so that clearly becomes problematic both in terms of time taken to calculate these rankings but also in terms of storing this in a database.
Thus, I'm looking for help/inspiration :)
My background is in java and I've been looking at hadoop/map-reduce as a possible way to implement the calculations in a parallel manner, however I really am not sure whether this problem is applicable to Map Reduce or as to what is the best approach in general.
So, I suppose there are two specific parts to my query..
1) To do the actual calculations, should I do these in a parallel manner, ie..is Map Reduce a good approach for this problem
2) To store the rankings, what should I be using...is a standard relational database a bad idea, ie...this won't be a good fit for MySQL...should I look at something like Cassandra, HBase or some other NoSQL solution?
Any help/ideas is greatly appreciated.
cheers,
Brian
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在使用 MapReduce 的强力解决问题之前,我会尝试减少搜索空间。如果它是一个拥有 1 万用户的社交网络,这意味着对于某个特定用户来说,大多数其他用户都是不认识的,因此没有用处。
因此,我会尝试根据适合您的社交网络的标准来限制用户进行评估的空间。例如,也许可以将搜索限制为本地用户(或者首先限制为本地用户,然后进行更详尽的搜索)。 “本地”在实践中的含义取决于您的用户,其想法是基于现实世界使用一些优化。
Before throwing the brute force of MapReduce to the problem, I'd try to reduce the search space. If it's a social network of even 10K users, that means to a particular user most other users are not known, thus not useful.
I would therefore try to limit the space of users to evaluate based on criteria that fit your social network. For instance, perhaps limiting the search to local users might be applicable (or limit it to them initially and do a more exhaustive search later). What "local" means in practice depends on your user, the idea is to use some optimizations based on the real world.
我建议仅存储“真实”值(由用户输入的值)。这样,用户就会对对他们有价值的其他用户进行排名,而所有其他用户都被认为是“无用的”;)。因此,您可能会为每个用户存储数百个值。我假设您不会真的让每个新用户浏览其他用户的整个列表并对它们单独进行排名,对吧?
您还可以通过建立存储两个用户评价的双向关联来减少空间需求(一条记录将用户 A 与用户 F 链接起来,并指出 A 将 F 列为 5,F 将 A 列为 3)。大致将您的空间需求减少一半,但仍然有很多记录。另外,您将需要两个用户键上的索引,因为您必须搜索两个用户键才能查找单个用户的所有记录。
I'd suggest only storing "real" values (those entered by a user). That way, users rank the other users that have value to them, and all the rest are assumed to be "useless" ;). Therefore, you'll store maybe a couple hundred values for each user. I'm assuming you're not really going to make each new user go through the entire list of other users and rank them individually, right?
You could also cut your space requirements down by making bidirectional associations that store both users' evaluations (one record links user A with user F and notes that A ranks F as a 5, and F ranks A as a 3). Cuts your space requirements in half, roughly, but it's still a lot of records. Plus you'll want indexes on both user keys, since you'll have to search both to find all records for a single user.
虽然 100m 行确实很大,但它可能没有您想象的那么大。我处理的 MySQL 数据库有一个超过 10m 行的表,该表连接到其他超过 100k 行的表,没有太多问题。重要的一点是让索引正确并提高查询效率。也许在花太多时间考虑超级架构之前,先用您认为可能包含在其中的行填充一个播放表,并编写一些您认为将要编写的查询,看看它是否易于管理。
While 100m rows is surely big, it may not be as big as you think. I deal with a MySQL db that has a table with more than 10m rows that joins to other tables with more than 100k rows without too many problems. The important point is to get your indexes right and make your queries efficient. Perhaps before spending too much time thinking about a super-architecture, fill a play table with the rows you think might be in it and also write some queries you think you'll be writing and see if its manageable.