构建可扩展数据库的最佳实践
我知道这是一个非常普遍和主观的问题,所以如果它不符合 StackOverflow 网络礼仪,请随意投票关闭它..但对我来说,这是值得尝试的;)
从现在起我从未构建过高流量应用程序,所以我不知道(除了在网上阅读一些内容)有关扩展实践的信息。
如何设计一个数据库,当需要扩展时,我不必重构数据库结构或应用程序代码?
我知道开发(和优化)应该逐步进行,优化发生的瓶颈,并且当您不知道您将拥有多少用户以及他们将如何使用数据库(例如读/写比率)时,几乎不可能设计完美的结构,我只是在寻找一个良好的开始基础。
使结构几乎准备好通过分区和分片进行扩展的最佳实践是什么,以及必须绝对避免哪些黑客行为?
编辑有关我的应用程序的一些详细信息:
- 应用程序将作为多站点行为运行
- 我将为每个应用程序版本(db_0_0_1、db_0_0_2 等)创建一个数据库*
- 每个“站点”都会有一个架构在数据库*和只能访问自己模式的角色中
- 应用程序代码主要是 PHP,Python 中的一些东西(守护进程和维护东西)
- 可能是 Nginx 和 lighttpd 或 node.js 作为支持对于长轮询任务(例如聊天),
- 缓存将使用 memcached 完成(对于与 php 代码严格相关的事情,加上 apc,因为它可以在 php 之外使用)
I know this is a very generic and subjective question, so feel free to vote to close it if it does not meet the StackOverflow netiquette.. but for me, it's worth trying ;)
I've never built a high-traffic application since now, so I'm not aware (except for some reading on the web) about scaling practices.
How can I design a database that, when a scaling is needed, I dont have to refactor the database structure, or the application code?
I know that development (and optimization) should come step-by-step, optimize bottleneck as they happen, and is nearly impossible to design the perfect structure when you don't know how many users you'll have and how would they use the database (e.g. read/write ratio), I'm just looking for a good base to start.
What are the best practices for making a structure almost ready to be scaled with partitioning
and sharding
, and what hacks
must be absolutely avoided?
Edit some detail about my application:
- The application will run as a multisite behavior
- I'll have a database for each application version (db_0_0_1, db_0_0_2, etc..)*
- Every 'site' will have a schema inside a database* and a role that can access only his own schemas
- Application code will be mostly PHP and few things (daemons and maintenance things) in Python
- Web server will probably be Nginx and lighttpd or node.js as support for long-polling tasks (e.g. chat)
- Caching will be done with memcached (plus apc for things strictly related to the php code, as it can be used outside php)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这个问题确实很通用,但这里有一些提示:
不要在应用程序代码中使用任何会话变量(pg_backend_pid()、inet_client_addr())或每个会话控制(SET ROLE、SET SESSION)。
不要在应用程序代码中使用显式事务控制(BEGIN/COMMIT/SET TRANSACTION)。所有此类逻辑都应包含在 UDF< /a>.这可以实现无状态、语句模式池化,从而实现尽可能快的数据库池化。 (请参阅 pgbouncer 文档 和 pg wiki 了解更多信息)
封装所有App<->UDF 的明确定义的 DB API 中的 Db 通信 - 这将允许您使用 PL/Proxy。如果对所有 SELECT 执行此操作太困难,请至少对所有数据写入(INSERT/UPDATE/DELETE)执行此操作。示例:您需要
SELECT create_user('Joe')
,而不是INSERT INTO users(name) VALUES('Joe')
。检查您的数据库架构 - 是否可以轻松分离属于给定用户的所有数据? (很可能这将是分区键)。剩下的就是需要复制到所有节点的通用共享数据。
在需要之前考虑缓存。缓存密钥是什么?什么是缓存超时?您会使用 memcached 吗?
The question is really generic, but here are few tips:
Do not use any session variables (pg_backend_pid(), inet_client_addr()) or per-session control (SET ROLE, SET SESSION) in application code.
Do not use explicit transaction control (BEGIN/COMMIT/SET TRANSACTION) in application code. All such logic should be wrapped in UDFs. This enables stateless, statement-mode pooling which enables fastest possible DB pooling. (see pgbouncer docs, and pg wiki for more info)
Encapsulate all App<->Db communication in well defined DB API of UDFs - this will let you use PL/Proxy. If doing this with all SELECTs is too hard, do it at least for all data writes (INSERT/UPDATE/DELETE). Example: instead of
INSERT INTO users(name) VALUES('Joe')
you needSELECT create_user('Joe')
.check your DB schema - is it easy to separate all data belonging to given user? (most probably this will be the partitioning key). All that's left is common, shared data which will need to be replicated to all nodes.
think of caching before you need it. what will be caching key? what will be cache timeout? will you use memcached?