具有百万行的 Django 表
我有一个包含 2 个应用程序的项目(书籍和阅读器)。
图书应用程序有一个包含 400 万行的表,其中包含以下字段:
book_title = models.CharField(max_length=40)
book_description = models.CharField(max_length=400)
为了避免查询包含 400 万行的数据库,我正在考虑按主题划分它(20 个模型,20 个表,200.000 行( book_horror、book_drammatic、ecc ) 。
在“reader”应用程序中,我正在考虑插入此字段:
reader_name = models.CharField(max_length=20, blank=True)
book_subject = models.IntegerField()
book_id = models.IntegerField()
因此,我正在考虑使用整数“book_subject”(允许访问适当的表)和“book_id”(允许访问书籍)而不是外键 在“book_subject”指定的表中)
一个好的解决方案可以避免查询具有 400 万行的表?
是否有
I have a project with 2 applications ( books and reader ).
Books application has a table with 4 milions of rows with this fields:
book_title = models.CharField(max_length=40)
book_description = models.CharField(max_length=400)
To avoid to query the database with 4 milions of rows, I am thinking to divide it by subject ( 20 models with 20 tables with 200.000 rows ( book_horror, book_drammatic, ecc ).
In "reader" application, I am thinking to insert this fields:
reader_name = models.CharField(max_length=20, blank=True)
book_subject = models.IntegerField()
book_id = models.IntegerField()
So instead of ForeignKey, I am thinking to use a integer "book_subject" (which allows to access the appropriate table) and "book_id" (which allows to access the book in the table specified in "book_subject").
Is a good solution to avoid to query a table with 4 milions of rows ?
Is there an alternative solution?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
正如许多人所说,将表拆分为更小的表(水平分区甚至分片)还为时过早。数据库是为了处理这种大小的表而设计的,因此您的性能问题可能出在其他地方。
索引是第一步,听起来你已经做到了这一点。数据库可以使用索引处理 400 万行。
其次,检查您正在运行的查询数量。您可以使用 django 调试工具栏之类的工具来完成此操作,并且您经常会惊讶地发现有多少不必要的查询被执行。
下一步是缓存,对大多数用户未更改的页面或部分页面使用 memcached。在这里,您只需付出很少的努力就能获得最大的性能提升。
如果你真的真的需要拆分表,最新版本的 django (1.2 alpha) 可以处理分片(例如多数据库),并且你应该能够手动编写水平分区解决方案(postgres 提供了 in-db方法来做到这一点)。请不要使用流派来分割表格!选择一些你永远不会改变并且在查询时你总是知道的东西。就像作者一样,除以姓氏的第一个字母之类的。这需要付出很大的努力,并且对于不是特别大的数据库来说有很多缺点——这就是为什么这里的大多数人都建议不要这样做!
[编辑]
我遗漏了非规范化!将常见计数、总和等放入例如作者表中,以防止常见查询上的连接。缺点是你必须自己维护它(直到 django 添加 DenormalizedField)。我会在开发过程中查看这一点,以获取清晰、简单的情况,或者在缓存失败后查看这一点 --- 但在分片或水平分区之前。
Like many have said, it's a bit premature to split your table up into smaller tables (horizontal partitioning or even sharding). Databases are made to handle tables of this size, so your performance problem is probably somewhere else.
Indexes are the first step, it sounds like you've done this though. 4 million rows should be ok for the db to handle with an index.
Second, check the number of queries you're running. You can do this with something like the django debug toolbar, and you'll often be surprised how many unnecessary queries are being made.
Caching is the next step, use memcached for pages or parts of pages that are unchanged for most users. This is where you will see your biggest performance boost for the little effort required.
If you really, really need to split up the tables, the latest version of django (1.2 alpha) can handle sharding (eg multi-db), and you should be able to hand write a horizontal partitioning solution (postgres offers an in-db way to do this). Please don't use genre to split the tables! pick something that you wont ever, ever change and that you'll always know when making a query. Like author and divide by first letter of the surname or something. This is a lot of effort and has a number of drawbacks for a database which isn't particularly big --- this is why most people here are advising against it!
[edit]
I left out denormalisation! Put common counts, sums etc in the eg author table to prevent joins on common queries. The downside is that you have to maintain it yourself (until django adds a DenormalizedField). I would look at this during development for clear, straightforward cases or after caching has failed you --- but well before sharding or horizontal partitioning.
ForeignKey
在数据库中以IntegerField
的形式实现,因此您可以以破坏模型为代价,节省很少甚至没有任何费用。编辑:
看在皮特的份上,将其保存在一张表中并酌情使用索引。
ForeignKey
is implemented asIntegerField
in the database, so you save little to nothing at the cost of crippling your model.Edit:
And for pete's sake, keep it in one table and use indexes as appropriate.
您还没有提到您正在使用哪个数据库。一些数据库(例如 MySQL 和 PostgreSQL)具有极其保守的开箱即用设置,除了小型服务器上的小型数据库之外,这些数据库基本上无法用于任何其他用途。
如果您告诉我们您正在使用哪个数据库,它在什么硬件上运行,以及该硬件是否与其他应用程序共享(例如,它是否也为 Web 应用程序提供服务),那么我们也许可以为您提供一些特定的调整建议。
例如,对于 MySQL,您可能需要调整 InnoDB 设置;对于 PostgreSQL,您需要更改共享缓冲区和许多其他设置。
You haven't mentioned which database you're using. Some databases - like MySQL and PostgreSQL - have extremely conservative settings out-of-the-box, which are basically unusable for anything except tiny databases on tiny servers.
If you tell us which database you're using, and what hardware it's running on, and whether that hardware is shared with other applications (is it also serving the web application, for example) then we may be able to give you some specific tuning advice.
For example, with MySQL, you will probably need to tune the InnoDB settings; for PostgreSQL, you'll need to alter shared_buffers and a number of other settings.
我对Django不熟悉,但对DB有一个大概的了解。
当您拥有大型数据库时,为数据库建立索引是很正常的。这样,检索数据应该很快。
当涉及到将书籍与读者关联时,您应该创建另一个表,将读者链接到书籍。
将书按主题划分并不是一个坏主意。但我不确定你说的 20 个应用程序是什么意思。
I'm not familiar with Django, but I have a general understanding of DB.
When you have large databases, it's pretty normal to index your database. That way, retrieving data, should be pretty quick.
When it comes to associate a book with a reader, you should create another table, that links reader to books.
It's not a bad idea to divide the books into subjects. But I'm not sure what you mean by having 20 applications.
您有性能问题吗?如果是这样,您可能需要添加一些索引。
了解索引有何帮助的一种方法是查看数据库服务器的查询日志(此处的说明(如果您使用的是 MySQL)。
如果您没有遇到性能问题,那就继续吧。数据库是用来处理数百万条记录的,而 django 非常擅长生成合理的查询。
Are you having performance problems? If so, you might need to add a few indexes.
One way to get an idea where an index would help is by looking at your db server's query log (instructions here if you're on MySQL).
If you're not having performance problems, then just go with it. Databases are made to handle millions of records, and django is pretty good at generating sensible queries.
解决此类问题的常见方法是分片。不幸的是,它主要由 ORM 来实现(Hibernate 做得很好),而 Django 不支持这一点。然而,我不确定 400 万行是否真的那么糟糕。您的查询应该仍然是完全可管理的。
也许您应该考虑使用 memcached 之类的缓存。 Django 对此支持得很好。
A common approach to this type of problem is Sharding. Unfortunately it's mostly up to the ORM to implement it (Hibernate does it wonderfully) and Django does not support this. However, I'm not sure 4 million rows is really all that bad. Your queries should still be entirely manageable.
Perhaps you should look in to caching with something like memcached. Django supports this quite well.
您可以使用服务器端数据表。如果您可以实现服务器端数据表,您将能够在不到一秒的时间内拥有超过 400 万条记录。
You can use a server-side datatable. If you can implement a server-side datatable, you will be able to have more than 4 million records in less than a second.