如何避免多次数据库往返和大量不相关的数据?
我曾使用过各种应用程序并多次遇到这种情况。直到现在我还没有弄清楚什么是最好的方法。
场景如下:
- 我有一个桌面或网络应用程序,
- 我需要从数据库中检索简单的文档。该文档具有一般详细信息和项目详细信息,因此数据库:
GeneralDetails
表:
| DocumentID | DateCreated | Owner |
| 1 | 07/07/07 | Naruto |
| 2 | 08/08/08 | Goku |
| 3 | 09/09/09 | Taguro |
ItemDetails
表
| DocumentID | Item | Quantity |
| 1 | Marbles | 20 |
| 1 | Cards | 56 |
| 2 | Yo-yo | 1 |
| 2 | Chess board | 3 |
| 2 | GI Joe | 12 |
| 3 | Rubber Duck | 1 |
如您所见,这些表具有一对多关系。现在,为了检索所有文档及其各自的项目,我总是执行以下两种操作之一:
方法 1 - 多次往返(伪代码):
Documents = GetFromDB("select DocumentID, Owner " +
"from GeneralDetails")
For Each Document in Documents
{
Display(Document["CreatedBy"])
DocumentItems = GetFromDB("select Item, Quantity " +
"from ItemDetails " +
"where DocumentID = " + Document["DocumentID"] + "")
For Each DocumentItem in DocumentItems
{
Display(DocumentItem["Item"] + " " + DocumentItem["Quantity"])
}
}
方法 2 - 大量不相关的数据(伪代码):
DocumentsAndItems = GetFromDB("select g.DocumentID, g.Owner, i.Item, i.Quantity " +
"from GeneralDetails as g " +
"inner join ItemDetails as i " +
"on g.DocumentID = i.DocumentID")
//Display...
我使用第一种我在大学时使用桌面应用程序时的方法,性能还不错,所以我意识到还可以。
直到有一天,我看到一篇文章“让网络更快”,它说多次往返数据库是不好的;所以从那时起我就使用了第二种方法。
在第二种方法中,我通过使用内部联接同时检索第一个和第二个表来避免往返,但它会产生不必要或冗余的数据。查看结果集。
| DocumentID | Owner | Item | Quantity |
| 1 | Naruto | Marbles | 20 |
| 1 | Naruto | Cards | 56 |
| 2 | Goku | Yo-yo | 1 |
| 2 | Goku | Chess board | 3 |
| 2 | Goku | GI Joe | 12 |
| 3 | Taguro | Rubber Duck | 1 |
结果集具有冗余的DocumentID
和Owner
。它看起来像一个非规范化的数据库。
现在的问题是,如何避免往返,同时避免冗余数据?
I have worked with various applications and encountered this situation many times. Until now I have not figured out what is the best approach.
Here's the scenario:
- I have an application either desktop or web
- I need to retrieve simple documents from the database. The document has a general details and item details so the database:
GeneralDetails
table:
| DocumentID | DateCreated | Owner |
| 1 | 07/07/07 | Naruto |
| 2 | 08/08/08 | Goku |
| 3 | 09/09/09 | Taguro |
ItemDetails
table
| DocumentID | Item | Quantity |
| 1 | Marbles | 20 |
| 1 | Cards | 56 |
| 2 | Yo-yo | 1 |
| 2 | Chess board | 3 |
| 2 | GI Joe | 12 |
| 3 | Rubber Duck | 1 |
As you can see, the tables have a one-to-many relationship. Now, in order to retrieve all the documents and their respective items, I always do either of the two:
Method 1 - Many round trips (pseudo-code):
Documents = GetFromDB("select DocumentID, Owner " +
"from GeneralDetails")
For Each Document in Documents
{
Display(Document["CreatedBy"])
DocumentItems = GetFromDB("select Item, Quantity " +
"from ItemDetails " +
"where DocumentID = " + Document["DocumentID"] + "")
For Each DocumentItem in DocumentItems
{
Display(DocumentItem["Item"] + " " + DocumentItem["Quantity"])
}
}
Method 2 - Much irrelevant data (pseudo-code):
DocumentsAndItems = GetFromDB("select g.DocumentID, g.Owner, i.Item, i.Quantity " +
"from GeneralDetails as g " +
"inner join ItemDetails as i " +
"on g.DocumentID = i.DocumentID")
//Display...
I used the first method when I was in college for desktop applications, the performance was not bad so I realized it was okay.
Until one day, I saw an article "Make the web faster", it says that many round trips to the database is bad; so ever since then I have used the second method.
On the second method, I avoided round trips by using inner join to retrieve the first and the second table at once, but it produces unecessary or redundant data. See the result set.
| DocumentID | Owner | Item | Quantity |
| 1 | Naruto | Marbles | 20 |
| 1 | Naruto | Cards | 56 |
| 2 | Goku | Yo-yo | 1 |
| 2 | Goku | Chess board | 3 |
| 2 | Goku | GI Joe | 12 |
| 3 | Taguro | Rubber Duck | 1 |
The result set has redundant DocumentID
and Owner
. It looks like an unnormalized database.
Now, the question is, how do I avoid round trips and at the same time avoid redundant data?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
ActiveRecord 和其他 ORM 使用的方法是选择第一个表,将 ID 批处理在一起,然后在 IN 子句中使用这些 ID 进行第二次选择。
优点:
缺点:
一般来说,第一种方法称为“N+1 查询问题”,解决方案参考作为“渴望加载”。我倾向于认为您的“方法 2”更可取,因为数据库的延迟通常胜过数据传输速率上的冗余数据的大小,但 YRMV。与软件中的几乎所有事物一样,这是一种权衡。
The method used by ActiveRecord and other ORMs is to select the first table, batch together the IDs and then use those IDs in an IN clause for the second select.
Advantages:
Disadvantages:
Generally speaking, the first method is referred to as the "N+1 query problem" and the solutions are referred to as "eager loading". I tend to see your "Method 2" as preferable as the latency to the database generally trumps the size of the redundant data over the data transfer rate, but YRMV. As with almost everything in software, it's a tradeoff.
内连接更好,因为数据库有更多优化的可能性。
一般来说,您不能创建这样的查询,它不会产生冗余结果。为此,关系模型的限制太多。我会接受它:数据库负责优化这些情况。
如果您确实遇到性能问题(主要是由于网络瓶颈),您可以编写一个存储过程,该存储过程进行查询并将其非规范化。在您的示例中,您创建了一个如下结果:
但这当然不符合第一个范式 - 因此您需要在客户端上解析它。
如果您使用支持 XML 的数据库(例如 Oracle 或 MS SQL Server),您甚至可以在服务器上创建 XML 文件并将其发送到客户端。
但无论你做什么,请记住:过早的优化是万恶之源。在你不能100%确定你确实面临着一个可以这样解决的问题之前,不要做这种事情。
The inner join is better because the database has more possibilities to optimize.
In general you can not create a query like this which does not produce redundant results. For that, the relational model is too restrictive. I would just live with it: the database is responsible to optimize these cases.
If you really encounter performance problems (mainly because of a network bottleneck) you could write a stored procedure, that makes the query and denormalizes it. In your example you create create a result like:
But this of course does not conform to the first normal form - so you will need to parse it on the client.
If you use a database with XML support (like Oracle or MS SQL Server) you could even create an XML file on the server and send this to the client.
But whatever you do, remember: premature optimization is the root of all evil. Don't do this kind of stuff before you are not 100% sure, that you are really facing a problem you can solve like this.
您可以读取第一个表,从第二个表中提取所需行的键,然后通过第二个选择检索它们。
像这样的东西
You can read first table, extract keys of rows you want from second table and retrieve them via second select.
Something like
你的第二种方法绝对是一个可行的方法。
但您不必选择不打算使用的列。
因此,如果您只需要
Item
和Quantity
,请执行以下操作:(我想您还有其他条件可以放入查询的
where
部分,否则不需要加入。)Your second method is definitely a way to go.
But you do not have to select columns you are not going to use.
So if you only need
Item
andQuantity
, do this:(I suppose you have other conditions that you would put in
where
part of the query, otherwise the join is not necessary.)如果您使用 .NET 和 MS SQL Server,这里的简单解决方案是考虑使用 MARS(多个活动结果集)。下面是直接从 MARS 演示的 Visual Studio 2015 帮助中提取的示例代码块:
希望这能让您走上理解之路。关于往返的主题有许多不同的理念,其中很大程度上取决于您正在编写的应用程序的类型以及您要连接的数据存储。如果这是一个 Intranet 项目并且没有大量的并发用户,那么大量的数据库往返并不是您认为的问题或担忧,除非您的声誉看起来没有这样做更精简的代码! (咧嘴笑)
如果这是一个 Web 应用程序,那么情况就不同了,您应该尽量确保不会过于频繁地返回井(如果可以避免的话)。 MARS 是解决此问题的一个很好的答案,因为所有内容都会一次性从服务器返回,然后由您来迭代返回的数据。
希望这对您有用!
If you are using .NET and MS SQL Server, the simple solution here would be to look into using MARS (Multiple Active Resultsets). Here's a sample code block pulled straight from the Visual Studio 2015 Help on a MARS demo:
Hopefully this puts you on a path to understanding. There are many different philosophies on the subject of round-tripping, and much of it depends on the type of application you're writing and the data store you're connecting to. If this is an intranet project and there aren't a huge number of simultaneous users then a large number of round-trips to the database is not the issue or concern you think it is, except for how it looks to your reputation not to have more streamlined code! (grin)
If this is a web application then that's a different story, and you should try to ensure you aren't going back to the well too frequently if at all avoidable. MARS is a good answer for resolving this, since everything comes back from the server in one shot, and it's then up to you to iterate through the returned data.
Hope this is useful to you!
答案取决于您的任务。
1. 如果您想生成列表/报告,那么您需要具有冗余数据的方法2。您可以通过网络传输更多数据,但可以节省生成内容的时间。
2. 如果您想先显示常规列表,然后通过用户点击显示详细信息,那么最好使用方法1。生成和发送有限的数据集将会非常快。
3. 如果您想将所有数据预加载到应用程序中,则可以使用 XML。它将提供所有非冗余数据。但是,还需要进行额外的编程,即在 SQL 中进行 XML 编码并在客户端进行解码。
我会做这样的事情来在 SQL 端生成 XML:
The answer depends on your task.
1. If you want to generate List/Report then you need Method-2 with redundant data. You transfer more data over the network, but save time on generating the content.
2. If you want to display General list first and then display details by user's click then it is better to use Method-1. To generate and send limited data set will be very fast.
3. If you want to pre-load all data into the app then you can use XML. It will provide ALL not redundant data. However, there is an additional programming with XML coding in SQL and decoding on the client.
I'd do something like this to generate XML on SQL side:
据我所知,您有很多选项
As far as I see it you have a number of options
不知何故,在我的应用程序中,大约有 200 个表单/屏幕和一个大约有 300 个表的数据库,我从来不需要第一种方法或第二种方法。
在我的应用程序中,用户经常在屏幕上看到两个网格(表格),彼此相邻:
main
GeneralDetails
表,其中包含文档列表(通常有搜索功能,使用一堆各种过滤器)。来自选定文档的
ItemDetails
表的数据。不适用于所有文档。仅针对一份当前文档。当用户在第一个网格中选择不同的文档时,我(重新)运行查询以检索所选文档的详细信息。仅适用于一个选定的文档。因此,主表和详细表之间没有连接。
并且,没有循环来检索所有主文档的详细信息。
为什么您需要了解客户端所有文档的详细信息?
我想说,最佳实践可以归结为常识:
通过网络仅传输您需要的数据而不存在冗余总是好的。保持查询/请求的数量尽可能低总是好的。不要循环发送多个请求,而是发送一个将返回所有必需行的请求。如果确实需要的话,然后在客户身上切片和切块。
如果需要以某种方式处理一批文档及其详细信息,那就是另一回事了,到目前为止,我总是设法在服务器端完成此操作,而无需将所有这些数据传输到客户端。
如果由于某种原因需要向客户端获取所有主文档的列表以及所有文档的详细信息,我将进行两个查询而无需任何循环:
这两个查询将返回两个数据数组,如果需要,我将在客户端内存中的内部结构中组合主从数据。
Somehow in my application with ~200 forms/screens and a database with ~300 tables I never had a need for neither first nor second method.
In my application quite often a user sees on screen two grids (tables), next to each other:
main
GeneralDetails
table with the list of documents (usually there is search function that limits results using a bunch various filters).data from the
ItemDetails
table for the selected document. Not for all documents. Just for one current document. When a user selects a different document in the first grid, I (re)run a query to retrieve details of the selected document. Just for one selected document.So, there is no join between master and details table.
And, there is no loop to retrieve details for all master documents.
Why would you need to have details for all documents on the client?
I would say, that best practices boil down to common sense:
It is always good to transmit over the network only the data that you need, without redundancy. And it is always good to keep the number of queries/requests as low as possible. Instead of sending many requests in a loop, send one request that would return all necessary rows. Then slice and dice it on the client if really needed.
If there is a need to process somehow a batch of documents together with their details, that's a different story and so far I've always managed to do it on the server side, without transferring all this data to the client.
If for some reason there is a need to get a list of all master documents together with details for all documents to the client, I'd make two queries without any loops:
These two queries would return two arrays of data and if needed, I would combine the master-detail data in the internal structures in memory on the client.
您可以通过分别从两个表中检索所需的数据来进一步优化此过程。然后,您可以循环记录或连接表以生成与来自 SQL Server 的结果集相同的结果集。
使用 ORM,您可以在两次往返中分别检索实体 - 一次检索 GeneralDetails,另一次在检查
GeneralDetails.DocumentId
后检索ItemDetails
>。尽管有两次到数据库的往返,但它比其他两种方法都要优化。这是一个 NHibernate 示例:
我相信(没有实际示例)使用 ADO.NET 数据集实际上可以保存到数据库的第二次往返。您甚至不需要加入结果;这是编码风格和工作流程的问题,但通常您可以通过同时处理两个结果集来更新您的 UI,
You can further optimize this process by retrieve the data you need from the two tables separately. Afterwards you can either loop through the records or join the tables to produce the same result set as it was coming from the SQL server.
With an ORM you can retrieve the entities separately in two round trips - one to retrieve
GeneralDetails
and another one to retrieveItemDetails
after examination ofGeneralDetails.DocumentId
. Altough there are two round trips to the DB it's way optimized than any of the other two methods.Here is an NHibernate example:
I believe (don't have live example) with an ADO.NET data set you can actually save the second round trip to the DB. You don't even need to join the results; it's a matter of coding style and a workflow, but generally you could update your UI by working with the two result sets simultaneously,
自从我提出这个问题以来,我意识到在检索数据方面我还可以在其他方面优化我的应用程序。在这种情况下,我将执行以下操作:
问问自己,我真的需要检索许多文档及其子项吗?通常在 UI 中,我会在列表中显示记录,只有当用户需要子项时(如果用户单击记录)我才会检索它们。
如果确实需要显示很多带有子项的记录,例如帖子/评论,我只会提供一些帖子,考虑分页,或者提供“加载更多”功能。
总而言之,我可能最终会进行延迟加载,仅在用户需要时才检索数据。
一种避免往返数据库服务器的解决方案,虽然不能保证性能提升,因为它需要在数据库服务器和应用程序中进行更多处理,但它是检索多个记录集,一个结果到父文档,一个结果到子项,请参阅伪代码:
我可能需要一个临时表来存储父文档,这样我就可以将它用于第二个查询中的条件,因为我只需要检索所选父文档的子项。
我还应该指出,进行基准测试比立即应用原则更好,因为它确实是个案的基础。
Since the time I asked this question, I realized there are other areas I can optimize my application in retrieving data. In such a case, I will do the following:
Ask myself, do I really need to retrieve many documents together with their subitems? Usually in a UI, I display records in a list, only when the user needs the subitems (if the user clicks the record) I will retrieve them.
If it is really necessary to display many records with subitems, post/comments for example, I will only provide some posts, think of pagination, or provide a "load more" function.
To summarize, I may end up doing lazy loading, retrieve data only when the user needs it.
A solution to avoid roundtrips to database server, although does not guarantee a performance boost since it requires more processing in the database server and in the application, is to retrieve multiple recordsets, one results to parent documents, and one results to subitems, see pseudocode:
I might need a temporary table here for parent documents so I can use it for the condition in the second query, since I only need to retrieve subitems of the selected parent documents.
I should also point out that doing a benchmark is better than just applying principles right away since it's really a case to case basis.