在包含 varbinary(max) 数据的表中查询非 varbinary(max) 字段是否存在性能问题?
我创建了一个表来插入我的申请的所有文档。这是一个简单的表(我们称之为 DOC_DATA),有 3 个字段:DOC_ID、FileSize、Data。数据为 varbinary(max)。
然后,我有许多包含其他数据(例如“文档描述”、“创建者”、“客户 ID”...)的表(CUSTOMERS_DOCUMENTS、EMPLOYEES_DOCUMENTS...)。我的情况不完全是这样,无论如何通过写这个例子我可以更好地表达自己。所有这些表都有一个 DOC_DATA.DOC_ID 的 FK)。
当用户搜索客户文档时,他将运行类似于以下的查询:
select CD.*, DD.FileSize
from DOC_DATA DD
join CUSTOMERS_DOCUMENTS CD ON CD.DOC_ID = DD.DOC_ID
我的问题是:此查询的性能是否会很差,因为我们还从可能很大的表中读取字段(DOC_DATA 表可以包含许多GB 数据)或者这不是问题?
另一种解决方案是将 FILESize 字段放入所有主表(CUSTOMER_DOCUMENTS、EMPLOYEES_DOCUMENTS,...)中。当然,连接对性能有一点影响,现在我不是问一般的连接或不连接,而是连接或不连接一个HUGE表,而我对HUGE字段不感兴趣。
请注意:我不是在设计一个新系统,而是在维护一个遗留系统,所以在这里我不是在讨论哪个是一般最好的设计,而是在这种情况下哪个是最佳选择。
I created a table to insert all the documents of my application. It is a simple table (let's call it DOC_DATA) that has 3 fields: DOC_ID, FileSize, Data. Data is varbinary(max).
I then have many tables (CUSTOMERS_DOCUMENTS, EMPLOYEES_DOCUMENTS, ...) that contain other data (like "document description", "Created by", "Customer ID" ...). My case is not exactly like this, anyway by writing this example I can express myself better. All these tables have a FK to DOC_DATA.DOC_ID).
When the user searches for a customer document he will run a query similar to this:
select CD.*, DD.FileSize
from DOC_DATA DD
join CUSTOMERS_DOCUMENTS CD ON CD.DOC_ID = DD.DOC_ID
My question is: will the performance of this query be bad because we are reading also a field from a table that is potentially huge (the DOC_DATA table can contain many GB of data) or this is not a problem?
The alternative solution is to put the FIleSize field in all the main tables (CUSTOMER_DOCUMENTS, EMPLOYEES_DOCUMENTS, ...). Of course a join has a little impact on the performance, now I am not asking about to join or not to join in general, but to join or not to join a HUGE table while I am not interested in the HUGE fields.
Please note: I am not designing a new system, I am maintaining a legacy system, so here I am not discussing which is the best design in general, but just which is the best option in this case.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为没有理由因为这些大列的存在而影响查询的性能。当您读取该数据时,就会出现性能问题——具体来说,当您需要数据库引擎返回文档,但您(当然)没有在查询中这样做时。
在内部,对于各种 yada(max) 数据类型,SQL 在行中存储一个 16 左右字节的指针(或引用标记、转发记录或其他任何名称),而实际数据存储在一组单独的页中。因此,如果您不阅读该列,则不需要访问这些页面,并且不会产生磁盘 I/O 命中。
I see no reason why the performance of your query would suffer due to the presence of those large columns. Performance issues would come up when you read that data --specifically, when you require the database engine to return the document, but you are (of course) not doing so in the query.
Internally, for the various yada(max) data types, SQL stores a 16 or so byte pointer (or reference marker, forwarding record, or whatever they call it) in the row, and the actual data is stored in a separate set of pages. Thus, if you're not reading that column, those pages do not need to be accessed, and you don't incur the disk I/O hit.