正确使用 ETag
我一直在读一本书,并且对 ETag 章节有一个特别的问题。作者表示 ETag 可能会损害性能,您必须对其进行精细调整或完全禁用它们。
我已经知道 ETag 是什么并了解风险,但是正确使用 ETag 有那么难吗?
我刚刚创建了一个发送 ETag 的应用程序,其值是响应正文的 MD5 哈希值。这是一个简单的解决方案,很容易用多种语言实现。
使用响应正文的 MD5 哈希值作为 ETag 是否错误?如果是,为什么?
为什么作者(显然比我聪明很多数量级)没有提出这样一个简单的解决方案?
为什么
最后一个问题很难回答,除非您是作者:),所以我试图找到使用 MD5 哈希作为 ETag 的弱点。
I’ve been reading a book and I have a particular question about the ETag chapter. The author says that ETags might harm performance and that you must tune them finely or disable them completely.
I already know what ETags are and understand the risks, but is it that hard to get ETags right?
I’ve just made an application that sends an ETag whose value is the MD5 hash of the response body. This is a simple solution, easy to achieve in many languages.
Is using MD5 hash of the response body as ETag wrong? If so, why?
Why the author (who obviously outsmarts me by many orders of magnitude) does not propose such a simple solution?
This last question is hard to answer unless you are the author :), so I’m trying to find the weak points of using an MD5 hash as an ETag.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
ETag 类似于 Last-Modified 标头。这是一种由客户端确定更改的机制。
ETag 需要是表示资源的状态和特定格式的唯一值(资源可以有多种格式,每种格式都需要自己的 ETag)。在整个资源域中并不唯一,而只是在资源内唯一。
现在,从技术上讲,与 Last-Modified 标头相比,ETag 具有“无限”分辨率。 Last-Modified 仅以 1 秒的粒度发生变化,而 ETag 可以是亚秒级。
您可以同时实现 ETag 和 Last-Modified,或者仅实现其中之一(当然,也可以不实现)。如果 Last-Modified 还不够,那么请考虑 ETag。
请注意,我不会为“每个”资源设置 ETag。基本上,我不会将其设置为任何不希望被缓存的内容(尤其是动态内容)。在这种情况下没有任何意义,只是浪费精力。
编辑:我看到您的编辑并澄清。
MD5 没问题。唯一的缺点是一直在计算MD5。例如,在 200K PDF 文件上运行 MD5 的成本很高。在不希望被缓存的资源上运行 MD5 简直就是一种浪费(即动态内容)。
诀窍在于,无论您使用什么机制,它都应该像 Last-Modified 通常一样便宜。最后修改时间通常是资源的一个属性,并且通常访问起来非常便宜。
ETag 应该同样便宜。如果您使用 MD5,并且可以缓存/存储资源和 MD5 哈希之间的关联,那么这是一个很好的解决方案。然而,每次需要 ETag 时都重新计算 MD5,基本上与使用 ETag 提高服务器整体性能的想法背道而驰。
ETag is similar to the Last-Modified header. It's a mechanism to determine change by the client.
An ETag needs to be a unique value representing the state and specific format of a resource (a resource could have multiple formats that each need their own ETag). Not unique across the entire domain of resources, simply within the resource.
Now, technically, an ETag has "infinite" resolution compared to a Last-Modified header. Last-Modified only changes at a granularity of 1 second, whereas an ETag can be sub second.
You can implement both ETag and Last-Modified, or simply one or the other (or none, of course). If you Last-Modified is not sufficient, then consider an ETag.
Mind, I would not set ETag for "every" resource. Basically, I wouldn't set it for anything that has no expectation of being cached (dynamic content notably). There's no point in that case, just wasted work.
Edit: I see your edit, and clarify.
MD5 is fine. The only downside is calculating MD5 all the time. Running MD5 on, say, a 200K PDF file, is expensive. Running MD5 on a resource that has no expectation of being cached is simply wasteful (i.e. dynamic content).
The trick is simply that whatever mechanism you use, it should be as cheap as Last-Modified typically is. Last-Modified is, again, typically, a property of the resource, and usually very cheap to access.
ETags should be similarly cheap. If you are using MD5, and you can cache/store the association between the resource and the MD5 hash, then that's a fine solution. However, recalculating the MD5 each time the ETag is necessary, is basically counter to the idea of using ETags to improve overall server performance.
我们在 instela 中使用 etag 来表示动态内容。
我们的策略是在输出结束时生成要发送的内容的 md5 哈希值,如果存在 if-none-match 标头,我们将标头与生成的哈希值进行比较。如果两个值相同,我们将发送 304 代码并中断请求而不返回任何内容。
确实,我们消耗了一点 cpu 来对内容进行哈希处理,但最终我们节省了很多带宽。
我们有一个 Facebook 新闻源风格的主页,为每个用户提供不同的内容。由于新闻源内容每小时仅更改 3-4 次,因此主页刷新对于客户端来说非常高效。在移动时代,我认为花更多的CPU时间比花带宽更好。带宽仍然比CPU贵,但对于客户端来说却有更好的体验。
We're using etags for our dynamic content in instela.
Our strategy is at the end of output generating the md5 hash of the content to send and if the if-none-match header exists, we compare the header with the generated hash. If the two values are the same we send 304 code and interrumpt the request without returning any content.
It's true that we consume a bit cpu to hash the content but finally we're saving much bandwidth.
We have a facebook newsfeed style main page which has different content for every user. As the newsfeed content changes only 3-4 time per hour, the main page refreshes are so efficient for the client side. In the mobile era I think it's better to spend a bit more cpu time than spending bandwidth. Bandwidth is still more expensive than the CPU, and it's a better experience for the client.
由于没有读过这本书,我无法说出作者具体的担忧。
然而,ETag 的生成应该使得 ETag 仅在页面更改时生成一次。生成网页的 MD5 哈希会消耗服务器的处理能力和时间;如果您有许多客户端连接,则可能会开始导致性能问题。
因此,您需要一种好的技术来仅在必要时生成 ETag,并将它们缓存在服务器上,直到相关页面发生更改。
Having not read the book, I can't speak on the author's precise concerns.
However, the generation of ETags should be such that an ETag is only generated once when a page has changed. Generating an MD5 hash of a web page costs processing power and time on the server; if you have many clients connecting, it could start to cause performance problems.
Thus, you need a good technique for generating ETags only when necessary and caching them on the server until the related page changes.
我认为 ETAGS 的
感知问题
可能是您的浏览器必须为页面上的每个资源发出并解析(简单而小的)请求/响应,以检查 etag 值是否已更改服务器端。我个人发现,对于经常更改图像、CSS、JavaScript 的服务器来说,这些额外的小往返是可以接受的(如果浏览器的 etag 是最新的,服务器不需要重新发送内容),因为该机制使得标记“更新”内容变得非常容易。
I think the
perceived problem
with ETAGS is probably that your browser has to issue and parse a (simple and small) request / response for every resource on your page to check if the etag value has changed server side.I personally find these extra small roundtrips to the server acceptable for often changing images, css, javascript (the server does not need to resend the content if the browser's etag is current) since the mechanism makes it quite easy to mark 'updated' content.