当前位置：文江博客话题详情

不同的访客指标有何关联？

发布于 2024-08-07 18:30:57 字数 270 浏览 3 评论 0原文

假设，tets 说有人告诉你，由于一次成功的营销活动，预计每天会有 X（比如 100,000 人或其他）数量的独立访客。

这如何转化为每秒峰值请求？并发请求峰值？

显然，这取决于许多因素，例如每个用户会话请求的页面的典型数量或典型页面的加载时间，或许多其他因素。这些是其他变量 Y、Z、V 等。

我正在寻找某种函数，甚至只是一个比率来估计这些指标。显然是为了规划生产环境的可扩展性策略。

这可能很快就会发生在我正在开发的生产站点上。任何估计这些的帮助都是有用的。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小猫一只 2024-08-14 18:30:57

编辑：（以下表明我们之前几乎没有流量统计数据）
因此，我们可以忘记下面列出的大部分计划，并直接进入“进行一些估算”部分。问题是我们需要使用有根据的猜测（或简单的疯狂猜测）来填充模型中的参数。以下是一个简单的模型，您可以根据您对情况的理解调整参数。

模型

假设：
a) 页面请求的分布遵循正态分布曲线。
b) 考虑到高峰流量期间的较短时期，例如 30 分钟，请求数量可以认为是均匀分布的。
这可能[有些]不正确：例如，如果广告活动针对多个地理区域（例如美国和亚洲市场），我们可能会出现双曲线。曲线也可以遵循不同的分布。然而，由于以下原因，这些假设是好的：

如果有的话，它会在“悲观的一面”上出错，即高估峰值流量值。通过使用稍小的标准偏差值，可以进一步采用这种“悲观”的前景。（我们建议使用 2 到 3 小时，这将分别将 68% 和 95% 的流量放在 4 小时和 8 小时（2h stddev）以及 6 小时和 12 小时（3h stddev）期间。
这样可以方便计算;-)
预计与实际情况大致相符。

参数：

V = 每 24 小时内不同访问者的预期数量
Ppv = 与给定访问者会话相关的页面请求的平均数量。（您可以考虑使用该公式两次，一次用于“静态”类型的响应，另一次用于动态响应，即当应用程序花费时间为给定用户/上下文制作响应时）
sig = 以分钟为单位的标准偏差
R = 峰值-time 每分钟请求数。

公式：

   R = (V * Ppv * 0.0796)/(2 * sig / 10)

这是因为，采用正态分布< /strong>，并根据 z 分数表，大约 3.98% 的样本落在标准偏差的 1/10 范围内，位于平均值（峰值）的一侧或另一侧，因此几乎 8% 的样本落在标准差的 1/10 范围内。每侧标准差的十分之一，并且假设在此期间分布相对均匀，我们只需除以分钟数即可。

示例：V=75,000 Ppv=12 且 sig = 150 分钟（即，假设 68% 的流量在 5 小时内到达，95% 的流量在 10 小时内到达，5% 的流量在一天中的其他 14 小时内到达）。
R = 每分钟 2,388 个请求，即每秒 40 个请求。相当繁重，但“可行”（除非应用程序每个请求需要 15 秒...）

编辑（2012 年 12 月）
：
我在这里添加了上述模型的“执行摘要”，如 full.stack.ex 的评论中所提供。
在此模型中，我们假设大多数人（例如，中午）访问我们的系统。那是高峰时间。其他人领先或落后，越远越少；半夜没人。我们选择了一条合理覆盖 24 小时内所有请求的钟形曲线：左侧约 4 个西格玛，右侧约 4 个西格玛，长尾中“没有”任何重要残留。为了模拟高峰期，我们在中午周围切出了一条窄带，并计算那里的请求数。

值得注意的是，该模型在实践中往往会高估峰值流量，并且可能被证明在估计“最坏情况”场景而不是更合理的流量模式方面更有用。改进估算的初步建议是

扩展sig参数（以承认相对高流量的有效流量周期较长），
以减少所考虑周期内的总访问量，即降低>V 参数，例如 20%（承认许多访问发生在任何高峰时间之外），
以使用不同的分布，例如泊松分布或某些二项式分布。
考虑到每天有多个峰值，并且流量曲线实际上是具有相似分布但具有明显峰值的几个正态（或其他分布）函数的总和。假设这些峰相距足够远，我们可以使用原始公式，只需将 V 因子除以所考虑的峰数即可。

[原始回复]
看来您最关心的是服务器如何处理额外的负载......这是一个非常值得关注的问题;-)。在不分散您对运营问题的注意力的情况下，考虑一下估计即将到来的激增规模的过程，这也为您提供了一个机会，让您在广告期间和广告之后准备收集更多更好的有关网站流量的情报 -活动。这些信息将及时证明对于更好地估计激增等有用，而且还可以指导一些站点的设计（为了商业效率以及提高可扩展性）。

暂定计划

假设与现有流量的质量相似。
广告活动会将网站暴露给不同于当前访问者/用户群体的不同人群（用户类型）：不同的情况选择不同的主题。例如，与“自选？”相比，“广告活动”访问者可能更不耐烦，专注于特定功能，关心价格......访客。尽管如此，由于缺乏任何其他支持模型和测量，并且为了估计负载，一般原则可以是假设激增用户的行为总体上与自选人群相似。一种常见的方法是在此基础上“运行数字”，并使用有根据的猜测来稍微改变模型的系数，以适应一些独特的定性差异。

收集有关现有流量的统计信息
除非您随时有更好的信息（例如 Tealeaf、Google Analytics...），您此类信息的来源可能只是网络服务器的日志...然后您可以构建一些简单的工具来提取解析这些日志并提取以下统计信息。请注意，这些工具将可重复用于将来的分析（例如：活动本身），并且还可以寻找记录更多/不同数据的机会，而无需显着更改应用程序！

平均值、最小值、最大值、标准差。为了
- 每个会话访问的页面数
- 会话持续时间
工作日每个小时 24 小时流量的百分比（不包括周末等，当然除非这是一个在这些时间段内接收大量流量的网站）应至少在几周内计算百分比以消除噪音。

“运行”一些估计：
例如，从高峰使用估计开始，使用高峰小时百分比、平均每日会话数、每个会话的平均页面点击数等。此估计应考虑流量的随机性。请注意，在此阶段您不必担心排队效应的影响，而是假设相对于请求周期的服务时间足够低。因此，只需使用实际的估计（或者更确切地说，对于这些非常高的使用时间段，从日志分析中获取一个值），以了解请求的概率在短时间内（例如 15 分钟）的分布方式。

最后，根据以这种方式获得的数字，您可以了解这在服务器上代表的持续负载类型，并计划添加资源，以重构部分应用程序。另外 - 非常重要！ - 如果预期持续的满容量负载，请开始运行 Pollaczek-Khinchine 公式（如 ChrisW 所建议），以更好地估计有效负载。

额外加分；-) 考虑在活动期间运行一些实验，例如，随机为某些访问的页面提供独特的外观或行为，并衡量此行为的影响可能有（如果有的话）特定的指标（注册更多信息、下订单、访问的页面数……）与此类实验相关的努力可能很重要，但回报也可能很重要，如果没有的话否则它可能会让您的“可用性专家/顾问”保持警惕;-)您显然希望与适当的营销/商业机构一起定义此类实验，并且您可能需要提前计算最小值提议替代站点的用户百分比，以保持实验的统计代表性。确实重要的是要知道该实验不需要应用于 50% 的访问者；可以从小处开始，但不能小到观察到的可能变化可能是由于随机造成的……

Edit: (following indication that we have virtually NO prior statistics on the traffic)
We can therefore forget about the bulk of the plan laid out below and directly get into the "run some estimates" part. The problem is that we'll need to fill-in parameters from the model using educated guesses (or plain wild guesses). The following is a simple model for which you can tweak the parameters based on your understanding of the situation.

Model

Assumptions:
a) The distribution of page requests follows the normal distribution curve.
b) Considering a short period during peak traffic, say 30 minutes, the number of requests can be considered to be evenly distributed.
This could be [somewhat] incorrect: for example we could have a double curve if the ad campaign targets multiple geographic regions, say the US and the Asian markets. Also the curve could follow a different distribution. These assumptions are however good ones for the following reasons:

it would err, if at all, on the "pessimistic side" i.e. over-estimating peak traffic values. This "pessimistic" outlook can further be further adopted by using a slightly smaller std deviation value. (We suggest using 2 to 3 hours, which would put 68% and 95% of the traffic over a period of 4 and 8 hours (2h std dev) and 6 and 12 hours (3h stddev), respectively.
it makes for easy calculations ;-)
it is expected to generally match reality.

Parameters:

V = expected number of distinct visitors per 24 hour period
Ppv = average number of page requests associated with a given visitor session. (you may consider using the formula twice, one for "static" type of responses, and the other for dynamic responses, i.e. when the application spends time crafting a response for a given user/context)
sig = std deviation in minutes
R = peak-time number of requests per minute.

Formula:

   R = (V * Ppv * 0.0796)/(2 * sig / 10)

That is because, with a normal distribution, and as per z-score table, roughly 3.98% of the samples fall within 1/10 of a std dev, on one or the other side of the mean (of the very peak), therefore get almost 8 percent of the samples within one tenth of a std dev on each side, and with the assumption of relatively even distribution during this period, we just divide by the number of minutes.

Example: V=75,000 Ppv=12 and sig = 150 minutes (i.e 68% of traffic assumed to come over 5 hours, 95% over 10 hours, 5% for the other 14 hours of the day).
R = 2,388 requests per minute, i.e. 40 requests per second. Rather Heavy, but "doable" (unless application takes 15 seconds per request...)

Edit (Dec 2012)
:
I'm adding here an "executive summary" of the model proposed above, as provided in comments by full.stack.ex.
In this model, we assume most people visit our system, say, at noon. That's the peak time. Others jump ahead or lag behind, the farther the fewer; Nobody at midnight. We chose a bell curve that reasonably covers all requests within the 24h period: with about 4 sigma on the left and 4 on the right, "nothing" significant remains in the long tail. To simulate the very peak, we cut out a narrow stripe around the noon and count the requests there.

It is noteworthy that this model, in practice, tends to overestimate the peak traffic, and may prove to be more useful at estimating "worse case" scenario rather than more plausible traffic patterns. Tentative suggestions to improve the estimate are

to extend the sig parameter (to acknowledge that the effective traffic period of relatively high traffic is longer)
to reduce the overall amount of visits for the period considered i.e. reduce the V parameter, by say 20% (to acknowledge that about that many visit happen outside of any peak time)
to use a different distribution such as say the Poisson or some binomial distribution.
to consider that there are a number of peaks each day, and that the traffic curve is actually the sum of several normal (or other distribution) functions with similar spread, but with a distinct peak. Assuming that such peaks are sufficiently apart, we can use the original formula, only with a V factor divided by as many peaks as considered.

[original response]
It appears that your immediate concern is how the server(s) may handle the extra load... A very worthy concern ;-). Without distracting you from this operational concern, consider the process of estimating the scale of the upcoming surge, also provides an opportunity of preparing yourself to gather more and better intelligence about the site's traffic, during and beyond the ad-campaign. Such information will in time prove useful for making better estimates of surges etc, but also for guiding some of the site's design (for commercial efficiency as well as for improving scalability).

A tentative plan

Assume qualitative similarity with existing traffic.
The ad campaign will expose the site to a distinct population (type of users) than its current visitors/users population: different situations select different subjects. For example the "ad campaign" visitors may be more impatient, focussed on a particular feature, concerned about price... as compared to the "self selected ?" visitors. Never the less, by lack of any other supporting model and measurement, and for sake of estimating load, the general principle could be to assume that the surge users will on-the-whole behave similarly to the self-selected crowd. A common approach is "run numbers" on this basis and to use educated guesses to slightly bend the coefficients of the model to accommodate for a few distinctive qualitative distinctions.

Gather statistics about existing traffic
Unless you readily have better information for this (eg. tealeaf, Google Analytics...) your source for such information may simply be the webserver's log... You can then build some simple tools to extract parse these logs and extract the following statistics. Note that these tools will be reusable for future analysis (eg: of the campaign itself), and also look for opportunities of logging more/different data, without significantly changing the application!

Average, Min, Max, Std Dev. for
- number of pages visited per session
- duration of a session
percentage of 24 hour traffic for each hour of a work day (exclude week-ends and such, unless of course this is a site which receives much traffic during these periods) These percentages should be calculated over several weeks at least to remove noise.

"Run" some estimates:
For example, start with peak use estimate, using the peak hour(s) percentage, the average daily session count, the average number of pages hits per session etc. This estimate should take into account the stochastic nature of traffic. Note that you don't have to, in this phase, worry about the impact of the queuing effect, instead, assume that the service time relative to the request period is low enough. Therefore just use a realistic estimate (or rather a value informed from the log analysis, for these very high usage periods), for the way the probability of a request is distributed over short periods (say of 15 minutes).

Finally, based on the numbers you obtained in this fashion, you can get a feel for the type of substained load this would represent on the server, and plan to add resources, to refactor part of the application. Also -very important!- if the outlook for sustained at-capacity load, start running the Pollaczek-Khinchine formula, as suggested by ChrisW, to get a better estimate of the effective load.

For extra credit ;-) Consider running some experiments during the campaign for example by randomly providing a distinct look or behavior for some of the pages visited, and by measuring the impact this may have (if any) on particular metrics (registration for more info, orders place, number of pages visited ...) The effort associated with this type of experiment may be significant, but the return can be significant as well, and if nothing else it may keep your "useability expert/consultant" on his/her toes ;-) You'll obviously want to work on defining such experiments, with the proper marketing/business authorities, and you may need to calculate ahead of time the minimum percentage of users upon which the alternate site would be proposed, to keep the experiment statistically representative. It is indeed important to know that the experiment doesn't need to be applied to 50% of the visitors; one can start small, just not so small that possible variations observed may be due to random...

回复收藏 0 原文