将三个表合并为一个,还是太多列?

发布于 2024-09-10 09:50:57 字数 1412 浏览 0 评论 0原文

我跟踪三个时间段的点击次数:过去一天、过去一周和过去一个月。

为此,我有三个表:

  • 一个每小时表,其中包含 link_id 列、两个其他属性以及 hour_1 到 hour_24,以及一个给出总和的计算列

  • 工作日表,其中包含 click_id 列、其他两个属性以及 day_1 到 day_7,以及给出总和的计算列

  • 月日表,如上,包含 day_1 到 day_31 列

当点击进来时,我将其关键属性(如 href、描述等)存储在其他表中,并插入或更新与上述每个表中的 link_id 对应的行。

每个链接可以在上述每个小时/工作日/月份表中具有多个条目,具体取决于其他两个属性(例如用户坐在哪里)。

因此,如果用户是 A 类用户并且坐在 X 中,则会在上表中创建或添加三行 - 第一行记录一段时间内对该链接的所有点击,第二行记录“A 类用户”的所有点击”,第三个“X 中人员的所有点击”。

我这样设计是因为我不想每小时/每天/每周/每月移动数据。我只是维护“当前小时”(1-24)、“当前日期”(1-31) 和“当前工作日”(1-7) 的指针,并写入表中相应的单元格。当我们输入一个新的时间段(例如“下午 3 点至下午 4 点”)时,我可以清空当前列(例如 hour_15),然后在链接进入时开始增加它。每隔一段时间我就可以删除已删除的旧行降至“全零”。

这样我就不必移动列数据,这对于可能有数万行的数据来说可能非常昂贵。

我只会根据属性从计算列中选择当前日期/工作日/小时行(在插入/更新之前)或前 20 个值(并且可能会将这些结果缓存一个小时左右)。

表填充后,更新将远远超过插入,因为没有那么多唯一的 href。

三问:

  • 把三张大表合并成一张月日/工作日/小时大表可以吗?这将给出一个包含 64 列的表,我不确定这是否太过分了。另一方面,将它们分开,就像现在一样,所需的 INSERT/UPDATE 语句数量增加了三倍。我对 SQL Server 的了解不够,不知道哪个是最好的。

  • 这种方法明智吗?当然,我使用过的大多数数据集每个项目都有一个单独的行,然后您可以按日期排序——但是当跟踪数千个用户的点击时,这会给我带来数十万行,我必须剔除这些行很多时候,对它们进行排序和求和是非常可怕的。一旦跟踪器得到验证,我计划将点击侦听器扩展到数百个页面,因此需要扩展。

  • 就设计而言,显然,工作日和月份都存在一些冗余。然而,这是我能想到的维护指向列的指针并快速更新它并使用计算列的唯一方法。如果我消除了工作日表,我将需要在“月日”上获得一个额外的计算列,对前 7 天进行求和(例如,如果今天是 21 日,则求和 day_14、day_15、day_16...day_20)。计算必须每天更新,我想这会很昂贵。因此需要额外的“工作日”表来进行简单的静态计算。与小数据存储相比,我更看重简单快速的计算。

提前致谢!

I am tracking clicks over three time periods: the past day, past week and past month.

To do this, I have three tables:

  • An hourly table, with columns link_id, two other attributes, and hour_1 to hour_24, together with a computed column giving the sum

  • A weekday table, with columns click_id, two other attributes, and day_1 to day_7, together with a computed column giving the sum

  • A monthday table, as above, with columns day_1 to day_31

When a click comes in, I store its key attributes like href, description, etc, in other tables, and insert or update the row(s) corresponding to the link_id in each of the above tables.

Each link can have several entries in each of the above hourly/weekday/monthday tables, depending on the two other attributes (e.g. where the user is sitting).

So if a user is Type A and sitting in X, three rows are created or added to in the above tables -- the first row records all clicks on that link over the time period, the second row records all clicks by "Type A people", and the third "All clicks by people in X".

I have designed it this way as I didn't want to have to move data around each hour/day/week/month. I just maintain pointers for "current hour" (1-24), "current day" (1-31) and "current weekday" (1-7), and write to the corresponding cells in the tables. When we enter a new period (e.g. "3pm-4pm"), I can just blank out that current column (e.g. hour_15), then start incrementing it for links as they come in. Every so often I can delete old rows which have fallen down to "all zero".

This way I shouldn't ever have to move around column data, which would likely be very expensive for what will potentially be tens of thousands of rows.

I will only be SELECTing either the current day/weekday/hour rows (prior to inserting/updating) or the TOP 20 values from the computed columns based on the attributes (and will likely cache these results for an hour or so).

After the tables populate, UPDATES will far exceed INSERTs as there aren't that many unique hrefs.

Three questions:

  • Is it OK to combine the three big tables into one big table of monthdays/weekdays/hours? This would give a table with 64 columns, which I'm not sure is overkill. On the other hand, keeping them separate like they are now triples the number of INSERT/UPDATE statements needed. I don't know enough about SQL server to know which is best.

  • Is this approach sensible? Most data sets I've worked with of course have a separate row per item and you would then sort by date -- but when tracking clicks from thousands of users this would give me many hundreds of thousands of rows, which I would have to cull very often, and ordering and summing them would be hideous. Once the tracker is proven, I have plans to roll the click listener out over hundreds of pages, so it needs to scale.

  • In terms of design, clearly there is some redundancy in having both weekdays and monthdays. However, this was the only way I could think of to maintain a pointer to a column and quickly update it, and use a computed column. If I eliminated the weekdays table, I would need to get an additional computed column on the "monthdays" that summed the previous 7 days -- (e.g. if today is the 21st, then sum day_14, day_15, day_16... day_20). The calculation would have to update every day, which I imagine would be expensive. Hence the additional "weekday" table for a simple static calculation. I value simple and fast calculations more highly than small data storage.

Thanks in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

梦里南柯 2024-09-17 09:50:57

每当您看到名称中带有数字的列(例如column_1、column_2、column_3...)时,您的“可怕的数据库设计”标志就会升起。 (仅供参考,这里你打破了 1NF,特别是你跨列重复组

现在,这种实现在生产中可能是可以接受的(甚至是必要的),但从概念上讲它绝对是错误的。

正如 Geert 所说,从概念上讲,两张表就足够了。如果性能是一个问题,您可以对每周/每月统计数据进行非规范化,但我仍然不会像上面那样对它们进行建模,但我会保留“

CREATE TABLE base_stats ( link_id INT, click_time DATETIME )
CREATE TABLE daily_stats ( link_id INT, period DATETIME, clicks INT )

您可以随时聚合”,

SELECT link_id, count(*) as clicks, DATE(click_time) as day
FROM base_stats
GROUP_BY link_id, day

可以定期运行以填充 daily_stats。如果您想使其保持最新,您可以在触发器中实现它(或者如果您确实必须这样做,请在应用程序端执行)。如有必要,您还可以对不同级别的数据进行非规范化(通过创建更多聚合表,或在聚合数据表中引入另一列),但这可能是不成熟的优化。

对于未来的临时分析(将在统计中发生),上述设计更加清晰。有关其他好处,请参阅有关重复组的维基百科。

编辑:
尽管接受包含两个表 base_statsaggreerated_stats 的解决方案,但采用以下策略:

  • base_stats 中插入每次点击,
  • 定期聚合来自 的数据>base_stats 进入 daily_stats 并清除完整的详细信息,

这可能不是最佳解决方案。
根据需求的讨论和澄清,表 base_stats 似乎没有必要。还应该研究以下方法:

CREATE TABLE period_stats ( link_id INT, period DATETIME, ...)

更新此表的成本很容易

UPDATE period_stats 
SET clicks = clicks + 1 
WHERE period = @dateTime AND link_id = @url AND ...

,正确索引与在 base_table 中插入行一样有效,并且也很容易使用它进行分析

SELECT link_id, SUM(clicks)
FROM period_stats
WHERE period between @dateTime1 AND @dateTime2
GROUP BY ...

Anytime you see columns with numbers in their names, such as column_1, column_2, column_3... your 'horrible database design' flag should raise. (FYI, here you are breaking 1NF, specifically you are repeating groups across columns)

Now, it is possible that such implementation can be acceptable (or even necessary) in production, but conceptually it is definitively wrong.

As Geert says, conceptually two tables will suffice. If the performance is an issue you could denormalize data for weekly/monthly stats, but still I would not model them as above but I would keep the

CREATE TABLE base_stats ( link_id INT, click_time DATETIME )
CREATE TABLE daily_stats ( link_id INT, period DATETIME, clicks INT )

You can always aggregate with

SELECT link_id, count(*) as clicks, DATE(click_time) as day
FROM base_stats
GROUP_BY link_id, day

which can be run periodically to fill the daily_stats. If you want to keep it up to date you can implement it in triggers (or if you really must, do it on the application side). You can also denormalize the data on different levels if necessary (by creating more aggregate tables, or by introducing another column in the aggregated data table), but that might be premature optimization.

The above design is much cleaner for future ad-hoc analysis (will happen with stats). For other benefits see wikipedia on repeating groups.

EDIT:
Even though the solution with two tables base_stats and aggregated_stats is accepted, with following strategy:

  • insert each click in base_stats
  • periodically aggregate the data from base_stats into daily_stats and purge the full detail

it might not be the optimal solution.
Based on discussions and clarification of requirements it seems that the table base_stats is not necessary. The following approach should be also investigated:

CREATE TABLE period_stats ( link_id INT, period DATETIME, ...)

Updates are easy with

UPDATE period_stats 
SET clicks = clicks + 1 
WHERE period = @dateTime AND link_id = @url AND ...

The cost of updating this table, properly indexed is as efficient as inserting rows in the base_table and any it is also easy to use it for analysis

SELECT link_id, SUM(clicks)
FROM period_stats
WHERE period between @dateTime1 AND @dateTime2
GROUP BY ...
迷途知返 2024-09-17 09:50:57

正如您在数据库中所做的那样,非规范化可以很好地解决某些问题。然而,就您而言,我不会选择上述解决方案,主要是因为您会丢失将来可能需要的信息,也许您希望将来每隔半小时报告一次。
因此,查看您的描述,您只需使用 2 个表即可:链接(ahref 和描述)和链接上的点击次数(包含点击的日期和时间,可能还有一些其他数据)。当然,缺点是您必须存储数十万条记录,并且查询如此大量的数据可能需要花费大量时间。如果是这种情况,您可能会考虑将这两个表的聚合数据存储在单独的表中,并定期更新这些表。

Denormalization as you have done in your database can be a good solution for some problems. In your case however I would not choose the above solution mainly because you lose information that you might need in the future, maybe you want to report on half-hour intervals in the future.
So looking at your description you could do with only 2 tables: Links (ahref's and descriptions) and clicks on the links (containing the date and time of the click and maybe some other data). The drawback of course is that you have to store hunderds of thousands of records and querying this amount of data can take a lot of time. If this is the case you might consider storing aggregate data on these 2 tables in separate tables and update these tables on a regular basis.

你与昨日 2024-09-17 09:50:57

这个设计实在是太糟糕了。 Unreason的提议更好。
如果您想让它变得简单,您也可以有一个包含 2 个字段的表:

   timeSlice  
   clickCount  
   location
   userType 

TimeSlice 保存四舍五入到小时的日期和时间。
其余的都可以从中扣除,你只会
24 * 365 * 位置# * 类型#
每年的记录。

始终取决于配置和可行性,通过此表设计,您最终可以在内存中累积值,并且每 10 秒只更新一次表。或任何时间长度 <= 1 小时,取决于可接受的风险

That design is really bad. Unreason's proposal is better.
If you want to make it nice and easy, you could as well have a single table with 2 fields:

   timeSlice  
   clickCount  
   location
   userType 

with TimeSlice holding the date and time rounded to the hour.
All the rest can be deducted from that, and you would have only
24 * 365 * locations# * types#
records per year.

Always depending on the configuration and feasibility, with this table design, you could eventually accumulate values in memory and only update the table once per 10 sec. or any time length <= 1 hour, depending on acceptable risk

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文