随机选择,同时按多个组的百分比进行细分

发布于 2024-08-17 09:05:49 字数 1968 浏览 7 评论 0原文

我正在尝试为用户构建一个简单的系统,以生成将向其发送调查的用户列表。列表生成可以取决于各种约束。例如,“我们只想要来自美国和加拿大的人”或“我们只想要拥有 2 级或 3 级会员资格的人”。

这部分非常简单,我已经设置了表格来捕获选择标准。不过,另一个标准是他们可能希望获得每件物品的一定比例。例如,“给我 70% 的美国用户和 30% 的加拿大用户”。再说一次,我认为我可以毫不费力地做到这一点。他们会给出他们想要的用户数量,所以我只需乘以百分比,然后确保四舍五入后数字仍然相加,我就可以开始了。

但考虑到未来,如果他们希望按照两组标准进行一定的百分比细分,该怎么办?例如,“给我 70% 的美国用户、30% 的加拿大用户,同时 50% 的 2 级用户和 50% 的 3 级用户。”因为这不是当前的要求,所以我不打算让自己为此头疼,但如果有人有一个相当简单的算法(或 SQL 代码)来完成这样的事情,那么我会很高兴看到它。

尽管我更喜欢与数据库无关的解决方案,但我使用的是 MS SQL 2005,因此特定于该 RDBMS 的解决方案也很好。

我当前使用的表结构与此类似:

CREATE TABLE Selection_Templates
(
     template_code     VARCHAR(20)     NOT NULL,
     template_name     VARCHAR(100)    NOT NULL,
     CONSTRAINT PK_Selection_Templates PRIMARY KEY CLUSTERED (template_code),
     CONSTRAINT UI_Selection_Templates UNIQUE (template_name)
)
GO
CREATE TABLE Selection_Template_Countries
(
     template_code            VARCHAR(20)       NOT NULL,
     country_code             CHAR(3)           NOT NULL,
     selection_percentage     DECIMAL(2, 2)     NULL,
     CONSTRAINT PK_Selection_Template_Countries PRIMARY KEY CLUSTERED (template_code, country_code),
     CONSTRAINT CK_Selection_Template_Countries_selection_percentage CHECK (selection_percentage > 0),
     CONSTRAINT FK_Selection_Template_Countries_Selection_Template FOREIGN KEY (template_code) REFERENCES Selection_Templates (template_code)
)
GO
CREATE TABLE Selection_Template_User_Levels
(
     template_code            VARCHAR(20)       NOT NULL,
     user_level               SMALLINT          NOT NULL,
     selection_percentage     DECIMAL(2, 2)     NULL,
     CONSTRAINT PK_Selection_Template_User_Levels PRIMARY KEY CLUSTERED (template_code, user_level),
     CONSTRAINT CK_Selection_Template_User_Levels_selection_percentage CHECK (selection_percentage > 0),
     CONSTRAINT FK_Selection_Template_User_Levels_Selection_Template FOREIGN KEY (template_code) REFERENCES Selection_Templates (template_code)
)

I'm trying to put together a simple system for a user to generate a list of users to whom surveys will be sent. The list generation may depend on various constraints. For example, "we only want people from the U.S. and Canada" or "we only want people who have a level 2 or level 3 membership."

This part is pretty easy and I've set up the tables to capture the selection criteria. One additional criteria though, is that they may want to get a certain percentage of each item. For example, "give me 70% U.S. users and 30% Canada users." Again, I think that I can do this without too much trouble. They will give the number of users that they want, so I can just multiple by the percentages then make sure that the numbers still add up after rounding and I'm good to go.

Thinking to the future though, what if they wanted certain percentage breakdowns by two sets of criteria. For example, "Give me 70% U.S., 30% Canada and at the same time, 50% level 2 users and 50% level 3 users." Since it's not a current requirement I'm not planning to give myself a headache over it, but if anyone has a reasonably simple algorithm (or SQL code) for accomplishing something like this then I'd be happy to see it.

Although I would prefer a DB-agnostic solution, I'm on MS SQL 2005, so solutions specific to that RDBMS are fine too.

The table structure which I'm currently using is similar to this:

CREATE TABLE Selection_Templates
(
     template_code     VARCHAR(20)     NOT NULL,
     template_name     VARCHAR(100)    NOT NULL,
     CONSTRAINT PK_Selection_Templates PRIMARY KEY CLUSTERED (template_code),
     CONSTRAINT UI_Selection_Templates UNIQUE (template_name)
)
GO
CREATE TABLE Selection_Template_Countries
(
     template_code            VARCHAR(20)       NOT NULL,
     country_code             CHAR(3)           NOT NULL,
     selection_percentage     DECIMAL(2, 2)     NULL,
     CONSTRAINT PK_Selection_Template_Countries PRIMARY KEY CLUSTERED (template_code, country_code),
     CONSTRAINT CK_Selection_Template_Countries_selection_percentage CHECK (selection_percentage > 0),
     CONSTRAINT FK_Selection_Template_Countries_Selection_Template FOREIGN KEY (template_code) REFERENCES Selection_Templates (template_code)
)
GO
CREATE TABLE Selection_Template_User_Levels
(
     template_code            VARCHAR(20)       NOT NULL,
     user_level               SMALLINT          NOT NULL,
     selection_percentage     DECIMAL(2, 2)     NULL,
     CONSTRAINT PK_Selection_Template_User_Levels PRIMARY KEY CLUSTERED (template_code, user_level),
     CONSTRAINT CK_Selection_Template_User_Levels_selection_percentage CHECK (selection_percentage > 0),
     CONSTRAINT FK_Selection_Template_User_Levels_Selection_Template FOREIGN KEY (template_code) REFERENCES Selection_Templates (template_code)
)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

转身以后 2024-08-24 09:05:49

您可以将问题分解为四组随机用户:

  • 美国用户,级别 2,选择所需样本总数的 35%
  • 加拿大用户,级别 2,选择所需样本总数的 15%
  • 美国用户,级别 3,选择样本总数的 35%所需样本
  • 加拿大用户,级别 3,选择所需总样本的 15%

如果有第三个标准,则将问题分为八组。等等。

在美国和加拿大的两组用户中准确获得50%的2级和50%的3级似乎有些人为。由于它应该是随机的,因此您可能会期望它会有更多变化。另外,如果来自加拿大的 3 级用户数量不多,无法达到总数的 15%,该怎么办?

随着标准变得越来越有选择性,您自然会消除总样本的随机性。最终,您可能会拥有一长串标准,这样只有一个用户子集可以满足它,然后就根本没有随机性了。


回复您的评论:是的,SQL 并不是解决所有类型问题的最佳解决方案。使用迭代算法而不是单个基于集合的 SQL 查询来处理问题可能会更好。例如:

  1. 随机选择一行。
  2. 如果该行已在先前的迭代中被选择,则丢弃它。
  3. 如果该行有助于保持选择 70% 美国、30% 加拿大、50% 2 级、50% 3 级总样本的速度,则保留它。否则,丢弃它。
  4. 如果达到所需的样本数量,请停止。
  5. 循环回到步骤 1。

当然,如果您选择有助于平衡 70/30% 国家比例的行,但不平衡 50/50% 水平比例,那就会很棘手。你扔还是不扔?而且,当您只选择前几行时,您可能想忽略比率。

正如@Hogan 评论的那样,这可能是一个无法解决的 NP 完全问题。但许多此类问题都有一个解决方案,可以为您提供“足够好”的结果,尽管不是可证明的最佳结果。

You could break down the problem into four sets of random users:

  • US users, level 2, choose 35% of total sample desired
  • Canada users, level 2, choose 15% of total sample desired
  • US users, level 3, choose 35% of total sample desired
  • Canada users, level 3, choose 15% of total sample desired

If there's a third criterion, split the problem down into eight sets. And so on.

It may seem artificial to get exactly 50% level 2 and 50% level 3 in both sets of users, US and Canada. Since it's supposed to be random, you might expect it to vary a bit more. Plus what if there aren't very many level 3 users from Canada to make up 15% of the total?

As the criteria get more and more selective, you're naturally taking away from the randomness of the total sample. Eventually you could have a long list of criteria such that only one subset of your users could satisfy it, and then there'd be no randomness at all.


Re your comment: Right, SQL isn't the best solution for every type of problem. You may be better off handling the problem with an iterative algorithm instead of a single set-based SQL query. For example:

  1. Pick one random row.
  2. If the row has been chosen already in a previous iteration, discard it.
  3. If the row helps keep the pace of choosing a total sample that is 70% US, 30% Canada, 50% level 2, 50% level 3, keep it. Otherwise, discard it.
  4. If you reach the desired number of samples, stop.
  5. Loop back to step 1.

Of course, it gets tricky if you pick a row that helps to balance the 70/30% ratio of nations, but imbalances the 50/50% ratio of levels. Do you discard it or not? And also you may want to ignore the ratios when you've only picked the first few rows.

As @Hogan commented, this might be an unsolvable NP-Complete problem. But many such problems have a solution that gives you a "good enough" result, though not a provably optimal result.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文