如何编写一个函数来比较多组布尔(真/假)答案并对其进行排名?

发布于 2024-09-11 07:39:28 字数 882 浏览 3 评论 0原文

我已经开始了一个项目,事实证明它比我最初想象的要复杂得多。我正在尝试规划一个基于布尔(真/假)问题和答案的系统。系统上的用户可以回答大量布尔(真/假)问题中的任何问题,并根据他们的答案看到一个显示最相似用户(按相似度顺序排列)的列表。

我在谷歌上搜索了很多,但仍然没有得到太多结果,所以我希望有人能给我指出正确的方向。我想知道:

存储此类数据的最佳数据结构和方法是什么?我最初假设我可以在 SQL 数据库中创建两个表“问题”和“答案” 。但是,我不知道如果两组答案都列为数字字符串,比较它们是否会更简单。即 0 = 未回答,1 = 正确,2 = 错误。比较字符串时,可以添加“未回答”= 0、“相同答案”= 1、“相反答案”= -1 的权重,从而产生相似性分数。

我将如何比较两组答案?为了能够计算出这组答案之间的“相似性”,我必须编写一个比较函数。有谁知道哪种比较最适合这个问题?我研究过序列比对,我认为这可能是正确的方法,但我我不确定,因为这需要数据处于很长的序列中,而且问题不相关,因此自然不是一个序列。

如何将此比较函数应用于大量数据? 一旦我编写了比较函数,我就可以将每个用户的答案与其他用户的答案进行比较,但这似乎不是很有效并且可能不会很好地扩展。我一直在研究集群分析方法来根据相似的答案自动对用户进行分组,你喜欢吗?认为这可行,或者有人知道我可以研究的更好方法吗?

我真的很感激任何有用的指示。谢谢!

I've embarked on a project that is proving considerably more complicated than I'd first imagined. I'm trying to plan a system that is based around boolean (true/false) questions and answers. Users on the system can answer any questions from a large set of boolean (true/false) questions and be presented with a list showing the most similar users (in order of similarity) based on their answers.

I've Googled far and wide but still not come up with much, so I was hoping somebody could point me in the right direction. I'd like to know:

What is the best data structure and method to store this kind of data? I'd originally assumed I could create two tables "questions" and "answers" in an SQL database. However, I'm not wondering if it would be simpler to compare two sets of answers if they were both listed as numerical string. I.e. 0 = not answered, 1 = true, 2 = false. When comparing the strings weights could be added for "not answered" = 0, "same answer" = 1, "opposite answer" = -1 producing a similarity score.

How would I go about comparing two sets of answers? To be able to work out the "similarity" between these sets of answers I'm going to have to write a comparison function. Does anyone know what kind of comparison would best suite this problem? I've looked into sequence alignment and I think this could be the correct way to go but I'm unsure as this requires the data to be in a long sequence, plus the questions aren't related so aren't naturally a sequence.

How do I apply this comparison function to a large set of data? Once I've written the comparison function I could just compare each users answers to every other user's answers, however this doesn't seem very efficient and probably wouldn't scale very well. I've been looking into cluster analysis methods to automatically group users according to similar answers, do you think this could work or does anyone know a better method I could look into?

I'd really appreciate any helpful pointers. Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

不顾 2024-09-18 07:39:28

如果您要在 SQL 中设置用户、问题和答案表,那么我相信以下 SQL 可用于让其他用户获得类似的响应。只需添加一个 TOP 子句即可获取所需的数字。

我不知道性能如何,但这在很大程度上取决于数据的大小。

SELECT
    U2.userid,
    SUM(CASE
            WHEN A1.answer = A2.answer THEN 1
            WHEN A1.answer <> A2.answer THEN -1
            WHEN A1.answer IS NULL OR A2.answer IS NULL THEN 0  -- A bit redundant, but I like to make it clear
            ELSE 0
        END) AS similarity_score
FROM
    Questions Q
LEFT OUTER JOIN Answers A1 ON
    A1.question_id = Q.question_id AND
    A1.userid = @userid
LEFT OUTER JOIN Answers A2 ON
    A2.question_id = A1.question_id AND
    A2.userid <> A1.userid
LEFT OUTER JOIN Users U2 ON
    U2.userid = A2.userid
GROUP BY
    U2.userid
ORDER BY
    similarity_score DESC

If you were to set it up in SQL with tables for Users, Questions, and Answers then I believe that the following SQL could be used to get other users with similar responses. Simply add a TOP clause to get the number that you want.

I don't know how performance will be, but that would depend a lot on the size of your data.

SELECT
    U2.userid,
    SUM(CASE
            WHEN A1.answer = A2.answer THEN 1
            WHEN A1.answer <> A2.answer THEN -1
            WHEN A1.answer IS NULL OR A2.answer IS NULL THEN 0  -- A bit redundant, but I like to make it clear
            ELSE 0
        END) AS similarity_score
FROM
    Questions Q
LEFT OUTER JOIN Answers A1 ON
    A1.question_id = Q.question_id AND
    A1.userid = @userid
LEFT OUTER JOIN Answers A2 ON
    A2.question_id = A1.question_id AND
    A2.userid <> A1.userid
LEFT OUTER JOIN Users U2 ON
    U2.userid = A2.userid
GROUP BY
    U2.userid
ORDER BY
    similarity_score DESC
美男兮 2024-09-18 07:39:28

数据存储:
我想说数据库是一个好主意(听起来像是一个相当大的数据集的潜力)。我不知道您计划提出多少问题,但为了帮助简化分析(包括 SQL 查询),您可能需要将类似问题的答案分组到单独的表中。我同意使用数值(字节 0-2)而不是布尔值或其他值是一个很好的选择。您正在计算相似度分数,因此不妨从数字开始。

比较:
就比较本身而言,我建议创建一个包含字节列表的SimilarQuestionAnswers类和一个包含这些SimilarQuestionAnswers列表的UserAnswers类。其作用是为您提到的聚类分析方法设置聚类。通过这种方式,您可以增加某些集群的权重。 (集群 a 是一个重要集群,因此它的分数乘以 20,而集群 b 不那么重要,因此它的分数仅乘以 10)这还允许您在需要时对每个集群应用不同的比较。

我知道您说过这些问题不相关,但您至少仍然可以按问题的重要性对问题进行分组。我认为序列分析仍然可以工作,假设你的相似度矩阵全为 1,这样可以稍微简化问题,但与之相关的其余数学应该足够了。

应用比较:
这就是数据库后端派上用场的地方。使用 SQL 查询来最小化您正在处理的数据集。如果您要将一个人与其他人进行比较,您可以对他们的答案使用 SQL sum 方法,以在每个集群内进行快速而粗略的比较,并仅提取特定阈值内的比较。这可能会导致一些重叠,但很容易消除。

另一个想法是为每个用户建立一个表,为每个集群建立一个列,并与对每个问题回答正确的假用户进行比较。然后,您可以查询该表以获取每个集群当前用户分数的范围。这可能会更快但不太准确。

无论哪种方式,最终您仍然需要对从该查询中获得的每个用户进行比较。因此,进行比较的速度越快越好。尝试坚持只涉及 +、-、*、/ 的公式。大多数 Math.Whatever() 方法会在大量调用时增加大量时间。

抱歉,这太长了,大多数问题都是开放式的,我不得不假设一些细节。我希望这有帮助。

Data Storage:
I would say a database is a good idea (sounds like the potential for a rather large data set). I don't know how many questions you plan on having but to help with simplifying the analysis (including your SQL queries) a bit you may want to group answers to similar questions in separate tables. And I would agree using a numerical value (byte 0-2) would be a good route to take instead of a boolean or something else. You are computing a similarity score so might as well start with numbers.

Comparison:
As far as the comparison itself, i would suggest creating an class SimilarQuestionAnswers that contains a list of bytes and a class UserAnswers that contains a list of these SimilarQuestionAnswers. What this does is it sets up your clusters for the cluster analysis method you mentioned. This way you can add weight to certain clusters. (cluster a is an important cluster so it's score is multiplied by 20 where as cluster b is not as important so its score is only multiplied by 10) This also allows you to apply different comparisons for each cluster if that is needed.

I know you said the questions aren't related but you can still at least group questions by their importance. I think the sequence analysis will still work granted your similarity matrix will be all 1's so that kinda simplifies the problem a bit, but the rest of the math associated with that should be sufficient.

Comparison Applied:
This is where having the database back end comes in handy. Use SQL queries to minimize the dataset you are dealing with. If you are comparing one person with everyone else, you can use the SQL sum method on their answers to get a quick and dirty comparison within each cluster and pull only those within a certain threshold. This may result in some overlap but that can be eliminated easily.

Another thought is to also have a table with each user and a column for each cluster with a comparison to a fake user that has answered true to each question. Then you could just query that table for a range around the current users scores for each cluster. This my be faster but less accurate.

Either way in the end you will still need to do the comparison to each of the users you get from that query. So the faster you can make that comparison the better. Try to stick to a formula that involves only +,-,*,/ most of the Math.Whatever() methods can add a lot of time over a large number of calls.

Sorry this was so long, most of the questions were pretty open ended and I had to assume a few details. I hope this helps.

无需解释 2024-09-18 07:39:28

我认为您可能想要一个基于所有用户响应方式的每个问题的权重。作为一个极端的情况,如果 1,000 人回答问题 A 和B,结果是 A (2Y, 998N) 和 B (500Y, 500N),A 的两个 Y 的数量比 B 中任何给定的 Y 对要多得多。并且 B 中的任何相似对都比任何一对 Y 更相似A 中的一对 N。

查看贝叶斯概率

I would think you might want a per-question weight that was based on how all users responded. As an extreme case, if 1,000 people answered questions A & B, and the results were A (2Y, 998N) and B (500Y, 500N), the two 'Ys for A count much more than any given pair of Y's from B. And any similar pair from B is somewhat more similar than any pair of Ns from A.

Check out Bayesian Probability

旧伤慢歌 2024-09-18 07:39:28

除了对用户进行聚类之外,您还可以考虑对问题进行聚类(例如 OkCupid)。然后,您不再比较用户的所有答案,而是比较它们的类别。

Rather than cluster the users, you might also consider clustering the questions (e.g. OkCupid). Then instead of comparing users on all answers, you compare them on the categories.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文