从MySQL数据库前N行获取运行频率分布
我有一个 MySQL 数据库,其中一列包含状态代码。该列的类型为 int,值只能是 100,200,300,400。如下所示;为了清楚起见,删除了其他列。
id | status
----------------
1 300
2 100
3 100
4 200
5 300
6 300
7 100
8 400
9 200
10 300
11 100
12 400
13 400
14 400
15 300
16 300
id 字段是自动生成的,并且始终是连续的。我想要第三列显示前 10 行状态代码的频率分布的逗号分隔字符串。它应该看起来像这样。
id | status | freq
-----------------------------------
1 300
2 100
3 100
4 200
5 200
6 300
7 100
8 400
9 300
10 300
11 100 300,100,200,400 -- from rows 1-10
12 400 100,300,200,400 -- from rows 2-11
13 400 100,300,200,400 -- from rows 3-12
14 400 300,400,100,200 -- from rows 4-13
15 300 400,300,100,200 -- from rows 5-14
16 300 300,400,100 -- from rows 6-15
我希望首先列出最常用的代码。如果两个状态代码具有相同的频率,那么首先列出哪个状态代码对我来说并不重要,但在示例中,我确实将较小的代码列在较大的代码之前。最后,如果代码根本没有出现在前十行中,则它也不应该列在 freq 列中。
需要非常清楚的是,频率字符串出现的行号不考虑该行的状态代码;这只是前面的行。
那么我做了什么?我对 SQL 很陌生。我是一名程序员,我发现这种 SQL 语言有点不太习惯。我管理了以下自连接选择语句。
select *, avg(b.status) freq
from sample a
join sample b
on (b.id < a.id) and (b.id > a.id - 11)
where a.id > 10
group by a.id;
使用聚合函数 avg,我至少可以演示这个概念。派生表 b 为 avg 函数提供了正确的行,但我无法弄清楚对 b 中的行进行计数和分组以获得频率分布,然后将频率行折叠为单个字符串值的多步骤过程。
我还尝试使用标准存储函数和过程来代替内置聚合函数,但似乎 b 派生表超出了范围或其他内容。我似乎无法访问它。据我了解,编写自定义聚合函数对我来说是不可能的,因为它似乎需要用 C 语言开发,而我没有接受过这方面的培训。
这是加载示例的 sql。
create table sample (
id int NOT NULL AUTO_INCREMENT,
PRIMARY KEY(id),
status int
);
insert into sample(status) values(300),(100),(100),(200),(200),(300)
,(100),(400),(300),(300),(100),(400),(400),(400),(300),(300),(300)
,(100),(400),(100),(100),(200),(500),(300),(100),(400),(200),(100)
,(500),(300);
该示例有 30 行数据可供使用。我知道这是一个很长的问题,但我只是想尽可能详细。我已经为此工作了几天并且真的很想完成它。
感谢您的帮助。
I have a MySQL database where one column contains status codes. The column is of type int and the values will only ever be 100,200,300,400. It looks like below; other columns removed for clarity.
id | status
----------------
1 300
2 100
3 100
4 200
5 300
6 300
7 100
8 400
9 200
10 300
11 100
12 400
13 400
14 400
15 300
16 300
The id field is auto-generated and will always be sequential. I want to have a third column displaying a comma-separated string of the frequency distribution of the status codes of the previous 10 rows. It should look like this.
id | status | freq
-----------------------------------
1 300
2 100
3 100
4 200
5 200
6 300
7 100
8 400
9 300
10 300
11 100 300,100,200,400 -- from rows 1-10
12 400 100,300,200,400 -- from rows 2-11
13 400 100,300,200,400 -- from rows 3-12
14 400 300,400,100,200 -- from rows 4-13
15 300 400,300,100,200 -- from rows 5-14
16 300 300,400,100 -- from rows 6-15
I want the most frequent code listed first. And where two status codes have the same frequency it doesn't matter to me which is listed first but I did list the smaller code before the larger in the example. Lastly, where a code doesn't appear at all in the previous ten rows, it shouldn't be listed in the freq column either.
And to be very clear the row number that the frequency string appears on does NOT take into account the status code of that row; it's only the previous rows.
So what have I done? I'm pretty green with SQL. I'm a programmer and I find this SQL language a tad odd to get used to. I managed the following self-join select statement.
select *, avg(b.status) freq
from sample a
join sample b
on (b.id < a.id) and (b.id > a.id - 11)
where a.id > 10
group by a.id;
Using the aggregate function avg, I can at least demonstrate the concept. The derived table b provides the correct rows to the avg function but I just can't figure out the multi-step process of counting and grouping rows from b to get a frequency distribution and then collapse the frequency rows into a single string value.
Also I've tried using standard stored functions and procedures in place of the built-in aggregate functions, but it seems the b derived table is out of scope or something. I can't seem to access it. And from what I understand writing a custom aggregate function is not possible for me as it seems to require developing in C, something I'm not trained for.
Here's sql to load up the sample.
create table sample (
id int NOT NULL AUTO_INCREMENT,
PRIMARY KEY(id),
status int
);
insert into sample(status) values(300),(100),(100),(200),(200),(300)
,(100),(400),(300),(300),(100),(400),(400),(400),(300),(300),(300)
,(100),(400),(100),(100),(200),(500),(300),(100),(400),(200),(100)
,(500),(300);
The sample has 30 rows of data to work with. I know it's a long question, but I just wanted to be as detailed as I could be. I've worked on this for a few days now and would really like to get it done.
Thanks for your help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
SQL Fiddle
SQL Fiddle
我知道执行您所要求的操作的唯一方法是使用
BEFORE INSERT
触发器。它必须是 BEFORE INSERT,因为您想要更新正在插入的行中的值,而这只能在 BEFORE 触发器中完成。不幸的是,这也意味着它尚未被分配 ID,因此希望可以安全地假设插入新记录时,表中的最后 10 条记录是您感兴趣的记录。您的触发器将需要获取最后 10 个 ID 的值,并使用GROUP_CONCAT
函数将它们连接成一个字符串,并按COUNT
排序。我主要使用 SQL Server,目前无法访问 MySQL 服务器来测试这一点,但希望我的语法足够接近,至少能让您朝着正确的方向前进:The only way I know of to do what you're asking is to use a
BEFORE INSERT
trigger. It has to beBEFORE INSERT
because you want to update a value in the row being inserted, which can only be done in aBEFORE
trigger. Unfortunately, that also means it won't have been assigned an ID yet, so hopefully it's safe to assume that at the time a new record is inserted, the last 10 records in the table are the ones you're interested in. Your trigger will need to get the values of the last 10 ID's and use theGROUP_CONCAT
function to join them into a single string, ordered by theCOUNT
. I've been using SQL Server mostly and I don't have access to a MySQL server at the moment to test this, but hopefully my syntax will be close enough to at least get you moving in the right direction: