获取直方图数据

发布于 2024-08-11 13:18:51 字数 1575 浏览 9 评论 0原文

有没有办法在 MySQL 中指定 bin 大小？现在，我正在尝试以下 SQL 查询：

select total, count(total) from faults GROUP BY total;

正在生成的数据足够好，但行数太多。我需要的是一种将数据分组到预定义的容器中的方法。我可以通过脚本语言来完成此操作，但是有没有办法直接在 SQL 中完成此操作？

示例：

+-------+--------------+
| total | count(total) |
+-------+--------------+
|    30 |            1 | 
|    31 |            2 | 
|    33 |            1 | 
|    34 |            3 | 
|    35 |            2 | 
|    36 |            6 | 
|    37 |            3 | 
|    38 |            2 | 
|    41 |            1 | 
|    42 |            5 | 
|    43 |            1 | 
|    44 |            7 | 
|    45 |            4 | 
|    46 |            3 | 
|    47 |            2 | 
|    49 |            3 | 
|    50 |            2 | 
|    51 |            3 | 
|    52 |            4 | 
|    53 |            2 | 
|    54 |            1 | 
|    55 |            3 | 
|    56 |            4 | 
|    57 |            4 | 
|    58 |            2 | 
|    59 |            2 | 
|    60 |            4 | 
|    61 |            1 | 
|    63 |            2 | 
|    64 |            5 | 
|    65 |            2 | 
|    66 |            3 | 
|    67 |            5 | 
|    68 |            5 | 
------------------------

我正在寻找什么：

+------------+---------------+
| total      | count(total)  |
+------------+---------------+
|    30 - 40 |            23 | 
|    40 - 50 |            15 | 
|    50 - 60 |            51 | 
|    60 - 70 |            45 | 
------------------------------

我想这不能以直接的方式实现，但对任何相关存储过程的引用也可以。

原文

Is there a way to specify bin sizes in MySQL? Right now, I am trying the following SQL query:

select total, count(total) from faults GROUP BY total;

The data that is being generated is good enough but there are just too many rows. What I need is a way to group the data into predefined bins. I can do this from a scripting language, but is there a way to do it directly in SQL?

Example:

+-------+--------------+
| total | count(total) |
+-------+--------------+
|    30 |            1 | 
|    31 |            2 | 
|    33 |            1 | 
|    34 |            3 | 
|    35 |            2 | 
|    36 |            6 | 
|    37 |            3 | 
|    38 |            2 | 
|    41 |            1 | 
|    42 |            5 | 
|    43 |            1 | 
|    44 |            7 | 
|    45 |            4 | 
|    46 |            3 | 
|    47 |            2 | 
|    49 |            3 | 
|    50 |            2 | 
|    51 |            3 | 
|    52 |            4 | 
|    53 |            2 | 
|    54 |            1 | 
|    55 |            3 | 
|    56 |            4 | 
|    57 |            4 | 
|    58 |            2 | 
|    59 |            2 | 
|    60 |            4 | 
|    61 |            1 | 
|    63 |            2 | 
|    64 |            5 | 
|    65 |            2 | 
|    66 |            3 | 
|    67 |            5 | 
|    68 |            5 | 
------------------------

What I am looking for:

+------------+---------------+
| total      | count(total)  |
+------------+---------------+
|    30 - 40 |            23 | 
|    40 - 50 |            15 | 
|    50 - 60 |            51 | 
|    60 - 70 |            45 | 
------------------------------

I guess this cannot be achieved in a straight forward manner but a reference to any related stored procedure would be fine as well.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

金橙橙 2024-08-18 13:18:51

这是一篇关于创建直方图的超级快速和肮脏的方法的文章
在 MySQL 中获取数值。
还有多种其他方法可以创建更好的直方图
更灵活，使用CASE语句和其他类型的复杂逻辑。
这种方法一次又一次地赢得了我的青睐，因为它是如此简单
针对每个用例进行修改，因此简短明了。这就是你如何
这样做：
SELECT ROUND(numeric_value, -2) AS 存储桶，
       计数（*）作为计数，
       RPAD('', LN(COUNT(*)), '*') AS 栏
来自我的表
GROUP BY 存储桶；
只需将 numeric_value 更改为您的列的值，然后更改
舍入增量，就是这样。我已经把酒吧弄进去了
对数刻度，这样当你有时它们就不会增长太多
大值。
numeric_value 应在舍入操作中根据舍入增量进行偏移，以确保第一个存储桶包含与后续存储桶一样多的元素。
例如使用 ROUND(numeric_value,-1)，[0,4] 范围内的 numeric_value（5 个元素）将被放置在第一个存储桶中，而 [5,14]（10 个元素）将被放置在第二个存储桶中，[15,24]第三，除非 numeric_value 通过 ROUND(numeric_value - 5, -1) 适当偏移。
这是对一些看起来很漂亮的随机数据进行此类查询的示例
甜的。足以快速评估数据。
<前><代码>+--------+----------+-----------------+
|桶 |计数|酒吧|
+--------+----------+-----------------+
| -500| 1 | |
| -400 | -400 2 | * |
| -300| 2 | * |
| -200 | -200 9 | ** |
| -100 | -100 52 | 52 **** |
| 0 | 5310766 | ************** |
| 100 | 100 20779 | 20779 ********** |
| 200 | 200 1865 | 1865 ****** |
| 300 | 300 527 | 527 ****** |
| 400 | 170 | 170 ****** |
| 500 | 79 | 79 **** |
| 600 | 63 | 63 **** |
| 700 | 700 35 | 35 **** |
| 800 | 14 | 14 *** |
| 900 | 900 15 | 15 *** |
| 1000 | 1000 6 | ** |
| 1100 | 1100 7 | ** |
| 1200 | 1200 8 | ** |
| 1300 | 1300 5 | ** |
| 1400 | 1400 2 | * |
| 1500 | 1500 4 | * |
+--------+----------+-----------------+
一些注意事项：没有匹配的范围不会出现在计数中 -
计数列中不会有零。另外，我正在使用
ROUND 函数在这里。您可以轻松地将其替换为 TRUNCATE
如果您觉得这对您更有意义。

我在这里找到它 http://blog.shlomoid.com/2011/ 08/如何快速创建直方图-in.html

This is a post about a super quick-and-dirty way to create a histogram
in MySQL for numeric values.
There are multiple other ways to create histograms that are better and
more flexible, using CASE statements and other types of complex logic.
This method wins me over time and time again since it's just so easy
to modify for each use case, and so short and concise. This is how you
do it:
SELECT ROUND(numeric_value, -2)    AS bucket,
       COUNT(*)                    AS COUNT,
       RPAD('', LN(COUNT(*)), '*') AS bar
FROM   my_table
GROUP  BY bucket;
Just change numeric_value to whatever your column is, change the
rounding increment, and that's it. I've made the bars to be in
logarithmic scale, so that they don't grow too much when you have
large values.
numeric_value should be offset in the ROUNDing operation, based on the rounding increment, in order to ensure the first bucket contains as many elements as the following buckets.
e.g. with ROUND(numeric_value,-1), numeric_value in range [0,4] (5 elements) will be placed in first bucket, while [5,14] (10 elements) in second, [15,24] in third, unless numeric_value is offset appropriately via ROUND(numeric_value - 5, -1).
This is an example of such query on some random data that looks pretty
sweet. Good enough for a quick evaluation of the data.
+--------+----------+-----------------+
| bucket | count    | bar             |
+--------+----------+-----------------+
|   -500 |        1 |                 |
|   -400 |        2 | *               |
|   -300 |        2 | *               |
|   -200 |        9 | **              |
|   -100 |       52 | ****            |
|      0 |  5310766 | *************** |
|    100 |    20779 | **********      |
|    200 |     1865 | ********        |
|    300 |      527 | ******          |
|    400 |      170 | *****           |
|    500 |       79 | ****            |
|    600 |       63 | ****            |
|    700 |       35 | ****            |
|    800 |       14 | ***             |
|    900 |       15 | ***             |
|   1000 |        6 | **              |
|   1100 |        7 | **              |
|   1200 |        8 | **              |
|   1300 |        5 | **              |
|   1400 |        2 | *               |
|   1500 |        4 | *               |
+--------+----------+-----------------+
Some notes: Ranges that have no match will not appear in the count -
you will not have a zero in the count column. Also, I'm using the
ROUND function here. You can just as easily replace it with TRUNCATE
if you feel it makes more sense to you.

I found it here http://blog.shlomoid.com/2011/08/how-to-quickly-create-histogram-in.html

回复收藏 0 原文

回忆追雨的时光 2024-08-18 13:18:51

Mike DelGaudio 的回答是我的做法，但略有改变：

select floor(mycol/10)*10 as bin_floor, count(*)
from mytable
group by 1
order by 1

优势？您可以根据需要将垃圾箱设置得尽可能大或尽可能小。大小为 100 的垃圾箱？ 下限(mycol/100)*100。尺寸为 5 的垃圾箱？ 地板(mycol/5)*5。

贝尔纳多.

Mike DelGaudio's answer is the way I do it, but with a slight change:

select floor(mycol/10)*10 as bin_floor, count(*)
from mytable
group by 1
order by 1

The advantage? You can make the bins as large or as small as you want. Bins of size 100? floor(mycol/100)*100. Bins of size 5? floor(mycol/5)*5.

Bernardo.

回复收藏 0 原文

暖心男生 2024-08-18 13:18:51

SELECT b.*,count(*) as total FROM bins b 
left outer join table1 a on a.value between b.min_value and b.max_value 
group by b.min_value

表 bin 包含定义 bin 的 min_value 和 max_value 列。
请注意，运算符“join... on x BETWEEN y and z”是包含在内的。

table1是数据表的名称

SELECT b.*,count(*) as total FROM bins b 
left outer join table1 a on a.value between b.min_value and b.max_value 
group by b.min_value

The table bins contains columns min_value and max_value which define the bins.
note that the operator "join... on x BETWEEN y and z" is inclusive.

table1 is the name of the data table

回复收藏 0 原文

挖鼻大婶 2024-08-18 13:18:51

奥弗里·拉维夫的答案非常接近但不正确。即使直方图间隔中的结果为零，count(*) 也将为 1。需要修改查询以使用条件 sum：

SELECT b.*, SUM(a.value IS NOT NULL) AS total FROM bins b
  LEFT JOIN a ON a.value BETWEEN b.min_value AND b.max_value
GROUP BY b.min_value;

Ofri Raviv's answer is very close but incorrect. The count(*) will be 1 even if there are zero results in a histogram interval. The query needs to be modified to use a conditional sum:

SELECT b.*, SUM(a.value IS NOT NULL) AS total FROM bins b
  LEFT JOIN a ON a.value BETWEEN b.min_value AND b.max_value
GROUP BY b.min_value;

回复收藏 0 原文

翻身的咸鱼 2024-08-18 13:18:51

select "30-34" as TotalRange,count(total) as Count from table_name
   where total between 30 and 34
union (
select "35-39" as TotalRange,count(total) as Count from table_name 
   where total between 35 and 39)
union (
select "40-44" as TotalRange,count(total) as Count from table_name
   where total between 40 and 44)
union (
select "45-49" as TotalRange,count(total) as Count from table_name
   where total between 45 and 49)
etc ....

只要间隔不是太多，这就是一个很好的解决方案。

select "30-34" as TotalRange,count(total) as Count from table_name
   where total between 30 and 34
union (
select "35-39" as TotalRange,count(total) as Count from table_name 
   where total between 35 and 39)
union (
select "40-44" as TotalRange,count(total) as Count from table_name
   where total between 40 and 44)
union (
select "45-49" as TotalRange,count(total) as Count from table_name
   where total between 45 and 49)
etc ....

As long as there are not too many intervals, this is a pretty good solution.

回复收藏 0 原文

若无相欠,怎会相见 2024-08-18 13:18:51

我制作了一个程序，可用于根据指定的数量或大小自动生成 bin 的临时表，以便稍后与 Ofri Raviv 的解决方案一起使用。

CREATE PROCEDURE makebins(numbins INT, binsize FLOAT) # binsize may be NULL for auto-size
BEGIN
 SELECT FLOOR(MIN(colval)) INTO @binmin FROM yourtable;
 SELECT CEIL(MAX(colval)) INTO @binmax FROM yourtable;
 IF binsize IS NULL 
  THEN SET binsize = CEIL((@binmax-@binmin)/numbins); # CEIL here may prevent the potential creation a very small extra bin due to rounding errors, but no good where floats are needed.
 END IF;
 SET @currlim = @binmin;
 WHILE @currlim + binsize < @binmax DO
  INSERT INTO bins VALUES (@currlim, @currlim+binsize);
  SET @currlim = @currlim + binsize;
 END WHILE;
 INSERT INTO bins VALUES (@currlim, @maxbin);
END;

DROP TABLE IF EXISTS bins; # be careful if you have a bins table of your own.
CREATE TEMPORARY TABLE bins (
minval INT, maxval INT, # or FLOAT, if needed
KEY (minval), KEY (maxval) );# keys could perhaps help if using a lot of bins; normally negligible

CALL makebins(20, NULL);  # Using 20 bins of automatic size here. 

SELECT bins.*, count(*) AS total FROM bins
LEFT JOIN yourtable ON yourtable.value BETWEEN bins.minval AND bins.maxval
GROUP BY bins.minval

这将仅生成填充的容器的直方图计数。 David West 的纠正应该是正确的，但由于某种原因，未填充的垃圾箱不会出现在我的结果中（尽管使用了 LEFT JOIN - 我不明白为什么）。

I made a procedure that can be used to automatically generate a temporary table for bins according to a specified number or size, for later use with Ofri Raviv's solution.

CREATE PROCEDURE makebins(numbins INT, binsize FLOAT) # binsize may be NULL for auto-size
BEGIN
 SELECT FLOOR(MIN(colval)) INTO @binmin FROM yourtable;
 SELECT CEIL(MAX(colval)) INTO @binmax FROM yourtable;
 IF binsize IS NULL 
  THEN SET binsize = CEIL((@binmax-@binmin)/numbins); # CEIL here may prevent the potential creation a very small extra bin due to rounding errors, but no good where floats are needed.
 END IF;
 SET @currlim = @binmin;
 WHILE @currlim + binsize < @binmax DO
  INSERT INTO bins VALUES (@currlim, @currlim+binsize);
  SET @currlim = @currlim + binsize;
 END WHILE;
 INSERT INTO bins VALUES (@currlim, @maxbin);
END;

DROP TABLE IF EXISTS bins; # be careful if you have a bins table of your own.
CREATE TEMPORARY TABLE bins (
minval INT, maxval INT, # or FLOAT, if needed
KEY (minval), KEY (maxval) );# keys could perhaps help if using a lot of bins; normally negligible

CALL makebins(20, NULL);  # Using 20 bins of automatic size here. 

SELECT bins.*, count(*) AS total FROM bins
LEFT JOIN yourtable ON yourtable.value BETWEEN bins.minval AND bins.maxval
GROUP BY bins.minval

This will generate the histogram count only for the bins that are populated. David West ought to be right in his correction, but for some reason, unpopulated bins do not appear in the result for me (despite the use of a LEFT JOIN — I do not understand why).

回复收藏 0 原文

乖乖 2024-08-18 13:18:51

那应该有效。不是那么优雅，但仍然：

select count(mycol - (mycol mod 10)) as freq, mycol - (mycol mod 10) as label
from mytable
group by mycol - (mycol mod 10)
order by mycol - (mycol mod 10) ASC

通过 Mike DelGaudio

That should work. Not so elegant but still:

select count(mycol - (mycol mod 10)) as freq, mycol - (mycol mod 10) as label
from mytable
group by mycol - (mycol mod 10)
order by mycol - (mycol mod 10) ASC

via Mike DelGaudio

回复收藏 0 原文

那一片橙海， 2024-08-18 13:18:51

SELECT
    CASE
        WHEN total <= 30 THEN "0-30"
        WHEN total <= 40 THEN "31-40"       
        WHEN total <= 50 THEN "41-50"
        ELSE "50-"
    END as Total,
    count(*) as count
GROUP BY Total 
ORDER BY Total;

SELECT
    CASE
        WHEN total <= 30 THEN "0-30"
        WHEN total <= 40 THEN "31-40"       
        WHEN total <= 50 THEN "41-50"
        ELSE "50-"
    END as Total,
    count(*) as count
GROUP BY Total 
ORDER BY Total;

回复收藏 0 原文

十秒萌定你 2024-08-18 13:18:51

等宽分箱到给定数量的箱中：

WITH bins AS(
   SELECT min(col) AS min_value
        , ((max(col)-min(col)) / 10.0) + 0.0000001 AS bin_width
   FROM cars
)
SELECT tab.*,
   floor((col-bins.min_value) / bins.bin_width ) AS bin
FROM tab, bins;

请注意，0.0000001 的作用是确保值等于 max(col) 的记录不会单独创建自己的箱。此外，附加常量可确保当列中的所有值都相同时查询不会因除零而失败。

另请注意，bin 的计数（示例中为 10）应使用小数标记，以避免整数除法（未调整的 bin_width 可以是小数）。

Equal width binning into a given count of bins:

WITH bins AS(
   SELECT min(col) AS min_value
        , ((max(col)-min(col)) / 10.0) + 0.0000001 AS bin_width
   FROM cars
)
SELECT tab.*,
   floor((col-bins.min_value) / bins.bin_width ) AS bin
FROM tab, bins;

Note that the 0.0000001 is there to make sure that the records with the value equal to max(col) do not make it's own bin just by itself. Also, the additive constant is there to make sure the query does not fail on division by zero when all the values in the column are identical.

Also note that the count of bins (10 in the example) should be written with a decimal mark to avoid integer division (the unadjusted bin_width can be decimal).

回复收藏 0 原文