在 SQL 中计算 2.3 四分位数平均值

发布于 2024-11-28 12:00:12 字数 717 浏览 2 评论 0原文

我想在 SQL (MySQL) 中编写一个存储过程来计算第二个和第三个四分位数的平均值。

换句话说，我有测量 URL 加载所需时间的记录。记录是（id，url，时间），它们是每个 URL 的许多测量值。我想要做的是对于每个 URL 删除最低和最高的 25%（即下四分位数和上四分位数）并计算剩余 25%-75% 的加载时间的平均值。并将其存储到另一个表中。

我看到了 MS SQL 的一些示例，似乎相对简单。但我必须使用 MySQL where :

LIMIT 子句不支持百分比（没有类似的选择前 25% ）
LIMIT 子句不支持其参数为变量（仅常量）
函数不支持动态 SQL（例如 PREPARE 和EXECUTE ）

我到目前为止：

create procedure G(
  IN val VARCHAR(10)
)
Begin
  select @cnt:=count(*) from test where a=val;
  select  @of:= @cnt /4;
  SELECT @len:= @cnt/2; 
  Prepare stmt from 'select * from test where a="a" LIMIT ?,?';
  execute stmt using @of, @len;
END;

我可以用 PHP 编写它，但认为用 SQL 它会有更好的整体性能。我将非常感谢一些帮助。

原文

I want to write a stored proc in SQL (MySQL) to compute the average of second and third quartiles.

In other words I have records for measurements for how long it takes for an URL to load. Records are (id,url,time) and they are many measurements for each URL. What I'm trying to do is for each URL remove the lowest and top 25% (i.e. lower and upper quartiles) and compute average of the remaining 25%-75% of loading times. And store this into another table.

I saw some examples for this for MS SQL and seemed to be relatively easy. But I have to use MySQL where :

LIMIT clause doesn't support percents ( no analogue to select top 25% )
LIMIT clause doesn't support its arguments to be variables (only constants)
functions don't support dynamic SQL (e.g. PREPARE and EXECUTE )

And I got as far as here :

create procedure G(
  IN val VARCHAR(10)
)
Begin
  select @cnt:=count(*) from test where a=val;
  select  @of:= @cnt /4;
  SELECT @len:= @cnt/2; 
  Prepare stmt from 'select * from test where a="a" LIMIT ?,?';
  execute stmt using @of, @len;
END;

I can write it in PHP but think in SQL it would have much better overall performance. I will appreciate some help very much.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

荭秂 2024-12-05 12:00:12

查看@Richard aka cyberkiwi 在这个问题：

Select *
from
(
    SELECT tbl.*, @counter := @counter +1 counter
    FROM (select @counter:=0) initvar, tbl
    ORDER BY ordcolumn
) X
where counter >= (25/100 * @counter) and counter <= (75/100 * @counter);
ORDER BY ordcolumn

Look at answer and comment by @Richard aka cyberkiwi in this question:

Select *
from
(
    SELECT tbl.*, @counter := @counter +1 counter
    FROM (select @counter:=0) initvar, tbl
    ORDER BY ordcolumn
) X
where counter >= (25/100 * @counter) and counter <= (75/100 * @counter);
ORDER BY ordcolumn

回复收藏 0 原文

Hello爱情风 2024-12-05 12:00:12

如果四分位值错误，您可以使用 IF 将四分位值设置为零来创建四分位值：

假设原始数据表是由创建的

DROP TABLE IF EXISTS `rawdata`;
CREATE TABLE `rawdata` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `url` varchar(250) NOT NULL DEFAULT '',
  `time` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `time` (`time`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

（当然也是填充的）。

我们还假设四分位表数据是由创建的

DROP TABLE IF EXISTS `quartiles`;
CREATE TABLE `quartiles` (
  `url` varchar(250) NOT NULL,
  `Q1` float DEFAULT '0',
  `Q2` float DEFAULT '0',
  `Q3` float DEFAULT '0',
  `Q4` float DEFAULT '0',
  PRIMARY KEY (`url`),
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

（并留空）。

然后，从原始数据填充四分位数的过程将如下所示

DELIMITER ;;

CREATE PROCEDURE `ComputeQuartiles`()
    READS SQL DATA
BEGIN
    DECLARE numrows int DEFAULT 0;
    DECLARE qrows int DEFAULT 0;
    DECLARE rownum int DEFAULT 0;
    DECLARE done int DEFAULT 0;
    DECLARE currenturl VARCHAR(250) CHARACTER SET utf8;
    DECLARE Q1,Q2,Q3,Q4 float DEFAULT 0.0;
    DECLARE allurls CURSOR FOR SELECT DISTINCT url FROM rawdata;
    DECLARE CONTINUE HANDLER FOR NOT FOUND SET currenturl='';

    OPEN allurls;
    FETCH allurls INTO currenturl;
    WHILE currenturl<>'' DO
        SELECT COUNT(*) INTO numrows FROM rawdata WHERE url=currenturl;
        SET qrows=FLOOR(numrows/4);
        if qrows>0 THEN
            -- Only session parameters can be recalculated inside a query,
            -- so @rownum:=@rownum+1 will work, but rownum:=rownum+1 will not.
            SET @rownum=0;
            SELECT
                SUM(IFNULL(QA,0))/qrows, 
                SUM(IFNULL(QB,0))/qrows, 
                SUM(IFNULL(QC,0))/qrows, 
                SUM(IFNULL(QD,0))/qrows
            FROM (
                SELECT 
                    if(@rownum<qrows,time,0) AS QA,
                    if(@rownum>=qrows AND @rownum<2*qrows,time,0) AS QB,
                    -- the middle 0-3 rows are left out 
                    if(@rownum>=(numrows-2*qrows) AND @rownum<(numrows-qrows),time,0) AS QC,
                    if(@rownum>=(numrows-qrows),time,0) AS QD,
                    @rownum:=@rownum+1 AS dummy
                FROM rawdata
                WHERE url=currenturl ORDER BY time
            ) AS baseview
            INTO Q1,Q2,Q3,Q4
            ;
            REPLACE INTO quartiles values (currenturl,Q1,Q2,Q3,Q4);
        END IF;

        FETCH allurls INTO currenturl;
    END WHILE;
    CLOSE allurls;

END ;;

DELIMITER ;

要点是：

使用游标循环 URL（或调整示例以接受 URL 作为参数）
对于每个 URL 查找总行数
做一些简单的数学运算省略中间行，如果 (rowcount % 4) != 0
选择 URL 的所有原始行，将 time 的值分配给 QA-QD 之一，具体取决于在行上数字，为另一个 Qx 分配值 0
使用此查询作为另一个查询的子查询，对值进行求和并标准化
使用此超级查询的结果来更新四分位数表

我使用 18432 个原始行对此进行了测试，url=concat ('http://.../',floor(rand()*10)), time=round(rand()*10000) 在 8x1.9GHz 机器上始终在 0.50-0.54 秒内完成

You can create the quartile values by using IF to set them to zero if in the wrong quartile:

Let's assume, the raw data table is created by

DROP TABLE IF EXISTS `rawdata`;
CREATE TABLE `rawdata` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `url` varchar(250) NOT NULL DEFAULT '',
  `time` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `time` (`time`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

(and ofcourse populated).

Let's also assume the quartile table data is created by

DROP TABLE IF EXISTS `quartiles`;
CREATE TABLE `quartiles` (
  `url` varchar(250) NOT NULL,
  `Q1` float DEFAULT '0',
  `Q2` float DEFAULT '0',
  `Q3` float DEFAULT '0',
  `Q4` float DEFAULT '0',
  PRIMARY KEY (`url`),
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

(and left empty).

Then a procedure to populate quartiles from rawdata would look like

DELIMITER ;;

CREATE PROCEDURE `ComputeQuartiles`()
    READS SQL DATA
BEGIN
    DECLARE numrows int DEFAULT 0;
    DECLARE qrows int DEFAULT 0;
    DECLARE rownum int DEFAULT 0;
    DECLARE done int DEFAULT 0;
    DECLARE currenturl VARCHAR(250) CHARACTER SET utf8;
    DECLARE Q1,Q2,Q3,Q4 float DEFAULT 0.0;
    DECLARE allurls CURSOR FOR SELECT DISTINCT url FROM rawdata;
    DECLARE CONTINUE HANDLER FOR NOT FOUND SET currenturl='';

    OPEN allurls;
    FETCH allurls INTO currenturl;
    WHILE currenturl<>'' DO
        SELECT COUNT(*) INTO numrows FROM rawdata WHERE url=currenturl;
        SET qrows=FLOOR(numrows/4);
        if qrows>0 THEN
            -- Only session parameters can be recalculated inside a query,
            -- so @rownum:=@rownum+1 will work, but rownum:=rownum+1 will not.
            SET @rownum=0;
            SELECT
                SUM(IFNULL(QA,0))/qrows, 
                SUM(IFNULL(QB,0))/qrows, 
                SUM(IFNULL(QC,0))/qrows, 
                SUM(IFNULL(QD,0))/qrows
            FROM (
                SELECT 
                    if(@rownum<qrows,time,0) AS QA,
                    if(@rownum>=qrows AND @rownum<2*qrows,time,0) AS QB,
                    -- the middle 0-3 rows are left out 
                    if(@rownum>=(numrows-2*qrows) AND @rownum<(numrows-qrows),time,0) AS QC,
                    if(@rownum>=(numrows-qrows),time,0) AS QD,
                    @rownum:=@rownum+1 AS dummy
                FROM rawdata
                WHERE url=currenturl ORDER BY time
            ) AS baseview
            INTO Q1,Q2,Q3,Q4
            ;
            REPLACE INTO quartiles values (currenturl,Q1,Q2,Q3,Q4);
        END IF;

        FETCH allurls INTO currenturl;
    END WHILE;
    CLOSE allurls;

END ;;

DELIMITER ;

The main points being:

Use a cursor to cycle the URLs (or adapt the sample to accept the URL as a parameter)
For every URL find the total number of rows
Do some trivial math to leave out the middle rows, if (rowcount % 4) != 0
select all raw rows for the URL, assigning the value of time to one of QA-QD, depending on the row number, assigning the other Qx the value 0
Use this query as a subquery to another one, which sums up and normalizes the values
Use the results of this superquery to update quartiles table

I tested this with 18432 raw rows, url=concat('http://.../',floor(rand()*10)), time=round(rand()*10000) on a 8x1.9GHz machine and it finished consistently in 0.50-0.54sec

回复收藏 0 原文

秋意浓 2024-12-05 12:00:12

这个怎么样？

prepare stmt from select concat('select * from test where a="a" LIMIT ',@of,@len);
execute stmt;

how about this ?

prepare stmt from select concat('select * from test where a="a" LIMIT ',@of,@len);
execute stmt;

回复收藏 0 原文

永言不败 2024-12-05 12:00:12

看一下这个使用 MySQL 计算百分位数的优秀示例。我在一些相当大的数据集上使用了它并取得了巨大成功。

http://planet.mysql.com/entry/?id=13588

注意与group_concat_max_len相关的部分 - 这非常重要。将此值设置为最大允许值 - 这是最大数据包大小的设置，将确保如果它构建的字符串太大，您将得到一个正确的错误，而不仅仅是“截断字段”警告。

SET @@group_concat_max_len := @@max_allowed_packet;

我要做的是使用此函数计算第 25 个和第 75 个百分位数（可以在单个查询中完成），然后通过对数据运行第二个查询来计算剩余数据的平均值。

<?php
$lowVal = /* result of query getting the 25%ile value */;
$highVal = /* result of query getting the 75%ile value */;

$strSQL = "SELECT AVG(`field`) AS myAvg 
             FROM `table` 
             WHERE { your_existing_criteria_goes_here }
                AND `filter_field` BETWEEN '{$lowVal}' AND '{$highVal}';"
/* Run the query and extract your data */
?>

希望一切都有意义，并帮助解决您的问题:)

Take a look at this excellent example of calculating percentiles with MySQL. I have used this with great success on some pretty large datasets.

http://planet.mysql.com/entry/?id=13588

Take note of the section relating to group_concat_max_len - this is really important. Setting this value to the maximum allowable value - which is your setting for max packet size, will ensure that if the string it builds gets too big, you'll get a proper error rather than just a 'truncated field' warning.

SET @@group_concat_max_len := @@max_allowed_packet;

What I would do is use this function to calculate the 25th and 75th percentiles (which can be done in a single query), and then calculate the averages of your remaining data by running a second query against the data.

<?php
$lowVal = /* result of query getting the 25%ile value */;
$highVal = /* result of query getting the 75%ile value */;

$strSQL = "SELECT AVG(`field`) AS myAvg 
             FROM `table` 
             WHERE { your_existing_criteria_goes_here }
                AND `filter_field` BETWEEN '{$lowVal}' AND '{$highVal}';"
/* Run the query and extract your data */
?>

Hope that all makes sense, and help with your problem :)

回复收藏 0 原文

池予 2024-12-05 12:00:12

为什么不这样使用一个查询：

select url, avg(time)
from mytable A
where time >
       (select min(B.time) + ((max(B.time)-min(B.time))/100*25)
          from mytable B where B.url = A.url)
and time <
       (select max(B.time) - ((max(B.time)-min(B.time))/100*25)
          from mytable B where B.url = A.url)
group by url;

Why don't you just use one query this way:

select url, avg(time)
from mytable A
where time >
       (select min(B.time) + ((max(B.time)-min(B.time))/100*25)
          from mytable B where B.url = A.url)
and time <
       (select max(B.time) - ((max(B.time)-min(B.time))/100*25)
          from mytable B where B.url = A.url)
group by url;

回复收藏 0 原文

~没有更多了~