PHP 中的日期匹配和插值
序言:我有一个大型数据集,其中存储有数十万条记录,存储在 MySQL 数据库中。大大简化后,每一行都有一个日期时间字段来存储拨打电话的日期和时间,以及一个整数字段来存储通话时长。
场景:我正忙于在 PHP 中编写一个插值函数,该函数生成一系列由预先计算的间隔分隔的日期。每个生成的日期都存储在一个关联数组中,并以日期为键,每个值都初始化为 0。然后,该脚本在数据库中查询记录列表,并尝试将日期时间记录与预置中最接近的日期进行匹配。生成的关联数组。当找到最接近的匹配时,它只是将调用持续时间添加到该索引处的数组的现有值中。
生成关联数组的示例:
$array = array( "2011-01-01 09:00:00" => 0,
"2011-01-01 09:30:00" => 0,
"2011-01-01 10:00:00" => 0,
"2011-01-01 10:30:00" => 0,
"2011-01-01 11:00:00" => 0,
"2011-01-01 11:30:00" => 0,
"2011-01-01 12:00:00" => 0
)
在上面的示例中,使用 30 分钟的间隔生成日期范围。
来自 MySQL 数据库的记录示例:
+---------------------+----------+
| datetime | duration |
+---------------------+----------+
| 2011-01-01 09:02:26 | 1 |
| 2011-01-01 09:14:51 | 1 |
| 2011-01-01 10:40:33 | 549 |
| 2011-01-01 11:10:27 | 38 |
| 2011-01-01 11:31:50 | 82 |
+---------------------+----------+
现在,每条记录都需要与上面给出的预生成数组中最接近的日期时间键相匹配,并将 duration
值添加到匹配的现有值。
问题: 构造两个嵌套的 for 循环来交互数据库中的记录,然后线性地运行关联数组来查找匹配项是很容易的,但是这是非常低效的,并且对于大型数据集来说会出现问题(想想冒泡排序,这大致相当于)。稍微好一点的方法是线性循环数据库中的记录,然后将数组作为二叉树进行迭代,这当然更有效,并且是可能的,因为两个数组都是按时间顺序排序的。
问题: 有没有比我在上述问题中描述的方式更有效的处理此日期匹配的方法?
Prelude: I have a large dataset with several hundred thousand records stored in a MySQL database. Greatly simplified, each row has a datetime field to store the date and time at which a telephone call was made and an integer field to store the length of the call.
Scenario: I'm busy writing an interpolation function in PHP which generates a range of dates separated by a pre-calculated interval. Each generated date is stored in an associative array with the date used as the key, and each value initialized to 0. The script then queries the database for a list of records and attempts to match the datetime record to the closest date in the pre-generated associative array. When the closest match is found, it simply adds the call duration to the existing value of the array at that index.
Example of the generated associative array:
$array = array( "2011-01-01 09:00:00" => 0,
"2011-01-01 09:30:00" => 0,
"2011-01-01 10:00:00" => 0,
"2011-01-01 10:30:00" => 0,
"2011-01-01 11:00:00" => 0,
"2011-01-01 11:30:00" => 0,
"2011-01-01 12:00:00" => 0
)
In the above example, a range of dates is generated using an interval of 30 minutes.
Example of records from the MySQL database:
+---------------------+----------+
| datetime | duration |
+---------------------+----------+
| 2011-01-01 09:02:26 | 1 |
| 2011-01-01 09:14:51 | 1 |
| 2011-01-01 10:40:33 | 549 |
| 2011-01-01 11:10:27 | 38 |
| 2011-01-01 11:31:50 | 82 |
+---------------------+----------+
Each of these records now needs to be matched to the closest datetime key from the pre-generated array given above and the duration
value added to the existing value of the match.
The Problem:
It's easy enough constructing two nested for
loops to interate through the records from the database and then linearly run through the associative array to find a match, but this is hugely inefficient, and becomes problematic for large datasets (think bubblesort, that's what this would be roughly equivalent to). A slightly better approach is to linearly loop over the records from the database, and then iterate over the array as a binary tree, which is certainly a lot more efficient, and is possible since both arrays are sorted chronologically.
The Question:
Is there a more efficient way of handling this date matching than how I've described in the above problem?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
将日期的
UNIX_TIMESTAMP()
除以 30 * 60 秒(30 分钟)并使用ROUND()
作为整数怎么样?然后,使用GROUP BY
对它们进行分组,最后使用SUM()
对持续时间求和。What about dividing the
UNIX_TIMESTAMP()
of the date by 30 * 60 seconds (30 minutes) and usingROUND()
for the integer. Then, useGROUP BY
to group them, and finally,SUM()
to sum the durations.您当前的算法(如果我理解正确的话)如下所示:
更有效的方法可能是:
这允许 MySQL 提供优化的数学逻辑并减少脚本所需的数组迭代次数。您将增加数据库查询的数量,但基准测试会告诉您这种权衡是否值得。
Your current algorithm (if I understand you correctly) looks like this:
A more efficient method may be:
This allows MySQL to provide optimized math logic and reduces the number of array iterations required by your script. You will increase the number of database queries, but benchmarking will tell you if the tradeoff is worthwhile.