在应用时,其中'对于时间戳,直接应用'提取'显示出不同的结果(附加的代码)
根据任务 5 的 Kaggle 练习(编写查询): https://www.kaggle.com/code/setthawutkulsrisuwan/exercise-as-与 我回答了2种方式:
- 使用 WHERE EXTRACT() 查询以获取年份和月份,答案为 不正确:
WITH RelevantRides AS
(
SELECT EXTRACT(HOUR from trip_start_timestamp) as hour_of_day, trip_seconds, trip_miles
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE EXTRACT(YEAR from trip_start_timestamp) = 2017 AND
EXTRACT(MONTH from trip_start_timestamp) BETWEEN 1 and 6 AND
trip_seconds > 0 AND
trip_miles > 0
)
SELECT hour_of_day,
COUNT(1) as num_trips,
3600 * SUM(trip_miles) / SUM(trip_seconds) as avg_mph
FROM RelevantRides
GROUP BY hour_of_day
ORDER BY hour_of_day
- 使用直接列名查询以获取年份和月份,答案是正确。:
WITH RelevantRides AS
(
SELECT EXTRACT(HOUR from trip_start_timestamp) AS hour_of_day, trip_seconds, trip_miles
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE trip_start_timestamp > '2017-01-01' AND
trip_start_timestamp < '2017-07-01' AND
trip_seconds > 0 AND
trip_miles > 0
)
SELECT hour_of_day,
COUNT(1) as num_trips,
3600 * SUM(trip_miles) / SUM(trip_seconds) as avg_mph
FROM RelevantRides
GROUP BY hour_of_day
ORDER BY hour_of_day
主要区别是第一个是
WHERE EXTRACT(YEAR from trip_start_timestamp) = 2017
AND EXTRACT(MONTH from trip_start_timestamp) BETWEEN 1 and 6
,第二个是
WHERE trip_start_timestamp > '2017-01-01' AND
trip_start_timestamp < '2017-07-01'
。
在我看来,它们的结果应该与使用 EXTRACT() 查询相同,显示 2017 年和 1 到 6 月份,与使用直接列名称查询相同;然而,结果并不相同。
请解释这些背后的原因。 谢谢。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您正在将恒定日期与时间戳进行比较。恒定日期实际上是类似于
2022-04-07 00:00:00
的时间戳。因此,当您想要获取一月到六月日期范围内的所有记录时,您需要:
换句话说,您希望该范围第一天午夜或之后的所有记录以及最后一天后一天午夜(但不包括午夜)的所有记录。在数学符号中,您希望日期在 [2017-01-01, 2017-07-01) 范围内。范围的开头是封闭的,结尾是开放的。
像这样的代码给出了正确的结果。
但它无法利用
trip_start_timestamp
列上的索引,因此在生产中效率不高。You're comparing constant dates to timestamps. Constant dates are actually timestamps looking like
2022-04-07 00:00:00
.So when you want to get all records in a date range January to June you need:
In other words you want everything on or after midnight on the first day of the range and everything up to but not including midnight on the day after the last day. In mathematical notation you want the dates in the range [2017-01-01, 2017-07-01). The beginning of the range is closed and the end is open.
Your code like this gives a correct result.
But it can't exploit an index on your
trip_start_timestamp
column, so it won't be efficient in production.