如何以编程方式快速构建即席查询?

发布于 2024-07-05 14:52:32 字数 247 浏览 10 评论 0原文

我使用 Excel 数据透视表来分析数据库中的数据,因为它允许我非常快速地“切片和切块”。 正如我们知道数据库表中的内容一样,我们都可以编写 SQL 查询来执行数据透视表的操作。

但我想知道为什么数据透视表可以如此快地构造查询,而它对数据以及我们提供的数据字段之间的含义/关系一无所知?

换句话说,我们如何以如此快速有效的方式构建即席 SQL 查询? (“当然使用数据透视表!”,是的,但我想要的是一种编程方式)。

I've used Excel PivotTable to analyze data from my database because it allows me to "slice and dice" very quickly. As we know what is in our database tables, we all can write SQL queries that do what PivotTable does.

But I am wondering why PivotTable can construct the queries so fast while it knows nothing about the data and the meanings/relationship between the data fields we give it?

Put the question in another way, how can we build ad-hoc SQL queries in such a fast and efficient way? ("Use PivotTable, of course!", yep, but what I want is a programmatic way).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

自控 2024-07-12 14:52:32

只需根据需要操纵您的订单和组条款即可。

Excel 速度很快,因为所有数据都在内存中,并且可以快速高效地排序。

Just manipulate your order and group clauses as necessary.

Excel is fast because all the data is in memory, and it can be sorted fast and efficiently.

忆伤 2024-07-12 14:52:32

@Mark Ransom 绝对支持 Excel 将数据保存在内存中的概念,从而使其计算速度更快。 Excel 也可能以某种方式对数据集进行预索引,使其比数据库的响应速度更快。

它之所以更快,有一个重要的、非算法的可能性:Excel 在数据透视表的使用中没有联接的概念。 当您从数据库中临时获取数据时,表之间的任何联接或关联都将导致进一步的查找、扫描、索引加载等。由于 Excel 将所有数据存储在单个位置(RAM 或没有),因此它可以无需预先形成数据集即可执行查找。 如果您要将数据库数据加载到临时表中,那么了解针对该表的即席查询在性能方面与 Excel 相比如何叠加会很有趣。

不过,有一点是肯定的:尽管数据库是生成准确报告的出色工具,但传统规范化的数据库对于即席查询来说远达不到最佳效果。 因为规范化数据结构首先注重完整性(如果我可以冒昧的话),所以它们牺牲了临时优化,但以保持所有数据的合理性为代价。 虽然这是一个很糟糕的例子,但请考虑这个标准化模式:

+--------+     +---------+
|tblUsers|     |luGenders|
+--------+     +---------+
|userID  |     |genderID |
|genderID||gender   |
+--------+     +---------+

SELECT * FROM luGenders;
> 1 Female
> 2 Male

在这个例子中,如果我们希望知道系统中女性/男性用户的数量,数据库将需要处理连接并采取相应的行为(同样,这是一个由于连接数量较少和可能值数量较少,因此是一个不好的例子,这通常会带来一些数据库引擎优化)。 但是,如果您要将这些数据转储到 Excel,您仍然会因提取数据而遭受一些数据库损失,但实际上在 Excel 中转换数据会相当快。 您认为 Excel 比直接临时查询更快的想法可能忽略了这种预先固定成本惩罚的概念,但我没有数据可以发表评论。

然而,最切题的一点是,虽然通用数据库有利于准确性,但它们通常在临时报告方面表现不佳。 为了生成临时报告,通常需要以更可查询的结构对数据进行反规范化(“仓库”)。 查找有关数据仓库的信息将提供很多关于该主题的良好结果。

这个故事的寓意是:拥有一个完全算法化的、快速的即席查询系统是一个很棒的理想,但由于空间和时间的限制(内存和工时),它不太实用。 为了有效地生成临时系统,您确实需要了解数据的用例,然后有效地对其进行非规范化。

我强烈推荐数据仓库工具包。 郑重声明,我不是 DBA,我只是一名低级分析师,每周花费 80 个小时研究 Excel 和 Oracle。 我知道你的痛苦。

@Mark Ransom is definitely onto something with the notion of Excel keeping the data in memory, making it faster computationally. It's also possible that Excel pre-indexes datasets in such a way that makes it more responsive than your database.

There's one significant, non-algorithmic possibility for why it's faster: Excel, in Pivot Table usage, has no concept of a join. When you're fetching the data ad hoc from your database, any joins or correlations between tables will result in further lookups, scans, index loads, etc. Since Excel has all the data in a single location (RAM or no), it can perform lookups without having to pre-form datasets. If you were to load your database data into a temp table, it would be interesting to see how ad hoc queries against that table stacked up, performance-wise, against Excel.

One thing's certain, though: although databases are excellent tools for producing accurate reports, a traditionally-normalized database will be far less than optimal for ad hoc queries. Because normalized data structures focus on integrity above all else (if I may take that liberty), they sacrifice ad hoc optimization at the expense of keeping all the data sensible. Although this is a poor example, consider this normalized schema:

+--------+     +---------+
|tblUsers|     |luGenders|
+--------+     +---------+
|userID  |     |genderID |
|genderID||gender   |
+--------+     +---------+

SELECT * FROM luGenders;
> 1 Female
> 2 Male

If, in this example, we wished to know the number of female/male users in our system, the database would need to process the join and behave accordingly (again, this is a bad example due to the low number of joins and low number of possible values, which generally should bring about some database-engine optimisation). However, if you were to dump this data to Excel, you'd still incur some database penalty to pull the data, but actually pivoting the data in Excel would be fairly speedy. It could be that this notion of up-front, fixed-cost penalty is being missed by your idea of Excel being quicker than straight ad hoc queries, but I don't have the data to comment.

The most tangential point, though, is that while general databases are good for accuracy, they often suck at ad hoc reports. To produce ad hoc reports, it's often necessary to de-normalize ("warehouse") the data in a more queryable structure. Looking up info on data warehousing will provide a lot of good results on the subject.

Moral of the story: having a fully algorithmic, fast ad hoc query system is an awesome ideal, but is less than practical given space and time constraints (memory and people-hours). To effectively generate an ad hoc system, you really need to understand the use cases of your data, and then denormalize it effectively.

I'd highly recommend The Data Warehouse Toolkit. For the record, I'm no DBA, I'm just a lowly analyst who spends 80 hours per week munging Excel and Oracle. I know your pain.

耳钉梦 2024-07-12 14:52:32

我的直觉告诉我,答案与数据透视表大纲有关,该大纲具有固定数量的区域,即:

- the Page Fields zone  
- the Column Fields zone  
- the Row Fields zone and
- the Data zone

在我的疯狂猜测中:

- The Page zone builds the WHERE part of the ad-hoc query.  
- The Column zone will put whichever fields drag-dropped to it in the GROUP BY clause.  
- The Row zone will build a SELECT DISTINCT <field names>
- The Data zone will apply an AGGREGATE function to the field drag-dropped to it. 

您认为“在数据透视表背后”会发生什么?当我们将字段拖到这些区域时的场景”?

My intuitive feeling tells me that the answer would have something to do with a Pivot Table outline, which has a fixed number of zones, namely:

- the Page Fields zone  
- the Column Fields zone  
- the Row Fields zone and
- the Data zone

In my wild guess:

- The Page zone builds the WHERE part of the ad-hoc query.  
- The Column zone will put whichever fields drag-dropped to it in the GROUP BY clause.  
- The Row zone will build a SELECT DISTINCT <field names>
- The Data zone will apply an AGGREGATE function to the field drag-dropped to it. 

What do you think would happen "behind the scene" when we drag fields to those zones?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文