在 Sql Server 或 .Net 中执行数据分析?
我需要执行一些数据分析。平均而言,它会涉及 50K-150K 行。我需要根据五个不同的标准从这些行中提取 Sum(X) 和 Count(X) 的总和。有两种方法可以实现这一目的:
- 编写 10 个不同的查询,每个查询都旨在使用 Sum() 或 Count() 聚合 X 列中的数据。运行每一个并使用 SqlCommand.ExecuteScalar() 检索结果。
- 创建一个自定义对象以包含评估不同条件所需的所有不同参数。使用 SqlCommand.ExecuteDataReader() 运行一个查询,该查询将返回组成包含所有不同条件子集的超集所需的所有数据。将 DataReader 中的每一行读取到一个新对象中,并将每一行添加到 List 集合中。检索完所有数据后,使用 Linq-to-Object 根据不同条件确定所需的不同 Sum() 和 Count() 值。
我知道我可以尝试每一个,看看哪一个最快,但我对社区关于哪一个可能更快的建议感兴趣。假设 Sql Server 和 Web Server 各自运行在自己的计算机上,并且各自具有足够的内存。
现在我倾向于选项 1。即使对数据库有更多的查询,数据库本身也会完成所有的聚合工作,并且在 Sql Server 和 Web Server 之间传递的数据很少。使用选项 2,只有一个查询,但它将向 .Net 传递大量数据,然后 .Net 将必须完成与聚合函数相关的所有繁重工作(尽管我不这样做)有任何基础,我怀疑 Sql Server 在运行这些类型的大聚合函数方面更有效)。
关于走哪条路有什么想法(或者我缺少的第三个选择)?
I have some data analysis that needs to perform. On average, it would involve somewhere in between 50K-150K rows. From these rows I need to extract the summation of Sum(X) as well as Count(X) based on five different criteria. There are two ways of going about it:
- Write 10 different queries, each one designed to aggregate the data from column X using Sum() or Count(). Run each one and retrieve the result using SqlCommand.ExecuteScalar().
- Create a custom object to contain all of the different parameters that would be needed to evaluate the different conditions. Run one query that will return all of the data needed to make up the superset containing all of the different conditional subsets, using SqlCommand.ExecuteDataReader(). Read each row from the DataReader into a new object, adding each one into a List collection. One all data is retrieved, use Linq-to-Object to determine the different Sum() and Count() values needed based on different conditions.
I know that I could try each one out to see which is fastest, but I am interested in the community's advice on which one is likely to be faster. Assume Sql Server and Web Server each running on their own machines, each with sufficient memory.
Right now I am leaning towards option 1. Even though there are many more queries to the DB, the DB itself will do all of the aggregation work and very little data will pass in between the Sql Server and the Web Server. With option 2, there is only one query, but it will pass a very large amount of data to .Net, and then .Net will have to do all of the heavy lifting with regards to the aggregate functions (and though I don't have anything to base it on, I suspect that Sql Server is more efficient at running these types of big aggregate functions).
Any thoughts on which way to go (or a third option that I am missing)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
数据库通常是分析此类数据的最佳选择,我会选择#1。
不过,我还会考虑第三种选择,即创建一个传递给 SQL 的查询(而不是十个)。这可能涉及将 CASE 语句放入聚合函数中,以便您可以在一次数据传递中完成所有工作。
考虑执行 PIVOT 的旧方法:
The database is generally the best option for analysing data like this, and I'd go for #1.
However, I'd also consider a third option, in which you create a single query that you pass to SQL (instead of ten). This could involve putting CASE statements in the aggregate functions, so that you can do all the work in a single pass of the data.
Consider the old way of doing PIVOT:
你知道,我会选择选项 1。在我看来,50-150k 行实际上并不是那么多行,特别是如果你没有太多列的话。
一旦你开始谈论数百万行,我就会开始考虑优化。
另一件事:与数据库一样,确保您的查询正确命中索引。这比你们两个想法之间的差异更重要。
You know, i'd go with option 1. 50-150k rows is not really that many rows IMO, especially if you haven't got too many columns.
Once you start talking millions of rows, i'd start thinking about optimising.
Another thing: as always with databases, make sure your query hits your indexes correctly. That matters much more than the difference between your two ideas.