C# IEnumerator/yield 结构可能不好?
背景:我从数据库中获取了一堆字符串,我想返回它们。 传统上,它会是这样的:
public List<string> GetStuff(string connectionString)
{
List<string> categoryList = new List<string>();
using (SqlConnection sqlConnection = new SqlConnection(connectionString))
{
string commandText = "GetStuff";
using (SqlCommand sqlCommand = new SqlCommand(commandText, sqlConnection))
{
sqlCommand.CommandType = CommandType.StoredProcedure;
sqlConnection.Open();
SqlDataReader sqlDataReader = sqlCommand.ExecuteReader();
while (sqlDataReader.Read())
{
categoryList.Add(sqlDataReader["myImportantColumn"].ToString());
}
}
}
return categoryList;
}
但是我认为消费者会想要迭代这些项目并且不关心其他太多,而且我不想将自己限制在列表中,所以如果我返回 IEnumerable 一切都很好/灵活。 所以我想我可以使用“yield return”类型设计来处理这个...类似这样的事情:
public IEnumerable<string> GetStuff(string connectionString)
{
using (SqlConnection sqlConnection = new SqlConnection(connectionString))
{
string commandText = "GetStuff";
using (SqlCommand sqlCommand = new SqlCommand(commandText, sqlConnection))
{
sqlCommand.CommandType = CommandType.StoredProcedure;
sqlConnection.Open();
SqlDataReader sqlDataReader = sqlCommand.ExecuteReader();
while (sqlDataReader.Read())
{
yield return sqlDataReader["myImportantColumn"].ToString();
}
}
}
}
但是现在我正在阅读更多关于yield的内容(在这样的网站上...msdn似乎没有提到这一点),它显然是一个惰性评估器,它保持填充器的状态,以预期有人询问下一个值,然后只运行它直到它返回下一个值。
在大多数情况下这似乎很好,但是对于数据库调用,这听起来有点冒险。 作为一个有点人为的例子,如果有人要求一个 IEnumerable ,我从数据库调用中填充,完成一半,然后陷入循环......据我所知,我的数据库连接正在进行永远保持开放。
在某些情况下,如果迭代器没有完成,听起来像是自找麻烦……我错过了什么吗?
Background: I've got a bunch of strings that I'm getting from a database, and I want to return them. Traditionally, it would be something like this:
public List<string> GetStuff(string connectionString)
{
List<string> categoryList = new List<string>();
using (SqlConnection sqlConnection = new SqlConnection(connectionString))
{
string commandText = "GetStuff";
using (SqlCommand sqlCommand = new SqlCommand(commandText, sqlConnection))
{
sqlCommand.CommandType = CommandType.StoredProcedure;
sqlConnection.Open();
SqlDataReader sqlDataReader = sqlCommand.ExecuteReader();
while (sqlDataReader.Read())
{
categoryList.Add(sqlDataReader["myImportantColumn"].ToString());
}
}
}
return categoryList;
}
But then I figure the consumer is going to want to iterate through the items and doesn't care about much else, and I'd like to not box myself in to a List, per se, so if I return an IEnumerable everything is good/flexible. So I was thinking I could use a "yield return" type design to handle this...something like this:
public IEnumerable<string> GetStuff(string connectionString)
{
using (SqlConnection sqlConnection = new SqlConnection(connectionString))
{
string commandText = "GetStuff";
using (SqlCommand sqlCommand = new SqlCommand(commandText, sqlConnection))
{
sqlCommand.CommandType = CommandType.StoredProcedure;
sqlConnection.Open();
SqlDataReader sqlDataReader = sqlCommand.ExecuteReader();
while (sqlDataReader.Read())
{
yield return sqlDataReader["myImportantColumn"].ToString();
}
}
}
}
But now that I'm reading a bit more about yield (on sites like this...msdn didn't seem to mention this), it's apparently a lazy evaluator, that keeps the state of the populator around, in anticipation of someone asking for the next value, and then only running it until it returns the next value.
This seems fine in most cases, but with a DB call, this sounds a bit dicey. As a somewhat contrived example, if someone asks for an IEnumerable from that I'm populating from a DB call, gets through half of it, and then gets stuck in a loop...as far as I can see my DB connection is going to stay open forever.
Sounds like asking for trouble in some cases if the iterator doesn't finish...am I missing something?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
这是一种平衡行为:您是否希望立即将所有数据强制写入内存以便释放连接,或者您希望从流式传输数据中受益,但代价是一直占用连接?
在我看来,这个决定可能应该由呼叫者决定,他们更了解自己想做什么。 如果您使用迭代器块编写代码,则调用者可以非常轻松地将流形式转换为完全缓冲形式:
另一方面,如果您自己进行缓冲,则无法调用者可以返回到流模型。
因此,我可能会使用流模型,并在文档中明确说明它的作用,并建议调用者做出适当的决定。 您甚至可能想提供一个辅助方法来基本上调用流版本并将其转换为列表。
当然,如果您不相信调用者会做出适当的决定,并且您有充分的理由相信他们永远不会真正想要流式传输数据(例如,无论如何它都不会返回太多),那么请选择列表方法。 无论哪种方式,记录它 - 它很可能会影响返回值的使用方式。
当然,处理大量数据的另一种选择是使用批处理 - 这在某种程度上偏离了最初的问题,但在流式传输通常很有吸引力的情况下,这是一种不同的考虑方法。
It's a balancing act: do you want to force all the data into memory immediately so you can free up the connection, or do you want to benefit from streaming the data, at the cost of tying up the connection for all that time?
The way I look at it, that decision should potentially be up to the caller, who knows more about what they want to do. If you write the code using an iterator block, the caller can very easily turned that streaming form into a fully-buffered form:
If, on the other hand, you do the buffering yourself, there's no way the caller can go back to a streaming model.
So I'd probably use the streaming model and say explicitly in the documentation what it does, and advise the caller to decide appropriately. You might even want to provide a helper method to basically call the streamed version and convert it into a list.
Of course, if you don't trust your callers to make the appropriate decision, and you have good reason to believe that they'll never really want to stream the data (e.g. it's never going to return much anyway) then go for the list approach. Either way, document it - it could very well affect how the return value is used.
Another option for dealing with large amounts of data is to use batches, of course - that's thinking somewhat away from the original question, but it's a different approach to consider in the situation where streaming would normally be attractive.
使用 IEnumerable 并不总是不安全。 如果您离开框架调用
GetEnumerator
(这是大多数人都会做的),那么您就安全了。 基本上,您与使用您的方法的代码的仔细程度一样安全:您是否可以努力使数据库连接保持打开状态也取决于您的体系结构。 如果调用者参与事务(并且您的连接是自动登记的),那么框架无论如何都会保持连接打开。
yield
的另一个优点是(当使用服务器端游标时),如果您的消费者想要退出,您的代码不必从数据库中读取所有数据(例如:1,000 个项目)较早的循环(例如:第 10 项之后)。 这样可以加快查询数据的速度。 特别是在 Oracle 环境中,服务器端游标是检索数据的常用方法。You're not always unsafe with the IEnumerable. If you leave the framework call
GetEnumerator
(which is what most of the people will do), then you're safe. Basically, you're as safe as the carefullness of the code using your method:Whether you can affort to leave the database connection open or not depends on your architecture as well. If the caller participates in an transaction (and your connection is auto enlisted), then the connection will be kept open by the framework anyway.
Another advantage of
yield
is (when using a server-side cursor), your code doesn't have to read all data (example: 1,000 items) from the database, if your consumer wants to get out of the loop earlier (example: after the 10th item). This can speed up querying data. Especially in an Oracle environment, where server-side cursors are the common way to retrieve data.你没有错过任何东西。 您的示例展示了如何不使用收益回报。 将项目添加到列表中,关闭连接,然后返回列表。 您的方法签名仍然可以返回 IEnumerable。
编辑:也就是说,乔恩有一个观点(太惊讶了!):在极少数情况下,从性能角度来看,流式传输实际上是最好的选择。 毕竟,如果我们在这里讨论的是 100,000(1,000,000?10,000,000?)行,您不希望首先将其全部加载到内存中。
You are not missing anything. Your sample shows how NOT to use yield return. Add the items to a list, close the connection, and return the list. Your method signature can still return IEnumerable.
Edit: That said, Jon has a point (so surprised!): there are rare occasions where streaming is actually the best thing to do from a performance perspective. After all, if it's 100,000 (1,000,000? 10,000,000?) rows we're talking about here, you don't want to be loading that all into memory first.
顺便说一句 - 请注意,IEnumerable方法本质上是 LINQ 提供程序(LINQ-to-SQL、LINQ-to-Entities)的谋生手段。 正如乔恩所说,这种方法有其优点。 然而,也存在一些明确的问题——特别是(对我来说)在分离(组合)方面| 抽象。
我在这里的意思是:
.ToList()
等),这与我的想法有些联系:实用 LINQ。
但我应该强调 - 有时流媒体确实是非常理想的。 这不是一个简单的“总是与从不”的事情......
As an aside - note that the
IEnumerable<T>
approach is essentially what the LINQ providers (LINQ-to-SQL, LINQ-to-Entities) do for a living. The approach has advantages, as Jon says. However, there are definite problems too - in particular (for me) in terms of (the combination of) separation | abstraction.What I mean here is that:
.ToList()
etc)This ties in a bit with my thoughts here: Pragmatic LINQ.
But I should stress - there are definitely times when the streaming is highly desirable. It isn't a simple "always vs never" thing...
强制评估迭代器的稍微简洁的方法:
Slightly more concise way to force evaluation of iterator:
不,您走在正确的道路上...收益将锁定读者...您可以在调用 IEnumerable 时进行另一个数据库调用来测试它
No, you are on the right path... the yield will lock the reader... you can test it doing another database call while calling the IEnumerable
唯一会导致问题的情况是调用者滥用
IEnumerable
协议。 正确的使用方法是当不再需要它时调用Dispose
。由
yield return
生成的实现将Dispose
调用作为执行任何打开的finally
块的信号,在您的示例中它将调用对您在
。using
语句中创建的对象进行 Dispose有许多语言功能(特别是
foreach
)使得正确使用IEnumerable
变得非常容易。The only way this would cause problems is if the caller abuses the protocol of
IEnumerable<T>
. The correct way to use it is to callDispose
on it when it is no longer needed.The implementation generated by
yield return
takes theDispose
call as a signal to execute any openfinally
blocks, which in your example will callDispose
on the objects you've created in theusing
statements.There are a number of language features (in particular
foreach
) which make it very easy to useIEnumerable<T>
correctly.您始终可以使用单独的线程来缓冲数据(可能缓冲到队列),同时也执行 yield 来返回数据。 当用户请求数据(通过 yield 返回)时,将从队列中删除一个项目。 数据也通过单独的线程不断添加到队列中。 这样,如果用户请求数据的速度足够快,队列就永远不会很满,您不必担心内存问题。 如果他们不这样做,那么队列就会填满,这可能还不错。 如果您想对内存施加某种限制,则可以强制执行最大队列大小(此时另一个线程将等待项目被删除,然后再向队列添加更多项目)。 当然,您需要确保在两个线程之间正确处理资源(即队列)。
作为替代方案,您可以强制用户传入一个布尔值来指示是否应缓冲数据。 如果为 true,则数据将被缓冲并尽快关闭连接。 如果为 false,则不会缓冲数据,并且只要用户需要,数据库连接就会保持打开状态。 使用布尔参数会迫使用户做出选择,从而确保他们了解问题。
You could always use a separate thread to buffer the data (perhaps to a queue) while also doing a yeild to return the data. When the user requests data (returned via a yeild), an item is removed from the queue. Data is also being continuously added to the queue via the separate thread. That way, if the user requests the data fast enough, the queue is never very full and you do not have to worry about memory issues. If they don't, then the queue will fill up, which may not be so bad. If there is some sort of limitation you would like to impose on memory, you could enforce a maximum queue size (at which point the other thread would wait for items to be removed before adding more to the queue). Naturally, you will want to make sure you handle resources (i.e., the queue) correctly between the two threads.
As an alternative, you could force the user to pass in a boolean to indicate whether or not the data should be buffered. If true, the data is buffered and the connection is closed as soon as possible. If false, the data is not buffered and the database connection stays open as long as the user needs it to be. Having a boolean parameter forces the user to make the choice, which ensures they know about the issue.
我已经撞到这堵墙好几次了。 SQL 数据库查询不像文件那样易于流式传输。 相反,仅查询您认为需要的数量,并将其作为您想要的任何容器(
IList<>
、DataTable
等)返回。IEnumerable
在这里不会为您提供帮助。I've bumped into this wall a few times. SQL database queries are not easily streamable like files. Instead, query only as much as you think you'll need and return it as whatever container you want (
IList<>
,DataTable
, etc.).IEnumerable
won't help you here.您可以做的是使用 SqlDataAdapter 并填充 DataTable。 像这样的事情:
这样,您一次性查询所有内容,并立即关闭连接,但您仍然懒惰地迭代结果。 此外,此方法的调用者无法将结果转换为 List 并执行不应该执行的操作。
What you can do is use a SqlDataAdapter instead and fill a DataTable. Something like this:
This way, you're querying everything in one shot, and closing the connection immediately, yet you're still lazily iterating the result. Furthermore, the caller of this method can't cast the result to a List and do something they shouldn't be doing.
不要在这里使用产量。 你的样品很好。
Dont use yield here. your sample is fine.