C# 中的 DataTable.Select 和性能问题

发布于 2024-09-10 17:50:39 字数 2502 浏览 5 评论 0原文

我从数据表中的三个制表符分隔文件导入数据,之后我需要遍历主表的每一行并查找两个子表中的所有行。针对我从子表中找到的每个 DataRow[] 数组,我必须再次单独遍历每一行并根据不同的参数检查值,最后我需要创建一个最终记录,该记录将合并主表和两个子表表列。 现在我已经做到了这一点并且它可以工作,但问题是它的性能。我正在使用 DataTable.Select 从子表中查找所有子行,我认为这会使其非常慢。 请记住,所有表都没有主键,因为重复行是可以接受的。 目前,我的主表中有 1200 行,子表中有大约 8000 行,完成此操作所需的总时间是 8 分钟。

知道如何提高性能。 提前致谢

代码如下****************

 DataTable rawMasterdt = importMasterFile();
 DataTable rawDespdt = importDescriptionFile();

        dsHelper = new DataSetHelper();
        DataTable distinctdt = new DataTable();
        distinctdt = dsHelper.SelectDistinct("DistinctOffers", rawMasterdt, "C1");

        if (distinctdt.Rows.Count > 0)
        {
            int count = 0;
                foreach (DataRow offer in distinctdt.Rows)
                {
                    string exp = "C1 = " + "'" + offer[0].ToString() + "'" + "";
                    DataRow masterRow = rawMasterdt.Select(exp)[0];

                    count++;
                    txtBlock1.Text = "Importing Offer " + count.ToString() + " of " + distinctdt.Rows.Count.ToString(); 
                    if (masterRow != null )
                        {
                            Product newProduct = new Product();

                            newProduct.Code = masterRow["C4"].ToString();
                            newProduct.Name = masterRow["C5"].ToString();
                          //  -----
                            newProduct.Description = getProductDescription(offer[0].ToString(), rawDespdt);
                            newProduct.Weight = getProductWeight(offer[0].ToString(), rawDespdt);
                            newProduct.Price = getProductRetailPrice(offer[0].ToString(), rawDespdt);
                            newProduct.UnitPrice = getProductUnitPrice(offer[0].ToString(), rawDespdt);
                          //  ------- more functions similar to above here

                            productList.Add(newProduct);
                        }
                }
                txtBlock1.Text = "Import Completed";
 public string getProductDescription(string offercode, DataTable dsp)
    {
        string exp = "((C1 = " + "'" + offercode + "')" + " AND ( C6 = 'c' ))";
        DataRow[] dRows = dsp.Select( exp);
        string descrip = "";
        if (dRows.Length > 0)
        { 
            for (int i = 0; i < dRows.Length - 1; i++)
            {
              descrip = descrip + " " + dRows[i]["C12"];
            }
        }
        return descrip;

    }

I'm importing the data from three Tab delimited files in the DataTables and after that I need to go thru every row of master table and find all the rows in two child tables. Against each DataRow[] array I found from the child tables, I have to again go thru individually each row and check the values based upon different paramenters and at the end I need to create a final record which will be merger of master and two child table columns.
Now I have done that and its working but the problem is its Performance. I'm using the DataTable.Select to find all child rows from child table which I believe making it very slow.
Please remember the None of the table has any Primary key as the duplicate rows are acceptable.
At the moment I have 1200 rows in master table and aroun 8000 rows in child table and the total time it takes to do that is 8 minutes.

Any idea how can I increase the Performance.
Thanks in advance

The code is below ***************

 DataTable rawMasterdt = importMasterFile();
 DataTable rawDespdt = importDescriptionFile();

        dsHelper = new DataSetHelper();
        DataTable distinctdt = new DataTable();
        distinctdt = dsHelper.SelectDistinct("DistinctOffers", rawMasterdt, "C1");

        if (distinctdt.Rows.Count > 0)
        {
            int count = 0;
                foreach (DataRow offer in distinctdt.Rows)
                {
                    string exp = "C1 = " + "'" + offer[0].ToString() + "'" + "";
                    DataRow masterRow = rawMasterdt.Select(exp)[0];

                    count++;
                    txtBlock1.Text = "Importing Offer " + count.ToString() + " of " + distinctdt.Rows.Count.ToString(); 
                    if (masterRow != null )
                        {
                            Product newProduct = new Product();

                            newProduct.Code = masterRow["C4"].ToString();
                            newProduct.Name = masterRow["C5"].ToString();
                          //  -----
                            newProduct.Description = getProductDescription(offer[0].ToString(), rawDespdt);
                            newProduct.Weight = getProductWeight(offer[0].ToString(), rawDespdt);
                            newProduct.Price = getProductRetailPrice(offer[0].ToString(), rawDespdt);
                            newProduct.UnitPrice = getProductUnitPrice(offer[0].ToString(), rawDespdt);
                          //  ------- more functions similar to above here

                            productList.Add(newProduct);
                        }
                }
                txtBlock1.Text = "Import Completed";
 public string getProductDescription(string offercode, DataTable dsp)
    {
        string exp = "((C1 = " + "'" + offercode + "')" + " AND ( C6 = 'c' ))";
        DataRow[] dRows = dsp.Select( exp);
        string descrip = "";
        if (dRows.Length > 0)
        { 
            for (int i = 0; i < dRows.Length - 1; i++)
            {
              descrip = descrip + " " + dRows[i]["C12"];
            }
        }
        return descrip;

    }

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

疯了 2024-09-17 17:50:39

.Net 4.5,问题仍然存在。

以下是一个简单基准测试的结果,其中比较了 DataTable.Select 和不同字典实现的 CPU 时间(结果以毫秒为单位)

    #Rows Table.Select  Hashtable[] SortedList[] Dictionary[]
     1000        43,31         0,01         0,06         0,00
     6000       291,73         0,07         0,13         0,01
    11000       604,79         0,04         0,16         0,02
    16000       914,04         0,05         0,19         0,02
    21000      1279,67         0,05         0,19         0,02
    26000      1501,90         0,05         0,17         0,02
    31000      1738,31         0,07         0,20         0,03

问题:

DataTable.Select 方法创建了一个“System.Data.Select”类实例在内部,这个“Select”类根据查询中指定的字段(列)创建索引。 Select 类重复使用它创建的索引,但 DataTable 实现不会重复使用 Select 类实例,因此每次调用 DataTable.Select 时都会重新创建索引。 (可以通过反编译 System.Data 来观察此行为)

解决方案:

假设以下查询

DataRow[] rows = data.Select("COL1 = 'VAL1' AND (COL2 = 'VAL2' OR COL2 IS NULL)");

相反,创建并填充一个字典,其中键对应于用作过滤器的列值的不同值组合。 (这个相对昂贵的操作必须只执行一次,然后必须重新使用字典实例)

Dictionary<string, List<DataRow>> di = new Dictionary<string, List<DataRow>>();

foreach (DataRow dr in data.Rows)
{
    string key = (dr["COL1"] == DBNull.Value ? "<NULL>" : dr["COL1"]) + "//" + (dr["COL2"] == DBNull.Value ? "<NULL>" : dr["COL2"]);
    if (di.ContainsKey(key))
    {
        di[key].Add(dr);
    }
    else
    {
        di.Add(key, new List<DataRow>());
        di[key].Add(dr);
    }
}

查询字典(可能需要多个查询)以过滤行并将结果组合到列表中

string key1 = "VAL1//VAL2";
string key2 = "VAL1//<NULL>";
List<DataRow>() results = new List<DataRow>();
if (di.ContainsKey(key1))
{
    results.AddRange(di[key1]);
}
if (di.ContainsKey(key2))
{
    results.AddRange(di[key2]);
}

.Net 4.5 and the issue is still there.

Here are the results of a simple benchmark where DataTable.Select and different dictionary implementations are compared for CPU time (results are in milliseconds)

    #Rows Table.Select  Hashtable[] SortedList[] Dictionary[]
     1000        43,31         0,01         0,06         0,00
     6000       291,73         0,07         0,13         0,01
    11000       604,79         0,04         0,16         0,02
    16000       914,04         0,05         0,19         0,02
    21000      1279,67         0,05         0,19         0,02
    26000      1501,90         0,05         0,17         0,02
    31000      1738,31         0,07         0,20         0,03

Problem:

The DataTable.Select method creates a "System.Data.Select" class instance internally, and this "Select" class creates indexes based on the fields (columns) specified in the query. The Select class makes re-use of the indexes it had created but the DataTable implementation does not re-use the Select class instance hence the indexes are re-created every time DataTable.Select is invoked. (This behaviour can be observed by decompiling System.Data)

Solution:

Assume the following query

DataRow[] rows = data.Select("COL1 = 'VAL1' AND (COL2 = 'VAL2' OR COL2 IS NULL)");

Instead, create and fill a Dictionary with keys corresponding to the different value combinations of the values of the columns used as the filter. (This relatively expensive operation must be done only once and the dictionary instance must then be re-used)

Dictionary<string, List<DataRow>> di = new Dictionary<string, List<DataRow>>();

foreach (DataRow dr in data.Rows)
{
    string key = (dr["COL1"] == DBNull.Value ? "<NULL>" : dr["COL1"]) + "//" + (dr["COL2"] == DBNull.Value ? "<NULL>" : dr["COL2"]);
    if (di.ContainsKey(key))
    {
        di[key].Add(dr);
    }
    else
    {
        di.Add(key, new List<DataRow>());
        di[key].Add(dr);
    }
}

Query the Dictionary (multiple queries may be required) to filter the rows and combine the results into a List

string key1 = "VAL1//VAL2";
string key2 = "VAL1//<NULL>";
List<DataRow>() results = new List<DataRow>();
if (di.ContainsKey(key1))
{
    results.AddRange(di[key1]);
}
if (di.ContainsKey(key2))
{
    results.AddRange(di[key2]);
}
始终不够 2024-09-17 17:50:39

我知道这是一个老问题,支撑这个问题的代码可能已经改变,但我最近遇到了(并深入了解)这个问题。

对于以后来的任何人......这就是我发现的。

DataTable.Select(condition) 的性能对您提供的“条件”的性质和结构非常敏感。对我来说,这看起来像是一个错误(我应该在哪里向 Microsoft 报告它?),但它可能只是一个怪癖。

我编写了一组测试来演示该问题,其结构如下:

  1. 定义一个包含几个简单列的数据表,如下所示:

    <块引用>

    var dataTable = new DataTable();
    var idCol = dataTable.Columns.Add("Id", typeof(Int32));
    dataTable.Columns.Add("代码", typeof(string));
    dataTable.Columns.Add("名称", typeof(string));
    dataTable.Columns.Add("FormationDate", typeof(DateTime));
    dataTable.Columns.Add("收入", typeof(Decimal));
    dataTable.Columns.Add("ChildCount", typeof(Int32));
    dataTable.Columns.Add("国外", typeof(Boolean));
    dataTable.PrimaryKey = new DataColumn[1] { idCol };

  2. 用 40000 条记录填充表,每条记录都有一个唯一的“代码”字段。

  3. 使用两种类似但格式不同的查询对数据表执行一批“选择”(每个选择具有不同的参数),并记录和比较两种格式所花费的总时间。

您将获得显着的成果。例如,并排测试以下两个条件:

Q1: [Code] = 'XX'

Q2: ([Code] = 'XX')

[ 我使用上述两个查询执行多个 Select 调用,每次迭代我将 XX 替换为数据表中存在的有效代码]
结果呢?

针对 40000 条记录进行 320 次查找的时间比较:不带括号的总搜索时间为 180 毫秒,带括号的搜索总搜索时间为 6871 毫秒

是 - 如果条件周围有额外的括号,速度会慢 38 倍。
还有其他一些反应不同的场景。

例如,
[Code] = '{searchCode}' OR 1=0([Code] = '{searchCode}' OR 1=0) 花费相似(缓慢)的时间执行,但是:

[Code] = '{searchCode}' AND 1=1([Code] = '{searchCode}' AND 1=1) 再次显示不带括号的版本快了近 40 倍。

我没有调查所有场景,但似乎括号的引入 - 要么冗余地围绕简单的比较检查,要么根据需要指定子表达式优先级 - 或者“OR”的存在会大大减慢查询速度。

我可以推测这个问题是由数据表如何解析您使用的条件以及它如何创建和使用内部索引引起的......但我不会。

I know this is an old question, and code underpinning this issue may have changed, but I've recently encountered (and gain some insight into) this very issue.

For anyone coming along at a later date ... here's what I found.

Performance of the DataTable.Select(condition) is quite sensitive to the nature and structure of the 'condition' you provide. This looks like a bug to me (where would I report it to Microsoft?) but it may merely be a quirk.

I've written a set of tests to demonstrate the issue that are structured as follows:

  1. Define a datatable with a few simple columns,like this:

    var dataTable = new DataTable();
    var idCol = dataTable.Columns.Add("Id", typeof(Int32));
    dataTable.Columns.Add("Code", typeof(string));
    dataTable.Columns.Add("Name", typeof(string));
    dataTable.Columns.Add("FormationDate", typeof(DateTime));
    dataTable.Columns.Add("Income", typeof(Decimal));
    dataTable.Columns.Add("ChildCount", typeof(Int32));
    dataTable.Columns.Add("Foreign", typeof(Boolean));
    dataTable.PrimaryKey = new DataColumn[1] { idCol };

  2. Populate the table with 40000 records, each with a unique 'Code' field.

  3. Perform a batch of 'selects' (each with different parameters) against the datatable using two similar, but differently formatted, queries and record and compare the total time taken by each of the two formats.

You get remarkable results. Testing, for example, the below two conditions side-by-side:

Q1: [Code] = 'XX'

Q2: ([Code] = 'XX')

[ I do multiple Select calls using the above two queries, each iteration I replace the XX with a valid code that exists in the datatable ]
The result?

Time comparison for 320 lookups against 40000 records: 180 msec total search time with no brackets, 6871 msec total search time for search WITH brackets

Yes - 38 times slower if you just have the extra brackets surrounding the condition.
There are other scenarios which react differently.

For example,
[Code] = '{searchCode}' OR 1=0 vs ([Code] = '{searchCode}' OR 1=0) take similar (slow) times to execute, but:

[Code] = '{searchCode}' AND 1=1 vs ([Code] = '{searchCode}' AND 1=1) again shows the non-bracketed version to be close to 40 times faster.

I've not investigated all scenarios, but it seems that the introduction of brackets - either redundantly around a simple comparison check, or as required to specify sub-expression precedence - or the presence of an 'OR' slows the query down considerably.

I could speculate that the issue is caused by how the datatable parses the condition you use and how it creates and uses internal indexes ... but I won't.

三月梨花 2024-09-17 17:50:39

使用字典可以大大加快速度。例如:

if (distinctdt.Rows.Count > 0)
{
    // build index of C1 values to speed inner loop
    Dictionary<string, DataRow> masterIndex = new Dictionary<string, DataRow>();
    foreach (DataRow row in rawMasterdt.Rows)
        masterIndex[row["C1"].ToString()] = row;

    int count = 0;
    foreach (DataRow offer in distinctdt.Rows)
    {

那么代替

    string exp = "C1 = " + "'" + offer[0].ToString() + "'" + "";
    DataRow masterRow = rawMasterdt.Select(exp)[0];

You 会这样做

DataRow masterRow;
if (masterIndex.ContainsKey(offer[0].ToString())
    masterRow = masterIndex[offer[0].ToString()];
else
    masterRow = null;

You can speed it up a lot by using a dictionary. For example:

if (distinctdt.Rows.Count > 0)
{
    // build index of C1 values to speed inner loop
    Dictionary<string, DataRow> masterIndex = new Dictionary<string, DataRow>();
    foreach (DataRow row in rawMasterdt.Rows)
        masterIndex[row["C1"].ToString()] = row;

    int count = 0;
    foreach (DataRow offer in distinctdt.Rows)
    {

Then in place of

    string exp = "C1 = " + "'" + offer[0].ToString() + "'" + "";
    DataRow masterRow = rawMasterdt.Select(exp)[0];

You would do this

DataRow masterRow;
if (masterIndex.ContainsKey(offer[0].ToString())
    masterRow = masterIndex[offer[0].ToString()];
else
    masterRow = null;
℡Ms空城旧梦 2024-09-17 17:50:39

如果在父 DataTable 和子 DataTable 之间创建 DataRelation,则可以通过在父行上调用 DataRow.GetChildRows(DataRelation) 来查找子行(如果是类型化 DataSet,则调用 DataRow.GetChildRelName)。搜索将应用 TreeMap 查找,即使有很多子行,性能也应该很好。

如果您必须根据 DataRelation 的外键以外的其他条件搜索行,我建议您使用 DataView.Sort / DataView.FindRows() 而不是 DataTable.Select(),只要您需要查询更多数据比一次。 DataView.FindRows() 将基于 TreeMap 查找 (O(log(N)),而 DataTable.Select() 必须扫描所有行 (O(N))。本文包含更多详细信息: http://arnosoftwaredev.blogspot.com/2011/02/when-datatableselect- is-slow-use.html

If you create a DataRelation between your parent and child DataTables, you can look up child rows by invoking DataRow.GetChildRows(DataRelation) on the parent row (resp. DataRow.GetChildRelName in case of typed DataSets). The search will apply a TreeMap lookup, and performance should be fine even with a lot of child rows.

In case you have to search for rows based on other criteria than on a DataRelation's foreign keys, I recommend to use DataView.Sort / DataView.FindRows() instead of DataTable.Select(), as soon as you have to query the data more than once. DataView.FindRows() will be based on TreeMap lookup (O(log(N)), where as DataTable.Select() has to scan all rows (O(N)). This article contains more details: http://arnosoftwaredev.blogspot.com/2011/02/when-datatableselect-is-slow-use.html

好听的两个字的网名 2024-09-17 17:50:39

可以使数据表与数据集中的其他数据表具有关系。请参阅 http://msdn.microsoft.com/en -us/library/ay82azad%28VS.71%29.aspx 进行一些讨论并作为浏览的起点。我没有太多使用它们的经验,但据我了解,它们会做你想做的事情(假设你的表格采用合适的格式)。我认为这些比手动过程具有更高的效率,但我可能是错的。可能值得看看它们是否适合您,并进行基准测试以查看它们是否有所改进......

DataTables can be made to have Relationships with other DataTables in a DataSet. See http://msdn.microsoft.com/en-us/library/ay82azad%28VS.71%29.aspx for a bit of discussion and as a start point to browsing. I've not much experience of using them but as I understand it they will do what you want (assuming your tables are in a suitable format). I would assume that these have greater efficiency than a manual process of doing the same but I may be wrong. Might be worth seeing if they work for you and benchmarking to see if they are an improvement or not...

攒眉千度 2024-09-17 17:50:39

您是否通过分析器运行过它?这应该是第一步。无论如何,这可能会有所帮助:

  • 将主文本文件逐行读入内存。将主记录作为键放入字典中。将其添加到数据集(1 次通过 master)。

  • 逐行读取子文本文件,将其添加为上面创建的字典中相应主记录的值

  • 现在,字典中的所有内容都已存储在内存中,只需对每个文件执行 1 次遍历。
    对字典/子项进行最后一次遍历,处理每一列并执行最终计算。

Have you ran it through a profiler? That should be the first step. Anyhow, this might help:

  • Read the master text file into memory line by line. Put the master record into a dictionary as the key. Add it to the dataset (1 pass through master).

  • Read child text file line by line, add this as a value for the appropriate master record in the dictionary created above

  • Now you have everything in the dictionary in memory, only doing 1 pass through each file.
    Do a final pass through the dictionary/children and process each column and perform final calcs.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文