文本重新格式化器随着每次迭代逐渐减慢

发布于 2024-10-16 18:22:43 字数 6399 浏览 6 评论 0原文

编辑2

好的,我已经在gist.github 并且我只有一个无法解决的挥之不去的问题。

FindLine() 始终返回 -1。我已将原因缩小到 if 语句,但我不明白为什么。我知道符号和符号列表都传递了良好的数据。

/EDIT 2

我有一个相当简单的 C# 程序,它查找 .csv 文件、读取该文件中的文本、重新格式化它(并包含来自 SQL 查询的一些信息)加载到 DataTable 中),并将其保存到 .tsv 文件以供其他程序稍后使用。

我的问题是,有时源 .csv 文件超过 10,000 行,程序在迭代这些行时逐渐变慢。如果 .csv 文件大约有 500 行,则需要大约 45 秒才能完成,并且随着 .csv 文件变大,时间会呈指数级恶化。

SQL 查询返回 37,000 多行,但仅请求一次,并且与 .csv 文件的排序方式相同,因此通常我不会注意到它在该文件中运行,除非它找不到相应的数据,在这种情况下它会一直完成并返回适当的错误文本。我 99% 确信这不是速度放缓的原因。

for 循环的 y 和 z 需要与它们的长度完全相同。

如果绝对有必要,我可以从初始 .csv 文件中删除一些数据并发布一个示例,但我真的希望我只是错过了一些非常明显的东西。

预先感谢各位!

这是我的来源:

using System;
using System.Collections.Generic;
using System.Data;
using System.Data.SqlClient;
using System.IO;
using System.Linq;
using System.Text;

namespace MoxySectorFormatter
{
    class Program
    {
        static void Main(string[] args)
        {
            DataTable resultTable = new DataTable();
            double curLine = 1;
            double numLines = 0;
            string ExportPath = @"***PATH***\***OUTFILE***.tsv";
            string ImportPath = @"***PATH***\***INFILE***.csv";
            string NewText = "SECURITY\r\n";
            string OrigText = "";
            string QueryString = "SELECT DISTINCT UPPER(MP.Symbol) AS Symbol, LOWER(MP.SecType) AS SecType, MBI.Status FROM MoxySecMaster AS MP LEFT JOIN MoxyBondInfo AS MBI ON MP.Symbol = MBI.Symbol AND MP.SecType = MBI.SecType WHERE MP.SecType <> 'caus' AND MP.SecType IS NOT NULL AND MP.Symbol IS NOT NULL ORDER BY Symbol ASC;";
            SqlConnection MoxyConn = new SqlConnection("server=***;database=***;user id=***;password=***");
            SqlDataAdapter adapter = new SqlDataAdapter(QueryString, MoxyConn);

            MoxyConn.Open();
            Console.Write("Importing source file from \"{0}\".", ImportPath);
            OrigText = File.ReadAllText(ImportPath);
            OrigText = OrigText.Substring(OrigText.IndexOf("\r\n", 0) + 2);
            Console.WriteLine("\rImporting source file from \"{0}\".  Done!", ImportPath);
            Console.Write("Scanning source report.");
            for (int loop = 0; loop < OrigText.Length; loop++)
            {
                if (OrigText[loop] == '\r')
                    numLines++;
            }
            Console.WriteLine("\rScanning source report.  Done!");
            Console.Write("Downloading SecType information.");
            resultTable = new DataTable();
            adapter.Fill(resultTable);
            MoxyConn.Close();
            Console.WriteLine("\rDownloading SecType information.  Done!");

            for (int lcv = 0; lcv < numLines; lcv++)
            {
                int foundSpot = -1;
                int nextStart = 0;
                Console.Write("\rGenerating new file... {0} / {1} ({2}%)  ", curLine, numLines, System.Math.Round(((curLine / numLines) * 100), 2));
                for (int vcl = 0; vcl < resultTable.Rows.Count; vcl++)
                {
                    if (resultTable.Rows[vcl][0].ToString() == OrigText.Substring(0, OrigText.IndexOf(",", 0)).ToUpper() && resultTable.Rows[vcl][1].ToString().Length > 0)
                    {
                        foundSpot = vcl;
                        break;
                    }
                }
                if (foundSpot != -1 && foundSpot < resultTable.Rows.Count)
                {
                    NewText += resultTable.Rows[foundSpot][1].ToString();
                    NewText += "\t";
                    NewText += OrigText.Substring(nextStart, (OrigText.IndexOf(",", nextStart) - nextStart));
                    NewText += "\t";
                    nextStart = OrigText.IndexOf(",", nextStart) + 1;
                    for (int y = 0; y < 142; y++)
                        NewText += "\t";
                    if(resultTable.Rows[foundSpot][2].ToString() == "r")
                        NewText += @"PRE/ETM";
                    else if (OrigText.Substring(nextStart, (OrigText.IndexOf(",", nextStart) - nextStart)) == "Municipals")
                    {
                        NewText += "Muni - ";
                        nextStart = OrigText.IndexOf(",", nextStart) + 1;
                        if (OrigText.Substring(nextStart, (OrigText.IndexOf(",", nextStart) - nextStart)).Length > 0)
                            NewText += OrigText.Substring(nextStart, (OrigText.IndexOf(",", nextStart) - nextStart));
                        else
                            NewText += "(Orphan)";
                    }
                    else if (OrigText.Substring(nextStart, (OrigText.IndexOf(",", nextStart) - nextStart)) == "Corporates")
                    {
                        NewText += "Corporate - ";
                        nextStart = OrigText.IndexOf(",", nextStart) + 1;
                        nextStart = OrigText.IndexOf(",", nextStart) + 1;
                        if (OrigText.Substring(nextStart, (OrigText.IndexOf("\r\n", nextStart) - nextStart)).Length > 0)
                            NewText += OrigText.Substring(nextStart, (OrigText.IndexOf("\r\n", nextStart) - nextStart));
                        else
                            NewText += "(Unknown)";
                    }
                    else
                        NewText += OrigText.Substring(nextStart, (OrigText.IndexOf(",", nextStart) - nextStart));
                    for (int z = 0; z < 17; z++)
                        NewText += "\t";
                    NewText += "\r\n";
                    resultTable.Rows.RemoveAt(foundSpot);
                }
                else
                    Console.WriteLine("\r  Omitting {0}: Missing Symbol or SecType.", OrigText.Substring(nextStart, (OrigText.IndexOf(",", nextStart) - nextStart)));
                OrigText = OrigText.Substring(OrigText.IndexOf("\r\n", 0) + 2);
                curLine++;
            }
            Console.Write("Exporting file to \"{0}\".", ExportPath);
            File.WriteAllText(ExportPath, NewText);
            Console.WriteLine("\rExporting file to \"{0}\".  Done!\nPress any key to exit.", ExportPath);
            Console.ReadLine();
        }
    }
}

EDIT 2

Okay, I've posted a copy of my source at gist.github and I only have one lingering problem that I can't resolve.

FindLine() always returns -1. I've narrowed the cause down to the if statement, but I can't figure out why. I know that symbol and symbolList are both getting passed good data.

/EDIT 2

I have a fairly simple C# program that looks for a .csv file, reads the text in that file, reformats it (and includes some information from a SQL query loaded into a DataTable), and saves it to a .tsv file for later use by another program.

My problem is that sometimes the source .csv file is over 10,000 lines and the program slows gradually as it iterates through the lines. If the .csv file is ~500 lines it takes about 45 seconds to complete, and this time get exponentially worse as the .csv file gets larger.

The SQL query returns 37,000+ lines, but is only requested once and is sorted the same way the .csv file is, so normally I won't notice it running through that file unless it can't find the corresponding data, in which case it makes it all the way through and returns the appropriate error text. I'm 99% sure it's not the cause of the slowdown.

The y and z for loops need to be exactly as long as they are.

If it's absolutely necessary I can scrub some data from the initial .csv file and post an example, but I'm really hoping I'm just missing something really obvious.

Thanks in advance guys!

Here's my source:

using System;
using System.Collections.Generic;
using System.Data;
using System.Data.SqlClient;
using System.IO;
using System.Linq;
using System.Text;

namespace MoxySectorFormatter
{
    class Program
    {
        static void Main(string[] args)
        {
            DataTable resultTable = new DataTable();
            double curLine = 1;
            double numLines = 0;
            string ExportPath = @"***PATH***\***OUTFILE***.tsv";
            string ImportPath = @"***PATH***\***INFILE***.csv";
            string NewText = "SECURITY\r\n";
            string OrigText = "";
            string QueryString = "SELECT DISTINCT UPPER(MP.Symbol) AS Symbol, LOWER(MP.SecType) AS SecType, MBI.Status FROM MoxySecMaster AS MP LEFT JOIN MoxyBondInfo AS MBI ON MP.Symbol = MBI.Symbol AND MP.SecType = MBI.SecType WHERE MP.SecType <> 'caus' AND MP.SecType IS NOT NULL AND MP.Symbol IS NOT NULL ORDER BY Symbol ASC;";
            SqlConnection MoxyConn = new SqlConnection("server=***;database=***;user id=***;password=***");
            SqlDataAdapter adapter = new SqlDataAdapter(QueryString, MoxyConn);

            MoxyConn.Open();
            Console.Write("Importing source file from \"{0}\".", ImportPath);
            OrigText = File.ReadAllText(ImportPath);
            OrigText = OrigText.Substring(OrigText.IndexOf("\r\n", 0) + 2);
            Console.WriteLine("\rImporting source file from \"{0}\".  Done!", ImportPath);
            Console.Write("Scanning source report.");
            for (int loop = 0; loop < OrigText.Length; loop++)
            {
                if (OrigText[loop] == '\r')
                    numLines++;
            }
            Console.WriteLine("\rScanning source report.  Done!");
            Console.Write("Downloading SecType information.");
            resultTable = new DataTable();
            adapter.Fill(resultTable);
            MoxyConn.Close();
            Console.WriteLine("\rDownloading SecType information.  Done!");

            for (int lcv = 0; lcv < numLines; lcv++)
            {
                int foundSpot = -1;
                int nextStart = 0;
                Console.Write("\rGenerating new file... {0} / {1} ({2}%)  ", curLine, numLines, System.Math.Round(((curLine / numLines) * 100), 2));
                for (int vcl = 0; vcl < resultTable.Rows.Count; vcl++)
                {
                    if (resultTable.Rows[vcl][0].ToString() == OrigText.Substring(0, OrigText.IndexOf(",", 0)).ToUpper() && resultTable.Rows[vcl][1].ToString().Length > 0)
                    {
                        foundSpot = vcl;
                        break;
                    }
                }
                if (foundSpot != -1 && foundSpot < resultTable.Rows.Count)
                {
                    NewText += resultTable.Rows[foundSpot][1].ToString();
                    NewText += "\t";
                    NewText += OrigText.Substring(nextStart, (OrigText.IndexOf(",", nextStart) - nextStart));
                    NewText += "\t";
                    nextStart = OrigText.IndexOf(",", nextStart) + 1;
                    for (int y = 0; y < 142; y++)
                        NewText += "\t";
                    if(resultTable.Rows[foundSpot][2].ToString() == "r")
                        NewText += @"PRE/ETM";
                    else if (OrigText.Substring(nextStart, (OrigText.IndexOf(",", nextStart) - nextStart)) == "Municipals")
                    {
                        NewText += "Muni - ";
                        nextStart = OrigText.IndexOf(",", nextStart) + 1;
                        if (OrigText.Substring(nextStart, (OrigText.IndexOf(",", nextStart) - nextStart)).Length > 0)
                            NewText += OrigText.Substring(nextStart, (OrigText.IndexOf(",", nextStart) - nextStart));
                        else
                            NewText += "(Orphan)";
                    }
                    else if (OrigText.Substring(nextStart, (OrigText.IndexOf(",", nextStart) - nextStart)) == "Corporates")
                    {
                        NewText += "Corporate - ";
                        nextStart = OrigText.IndexOf(",", nextStart) + 1;
                        nextStart = OrigText.IndexOf(",", nextStart) + 1;
                        if (OrigText.Substring(nextStart, (OrigText.IndexOf("\r\n", nextStart) - nextStart)).Length > 0)
                            NewText += OrigText.Substring(nextStart, (OrigText.IndexOf("\r\n", nextStart) - nextStart));
                        else
                            NewText += "(Unknown)";
                    }
                    else
                        NewText += OrigText.Substring(nextStart, (OrigText.IndexOf(",", nextStart) - nextStart));
                    for (int z = 0; z < 17; z++)
                        NewText += "\t";
                    NewText += "\r\n";
                    resultTable.Rows.RemoveAt(foundSpot);
                }
                else
                    Console.WriteLine("\r  Omitting {0}: Missing Symbol or SecType.", OrigText.Substring(nextStart, (OrigText.IndexOf(",", nextStart) - nextStart)));
                OrigText = OrigText.Substring(OrigText.IndexOf("\r\n", 0) + 2);
                curLine++;
            }
            Console.Write("Exporting file to \"{0}\".", ExportPath);
            File.WriteAllText(ExportPath, NewText);
            Console.WriteLine("\rExporting file to \"{0}\".  Done!\nPress any key to exit.", ExportPath);
            Console.ReadLine();
        }
    }
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

季末如歌 2024-10-23 18:22:43

不要使用 += 运算符进行连接,而是使用 System.Text.StringBuilder 对象及其 Append() 和 AppendLine 方法

字符串在 C# 中是不可变的,因此每次在循环中使用 += 时,都会在内存中创建一个新字符串,可能会导致最终的减速。

Instead of using the += operator for concatenation, use a System.Text.StringBuilder object and it's Append() and AppendLine Methods

Strings are immutable in C#, so every time you use += in your loop, a new string is created in memory, likely causing the eventual slowdown.

不一样的天空 2024-10-23 18:22:43

虽然这需要 StringBuilder,但我不认为这是罪魁祸首。相反,我们有一段具有指数运行时间的代码。

我想到的罪魁祸首是计算foundSpot 的代码。如果我正确地阅读代码,这是 O(n^2),而其他一切都是 O(n)。

三个建议:

1)重构!这个例行程序太长了。我不必引用“计算foundSpot 的代码”,那应该是一个有名称的例程。我在这里看到至少 4 个例程,也许更多。

2) 字符串构建器。

3) 必须清理该搜索例程。您每次都会在循环中进行大量重复计算,除非有某种原因反对它(我不会尝试找出您正在应用的测试),否则需要使用搜索性能优于的东西来完成在)。

While this cries out for StringBuilder I don't believe that's the primary culprit here. Rather, we have a piece of code with an exponential runtime.

The culprit I have in mind is the code that calculates foundSpot. If I'm reading the code correctly this is O(n^2) while everything else is O(n).

Three pieces of advise:

1) Refactor! This routine is WAY too long. I shouldn't have to refer to "the code that calculates foundSpot", that should be a routine with a name. I see a minimum of 4 routines here, maybe more.

2) Stringbuilder.

3) That search routine has to be cleaned up. You're doing a lot of repeated calculations each time around the loop and unless there's some reason against it (I'm not going to try to figure out the tests you are applying) it needs to be done with something with search performance better than O(n).

智商已欠费 2024-10-23 18:22:43

您应该在创建输出文件时将每一行写入输出文件,而不是将所有行附加到输出字符串 (NewText) 的末尾并在末尾写出。

每次程序将某些内容附加到输出字符串的末尾时,C# 都会创建一个足以容纳旧字符串和附加文本的新字符串,将旧内容复制到目标字符串中,然后将新文本附加到末尾。

假设每行 40 个字符和 500 行,总字符串大小将约为 20K,此时所有这些 20K 副本的开销会大大减慢程序速度。

You should write each line to the output file as you create it instead of appending all lines to the end of your output string (NewText) and writing them out at the end.

Every time that the program appends something to the end of the output string, C# creates a new string that's big enough for the old string plus the appended text, copies the old contents into the target string, then appends the new text to the end.

Assuming 40 characters per line and 500 lines, the total string size will be ~ 20K, at which point the overhead of all of those 20K copies is slowing the program WAY down.

隔岸观火 2024-10-23 18:22:43

NewText 只是附加到,对吧?那么为什么不直接写入文件流呢?另外不要忘记在它周围添加一个 try catch,这样如果你的应用程序崩溃了,你可以关闭文件流。

另外,如果您取消 SubString 调用,第二个循环可能会更快。没有理由一遍又一遍地这样做。

string txt = OrigText.Substring(0, OrigText.IndexOf(",", 0)).ToUpper()
for (int vcl = 0; vcl < resultTable.Rows.Count; vcl++)
{
  if (resultTable.Rows[vcl][0].ToString() == txt && resultTable.Rows[vcl][1].ToString().Length > 0)
  {
      foundSpot = vcl;
      break;
  }
}

这些制表符循环很荒谬。它们本质上是每次构建的常量字符串。将它们替换为在应用程序启动时声明的格式化变量。

string tab17 = "\t\t\t\t\t\t\t\t\t"
string tab142 = "\t\t\t\t\t...etc." 

//bad
for (int z = 0; z < 17; z++)
  NewText += "\t";

NewText is only appended to, right? So why not just write out to the file stream? Also don't forget to add a try catch around it, so if your app blows up, you can close the file stream.

Also, the second loop would probably be faster if you pulled out the SubString call. There is no reason to be doing that over and over again.

string txt = OrigText.Substring(0, OrigText.IndexOf(",", 0)).ToUpper()
for (int vcl = 0; vcl < resultTable.Rows.Count; vcl++)
{
  if (resultTable.Rows[vcl][0].ToString() == txt && resultTable.Rows[vcl][1].ToString().Length > 0)
  {
      foundSpot = vcl;
      break;
  }
}

These tab loops are ridiculous. They are essentially constant strings that get built each time. Replace them with formatting vars that are declared at the start of your app.

string tab17 = "\t\t\t\t\t\t\t\t\t"
string tab142 = "\t\t\t\t\t...etc." 

//bad
for (int z = 0; z < 17; z++)
  NewText += "\t";
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文