从文件中获取可变子字符串长度的最快方法 (C#)

发布于 2025-01-12 20:30:02 字数 2093 浏览 3 评论 0原文

我有一个文本文件,其中包含需要提取的值,并且每个值都是可变长度。每个变量的长度存储在 List 中,如果有更有效的方法,则可以更改。

问题:在给定长度的 List 的情况下,将可变长度子字符串放入 DataTable 的最快方法是什么?

示例文本文件内容:

Field1ValueField2ValueIsLongerField3Field4IsExtremelyLongField5IsProbablyTheLongestFieldOfThemAll
A1201605172B160349150816431572C16584D31601346427946121346E674306102966595346438476174959205395664

示例 List

11, 19, 6, 21, 40

示例输出 DataTable

Field 1Field 2Field 3Field 4Field 5
Field1ValueField2ValueIsLongerField3Field4IsExtremelyLongField5IsProbicallyTheLongestFieldOfThemAll
A1201605172B160349150816431572C16584D31601346427946121346E674306102966595346438476174959205395664

字段值没有模式,可以是任何字母数字值,只能通过长度列表获取字段值。

我的方法如下:

List<int> lengths = new() { 11, 19, 6, 21, 40};

DataTable dataTable = new();

//Add Columns for each field
foreach (int i in lengths)
{
    dataTable.Columns.Add();
}

//Read file and get fields
using (StreamReader streamReader = new(fileName))
{
    string line; //temp
    while ((line = streamReader.ReadLine()) != null)
    {
        //Create new row each time we see a new line in the text file
        DataRow dataRow = dataTable.NewRow();

        //Temp counter for starting index of substring
        int tempCounter = 0;

        //Enumerate through variable lengths
        foreach (int i in lengths)
        {
            //Set the value for tat cell
            dataRow[lengths.IndexOf(i)] = line.Substring(tempCounter, i);

            //Add the length of the current field
            tempCounter += i;
        }

        //Add Row to DataTable
        dataTable.Rows.Add(dataRow);
    }
}

是否有更有效(时间和/或内存)的方式来完成这项任务?

I have a text file that has values that need to be extracted and each value is a variable length. The length of each variable is stored in a List<int>, this can change if there is a more efficient way.

The Problem: What is the fastest way to get the variable length substrings into a DataTable given a List<int> of lengths?

Example text file contents:

Field1ValueField2ValueIsLongerField3Field4IsExtremelyLongField5IsProbablyTheLongestFieldOfThemAll
A1201605172B160349150816431572C16584D31601346427946121346E674306102966595346438476174959205395664

Example List<int>:

11, 19, 6, 21, 40

Example output DataTable:

Field 1Field 2Field 3Field 4Field 5
Field1ValueField2ValueIsLongerField3Field4IsExtremelyLongField5IsProbablyTheLongestFieldOfThemAll
A1201605172B160349150816431572C16584D31601346427946121346E674306102966595346438476174959205395664

There is no pattern to the field values, could be any alphanumeric value, and can only get the field values via the length list.

My approach was as follows:

List<int> lengths = new() { 11, 19, 6, 21, 40};

DataTable dataTable = new();

//Add Columns for each field
foreach (int i in lengths)
{
    dataTable.Columns.Add();
}

//Read file and get fields
using (StreamReader streamReader = new(fileName))
{
    string line; //temp
    while ((line = streamReader.ReadLine()) != null)
    {
        //Create new row each time we see a new line in the text file
        DataRow dataRow = dataTable.NewRow();

        //Temp counter for starting index of substring
        int tempCounter = 0;

        //Enumerate through variable lengths
        foreach (int i in lengths)
        {
            //Set the value for tat cell
            dataRow[lengths.IndexOf(i)] = line.Substring(tempCounter, i);

            //Add the length of the current field
            tempCounter += i;
        }

        //Add Row to DataTable
        dataTable.Rows.Add(dataRow);
    }
}

Is there a more efficient (time and/or memory) way of completing this task?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

一身软味 2025-01-19 20:30:02

您正在生成该输入字符串或该长度数组吗?

如果是:

  • 保存每个第N个字段起始字符的索引(如果您已经有长度数组,那么您也可以构建一个起始数组)
  • 然后在解码时,使用多个线程一次解析多个索引点并将它们连接到目标上列表或数组(在我看来,数组必须更快,因为您有字段总数)

如果不是:

  • 将每个遇到的字段开始推入队列(及其字段索引)并直接跳转到下一个字段,
  • 由其他线程从队列中异步弹出元素,然后将它们按照索引放入列表中(数组可能更好如果总长度已知),

因为当您在同一个循环中进行提取和解析时,提取吞吐量会下降。因此,您应该将工作卸载到其他线程,也许一次使用 N 个字段,以容忍多线程同步延迟。

如果单线程提取比多线程解析太慢,那么您可以尝试向量化提取。一次启动 128 个字符采样器,检查它们是否找到前缀代码,并在它们之间进行归约以找到其中的第一个前缀(如果找到多个)。

Are you producing that input string or that length array?

If yes:

  • save index of every Nth field starting character (if you already have length-array, then you can build a start-array too)
  • then when decoding, use multiple threads to parse multiple index points at once and join them on a target list or array (imo an array must be faster since you have total number of fields)

If no:

  • push every encountered field start into a queue(with their field index) and jump directly to next field
  • asynchronously pop elements from queue by other threads and place them into the list accordingly with their index (array could be better if total length known)

because when you do both extracting and parsing in same loop, the extracting throughput drops. So you should offload the work to other threads, maybe with N fields at once to tolerate multi-threading synchronization latency.

If extracting by single-thread is too slow compared to multi-thread parsing, then you can try to vectorize the extracting. Launch 128 char samplers at once, check if they find a prefix code and do a reduction between them to find the first prefix in them (if they find multiple).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文