读取大型开放式 xml 电子表格

发布于 2024-08-17 23:45:53 字数 240 浏览 11 评论 0原文

我需要使用 openxml 库读取(并解析)大型电子表格文件(20-50MB),并且似乎没有一种方法可以一次一个地流式传输行进行解析。

我一直遇到内存不足异常,因为似乎只要我尝试访问一行(或迭代),就会加载整个行内容(100K+ 行)。

每次调用,是否 Elements.Where( 和 query ) 或 Descendants() 似乎加载整个行集

是否有办法进行流式传输或一次只读取一行?

谢谢

i need to read (and parse) large spreadsheet files (20-50MB) using the openxml libraries and there doesn't seem to be a way to stream the rows one at a time for parsing.

i'm consistently getting Out Of Memory exceptions as it seems as soon as i attempt to access a row (or iterate) the entire row contents are loaded (100K+ rows).

each of the calls, whether Elements.Where( with query )
or Descendants ( ) seem to load the entire rowset

is there a way to stream or just read a row at a time ?

thx

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

扭转时空 2024-08-24 23:45:54

我找到了答案。如果您在工作表部分使用 OpenXmlReader,您可以迭代并有效地延迟加载您遇到的元素。

OpenXmlReader oxr = OpenXmlReader.Create(worksheetPart); 

查找

ElementType == typeof(SheetData) 

并加载行(惰性)

Row row = (Row)oxr.LoadCurrentElement();

i found an answer. if you use the OpenXmlReader on the worksheet part you can iterate through and effectively lazy load the elements you come across.

OpenXmlReader oxr = OpenXmlReader.Create(worksheetPart); 

look for

ElementType == typeof(SheetData) 

and load the row (lazy)

Row row = (Row)oxr.LoadCurrentElement();
表情可笑 2024-08-24 23:45:54

openxml 库使用 dom 还是 sax 模型?使用 dom,您通常必须立即将整个文档保存在内存中,但使用 sax,您可以在事件发生时对其进行流式传输。

do the openxml libraries use dom or sax models? with dom you usually have to hold the entire document in memory at once, but with sax you can stream the events as they come.

邮友 2024-08-24 23:45:54

以下是使用 SAX 方法读取具有多个工作表的大型 excel 文件的代码:

public static DataTable ReadIntoDatatableFromExcel(string newFilePath)
        {
            /*Creating a table with 20 columns*/
            var dt = CreateProviderRvenueSharingTable();

            try
            {
                /*using stream so that if excel file is in another process then it can read without error*/
                using (Stream stream = new FileStream(newFilePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
                {
                    using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(stream, false))
                    {
                        var workbookPart = spreadsheetDocument.WorkbookPart;
                        var workbook = workbookPart.Workbook;

                        /*get only unhide tabs*/
                        var sheets = workbook.Descendants<Sheet>().Where(e => e.State == null);

                        foreach (var sheet in sheets)
                        {
                            var worksheetPart = (WorksheetPart)workbookPart.GetPartById(sheet.Id);

                            /*Remove empty sheets*/
                            List<Row> rows = worksheetPart.Worksheet.Elements<SheetData>().First().Elements<Row>()
                                .Where(r => r.InnerText != string.Empty).ToList();

                            if (rows.Count > 1)
                            {
                                OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);

                                int i = 0;
                                int BTR = 0;/*Break the reader while empty rows are found*/

                                while (reader.Read())
                                {
                                    if (reader.ElementType == typeof(Row))
                                    {
                                        /*ignoring first row with headers and check if data is there after header*/
                                        if (i < 2)
                                        {
                                            i++;
                                            continue;
                                        }

                                        reader.ReadFirstChild();

                                        DataRow row = dt.NewRow();

                                        int CN = 0;

                                        if (reader.ElementType == typeof(Cell))
                                        {
                                            do
                                            {
                                                Cell c = (Cell)reader.LoadCurrentElement();

                                                /*reader skipping blank cells so data is getting worng in datatable's rows according to header*/
                                                if (CN != 0)
                                                {
                                                    int cellColumnIndex =
                                                        ExcelHelper.GetColumnIndexFromName(
                                                            ExcelHelper.GetColumnName(c.CellReference));

                                                    if (cellColumnIndex < 20 && CN < cellColumnIndex - 1)
                                                    {
                                                        do
                                                        {
                                                            row[CN] = string.Empty;
                                                            CN++;
                                                        } while (CN < cellColumnIndex - 1);
                                                    }
                                                }

                                                /*stopping execution if first cell does not have any value which means empty row*/
                                                if (CN == 0 && c.DataType == null && c.CellValue == null)
                                                {
                                                    BTR++;
                                                    break;
                                                }

                                                string cellValue = GetCellValue(c, workbookPart);
                                                row[CN] = cellValue;
                                                CN++;

                                                /*if any text exists after T column (index 20) then skip the reader*/
                                                if (CN == 20)
                                                {
                                                    break;
                                                }
                                            } while (reader.ReadNextSibling());
                                        }

                                        /*reader skipping blank cells so fill the array upto 19 index*/
                                        while (CN != 0 && CN < 20)
                                        {
                                            row[CN] = string.Empty;
                                            CN++;
                                        }

                                        if (CN == 20)
                                        {
                                            dt.Rows.Add(row);
                                        }
                                    }
                                    /*escaping empty rows below data filled rows after checking 5 times */
                                    if (BTR > 5)
                                        break;
                                }
                                reader.Close();
                            }                            
                        }
                    }
                }
            }
            catch (Exception ex)
            {
                throw ex;
            }
            return dt;
        }

  private static string GetCellValue(Cell c, WorkbookPart workbookPart)
        {
            string cellValue = string.Empty;
            if (c.DataType != null && c.DataType == CellValues.SharedString)
            {
                SharedStringItem ssi =
                    workbookPart.SharedStringTablePart.SharedStringTable
                        .Elements<SharedStringItem>()
                        .ElementAt(int.Parse(c.CellValue.InnerText));
                if (ssi.Text != null)
                {
                    cellValue = ssi.Text.Text;
                }
            }
            else
            {
                if (c.CellValue != null)
                {
                    cellValue = c.CellValue.InnerText;
                }
            }
            return cellValue;
        }

public static int GetColumnIndexFromName(string columnNameOrCellReference)
        {
            int columnIndex = 0;
            int factor = 1;
            for (int pos = columnNameOrCellReference.Length - 1; pos >= 0; pos--)   // R to L
            {
                if (Char.IsLetter(columnNameOrCellReference[pos]))  // for letters (columnName)
                {
                    columnIndex += factor * ((columnNameOrCellReference[pos] - 'A') + 1);
                    factor *= 26;
                }
            }
            return columnIndex;
        }

        public static string GetColumnName(string cellReference)
        {
            /* Advance from L to R until a number, then return 0 through previous position*/
            for (int lastCharPos = 0; lastCharPos <= 3; lastCharPos++)
                if (Char.IsNumber(cellReference[lastCharPos]))
                    return cellReference.Substring(0, lastCharPos);

            throw new ArgumentOutOfRangeException("cellReference");
        }

private static DataTable CreateProviderRvenueSharingTable()
    {
        DataTable dt = new DataTable("RevenueSharingTransaction");

        // Create fields
        dt.Columns.Add("IMId", typeof(string));
        dt.Columns.Add("InternalPlanId", typeof(string));
        dt.Columns.Add("PaymentReceivedDate", typeof(string));
        dt.Columns.Add("PaymentAmount", typeof(string));
        dt.Columns.Add("BPS", typeof(string));
        dt.Columns.Add("Asset", typeof(string));
        dt.Columns.Add("PaymentType", typeof(string));
        dt.Columns.Add("InvestmentManager", typeof(string));
        dt.Columns.Add("Frequency", typeof(string));
        dt.Columns.Add("StartDateForPayment", typeof(string));
        dt.Columns.Add("EndDateForPayment", typeof(string));
        dt.Columns.Add("Participant", typeof(string));
        dt.Columns.Add("SSN", typeof(string));
        dt.Columns.Add("JEDate", typeof(string));
        dt.Columns.Add("GL", typeof(string));
        dt.Columns.Add("JEDescription", typeof(string));
        dt.Columns.Add("CRAccount", typeof(string));
        dt.Columns.Add("ReportName", typeof(string));
        dt.Columns.Add("ReportLocation", typeof(string));
        dt.Columns.Add("Division", typeof(string));
        return dt;
    }

代码适用于:
1. 从第一张开始按升序阅读
2. 如果 excel 文件正在被另一个进程使用,OpenXML 仍然会读取该文件。
3.此代码读取空白单元格
4. 读取完成后跳过空行。
5. 4秒内读取5000行。

Here is the code to read large excel file with multiple sheets using SAX approach:

public static DataTable ReadIntoDatatableFromExcel(string newFilePath)
        {
            /*Creating a table with 20 columns*/
            var dt = CreateProviderRvenueSharingTable();

            try
            {
                /*using stream so that if excel file is in another process then it can read without error*/
                using (Stream stream = new FileStream(newFilePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
                {
                    using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(stream, false))
                    {
                        var workbookPart = spreadsheetDocument.WorkbookPart;
                        var workbook = workbookPart.Workbook;

                        /*get only unhide tabs*/
                        var sheets = workbook.Descendants<Sheet>().Where(e => e.State == null);

                        foreach (var sheet in sheets)
                        {
                            var worksheetPart = (WorksheetPart)workbookPart.GetPartById(sheet.Id);

                            /*Remove empty sheets*/
                            List<Row> rows = worksheetPart.Worksheet.Elements<SheetData>().First().Elements<Row>()
                                .Where(r => r.InnerText != string.Empty).ToList();

                            if (rows.Count > 1)
                            {
                                OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);

                                int i = 0;
                                int BTR = 0;/*Break the reader while empty rows are found*/

                                while (reader.Read())
                                {
                                    if (reader.ElementType == typeof(Row))
                                    {
                                        /*ignoring first row with headers and check if data is there after header*/
                                        if (i < 2)
                                        {
                                            i++;
                                            continue;
                                        }

                                        reader.ReadFirstChild();

                                        DataRow row = dt.NewRow();

                                        int CN = 0;

                                        if (reader.ElementType == typeof(Cell))
                                        {
                                            do
                                            {
                                                Cell c = (Cell)reader.LoadCurrentElement();

                                                /*reader skipping blank cells so data is getting worng in datatable's rows according to header*/
                                                if (CN != 0)
                                                {
                                                    int cellColumnIndex =
                                                        ExcelHelper.GetColumnIndexFromName(
                                                            ExcelHelper.GetColumnName(c.CellReference));

                                                    if (cellColumnIndex < 20 && CN < cellColumnIndex - 1)
                                                    {
                                                        do
                                                        {
                                                            row[CN] = string.Empty;
                                                            CN++;
                                                        } while (CN < cellColumnIndex - 1);
                                                    }
                                                }

                                                /*stopping execution if first cell does not have any value which means empty row*/
                                                if (CN == 0 && c.DataType == null && c.CellValue == null)
                                                {
                                                    BTR++;
                                                    break;
                                                }

                                                string cellValue = GetCellValue(c, workbookPart);
                                                row[CN] = cellValue;
                                                CN++;

                                                /*if any text exists after T column (index 20) then skip the reader*/
                                                if (CN == 20)
                                                {
                                                    break;
                                                }
                                            } while (reader.ReadNextSibling());
                                        }

                                        /*reader skipping blank cells so fill the array upto 19 index*/
                                        while (CN != 0 && CN < 20)
                                        {
                                            row[CN] = string.Empty;
                                            CN++;
                                        }

                                        if (CN == 20)
                                        {
                                            dt.Rows.Add(row);
                                        }
                                    }
                                    /*escaping empty rows below data filled rows after checking 5 times */
                                    if (BTR > 5)
                                        break;
                                }
                                reader.Close();
                            }                            
                        }
                    }
                }
            }
            catch (Exception ex)
            {
                throw ex;
            }
            return dt;
        }

  private static string GetCellValue(Cell c, WorkbookPart workbookPart)
        {
            string cellValue = string.Empty;
            if (c.DataType != null && c.DataType == CellValues.SharedString)
            {
                SharedStringItem ssi =
                    workbookPart.SharedStringTablePart.SharedStringTable
                        .Elements<SharedStringItem>()
                        .ElementAt(int.Parse(c.CellValue.InnerText));
                if (ssi.Text != null)
                {
                    cellValue = ssi.Text.Text;
                }
            }
            else
            {
                if (c.CellValue != null)
                {
                    cellValue = c.CellValue.InnerText;
                }
            }
            return cellValue;
        }

public static int GetColumnIndexFromName(string columnNameOrCellReference)
        {
            int columnIndex = 0;
            int factor = 1;
            for (int pos = columnNameOrCellReference.Length - 1; pos >= 0; pos--)   // R to L
            {
                if (Char.IsLetter(columnNameOrCellReference[pos]))  // for letters (columnName)
                {
                    columnIndex += factor * ((columnNameOrCellReference[pos] - 'A') + 1);
                    factor *= 26;
                }
            }
            return columnIndex;
        }

        public static string GetColumnName(string cellReference)
        {
            /* Advance from L to R until a number, then return 0 through previous position*/
            for (int lastCharPos = 0; lastCharPos <= 3; lastCharPos++)
                if (Char.IsNumber(cellReference[lastCharPos]))
                    return cellReference.Substring(0, lastCharPos);

            throw new ArgumentOutOfRangeException("cellReference");
        }

private static DataTable CreateProviderRvenueSharingTable()
    {
        DataTable dt = new DataTable("RevenueSharingTransaction");

        // Create fields
        dt.Columns.Add("IMId", typeof(string));
        dt.Columns.Add("InternalPlanId", typeof(string));
        dt.Columns.Add("PaymentReceivedDate", typeof(string));
        dt.Columns.Add("PaymentAmount", typeof(string));
        dt.Columns.Add("BPS", typeof(string));
        dt.Columns.Add("Asset", typeof(string));
        dt.Columns.Add("PaymentType", typeof(string));
        dt.Columns.Add("InvestmentManager", typeof(string));
        dt.Columns.Add("Frequency", typeof(string));
        dt.Columns.Add("StartDateForPayment", typeof(string));
        dt.Columns.Add("EndDateForPayment", typeof(string));
        dt.Columns.Add("Participant", typeof(string));
        dt.Columns.Add("SSN", typeof(string));
        dt.Columns.Add("JEDate", typeof(string));
        dt.Columns.Add("GL", typeof(string));
        dt.Columns.Add("JEDescription", typeof(string));
        dt.Columns.Add("CRAccount", typeof(string));
        dt.Columns.Add("ReportName", typeof(string));
        dt.Columns.Add("ReportLocation", typeof(string));
        dt.Columns.Add("Division", typeof(string));
        return dt;
    }

Code works for:
1. read the sheets from first in ascending order
2. if excel file is being used by another process, OpenXML still reads that.
3. This code reads blank cells
4. skip empty rows after reading complete.
5. it reads 5000 rows within 4 seconds.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文