读取 Excel Open XML 会忽略空白单元格
我正在使用此处接受的解决方案来转换将 Excel 工作表转换为数据表。如果我有“完美”的数据,则效果很好,但如果我的数据中间有一个空白单元格,则似乎在每列中放入了错误的数据。
我认为这是因为在下面的代码中:
row.Descendants<Cell>().Count()
是填充单元格的数量(不是所有列)并且:
GetCellValue(spreadSheetDocument, row.Descendants<Cell>().ElementAt(i));
似乎找到下一个填充单元格(不一定是该索引中的内容)所以如果第一列为空并且我调用 ElementAt( 0),它返回第二列中的值。
这是完整的解析代码。
DataRow tempRow = dt.NewRow();
for (int i = 0; i < row.Descendants<Cell>().Count(); i++)
{
tempRow[i] = GetCellValue(spreadSheetDocument, row.Descendants<Cell>().ElementAt(i));
if (tempRow[i].ToString().IndexOf("Latency issues in") > -1)
{
Console.Write(tempRow[i].ToString());
}
}
I am using the accepted solution here to convert an excel sheet into a datatable. This works fine if I have "perfect" data but if I have a blank cell in the middle of my data it seems to put the wrong data in each column.
I think this is because in the below code:
row.Descendants<Cell>().Count()
is number of populated cells (not all columns) AND:
GetCellValue(spreadSheetDocument, row.Descendants<Cell>().ElementAt(i));
seems to find the next populated cell (not necessarily what is in that index) so if the first column is empty and i call ElementAt(0), it returns the value in the second column.
Here is the full parsing code.
DataRow tempRow = dt.NewRow();
for (int i = 0; i < row.Descendants<Cell>().Count(); i++)
{
tempRow[i] = GetCellValue(spreadSheetDocument, row.Descendants<Cell>().ElementAt(i));
if (tempRow[i].ToString().IndexOf("Latency issues in") > -1)
{
Console.Write(tempRow[i].ToString());
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(17)
它使用以下代码运行成功:
it run success with this code:
我一直在努力加载一个包含 50000 个项目的 Xlsx 文件,并共享上述所有方法,所以我想我会分享我最终设法做的事情。
现在,它可以在不到 5 秒的时间内将 50000 个项目加载到可用的 string[] 列表中,因此这可能会对某人有所帮助。因此,请使用下面的示例并在需要时进行调整,例如将其转换为 DataTable 方法等。
I have been struggling to load an Xlsx file with 50000 items with all of the above methods shared, so I thought I would share what I eventually manage to do.
It now loads to 50000 items into a workable string[] list in less than 5 seconds, so this could possible help someone. So please take the below sample and tweak it where needed, like converting it to a DataTable method, etc.
这是有道理的,因为 Excel 不会存储空单元格的值。如果您使用 Open XML SDK 2.0 生产力工具打开文件并将 XML 向下遍历到单元格级别,您将看到只有具有数据的单元格才会出现在该文件中。
您的选择是在要遍历的单元格范围中插入空白数据,或者以编程方式找出跳过的单元格并适当调整索引。
我制作了一个示例 Excel 文档,其中单元格引用 A1 和 C1 中有一个字符串。然后,我在 Open XML Productivity Tool 中打开 Excel 文档,下面是存储的 XML:
在这里您将看到数据对应于第一行,并且只为该行保存了两个单元格的数据。保存的数据对应于A1和C1,并且不保存具有空值的单元格。
要获得所需的功能,您可以像上面那样遍历单元格,但您需要检查单元格引用的值并确定是否跳过了任何单元格。为此,您将需要两个实用程序函数来从单元格引用中获取列名称,然后将该列名称转换为从零开始的索引:
然后您可以迭代单元格并检查单元格引用与列索引。如果小于,则将空白数据添加到 tempRow 中,否则只需读取单元格中包含的值。 (注意:我没有测试下面的代码,但总体思路应该有帮助):
This makes sense since Excel will not store a value for a cell that is null. If you open your file using the Open XML SDK 2.0 Productivity Tool and traverse the XML down to the cell level you will see that only the cells that have data are going to be in that file.
Your options are to insert blank data in the range of cells you are going to traverse or programmatically figure out a cell was skipped and adjust your index appropriately.
I made an example excel document with a string in cell reference A1 and C1. I then opened up the excel document in the Open XML Productivity Tool and here is the XML that was stored:
Here you will see that the data corresponds to the first row and that only two cells worth of data are saved for that row. The data saved corresponds to A1 and C1 and that no cells with null values are saved.
To get the functionality that you need, you can traverse over the Cells as you are doing above, but you will need to check what the value the Cell is referencing and determine if any Cells have been skipped. to do that you will need two utility functions to get the Column Name from the cell reference and to then translate that column name into a zero based index:
Then you can iterate over the Cells and check to see what the cell reference is compared to the columnIndex. If it is less than then you add blank data to your tempRow, otherwise just read in the value contained in the cell. (Note: I did not test the code below, but the general idea should help):
这是
IEnumerable
的实现,它应该可以完成您想要的操作,经过编译和单元测试。下面是它所依赖的函数:
把它扔到一个类中并尝试一下。
Here's an implementation of
IEnumerable
that should do what you want, compiled and unit tested.Here are the functions it relies on:
Throw it in a class and give it a try.
这是 Waylon 的答案的稍作修改的版本,它也依赖于其他答案。它将他的方法封装在一个类中。
我更改
为
Here's the class,您不需要实例化它,它只是作为一个实用程序类:
现在您可以通过这种方式获取所有行的单元格:
它将包含所有单元格,即使它们是空的。
Here's a slightly modified version of Waylon's answer which also relied on other answers. It encapsulates his method in a class.
I changed
to
Here's the class, you don't need to instantiate it, it just serves as an utility class:
Now you're able to get all rows' cells in this way:
It will contain all cells even if they are empty.
请参阅我的实现:
使用此类收集 HeaderLetters:
辅助方法为:
该解决方案处理共享单元格项目(SST 索引单元格)。
See my implementation:
HeaderLetters are collected using this class:
And the helper methods are:
The solution deals with shared cell items (SST indexed cells).
都是很好的例子。这是我正在使用的,因为我需要跟踪所有行、单元格、值和标题以进行关联和分析。
ReadSpreadsheet 方法打开一个 xlxs 文件并遍历每个工作表、行和列。由于这些值存储在引用的字符串表中,因此我还明确在每个工作表中使用该值。还使用了其他类:DSFunction 和 StaticVariables。后者保存经常使用的参数值,例如引用的 'quotdouble' ( quotdouble = "\u0022"; ) 和 'crlf' (crlf = "\u000D" + "\u000A"; )。
下面包含相关的 DSFunction 方法 GetIntColIndexForLetter。它返回与字母名称(A、B、AA、ADE 等)相对应的列索引的整数值。它与参数“ncellcolref”一起使用,以确定是否跳过了任何列,并为每个缺失的列输入空字符串值。
在临时存储在 List 对象中之前(使用 Replace 方法),我还会对值进行一些清理。
随后,我使用列名称的哈希表(字典)来提取不同工作表中的值,将它们关联起来,创建标准化值,然后创建在我们的产品中使用的对象,然后将其存储为 XML 文件。这些都没有显示,但这就是使用这种方法的原因。
All good examples. Here is the one I am using since I need to keep track of all rows, cells, values, and titles for correlation and analysis.
The method ReadSpreadsheet opens an xlxs file and goes through each worksheet, row, and column. Since the values are stored in a referenced string table, I also explicitly use that per worksheet. There are other classes used: DSFunction and StaticVariables. The latter holds oft used parameter values, such as the referenced 'quotdouble' ( quotdouble = "\u0022"; ) and 'crlf' (crlf = "\u000D" + "\u000A"; ).
The relevant DSFunction method GetIntColIndexForLetter is included below. It returns an integer value for the column index corresponding to letter names such as (A,B, AA, ADE, etc.). This is used along with the parameter 'ncellcolref' to determine if any columns have been skipped and to enter empty string values for each one that is missing.
I also do some cleaning of the values before storing temporarily in a List object (using Replace method).
Subsequently, I use the hash table (Dictionary) of column names to extract values across different worksheets, correlate them, create normalized values, and then create an object used in our product which is then stored as an XML file. None of this is shown but is why this approach is used.
字母代码是基于 26 的编码,因此应该可以将其转换为偏移量。
The letter code is a base 26 encoding so this should work to convert it into an offset.
我无法抗拒优化阿穆拉答案中的子例程,以消除对正则表达式的需要。
第一个函数实际上并不需要,因为第二个函数可以接受单元格引用(C3)或列名(C)(但仍然是一个很好的辅助函数)。索引也是基于 1 的(只是因为我们的实现使用基于 1 的行以在视觉上与 Excel 匹配)。
I can't resist optimizing the subroutines from Amurra's answer to remove need for Regex's.
The first function isn't actually needed since the second one can will accept a cell reference (C3) or a column name (C) (but still a nice helper function). The indices are also one-based (only because our implementation used one-based for the rows to match visually with Excel).
您可以使用此函数从传递标题索引的行中提取单元格:
You can use this function to extract a cell from a row passing the header index:
好吧,我并不是这方面的专家,但其他答案对我来说确实有点过头了,所以这是我的解决方案:
希望有人觉得这很有用!
Okay, I'm not exactly an expert on this but the other answers do seem like over kill to me so here's my solution:
Hope someone finds this useful!
很抱歉发布这个问题的另一个答案,这是我使用的代码。
如果工作表顶部有空白行,我会遇到 OpenXML 无法正常工作的问题。有时它只会返回一个包含 0 行和 0 列的 DataTable。下面的代码处理这个工作表以及所有其他工作表。
这是您调用我的代码的方式。只需传入文件名和要读入的工作表的名称:
这是代码本身:
With apologies for posting yet another answer to this question, here's the code I used.
I was having problems with OpenXML not working properly if a worksheet had a blank row at the top. It would sometimes just return a DataTable with 0 rows and 0 columns in it. The code below copes with this, and all other worksheets.
Here's how you would call my code. Just pass in a filename and the name of the Worksheet to read in:
And here's the code itself:
使用 ClosedXML.Excel 代替 OpenXML:
Using ClosedXML.Excel Instead of OpenXML:
我想将其添加为评论,但没有足够的空间进行解释。 Waylon Flinn 提供的解决方案效果很好,除非您的行以空单元格结尾,您仍然需要考虑这一点。
这是因为电子表格数据
row.Descendants |
仅包含具有数据的单元格。 GetEnumerator() 方法检查每对“包含数据”的单元格,然后根据需要在它们之间注入空单元格(基于列索引) )。但是,如果最后一列不包含数据(如果它不是“数据”的一部分),GetEnumerator() 方法将错过行末尾处的空单元格。包含”对)。
为了确保 GetEnumerator() 方法将所需的空单元格添加到“行尾”,您首先必须计算该行中的 requiredNumberOfColumns(可能通过计算“包含数据的数据”的数量) ”标题行中的单元格),然后将该 requiredNumberOfColumns 传递给 GetEnumerator() 方法。
然后,您必须添加下一个 for 循环,作为 GetEnumerator() 方法中的 LAST 语句:
I wanted to add this as a comment, but there is not enough space for explanations. The solution provided by Waylon Flinn works well, except if your row ends with empty cells that you still have to take into consideration.
This is because the spreadsheet data
row.Descendants<Cell>
only contains the cells that have data. The GetEnumerator() method examines each pair of “data containing” cells and then injects empty cells between them if needed (based on the column index).However, the GetEnumerator() method will miss empty cells at the end of the row - if the last column does not contain data (if it is not part of a “data containing” pair).
To make sure that the GetEnumerator() method adds the required empty cells to the “end of the row”, you first have to calculate the requiredNumberOfColumns in the row (perhaps by counting the number of “data containing” cells in the header row), and then pass that requiredNumberOfColumns to the GetEnumerator() method.
Then you have to add the next for loop, as the very LAST statement in the GetEnumerator() method:
添加了另一个实现,这次预先知道列数:
调用使用:
其中 38 是所需的列数。
Added yet another implementation, this time where the number of columns is known in advance:
Call using:
Where 38 is the required number of columns.
为了读取空白单元格,我使用在行读取器外部分配的名为“CN”的变量,在 while 循环中,我检查列索引是否大于我的变量,因为它在读取每个单元格后递增。如果这不匹配,我将用我想要的值填充我的列。这是我用来将空白单元格捕获到我的相关列值中的技巧。下面是代码:
代码适用于:
To read blank cells, I am using a variable named "CN" assigned outside the row reader and in while loop, I am checking if column index is greater than or not from my variable as it is being incremented after each cell read. if this does not match, I am filling my column with value I want to. This is the trick I used to catch up the blank cells into my respecting column value. Here is the code:
Code works for:
这是我的解决方案。我发现当缺少的字段位于行末尾时,上述方法似乎效果不佳。
假设 Excel 工作表中的第一行包含所有列(通过标题),然后获取每行预期的列数 (row == 1)。然后循环遍历数据行(行> 1)。处理缺失单元格的关键在于 getRowCells 方法,其中传入已知数量的列单元格以及要处理的当前行。
getRowCells 方法当前有一个限制,即只能支持少于 26 列的工作表(行)。基于已知列计数的循环用于查找丢失的列(单元格)。如果找到,则将新的 Cell 值插入到 cells 集合中,新 Cell 的默认值为“”而不是“null”。然后返回修改后的 Cell 集合。
Here is my solution. I found the above didn't seem to work well when the missing fields where at the end of a row.
Assuming the first row in the Excel sheet has ALL the columns (via headers), then grab the number of columns expected per row (row == 1). Then loop through the data rows (row > 1). The key to processing the missing cells is in method getRowCells, where the known number of column cells is passed in as well as the current row to process.
Method getRowCells has a current limitation of only being able to support a sheet (row) that has less an 26 columns. A loop based on the known column count is used to find missing columns (cells). If found, a new Cell value is inserted into the cells collection, with the new Cell having a default value of "" instead of 'null'. The modified Cell collection is then returned.