从 Excel 导入中查找包含的边框区域
我正在从具有各种表格布局的 Excel 导入大量数据。我有足够好的表格检测例程和合并单元格处理,但是在处理边框时遇到了问题。即性能。其中一些文件中的边框区域具有含义。
数据设置:
我使用 VB6 和 MSXML 直接从 Office Open XML 导入。数据从 XML 解析为单元格数据字典。这非常神奇,与在 Access 中使用 docmd.transferspreadsheet 一样快,但返回的结果要好得多。每个单元格都包含一个指向样式元素的指针,该样式元素包含一个指向定义每个边框的可见性和粗细的边框元素的指针(这也是 OpenXML 内部数据的结构方式)。
挑战:
我想做的是找到边界内封闭的每个区域,并创建该区域内的单元格列表。
我所做的:
我最初创建了一个 BFS(广度优先搜索)填充例程来查找这些区域。对于“正常”大小的电子表格来说,这非常有效且快速,但对于导入数千行来说就太慢了。一个问题是 Excel 中的边框可能存储在您正在检查的单元格中或相邻单元格中的相对边框中。没关系,我可以在导入时合并这些数据,以减少所需的检查数量。
我想做的一件事是创建一个单独的图形,使用边界作为边缘来概述单元格,并使用图形算法以这种方式查找区域,但我在弄清楚如何实现该算法时遇到了麻烦。我过去使用过 Dijkstra,并认为我可以用它做类似的事情。因此,我可以不使用端点来搜索整个图,如果我遇到一个封闭的节点,我知道我刚刚找到了一个封闭的区域,但我如何知道我找到的路线是否是最佳路线?我想我可以标记为对找到的闭合节点与前一个节点运行单独的检查,忽略该一条边。
这可以工作,但在密集图上不会有更好的性能。其他人可以建议更好的方法吗?感谢您花时间阅读本文。
I am importing massive amounts of data from Excel that have various table layouts. I have good enough table detection routines and merge cell handling, but I am running into a problem when it comes to dealing with borders. Namely performance. The bordered regions in some of these files have meaning.
Data Setup:
I am importing directly from Office Open XML using VB6 and MSXML. The data is parsed from the XML into a dictionary of cell data. This wonks wonderfully and is just as fast as using docmd.transferspreadsheet in Access, but returns much better results. Each cell contains a pointer to a style element which contains a pointer to a border element that defines the visibility and weight of each border (this is how the data is structured inside OpenXML, also).
Challenge:
What I'm trying to do is find every region that is enclosed inside borders, and create a list of cells that are inside that region.
What I have done:
I initially created a BFS(breadth first search) fill routine to find these areas. This works wonderfully and fast for "normal" sized spreadsheets, but gets way too slow for imports into the thousands of rows. One problem is that a border in Excel could be stored in the cell you are checking or the opposing border in the adjacent cell. That's ok, I can consolidate that data on import to reduce the number of checks needed.
One thing I thought about doing is to create a separate graph that outlines the cells using the borders as my edges and using a graph algorithm to find regions that way, but I'm having trouble figuring out how to implement the algorithm. I've used Dijkstra in the past and thought I could do similar with this. So I can span out using no endpoint to search the entire graph, and if I encounter a closed node I know that I just found an enclosed region, but how can I know if the route I've found is the optimal one? I guess I could flag that to run a separate check for the found closed node to the previous node ignoring that one edge.
This could work, but wouldn't be much better performance wise on dense graphs. Can anyone else suggest a better method? Thanks for taking the time to read this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
你的问题非常复杂,但听起来好像你需要一种算法来找到图的连通分量(连通分量=全部相互连接但没有其他节点的节点集),这可以通过以下方式在线性时间内完成反复遍历。伪代码:
Traverse
的迭代变体:Your question is pretty complicated, but it sounds as though you need an algorithm to find the connected components of a graph (connected component = set of nodes all connected to one another but to no other nodes), which can be accomplished in linear time by repeated traversals. Pseudocode:
Iterative variant of
Traverse
: