用于 Excel 克隆的正确数据结构

发布于 2024-07-15 20:14:41 字数 618 浏览 5 评论 0原文

假设我正在使用 C# 开发 Excel 克隆。 我的网格表示如下:

private struct CellValue
{
    private int column;
    private int row;
    private string text;
}
private List<CellValue> cellValues = new List<CellValue>();

每次用户添加文本时,我只是将其打包为 CellValue 并将其添加到 cellValues 中。 给定一个 CellValue 类型,我可以在 O(1) 时间内确定它的行和列,这很棒。 但是,给定一列和一行,我需要循环遍历整个 cellValues 来查找该列和行中的文本,这是非常慢的。 另外,给定一个文本,我也需要循环遍历整个内容。 是否有任何数据结构可以让我在 O(1) 时间内完成所有 3 个任务?

更新: 浏览了一些答案,我认为我没有找到我喜欢的答案。 我可以:

  1. 不要保留超过 2 个 CellValue 副本,以避免同步它们。 在 C 世界中,我会很好地使用指针。
  2. 行和列可以动态添加(与 Excel 不同)。

Let say I'm working on an Excel clone in C#.
My grid is represented as follows:

private struct CellValue
{
    private int column;
    private int row;
    private string text;
}
private List<CellValue> cellValues = new List<CellValue>();

Each time user add a text, I just package it as CellValue and add it into cellValues. Given a CellValue type, I can determine its row and column in O(1) time, which is great. However, given a column and a row, I need to loop through the entire cellValues to find which text is in that column and row, which is terribly slow. Also, given a text, I too need to loop through the entire thing. Is there any data structure where I can achive all 3 task in O(1) time?

Updated:
Looking through some of the answers, I don't think I had found one that I like. Can I:

  1. Not keeping more than 2 copies of CellValue, in order to avoid sync-ing them. In C world I would have made nice use of pointers.
  2. Rows and Columns can be dynamically added (Unlike Excel).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

迎风吟唱 2024-07-22 20:14:42

我认为您应该使用索引集合之一来使其工作得相当快,完美的一个是 KeyedCollection

您需要通过扩展此类来创建自己的集合。 这样,您的对象仍将包含行和列(因此您不会丢失任何内容),但您将能够搜索它们。 可能您必须创建一个封装(行,列)的类并使其成为键(因此使其不可变并覆盖 equals 并获取哈希码)

I think you should use one of the indexed collections to make it work reasonably fast, the perfect one is the KeyedCollection

You need to create your own collection by extending this class. This way your object will still contain row and column (so you will not loose anything), but you will be able to search for them. Probably you will have to create a class encapsulating (row, column) and make it the key (so make it immutable and override equals and get hash code)

可爱暴击 2024-07-22 20:14:42

我将创建

 Collection<Collection<CellValue>> rowCellValues = new Collection<Collection<CellValue>>();

外部集合,

Collection<Collection<CellValue>> columnCellValues = new Collection<Collection<CellValue>>();

每一行或每一列都有一个条目,按行号或列号索引,内部集合包含该行或列中的所有单元格。 这些集合应作为创建新 CellValue 对象的过程的一部分进行填充。

rowCellValues[newCellValue.Row].Add(newCellValue);
columnCellValues[newCellValue.Column].Add(newCellValue);

I'd create

 Collection<Collection<CellValue>> rowCellValues = new Collection<Collection<CellValue>>();

and

Collection<Collection<CellValue>> columnCellValues = new Collection<Collection<CellValue>>();

The outer collection has one entry for each row or column, indexed by the row or column number, the inner collection has all the cells in that row or column. These collections should be populated as part of the process that creates new CellValue objects.

rowCellValues[newCellValue.Row].Add(newCellValue);
columnCellValues[newCellValue.Column].Add(newCellValue);
暗藏城府 2024-07-22 20:14:42

这有点过早优化的味道。

也就是说,Excel 有一些功能对于选择良好的结构非常重要。

首先,Excel 以适度非线性的方式使用单元格。 求解公式的过程涉及以有效的随机顺序遍历电子表格。 该结构需要一种机制,可以轻松、廉价地查找随机键的值,将它们标记为脏的、已解析的或由于循环引用而无法解析的。 它还需要某种方式来知道何时不再有未解析的单元格,以便它可以停止工作。 任何涉及链表的解决方案可能都不是最佳的,因为它们需要线性扫描来获取这些单元格。

另一个问题是 Excel 一次显示一系列单元格。 这可能看起来微不足道,而且在很大程度上确实如此,但如果应用程序能够一次提取绘制一系列单元格所需的所有数据,那肯定是理想的。 其中一部分可能是跟踪行和列的显示高度和宽度,以便显示系统可以迭代该范围,直到收集到所需的单元格宽度和高度。 以这种方式迭代的需要可能会妨碍使用散列策略来进行单元的稀疏存储。

最重要的是,电子表格的表示模型存在一些弱点,可以通过采取稍微不同的方法来更有效地解决这些弱点。

例如,列聚合有点笨重。 在 Excel 中实现列总计很容易,但它有一种神奇的行为,在大多数情况下都有效,但并非总是有效。 例如,如果您将一行添加到聚合区域中,对该聚合的进一步计算可能会继续进行,也可能不会进行,具体取决于添加方式。 如果您复制并插入一行(并替换值),一切都会正常工作,但如果您将单元格剪切并粘贴到下一行,事情就不会那么顺利。

This smells of premature optimization.

That said, there's a few features of excel that are important in choosing a good structure.

First is that excel uses the cells in a moderately non-linear fashion. The process of resolving formulas involves traversing the spreadsheets in effectively random order. The structure will need a mechanism of easily looking up values of random keys cheaply, marking them dirty, resolved, or unresolvable due to circular reference. It will also need some way to know when there are no more unresolved cells left, so that it can stop working. Any solution that involves a linked list is probably sub-optimal for this, since they would require a linear scan to get those cells.

Another issue is that excel displays a range of cells at one time. This may seem trivial, and to a large extent it is, but It will certainly be ideal if the app can pull all of the data needed to draw a range of cells in one shot. part of this may be keeping track of the display height and width of the rows and columns, so that the display system can iterate over the range until the desired width and height of cells has been collected. The need to iterate in this manner may preclude the use of a hashing strategy for sparse storage of cells.

On top of that, there are some weaknesses of the representational model of spreadsheets that could be addressed much more effectively by taking a slightly different approach.

For example, column aggregates are sort of clunky. A column total is easy enough to implement in excel, but it has a sort of magic behavior that works most of the time but not all of the time. For instance, if you add a row into the aggregated area, further calculations on that aggregate may continue to work, or not, depending on how you added it. If you copy and insert a row (and replace the values) everything works fine, but if you cut and paste the cells one row down, things don't work out so well.

冬天旳寂寞 2024-07-22 20:14:42

鉴于数据是二维的,我将有一个二维数组来保存它。

Given that the data is 2-dimensional, I would have a 2D array to hold it in.

我最亲爱的 2024-07-22 20:14:42

那么,您可以将它们存储在三个字典中:两个用于行和列的 Dictionary 对象,以及一个用于文本的 Dictionary 对象。 不过,您必须小心地使这三个部分保持同步。

我不确定我是否会选择一个大的二维数组......

Well, you could store them in three Dictionaries: two Dictionary<int,CellValue> objects for rows and columns, and one Dictionary<string,CellValue> for text. You'd have to keep all three carefully in sync though.

I'm not sure that I wouldn't just go with a big two-dimensional array though...

離人涙 2024-07-22 20:14:42

如果它是精确克隆,则为 CellValue[256] 数组的数组支持列表。 Excel 有 256 列,但行数不断增加。

If it's an exact clone, then an array-backed list of CellValue[256] arrays. Excel has 256 columns, but a growable number of rows.

荒芜了季节 2024-07-22 20:14:42

如果可以“动态”添加行和列,则不应将行/列存储为单元格的数字属性,而应存储为对行或列对象的引用。

示例:

private struct CellValue
{
  private List<CellValue> _column;
  private List<CellValue> _row;
  private string text;

  public List<CellValue> column {
     get { return _column; }
     set {
         if(_column!=null) { _column.Remove(this); }
         _column = value;
         _column.Add(this);
        }
     }

  public List<CellValue> row {
     get { return _row; }
     set {
         if(_row!=null) { _row.Remove(this); }
         _row = value;
         _row.Add(this);
        }
     }
}

private List<List<CellValue>> MyRows    = new List<List<CellValue>>;
private List<List<CellValue>> MyColumns = new List<List<CellValue>>;

每个 Row 和 Column 对象都实现为 CellValue 对象的列表。 这些是无序的——特定行中单元格的顺序与列索引不对应,反之亦然。

每个工作表都有一个行列表和一个列列表,按工作表的顺序排列(如上面所示为 MyRows 和 MyColumns)。

这将允许您重新排列和插入新的行和列,而无需循环和更新任何单元格。

删除行应循环遍历该行上的单元格,并在删除该行本身之前从各自的列中删除它们。 对于列来说反之亦然。

要查找特定的 Row 和 Column,请查找相应的 Row 和 Column 对象,然后查找它们共同包含的 CellValue。

示例:(

public CellValue GetCell(int rowIndex, int colIndex) {
  List<CellValue> row = MyRows[rowIndex];
  List<CellValue> col = MyColumns[colIndex];
  return row.Intersect(col)[0];
  }

我对 .NET 3.5 中的这些扩展方法有点模糊,但这应该是大概的情况。)

If rows and columns can be added "dynamically", then you shouldn't store the row/column as an numeric attribute of the cell, but rather as a reference to a row or column object.

Example:

private struct CellValue
{
  private List<CellValue> _column;
  private List<CellValue> _row;
  private string text;

  public List<CellValue> column {
     get { return _column; }
     set {
         if(_column!=null) { _column.Remove(this); }
         _column = value;
         _column.Add(this);
        }
     }

  public List<CellValue> row {
     get { return _row; }
     set {
         if(_row!=null) { _row.Remove(this); }
         _row = value;
         _row.Add(this);
        }
     }
}

private List<List<CellValue>> MyRows    = new List<List<CellValue>>;
private List<List<CellValue>> MyColumns = new List<List<CellValue>>;

Each Row and Column object is implemented as a List of the CellValue objects. These are unordered--the order of the cells in a particular Row does not correspond to the Column index, and vice-versa.

Each sheet has a List of Rows and a list of Columns, in order of the sheet (shown above as MyRows and MyColumns).

This will allow you to rearrange and insert new rows and columns without looping through and updating any cells.

Deleting a row should loop through the cells on the row and delete them from their respective columns before deleting the row itself. And vice-versa for columns.

To find a particular Row and Column, find the appropriate Row and Column objects, then find the CellValue that they contain in common.

Example:

public CellValue GetCell(int rowIndex, int colIndex) {
  List<CellValue> row = MyRows[rowIndex];
  List<CellValue> col = MyColumns[colIndex];
  return row.Intersect(col)[0];
  }

(I'm a little fuzzy on these Extension methods in .NET 3.5, but this should be in the ballpark.)

淡写薰衣草的香 2024-07-22 20:14:42

如果我没记错的话,可能在 80 年代初的 Byte 杂志上有一篇关于 Visicalc 如何做到这一点的文章。 我相信这是某种稀疏数组。 但我认为存在上下和左右的链接,因此任何给定的单元格都有一个指向其上方单元格(无论距离可能有多少单元格)、下方单元格、左侧单元格的指针,以及它的右侧。

If I recall correctly, there was an article about how Visicalc did it, maybe in Byte Magazine in the early 80s. I believe it was a sparse array of some sort. But I think there were links both up-and-down and left-and-right, so that any given cell had a pointer to the cell above it (however many cells away that may be), below it, to the left of it, and to the right of it.

许你一世情深 2024-07-22 20:14:41

我会选择稀疏数组(链表的链表),以最小的存储空间提供最大的灵活性。

在此示例中,您有一个行链接列表,其中每个元素都指向该行中的单元格链接列表(您可以根据需要反转单元格和行)。

 |
 V
+-+    +---+             +---+
|1| -> |1.1| ----------> |1.3| -:
+-+    +---+             +---+
 |
 V
+-+             +---+
|7| ----------> |7.2| -:
+-+             +---+
 |
 =

每个行元素都有行号,每个单元格元素都有一个指向其行元素的指针,因此从单元格获取行号的时间复杂度为 O(1)。

类似地,每个单元格元素都有其列号,也使得 O(1) 复杂度。

没有简单的方法可以让 O(1) 立即查找给定行/列的单元格,但是稀疏数组的速度是最快的,除非您为每个可能的单元格预先分配信息以便可以进行索引查找在数组上 - 这在存储方面会非常浪费。

您可以做的一件事是使一维非稀疏,例如使列成为主数组(而不是链接列表)并将它们限制为 1,000 - 这将使列查找索引(快速),然后在稀疏上进行搜索行。

我认为您不可能仅仅因为文本可以在多个单元格中重复(与行/列不同)而获得 O(1) 的文本查找。 我仍然相信稀疏数组将是搜索文本的最快方法,除非您在另一个数组中维护所有文本值的排序索引(同样,这可以使其更快,但会占用大量内存)。

I would opt for a sparse array (a linked list of linked lists) to give maximum flexibility with minimum storage.

In this example, you have a linked list of rows with each element pointing to a linked list of cells in that row (you could reverse the cells and rows depending on your needs).

 |
 V
+-+    +---+             +---+
|1| -> |1.1| ----------> |1.3| -:
+-+    +---+             +---+
 |
 V
+-+             +---+
|7| ----------> |7.2| -:
+-+             +---+
 |
 =

Each row element has the row number in it and each cell element has a pointer to its row element, so that getting the row number from a cell is O(1).

Similarly, each cell element has its column number, making that O(1) as well.

There's no easy way to get O(1) for finding immediately the cell at a given row/column but a sparse array is as fast as it's going to get unless you pre-allocate information for every possible cell so that you can do index lookups on an array - this would be very wasteful in terms of storage.

One thing you could do is make one dimension non-sparse, such as making the columns the primary array (rather than linked list) and limiting them to 1,000 - this would make the column lookup indexed (fast), then a search on the sparse rows.

I don't think you can ever get O(1) for a text lookup simply because text can be duplicated in multiple cells (unlike row/column). I still believe the sparse array will be the fastest way to search for text, unless you maintain a sorted index of all text values in another array (again, that can make it faster but at the expense of copious amounts of memory).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文