如何在 C++ 中解析基于文本的表格

发布于 2024-10-01 09:19:40 字数 836 浏览 6 评论 0原文

我正在尝试使用 ifstream 解析文本文件形式的表，并评估/操作每个条目。然而，由于遗漏了特定项目，我无法弄清楚如何解决这个问题。考虑下表：

NEW  VER  ID   NAME
1    2a   4    "ITEM ONE" (2001)
     1    7    "2 ITEM" (2002) {OCT}
     1.1  10   "SOME ITEM 3" (2003)
1         12   "DIFFERENT ITEM 4" (2004)
1    a4   16   "ITEM5" (2005) {DEC}

正如您所看到的，有时“NEW”列中没有任何内容。我想做的是记下 ID、姓名、年份（括号中），并记下后面是否有大括号。

当我开始这样做时，我寻找“分割”功能，但我意识到它会有点复杂，因为前面提到的缺失项目和标题被分开。

我能想到的一件事就是逐字阅读每一行，记录我看到的最新数字。一旦我点击了引号，请注意我看到的最新数字是一个 ID（如果我使用了诸如分割之类的东西，则数组位置就在引号之前），然后记录所有内容，直到下一个引用（标题），最后，开始寻找括号和大括号中的其他信息。然而，这看起来真的很原始，我正在寻找更好的方法来做到这一点。

我这样做是为了提高我的 C++ 技能并处理更大的现有数据集，所以如果可能的话我想使用 C++，但如果另一种语言（我正在考虑 Perl 或 Python）使这变得简单，我可以只需学习如何将不同的语言与 C++ 进行交互即可。我现在想做的只是筛选数据，无论如何这些数据最终都会成为 C++ 中的对象，所以我仍然有机会提高我的 C++ 技能。

编辑：我还意识到这可以仅使用正则表达式来完成，但如果可能的话，我想尝试使用不同的文件/字符串操作方法。

原文

I am trying to parse a table in the form of a text file using ifstream, and evaluating/manipulating each entry. However, I'm having trouble figuring out how to approach this because of omissions of particular items. Consider the following table:

NEW  VER  ID   NAME
1    2a   4    "ITEM ONE" (2001)
     1    7    "2 ITEM" (2002) {OCT}
     1.1  10   "SOME ITEM 3" (2003)
1         12   "DIFFERENT ITEM 4" (2004)
1    a4   16   "ITEM5" (2005) {DEC}

As you can see, sometimes the "NEW" column has nothing in it. What I want to do is take note of the ID, the name, the year (in brackets), and note whether there are braces or not afterwards.

When I started doing this, I looked for a "split" function, but I realized that it would be a bit more complicated because of the aforementioned missing items and the titles becoming separated.

The one thing I can think of is reading each line word by word, keeping track of the latest number I saw. Once I hit a quotation mark, make note that the latest number I saw was an ID (if I used something like a split, the array position right before the quotation mark), then keep record of everything until the next quote (the title), then finally, start looking for brackets and braces for the other information. However, this seems really primitive and I'm looking for a better way to do this.

I'm doing this to sharpen my C++ skills and work with larger, existing datasets, so I'd like to use C++ if possible, but if another language (I'm looking at Perl or Python) makes this trivially easy, I could just learn how to interface a different language with C++. What I'm trying to do now is just sifting data anyways which will eventually become objects in C++, so I still have chances to improve my C++ skills.

EDIT: I also realize that this is possible to complete using only regex, but I'd like to try using different methods of file/string manipulation if possible.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

女中豪杰 2024-10-08 09:19:40

如果列偏移量确实是固定的（没有制表符，只是真正的空格字符a la 0x20），我会一次读取一行（string::getline）并中断使用固定偏移量将其向下转换为一组四个字符串 (string::substr)。

然后根据需要对每个四元组字符串进行后处理。

我不会对偏移量进行硬编码，而是将它们存储在描述输入格式的单独输入文件中 - 就像 SQL Server 或其他数据库中的表描述一样。

回复收藏 0 原文

拥抱没勇气 2024-10-08 09:19:40

像这样的事情：

读取第一行，找到“ID”，并存储索引。
使用 std::getline() 读取每个数据行。
从数据行创建一个子字符串，从您在标题行中找到的 "ID" 索引开始。使用它来初始化 std::istringstream 。
使用iss>>读取ID an_int。
搜索第一个 "。搜索第二个 "。搜索 ( 并记住其索引。搜索 ) 并记住该索引。从这些索引之间的字符创建一个子字符串，并使用它来初始化另一个 std::istringstream 。从此流中读取数字。
寻找大括号。