解析 C++ 中由制表符和换行符分隔的字符串;
我正在为一个个人项目编写一个程序,该项目想要从谷歌图书中获取单词列表及其出现次数,并将它们放入一个带有其出现次数的向量中,以便我可以将列表削减一些。单词列表的格式如下:包含单词、\t 字符、数字、换行符 (\n) 和重复项。我对这种类型的编程没有太多经验,我想知道有人如何解析以这种方式格式化的文件。这是我到目前为止所拥有的:
#include <iostream>
#include <string>
#include <fstream>
#include <vector>
#define FILE_NAME
using namespace std;
// structure denoting a word occurence
// contains the string of the word and an integer representing its frequency
struct word_occ {
String word;
int occurence;
};
vector<word_occ> words_vector;
int main() {
/*
File is a .txt file that has the following format:
word1 #####
word2 #####
where word is the word from the english 1-grams from google books
and ##### is the number of occurences.
The word is separated from it's occurences by a tab (\t) and other words by a newline (\n).
All words are entirely lowercase, and all numbers are integers lower than 20,000,000
*/
ifstream all_words_list(FILE_NAME);
string line;
string line_word;
int line_occurence;
word_occ this_line;
while (getline(all_words_list, line)) {
// ... <-- what goes here?
this_line.word = line_word;
this_line.occurence = line_occurence;
words_vector.push_back(this_line);
}
}
I'm writing a program for a personal project that wants to take a list of words from google books and their occurrences and put them into a vector with their occurrences attached so I can whittle the list down some. The list of words is formatted such that it has the word, a \t character, the number, a newline (\n), and it repeats. I don't have much experience with this type of programming, I was wondering how someone may parse a file that's formatted this way. Here's what I have so far:
#include <iostream>
#include <string>
#include <fstream>
#include <vector>
#define FILE_NAME
using namespace std;
// structure denoting a word occurence
// contains the string of the word and an integer representing its frequency
struct word_occ {
String word;
int occurence;
};
vector<word_occ> words_vector;
int main() {
/*
File is a .txt file that has the following format:
word1 #####
word2 #####
where word is the word from the english 1-grams from google books
and ##### is the number of occurences.
The word is separated from it's occurences by a tab (\t) and other words by a newline (\n).
All words are entirely lowercase, and all numbers are integers lower than 20,000,000
*/
ifstream all_words_list(FILE_NAME);
string line;
string line_word;
int line_occurence;
word_occ this_line;
while (getline(all_words_list, line)) {
// ... <-- what goes here?
this_line.word = line_word;
this_line.occurence = line_occurence;
words_vector.push_back(this_line);
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
字符串流可能会起作用:
A string stream would likely work: