读取大型TXT文件(2GB)传递到字符串,花费太长
我有一个包含几本书的大文本文件(2GB)。我想创建(** char)
其中包含整个文本文件的每个单词。但是首先,我将所有文本文件数据传递给一个巨大的字符串
,然后制作** char
变量,
问题是太长 (小时)getline()
循环结束。我运行了30分钟,并且程序读取500.000行。整个文件为43.000.000行
int main (){
ifstream book;
string sbook,str;
book.open("gutenberg.txt"); // the huge file
cout<<"Reading the file ....."<<endl;
while(!book.eof()){
getline(book,sbook);//passing the line as a string to sbook
if(str.empty()){
str= sbook;
}
else
str= str + " " + sbook;//apend sbook to another string until the file closes
}//I never managed to get out of this loop
cout<<"Done reading the file."<<endl;
cout<<"Removal....."<<endl;
removal(str);//removes all puncuations and makes each upperccase letter to a lowercase
cout<<"done removal"<<endl;
cout<<"Removing doublewhitespaces...."<<endl;
int whitespaces=removedoublewhitespace(str);//removes excess whitespaces leaving only one whitespace within each word
//and returns the number of all the whitespaces
cout<<"doublewhitespaces removed."<<endl;
cout<<"initiating leksis....."<<endl;
char **leksis=new char*[whitespaces+1];//whitespase+1 is how many words are left in the file
for(int i=0;i<whitespaces+1;i++){
leksis[i]= new char[30];
}
cout<<"done initiating leksis."<<endl;
int y=0,j=0;
cout<<"constructing leksis,finding plithos...."<<endl;
for(int i=0;i<str.length();i++){
if(isspace(str[i])){;
y++;
j=0;
leksis[y][j]=' ';
j++;
}
else{
leksis[y][j]=str[i];
j++;
}
}
cout<<"Done constructing leksis,finding plithos...."<<endl;
ememoval()
函数
void removal(string &s) {
for (int i = 0, len = s.size(); i < len; i++)
{
char c=s[i];
if(isupper(s[i])){
s[i]=tolower(s[i]);
}
int flag=ispunct(s[i]);
if (flag){
s.erase(i--, 1);
len = s.size();
}
}
}
remaveBookedUblewhitespace()
函数:
int removedoublewhitespace(string &str){
int wcnt=0;
for(int i=str.size()-1; i >= 0; i-- )
{
if(str[i]==' '&&str[i]==str[i-1]) //added equal sign
{
str.erase( str.begin() + i );
}
}
for(int i=0;i<str.size();i++){
if(isspace(str[i])){
wcnt++;
}
}
return wcnt;
}
I have a big text file (2GB) that contains couple of books. I want to create a (**char)
that contains each word of the whole text file. But firstly i pass all the text file data in a HUGE string
, THEN making the **char
variable
the problem is that it takes TOO long(hours) for the getline()
loop to end.I ran it for 30 mins and the program read 500.000 lines. The whole file is 43.000.000 lines
int main (){
ifstream book;
string sbook,str;
book.open("gutenberg.txt"); // the huge file
cout<<"Reading the file ....."<<endl;
while(!book.eof()){
getline(book,sbook);//passing the line as a string to sbook
if(str.empty()){
str= sbook;
}
else
str= str + " " + sbook;//apend sbook to another string until the file closes
}//I never managed to get out of this loop
cout<<"Done reading the file."<<endl;
cout<<"Removal....."<<endl;
removal(str);//removes all puncuations and makes each upperccase letter to a lowercase
cout<<"done removal"<<endl;
cout<<"Removing doublewhitespaces...."<<endl;
int whitespaces=removedoublewhitespace(str);//removes excess whitespaces leaving only one whitespace within each word
//and returns the number of all the whitespaces
cout<<"doublewhitespaces removed."<<endl;
cout<<"initiating leksis....."<<endl;
char **leksis=new char*[whitespaces+1];//whitespase+1 is how many words are left in the file
for(int i=0;i<whitespaces+1;i++){
leksis[i]= new char[30];
}
cout<<"done initiating leksis."<<endl;
int y=0,j=0;
cout<<"constructing leksis,finding plithos...."<<endl;
for(int i=0;i<str.length();i++){
if(isspace(str[i])){;
y++;
j=0;
leksis[y][j]=' ';
j++;
}
else{
leksis[y][j]=str[i];
j++;
}
}
cout<<"Done constructing leksis,finding plithos...."<<endl;
removal()
function
void removal(string &s) {
for (int i = 0, len = s.size(); i < len; i++)
{
char c=s[i];
if(isupper(s[i])){
s[i]=tolower(s[i]);
}
int flag=ispunct(s[i]);
if (flag){
s.erase(i--, 1);
len = s.size();
}
}
}
removedoublewhitespace()
function :
int removedoublewhitespace(string &str){
int wcnt=0;
for(int i=str.size()-1; i >= 0; i-- )
{
if(str[i]==' '&&str[i]==str[i-1]) //added equal sign
{
str.erase( str.begin() + i );
}
}
for(int i=0;i<str.size();i++){
if(isspace(str[i])){
wcnt++;
}
}
return wcnt;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这个循环
效率极高。像这样的巨大绳子会造成可怕的串联。如果您必须立即将整个文件放在内存中,则将其放入链接的字符串列表中,每行一个。或字符串向量,这也是很大一部分记忆,但它将更有效地分配
this loop
is hugely inefficient. Concatenating an huge string like that is terrible. If you must have the whole file in memory at once then put it in a linked list of strings, one for each line. Or a vector of strings, thats also a huge chunk of memory but it will be allocated more efficiently