读取大型TXT文件（2GB）传递到字符串，花费太长

发布于 2025-02-06 06:22:44 字数 2550 浏览 3 评论 0原文

我有一个包含几本书的大文本文件（2GB）。我想创建（** char） 其中包含整个文本文件的每个单词。但是首先，我将所有文本文件数据传递给一个巨大的字符串，然后制作** char变量，

问题是太长（小时）getline（）循环结束。我运行了30分钟，并且程序读取500.000行。整个文件为43.000.000行

int main (){
ifstream book;
string sbook,str;
book.open("gutenberg.txt"); // the huge file
cout<<"Reading the file ....."<<endl;
while(!book.eof()){
    getline(book,sbook);//passing the line as a string to sbook
    if(str.empty()){
        str= sbook;
    }
    else
        str= str + " " + sbook;//apend sbook to another string until the file closes

}//I never managed to get out of this loop
cout<<"Done reading the file."<<endl;
cout<<"Removal....."<<endl;
removal(str);//removes all puncuations and makes each upperccase letter to a lowercase
cout<<"done removal"<<endl;
cout<<"Removing doublewhitespaces...."<<endl;
int whitespaces=removedoublewhitespace(str);//removes excess whitespaces leaving only one whitespace within each word
                                            //and returns the number of all the whitespaces
cout<<"doublewhitespaces removed."<<endl;
cout<<"initiating leksis....."<<endl;
char **leksis=new char*[whitespaces+1];//whitespase+1 is how many words are left in the file
for(int i=0;i<whitespaces+1;i++){
    leksis[i]= new char[30];
}
cout<<"done initiating leksis."<<endl;
int y=0,j=0;
cout<<"constructing leksis,finding plithos...."<<endl;
for(int i=0;i<str.length();i++){
    if(isspace(str[i])){;
        y++;
        j=0;
        leksis[y][j]=' ';
        j++;
    }
    else{
        leksis[y][j]=str[i];
        j++;
    } 
}
cout<<"Done constructing leksis,finding plithos...."<<endl;

ememoval（）函数

void removal(string &s) {
for (int i = 0, len = s.size(); i < len; i++)
{
    char c=s[i];
    if(isupper(s[i])){
        s[i]=tolower(s[i]);
    }
    int flag=ispunct(s[i]);
    if (flag){
        s.erase(i--, 1);
        len = s.size();
    }
}

}

remaveBookedUblewhitespace（）函数：

int removedoublewhitespace(string &str){
int wcnt=0;
for(int i=str.size()-1; i >= 0; i-- )
{
    if(str[i]==' '&&str[i]==str[i-1]) //added equal sign
    {
        str.erase( str.begin() + i );
    }
}
for(int i=0;i<str.size();i++){
    if(isspace(str[i])){
        wcnt++;
    }
}
return wcnt;

}

原文

I have a big text file (2GB) that contains couple of books. I want to create a (**char)
that contains each word of the whole text file. But firstly i pass all the text file data in a HUGE string, THEN making the **char variable

the problem is that it takes TOO long(hours) for the getline() loop to end.I ran it for 30 mins and the program read 500.000 lines. The whole file is 43.000.000 lines

int main (){
ifstream book;
string sbook,str;
book.open("gutenberg.txt"); // the huge file
cout<<"Reading the file ....."<<endl;
while(!book.eof()){
    getline(book,sbook);//passing the line as a string to sbook
    if(str.empty()){
        str= sbook;
    }
    else
        str= str + " " + sbook;//apend sbook to another string until the file closes

}//I never managed to get out of this loop
cout<<"Done reading the file."<<endl;
cout<<"Removal....."<<endl;
removal(str);//removes all puncuations and makes each upperccase letter to a lowercase
cout<<"done removal"<<endl;
cout<<"Removing doublewhitespaces...."<<endl;
int whitespaces=removedoublewhitespace(str);//removes excess whitespaces leaving only one whitespace within each word
                                            //and returns the number of all the whitespaces
cout<<"doublewhitespaces removed."<<endl;
cout<<"initiating leksis....."<<endl;
char **leksis=new char*[whitespaces+1];//whitespase+1 is how many words are left in the file
for(int i=0;i<whitespaces+1;i++){
    leksis[i]= new char[30];
}
cout<<"done initiating leksis."<<endl;
int y=0,j=0;
cout<<"constructing leksis,finding plithos...."<<endl;
for(int i=0;i<str.length();i++){
    if(isspace(str[i])){;
        y++;
        j=0;
        leksis[y][j]=' ';
        j++;
    }
    else{
        leksis[y][j]=str[i];
        j++;
    } 
}
cout<<"Done constructing leksis,finding plithos...."<<endl;

removal() function

void removal(string &s) {
for (int i = 0, len = s.size(); i < len; i++)
{
    char c=s[i];
    if(isupper(s[i])){
        s[i]=tolower(s[i]);
    }
    int flag=ispunct(s[i]);
    if (flag){
        s.erase(i--, 1);
        len = s.size();
    }
}

}

removedoublewhitespace() function :

int removedoublewhitespace(string &str){
int wcnt=0;
for(int i=str.size()-1; i >= 0; i-- )
{
    if(str[i]==' '&&str[i]==str[i-1]) //added equal sign
    {
        str.erase( str.begin() + i );
    }
}
for(int i=0;i<str.size();i++){
    if(isspace(str[i])){
        wcnt++;
    }
}
return wcnt;

}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

诗笺 2025-02-13 06:22:44

这个循环

while(!book.eof()){
    getline(book,sbook);//passing the line as a string to sbook
    if(str.empty()){
        str= sbook;
    }
    else
        str= str + " " + sbook;

效率极高。像这样的巨大绳子会造成可怕的串联。如果您必须立即将整个文件放在内存中，则将其放入链接的字符串列表中，每行一个。或字符串向量，这也是很大一部分记忆，但它将更有效地分配

this loop

while(!book.eof()){
    getline(book,sbook);//passing the line as a string to sbook
    if(str.empty()){
        str= sbook;
    }
    else
        str= str + " " + sbook;

is hugely inefficient. Concatenating an huge string like that is terrible. If you must have the whole file in memory at once then put it in a linked list of strings, one for each line. Or a vector of strings, thats also a huge chunk of memory but it will be allocated more efficiently

回复收藏 0 原文

~没有更多了~