在 c++ 中读取和写入西里尔文文件

发布于 2024-12-06 13:35:02 字数 848 浏览 0 评论 0原文

我必须首先读取西里尔文文件,然后随机选择随机行数并将修改后的文本写入不同的文件。拉丁字母没有问题,但我遇到了西里尔文字的问题,因为我得到了一些垃圾。这就是我尝试做这件事的方式。

文件 input.txt

ааааааа
ббббббб
ввввввв

比如说,我必须读取

vector<wstring> inputVector;
wstring inputString, result;
wifstream inputStream;
inputStream.open("input.txt");
while(!inputStream.eof())
{
    getline(inputStream, inputString);              
    inputVector.push_back(inputString);
}
inputStream.close();    

srand(time(NULL));
int numLines = rand() % inputVector.size();
for(int i = 0; i < numLines; i++)
{
    int randomLine = rand() % inputVector.size();
    result += inputVector[randomLine];
}

wofstream resultStream;
resultStream.open("result.txt");
resultStream << result;
resultStream.close();

,并将每一行放入向量中:那么我该如何使用西里尔字母,以便它生成可读的内容,而不仅仅是符号?

I have to first read a file in Cyrillic, then randomly pick random number of lines and write modified text to a different file. No problem with Latin letter, but I run into a problem with Cyrillic text, because I get some rubbish. So this is how I tried to do the thing.

Say, file input.txt is

ааааааа
ббббббб
ввввввв

I have to read it, and put every line into a vector:

vector<wstring> inputVector;
wstring inputString, result;
wifstream inputStream;
inputStream.open("input.txt");
while(!inputStream.eof())
{
    getline(inputStream, inputString);              
    inputVector.push_back(inputString);
}
inputStream.close();    

srand(time(NULL));
int numLines = rand() % inputVector.size();
for(int i = 0; i < numLines; i++)
{
    int randomLine = rand() % inputVector.size();
    result += inputVector[randomLine];
}

wofstream resultStream;
resultStream.open("result.txt");
resultStream << result;
resultStream.close();

So how can I do work with Cyrillic so it produces readable things, not just symbols?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

梦年海沫深 2024-12-13 13:35:02

因为您看到类似的内容 ■aaaaaaa 1 1 1 1 1 1 1 2 2 2 2 2 2 2 ♦ 打印到控制台,看起来 input.txt< /code> 以 UTF-16 编码进行编码,可能是 UTF-16 LE + 物料清单。如果将文件的编码更改为 UTF-8,则可以使用原始代码。

使用UTF-8的原因是,无论文件流的char类型如何,basic_fstream的底层basic_filebuf都使用codecvt对象将 char 对象流与 char 类型对象流相互转换;即读取时,从文件中读取的 char 流会转换为 wchar_t 流,但写入时,会转换为 wchar_t 流到一个 char 流,然后写入文件。对于 std::wifstreamcodecvt 对象是标准 std::codecvt 的实例,通常将 UTF-8 转换为 UCS-16。

正如 的 MSDN 文档页面中所述basic_filebuf

basic_filebuf 类型的对象是使用 char * 类型的内部缓冲区创建的,无论类型参数 Elem 指定的 char_type 是什么。这意味着 Unicode 字符串(包含 wchar_t 字符)在写入内部缓冲区之前将被转换为 ANSI 字符串(包含 char 字符)。

同样,当读取 Unicode 字符串(包含 wchar_t 字符)时,basic_filebuf 将从文件中读取的 ANSI 字符串转换为返回到的 wchar_t 字符串getline 和其他读取操作。

如果您将 input.txt 的编码更改为 UTF-8,您的原始程序应该可以正常运行。

作为参考,这对我有用:

#include <cstdlib>
#include <ctime>
#include <fstream>
#include <iostream>
#include <string>
#include <vector>

int main()
{
    using namespace std;

    vector<wstring> inputVector;
    wstring inputString, result;
    wifstream inputStream;
    inputStream.open("input.txt");
    while(!inputStream.eof())
    {
        getline(inputStream, inputString);
        inputVector.push_back(inputString);
    }
    inputStream.close();

    srand(time(NULL));
    int numLines = rand() % inputVector.size();
    for(int i = 0; i < numLines; i++)
    {
        int randomLine = rand() % inputVector.size();
        result += inputVector[randomLine];
    }

    wofstream resultStream;
    resultStream.open("result.txt");
    resultStream << result;
    resultStream.close();

    return EXIT_SUCCESS;
}

请注意,result.txt 的编码也将是 UTF-8(通常)。

Because you saw something like ■a a a a a a a 1♦1♦1♦1♦1♦1♦1♦ 2♦2♦2♦2♦2♦2♦2♦ printed to the console, it appears that input.txt is encoded in a UTF-16 encoding, probably UTF-16 LE + BOM. You can use your original code if you change the encoding of the file to UTF-8.

The reason for using UTF-8 is that, regardless of the char type of the file stream, basic_fstream's underlying basic_filebuf uses a codecvt object to convert a stream of char objects to/from a stream of objects of the char type; i.e. when reading, the char stream that is read from the file is converted to a wchar_t stream, but when writing, a wchar_t stream is converted to a char stream that is then written to the file. In the case of std::wifstream, the codecvt object is an instance of the standard std::codecvt<wchar_t, char, mbstate_t>, which generally converts UTF-8 to UCS-16.

As explained on the MSDN documentation page for basic_filebuf:

Objects of type basic_filebuf are created with an internal buffer of type char * regardless of the char_type specified by the type parameter Elem. This means that a Unicode string (containing wchar_t characters) will be converted to an ANSI string (containing char characters) before it is written to the internal buffer.

Similarly, when reading a Unicode string (containing wchar_t characters), the basic_filebuf converts the ANSI string read from the file to the wchar_t string returned to getline and other read operations.

If you change the encoding of input.txt to UTF-8, your original program should work correctly.

For reference, this works for me:

#include <cstdlib>
#include <ctime>
#include <fstream>
#include <iostream>
#include <string>
#include <vector>

int main()
{
    using namespace std;

    vector<wstring> inputVector;
    wstring inputString, result;
    wifstream inputStream;
    inputStream.open("input.txt");
    while(!inputStream.eof())
    {
        getline(inputStream, inputString);
        inputVector.push_back(inputString);
    }
    inputStream.close();

    srand(time(NULL));
    int numLines = rand() % inputVector.size();
    for(int i = 0; i < numLines; i++)
    {
        int randomLine = rand() % inputVector.size();
        result += inputVector[randomLine];
    }

    wofstream resultStream;
    resultStream.open("result.txt");
    resultStream << result;
    resultStream.close();

    return EXIT_SUCCESS;
}

Note that the encoding of result.txt will also be UTF-8 (generally).

时光沙漏 2024-12-13 13:35:02

为什么要使用 wifstream - 您是否确信您的文件由一系列(取决于系统)宽字符组成?几乎可以肯定事实并非如此。 (最值得注意的是,系统的宽字符集在 C++ 程序范围之外实际上并不是确定的)。

相反,只需按原样读取输入字节流并相应地回显它:

std::ifstream infile(thefile);
std::string line;
std::vector<std::string> input;

while (std::getline(infile, line))   // like this!!
{
  input.push_back(line);
}

// etc.

Why would you use wifstream -- are you confident that your file consists of a sequence of (system-dependent) wide characters? Almost certainly that is not the case. (Most notably because the system's wide character set isn't actually definite outside the scope of a C++ program).

Instead, just read the input byte stream as it is and echo it accordingly:

std::ifstream infile(thefile);
std::string line;
std::vector<std::string> input;

while (std::getline(infile, line))   // like this!!
{
  input.push_back(line);
}

// etc.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文