C++文本文件、汉字

发布于 2024-10-10 08:41:47 字数 1032 浏览 7 评论 0原文

我有一个 C++ 项目,应该将 添加到每行的开头,将 添加到每行的末尾。这对于普通的英文文本来说效果很好,但是我有一个中文文本文件,我想对其执行此操作,但它不起作用。我通常使用 .txt 文件,但为此我必须使用 .rtf 来保存中文文本。运行我的代码后,它变得乱码。这是一个例子。

{\rtf1\adeflang1025\ansi\ansicpg1252\uc1\adeff31507\deff0\stshfdbch31506\stshfloch31506\stshfhich31506\stshfbi31507\deflang1033\deflangfe1033\themelang1033\themelangfe0\themelangcs0{\字体tbl{\f2\fbidi \fmodern\fcharset0\fprq1{*\panose 02070309020205020404}快递 新;}

代码:

int main()
{
    ifstream in;
    ofstream out;
    string lineT, newlineT;

    in.open("rawquote.rtf");
    if(in.fail())
       exit(1);
    out.open("itemisedQuote.rtf");
    do
    {
        getline(in,lineT,'\n');
        newlineT += "<item>";
        newlineT += lineT;
        newlineT += "</item>";
        if (lineT.length() >5)
        {
            out<<newlineT<<'\n';
        }
        newlineT = "";
        lineT = "";
    } while(!in.eof());
    return 0;
}

I have a C++ project which is supposed to add <item> to the beginning of every line and </item > to the end of every line. This works fine with normal English text, but I have a Chinese text file I would like to do this to, but it does not work. I normally use .txt files, but for this I have to use .rtf to save the Chinese text. After I run my code, it becomes gibberish. Here's an example.

{\rtf1\adeflang1025\ansi\ansicpg1252\uc1\adeff31507\deff0\stshfdbch31506\stshfloch31506\stshfhich31506\stshfbi31507\deflang1033\deflangfe1033\themelang1033\themelangfe0\themelangcs0{\fonttbl{\f2\fbidi
\fmodern\fcharset0\fprq1{*\panose
02070309020205020404}Courier
New;}

Code:

int main()
{
    ifstream in;
    ofstream out;
    string lineT, newlineT;

    in.open("rawquote.rtf");
    if(in.fail())
       exit(1);
    out.open("itemisedQuote.rtf");
    do
    {
        getline(in,lineT,'\n');
        newlineT += "<item>";
        newlineT += lineT;
        newlineT += "</item>";
        if (lineT.length() >5)
        {
            out<<newlineT<<'\n';
        }
        newlineT = "";
        lineT = "";
    } while(!in.eof());
    return 0;
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

白云不回头 2024-10-17 08:41:47

这看起来像 RTF,这是有道理的,因为你说这是一个 rtf 文件。

基本上,如果您在打开时转储该文件,您会看到它看起来像这样......

另外,您应该重新访问循环

std::string line;
while(getline(in, line, '\n'))
{
  // do stuff here, the above check correctly that you have indeed read in a line!
  out << "<item>" << line << "</item>" << endl;
}

That looks like RTF, which makes sense as you say this is an rtf file.

Basically, if you dump that file when you open, you'll see it looks like that...

Also, you should revisit your loop

std::string line;
while(getline(in, line, '\n'))
{
  // do stuff here, the above check correctly that you have indeed read in a line!
  out << "<item>" << line << "</item>" << endl;
}
水晶透心 2024-10-17 08:41:47

您无法以与纯文本相同的方式读取 RTF 代码,因为您将忽略格式标记等,并且可能会破坏代码。

尝试使用 UTF-8(无 BOM)将中文文本保存为文本文件,您的代码应该可以工作。但是,如果其他一些 UTF-8 编码字符本质上包含换行符(现在不确定这部分),这可能会失败,因此您应该尝试进行真正的 UTF-8 转换并使用宽字符而不是常规字符读取文件(正如 Chan 建议的那样),使用 C++ 有点棘手。

You can't read the RTF code the same way as plain text as you'll just ignore format tags, etc. and might just break the code.

Try to save your chinese text as a text file using UTF-8 (without BOM) and your code should work. However this might fail if some other UTF-8 encoded character contains essentially a line break (not sure about this part right now), so you should try to do real UTF-8 conversion and read the file using wide chars instead of regular chars (as Chan suggested), which is a little bit tricky using C++.

你在我安 2024-10-17 08:41:47

这对非中文文本有效,真是一个奇迹。 “\n”不是 RTF 中的行分隔符,“\par”是。对于中文来说,RTF 标头受到更多损害的可能性肯定更大。

C++ 并不是解决这个问题的最佳语言。只要文件不会变得太大,这就是一个简单的 5 分钟 C# 程序:

using System;
using System.Windows.Forms;   // Add reference

class Program {
    static void Main(string[] args) {
        var rtb = new RichTextBox();
        rtb.LoadFile(args[0], RichTextBoxStreamType.RichText);
        var lines = rtb.Lines;
        for (int ix = 0; ix < lines.Length; ++ix) {
            lines[ix] = "<item>" + lines[ix] + "</item>";
        }
        rtb.Lines = lines;
        rtb.SaveFile(args[0], RichTextBoxStreamType.RichText);
    }
}

如果 C++ 是一个硬性要求,那么您就必须找到一个 RTF 解析器。

It's kind of a miracle that this works for non-Chinese text. "\n" is not the line separator in RTF, "\par" is. The odds that more damage is done to the RTF header are certainly greater for Chinese.

C++ is not the best language to tackle this. It is a trivial 5 minute program in C# as long as the file doesn't get too large:

using System;
using System.Windows.Forms;   // Add reference

class Program {
    static void Main(string[] args) {
        var rtb = new RichTextBox();
        rtb.LoadFile(args[0], RichTextBoxStreamType.RichText);
        var lines = rtb.Lines;
        for (int ix = 0; ix < lines.Length; ++ix) {
            lines[ix] = "<item>" + lines[ix] + "</item>";
        }
        rtb.Lines = lines;
        rtb.SaveFile(args[0], RichTextBoxStreamType.RichText);
    }
}

If C++ is a hard requirement then you'll have to find an RTF parser.

慢慢从新开始 2024-10-17 08:41:47

我认为你应该使用“wchar”作为字符串而不是“常规字符”。

I think you should use 'wchar' for string instead of 'regular char'.

梦一生花开无言 2024-10-17 08:41:47

如果我理解这段代码的目标,那么您的解决方案将不起作用。 RTF 文档中的换行符与可见文本中的换行符不对应。

如果您不能只使用纯文本(中文字符不是有效编码的问题),请查看 RTF 规范。你会发现这是一场噩梦。因此,最好的选择可能是可以解析 RTF 并逐行读取它的第三方库。我从来没有寻找过这样的图书馆,所以我的脑海中没有任何建议,但我确信它们就在那里。

If I'm understanding the objective of this code, your solution is not going to work. A line break in an RTF document does not correspond to a line break in the visible text.

If you can't just use plain text (Chinese characters are not a problem with a valid encoding), take a look at the RTF spec. You'll discover that it is a nightmare. So you're best bet is probably a third-party library that can parse RTF and read it "line" by "line." I have never looked for such a library, so do not have any suggestions off the top of my head, but I'm sure they are out there.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文