C++文本文件、汉字
我有一个 C++ 项目,应该将
添加到每行的开头,将 添加到每行的末尾。这对于普通的英文文本来说效果很好,但是我有一个中文文本文件,我想对其执行此操作,但它不起作用。我通常使用 .txt 文件,但为此我必须使用 .rtf 来保存中文文本。运行我的代码后,它变得乱码。这是一个例子。
{\rtf1\adeflang1025\ansi\ansicpg1252\uc1\adeff31507\deff0\stshfdbch31506\stshfloch31506\stshfhich31506\stshfbi31507\deflang1033\deflangfe1033\themelang1033\themelangfe0\themelangcs0{\字体tbl{\f2\fbidi \fmodern\fcharset0\fprq1{*\panose 02070309020205020404}快递 新;}
代码:
int main()
{
ifstream in;
ofstream out;
string lineT, newlineT;
in.open("rawquote.rtf");
if(in.fail())
exit(1);
out.open("itemisedQuote.rtf");
do
{
getline(in,lineT,'\n');
newlineT += "<item>";
newlineT += lineT;
newlineT += "</item>";
if (lineT.length() >5)
{
out<<newlineT<<'\n';
}
newlineT = "";
lineT = "";
} while(!in.eof());
return 0;
}
I have a C++ project which is supposed to add <item>
to the beginning of every line and </item >
to the end of every line. This works fine with normal English text, but I have a Chinese text file I would like to do this to, but it does not work. I normally use .txt files, but for this I have to use .rtf to save the Chinese text. After I run my code, it becomes gibberish. Here's an example.
{\rtf1\adeflang1025\ansi\ansicpg1252\uc1\adeff31507\deff0\stshfdbch31506\stshfloch31506\stshfhich31506\stshfbi31507\deflang1033\deflangfe1033\themelang1033\themelangfe0\themelangcs0{\fonttbl{\f2\fbidi
\fmodern\fcharset0\fprq1{*\panose
02070309020205020404}Courier
New;}
Code:
int main()
{
ifstream in;
ofstream out;
string lineT, newlineT;
in.open("rawquote.rtf");
if(in.fail())
exit(1);
out.open("itemisedQuote.rtf");
do
{
getline(in,lineT,'\n');
newlineT += "<item>";
newlineT += lineT;
newlineT += "</item>";
if (lineT.length() >5)
{
out<<newlineT<<'\n';
}
newlineT = "";
lineT = "";
} while(!in.eof());
return 0;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
这看起来像 RTF,这是有道理的,因为你说这是一个 rtf 文件。
基本上,如果您在打开时转储该文件,您会看到它看起来像这样......
另外,您应该重新访问循环
That looks like RTF, which makes sense as you say this is an rtf file.
Basically, if you dump that file when you open, you'll see it looks like that...
Also, you should revisit your loop
您无法以与纯文本相同的方式读取 RTF 代码,因为您将忽略格式标记等,并且可能会破坏代码。
尝试使用 UTF-8(无 BOM)将中文文本保存为文本文件,您的代码应该可以工作。但是,如果其他一些 UTF-8 编码字符本质上包含换行符(现在不确定这部分),这可能会失败,因此您应该尝试进行真正的 UTF-8 转换并使用宽字符而不是常规字符读取文件(正如 Chan 建议的那样),使用 C++ 有点棘手。
You can't read the RTF code the same way as plain text as you'll just ignore format tags, etc. and might just break the code.
Try to save your chinese text as a text file using UTF-8 (without BOM) and your code should work. However this might fail if some other UTF-8 encoded character contains essentially a line break (not sure about this part right now), so you should try to do real UTF-8 conversion and read the file using wide chars instead of regular chars (as Chan suggested), which is a little bit tricky using C++.
这对非中文文本有效,真是一个奇迹。 “\n”不是 RTF 中的行分隔符,“\par”是。对于中文来说,RTF 标头受到更多损害的可能性肯定更大。
C++ 并不是解决这个问题的最佳语言。只要文件不会变得太大,这就是一个简单的 5 分钟 C# 程序:
如果 C++ 是一个硬性要求,那么您就必须找到一个 RTF 解析器。
It's kind of a miracle that this works for non-Chinese text. "\n" is not the line separator in RTF, "\par" is. The odds that more damage is done to the RTF header are certainly greater for Chinese.
C++ is not the best language to tackle this. It is a trivial 5 minute program in C# as long as the file doesn't get too large:
If C++ is a hard requirement then you'll have to find an RTF parser.
我认为你应该使用“wchar”作为字符串而不是“常规字符”。
I think you should use 'wchar' for string instead of 'regular char'.
如果我理解这段代码的目标,那么您的解决方案将不起作用。 RTF 文档中的换行符与可见文本中的换行符不对应。
如果您不能只使用纯文本(中文字符不是有效编码的问题),请查看 RTF 规范。你会发现这是一场噩梦。因此,最好的选择可能是可以解析 RTF 并逐行读取它的第三方库。我从来没有寻找过这样的图书馆,所以我的脑海中没有任何建议,但我确信它们就在那里。
If I'm understanding the objective of this code, your solution is not going to work. A line break in an RTF document does not correspond to a line break in the visible text.
If you can't just use plain text (Chinese characters are not a problem with a valid encoding), take a look at the RTF spec. You'll discover that it is a nightmare. So you're best bet is probably a third-party library that can parse RTF and read it "line" by "line." I have never looked for such a library, so do not have any suggestions off the top of my head, but I'm sure they are out there.