固定大小字符编码
我正在 VB.Net 中开发一个使用 FileStream
对象从文本文件读取的应用程序。我不使用 StreamReader
,因为它所做的缓冲使得无法使用 Seek
。
这些文本文件形成一个数据库,其中包含索引文件和数据文件。在索引文件中,所有字段都是固定长度的,而在数据文件中则不是这样。
我最近遇到了一个问题。由于我的某些文件包含重音符号,因此相应的字符占用的空间超过 1 个字节。因此,当我在索引文件中查找时,出现偏移量,索引文件的其余部分没有以正确的方式读取。
我正在寻找一种允许使用重音符号、特殊字符等的编码,其中每个字符都使用相同的字节数存储。这边走,;我仍然可以在我的文件中查找。这存在吗?
谢谢你,
CFP。
I am developing, in VB.Net, an application that reads from text files using a FileStream
Object. I do not use a StreamReader
, since the buffering it does makes it impossible to use Seek
.
Those text files form a database, with both index and data files. In index files, all fields are fixed-length, which is not the case in data files.
I've recently run into a problem. Since some of my files contain accents, the corresponding characters take more that 1 Byte. Therefore, when I seek in the index file, and offset appears the rest of my index file is not read in the right way.
I'm searching for an encoding that allows to use accents, special characters and so on, where every character is stored using the same number of bytes. This way,; I could still seek in my files. Does this exist ?
Thank you,
CFP.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
UTF-32 是唯一保证固定长度的(无损)编码。但这会导致大量开销。
我不明白的是您声明您的索引文件包含固定长度字段。这意味着您应该不会有问题。您可以使用这些特定的固定长度在索引文件中进行查找。然后使用索引文件中给定的地址在数据文件中查找。您将始终位于文本的开头。我缺少什么?
UTF-32 is the only (non-lossy) encoding that is garanteed to be fixed length. This causes a lot of overhead though.
What I don't understand is that you state that your index file contains fixed length fields. This means that you shouldn't have a problem. You can seek in the index file using these specific fixed lengths. And then seek in the data file using the given address in the index file. You will always end up at the start of text. What am I missing?
我相信 UTF-16 将包含所有重音符号,并且每个字符都是相同的字节数。
如果您知道这是一种特定语言,则可以使用
I believe UTF-16 will have all the accents and each character is the same number of bytes.
If you know this is a specific language, you may be able to use the encoding specific that language.