制表符分隔文件中的引号
我有一个简单的应用程序,它打开一个制表符分隔的文本文件,并将该数据插入数据库。
我正在使用此 CSV 阅读器来读取数据:http://www.codeproject。 com/KB/database/CsvReader.aspx
一切都工作得很好!
现在我的客户在文件末尾添加了一个新字段,即“ClaimDescription”,并且在其中一些索赔描述中,数据中包含引号,例如:
“SUMISEI MARU NO 2” - 日本海
这似乎给我的应用程序带来了很大的麻烦。我收到一个异常,如下所示:
CSV 似乎在位置“181”处的记录“1470”字段“26”附近已损坏。当前原始数据:...
在“原始数据”中,索赔描述字段确实显示了带引号的数据。
我想知道是否有人曾经遇到过这个问题,并且解决了? 显然,我可以要求客户更改他们最初发送给我的数据,但这是他们用来生成制表符分隔文件的自动化过程;我宁愿用它作为最后的手段。
我想我可以事先使用标准 TextReader 打开文件,转义任何引号,将内容写回到新文件中,然后将该文件输入 CSV 阅读器。值得一提的是,这些制表符分隔文件的平均文件大小约为 40MB。
非常感谢任何帮助!
干杯,肖恩
I've got a simple application that opens a tab-delimited text file, and inserts that data into a database.
I'm using this CSV reader to read the data: http://www.codeproject.com/KB/database/CsvReader.aspx
And it is all working just fine!
Now my client has added a new field to the end of the file, which is "ClaimDescription", and in some of these claim descriptions, the data has quotes in it, example:
"SUMISEI MARU NO 2" - sea of Japan
This seems to be causing a major headache for my app. I get an exception which looks like this:
The CSV appears to be corrupt near record '1470' field '26 at position '181'. Current raw data : ...
And in that "raw data", sure enough the claim description field shows data with quotes in it.
I want to know if anyone has ever had this problem before, and got round it?
Obviously I can ask the client to change the data they originally send to me, but this is an automated process that they use to generate the tab-delimited file; and I'd rather use that as a last resort.
I was thinking I could maybe open the file using a standard TextReader before hand, escape any quotes, write the content back into a new file, then feed that file into the CSV Reader. It is probably worth mentioning that the average file size of these tab-delimited files is around 40MB.
Any help is greatly appreciated!
Cheers, Sean
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
检查 codeproject 文章中有关引用的评论:
http://www.codeproject.com/Messages/3382857/Re-Quotes-inside-of-the-Field.aspx
您需要在构造函数中指定您希望除 " 之外的另一个字符用作引号。
Check the comment on the codeproject article about quotes:
http://www.codeproject.com/Messages/3382857/Re-Quotes-inside-of-the-Field.aspx
You need to specify in the constructor that you want another character besides " to be used as quotes.
请改用 FileHelpers 库。它被广泛使用,可以处理带引号的字段或包含引号的字段。
Use the FileHelpers library instead. It is widely used and will cope with quoted fields, or fields that contain quotes.
我最近解决了一个类似的问题,虽然 CsvReader 在我的 TSV 文件中除了几行之外的所有行上都能正常工作,但最终解决我的问题的是在 CsvReader 的构造函数中设置一个
customDelimiter
I recently solved a similar issue, and although CsvReader was working properly on all but a few lines of my TSV file, what solved my problem in the end was setting a
customDelimiter
in the constructor ofCsvReader
使用 OleDbConnection
http://social.msdn。 microsoft.com/Forums/en/winformsdatacontrols/thread/98fce7d7-b02d-4027-ad2e-2df3a28bd439
use OleDbConnection
http://social.msdn.microsoft.com/Forums/en/winformsdatacontrols/thread/98fce7d7-b02d-4027-ad2e-2df3a28bd439
也许您可以使用应用程序打开该文件,并将每个引号替换为另一个字符,然后对其进行处理。
Maybe you can open the file with your application and replace each quote with another character and then process it.
我做了一些搜索,有一个针对 CSV 文件的 RFC (RFC 4180),这确实明确禁止他们正在做的事情:
基本上,如果他们想这样做,他们需要将整个字段用引号引起来,如下所示:
因此,如果您愿意,您可以将这个问题扔给他们,并坚持让他们向您发送“正确的”RFC 4180 CSV 文件。
由于您可以访问该 CSV 阅读器的源文件,因此另一个选择是修改它以处理它们向您提供的带引号的字符串类型。
这种情况正是为什么拥有对工具集的源代码访问权限至关重要的原因。
相反,如果您想在将文件提供给您的工具之前对其文件进行预处理(破解),则正确的方法是查找带有引号的字段,而不是紧邻分隔符前面或后面,并将其整个字段包含在另一组中的报价。
I did some searching, and there is an RFC for CSV files (RFC 4180), and that does explicitly prohibit what they are doing:
Basicly, if they want to do that, they need to enclose that whole field in quotes, like so:
So if you want you can throw this problem back at them and insist they send you a "proper" RFC 4180 CSV file.
Since you have access to the source files for that CSV reader, another option would be to modify it to handle the kind of quoted strings they are feeding you.
This kind of situation is exactly why it is vital to have source code access to your toolset.
If instead you'd like to preprocess (hack) their files before feeing them to your tool, the correct method would be to look for fields with a quote not immediately in front of or behind a separator, and enclose its whole field in another set of quotes.
是的 - 经过一个深夜的红牛和挠头之后,我最终发现了问题,它是“Claim_Description”字段中的逗号。甚至没有考虑到这一点,因为我使用的是制表符分隔的文件,但是一旦我对文件中的所有逗号进行查找和替换,它就工作得非常好!
下一步是找出如何在处理之前替换这些逗号。
再次感谢您的所有建议。
干杯,肖恩
Right - after a late night of redbull and scratching my head, i eventually found the problem, it was commas in the "Claim_Description" field. Didn't even think about that because I was using a tab-delimited file, but as soon as i did a find and replace on all commas in the file it worked absolutely fine!
The next step is to find out how to replace those commas before processing.
Again, thanks for all the suggestions.
Cheers, Sean