如何判断文件是否为 CSV 文件?
我有一个场景,其中用户将文件上传到系统。系统可以理解的唯一 CSV 文件,但用户可以上传任何类型的文件,例如:jpeg、doc、html。如果用户上传 CSV 文件以外的任何内容,我需要引发异常。
谁能告诉我如何确定上传的文件是否是 CSV 文件?
I have a scenario wherein the user uploads a file to the system. The only file that the system understands in a CSV, but the user can upload any type of file eg: jpeg, doc, html. I need to throw an exception if the user uploads anything other than CSV file.
Can anybody let me know how can I find if the uploaded file is a CSV file or not?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
我能想到几种方法。
一种方法是尝试使用 UTF-8 解码文件。 (它内置于 Java 中,并且可能也内置于 .NET 中。)如果文件正确解码,那么您至少知道它是某种文本文件。
一旦您知道它是一个文本文件,请解析每行中的各个字段并检查是否获得了预期的字段数。如果每行的字段数不一致,则您可能只有一个包含文本的文件,但未将其组织为行和字段。
否则你就有一个 CSV。然后您可以验证这些字段。
I can think of several methods.
One way is to try to decode the file using UTF-8. (This is built into Java and is probably built into .NET too.) If the file decodes properly, then you at least know that it's a text file of some kind.
Once you know it's a text file, parse out the individual fields from each line and check that you get the number of fields that you expect. If the number of fields per line is inconsistent then you might just have a file that contains text but is not organized into lines and fields.
Otherwise you have a CSV. Then you can validate the fields.
如果它是一个 Web 应用程序,您可能需要检查浏览器在通过表单上传/发布文件时发送的内容类型 HTTP 标头。
如果您使用的语言有绑定,您也可以尝试使用 libmagic,它非常擅长识别文件类型。例如,UNIX 工具
file
使用它。http://sourceforge.net/projects/libmagic/
If it's a web application, you might want to check the content-type HTTP header the browser sends when uploading/posting a file through a form.
If there's a bind for the language you're using, you might also try using libmagic, is pretty good at recognizing file types. For example, the UNIX tool
file
uses it.http://sourceforge.net/projects/libmagic/
我不知道你是否可以以任何方式 100% 确定,但我建议第一个验证应该是:
I don't know if you can tell for 100% certain in any way, but I'd suggest that the first validations should be:
试试这个:
try this one :
我是这样解决的:读取UTF-16编码的文件,如果文件中没有找到逗号,则说明UTF-16编码不起作用。这意味着该 csv 文件是 Excel 格式(不是纯文本)。
I solved it like this: read the file with UTF-16 encoding, if no comma is found in the file, it means UTF-16 encoding didnt work. Which means that this csv file is of Excel format (NOT plain text).
CSV 文件差异很大,但它们都可以合法地称为 CSV 文件。
我想你的方法不是最好的方法,正确的方法是判断上传的文件是否是文本文件应用程序可以解析而不是CSV。
每当您无法解析文件时,无论是 JPG、MP3 还是无法解析格式的 CSV,您都会报告错误。
为此,我会尝试找到一个库来解析各种 CSV 文件格式,否则您将有很长的路要走,编写代码来解析许多可能类型的 CSV 文件(或者通过支持少数 CSV 格式来限制应用程序的灵活性。
) Java 库是 opencsv
CSV files vary a lot, and they all could be called, legitimately, CSV files.
I guess your approach is not the best one, the correct approach would be to tell if the uploaded file is a text file the application can parse instead of it it's a CSV or not.
You would report errors whenever you can't parse the file, be it a JPG, MP3 or CSV in a format you cannot parse.
To do that, I would try to find a library to parse various CSV file formats, else you have a long road ahead writing code to parse many possible types of CSV files (or restricting the application's flexibility by supporting few CSV formats.)
One such library for Java is opencsv
如果您正在使用某个库 CSV 解析器,您所要做的就是捕获它抛出的任何错误。
如果您使用的 CSV 解析器非常强大,那么它会在不理解文件格式的情况下抛出一些有用的错误。
If you're using some library CSV parser, all you would have to do is catch any errors it throws.
If the CSV parser you're using is remotely robust, it will throw some useful errors in the event that it doesn't understand the file format.