java.net.URLConnection.guessContentTypeFromStream 和 text/plain

发布于 2024-10-07 18:40:41 字数 595 浏览 6 评论 0原文

所有，

我试图识别具有 Mac 行结尾的纯文本文件，并在 InputStream 内默默地将它们转换为 Windows 或 Linux 行结尾（实际上，重要的部分是 LF 字符）。具体来说，我正在使用几个采用 InputStreams 的 API，并且硬锁定寻找 \n 作为换行符。

有时，我会得到二进制文件。显然，不是类似文本的文件不应该完成此替换，因为恰好对应于 \r 的值显然不能默默地跟在 \n 后面而不严重破坏事情。

我正在尝试使用 java.net.URLConnection.guessContentTypeFromStream 并且仅在类型为文本/纯文本时执行行尾转换。不幸的是，“text/plain”似乎并不在其返回值范围内；对于我的平面文本文件，我得到的只是 null，并且假设所有无法识别的文件都可以修改可能并不安全。

我可以使用什么更好的库（最好是公共 Maven 存储库和开源库）来执行此操作？或者，我怎样才能让guessContentTypeFromStream为我工作？我知道我正在描述一个本质上危险的应用程序，没有一个解决方案是完美的，但是我是否应该将“null”视为可能是“text/plain”，而我只需要自己编写更多代码来寻找它不是的证据不是吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

半世蒼涼 2024-10-14 18:40:41

在我看来，您要求的是确定文件是否是文本文件。鉴于此，这里有一个似乎正确的解决方案：

当然，他正在谈论unix 、 bash 和 perl 但概念是相同的：

除非你检查了每个字节
文件，你不会得到这个
100%。还有一场盛大的表演
检查每个字节。但
经过一些实验，我决定
适合我的算法。我
检查第一行并声明
如果我遇到甚至是二进制文件
一个非文本字节。好像有点
懈怠，我知道，但我似乎逃脱了
有了它。

编辑#1：
扩展这种类型的解决方案，似乎一个合理的方法是确保文件不包含非 ASCII 字符（除非您正在处理非英语文件......这是另一个解决方案）。这可以通过检查字符串形式的文件内容是否与此不匹配来完成：

// -- uses commons-io
String fileAsString = FileUtils.readFileToString( new File( "file-name-here" ) );
boolean isTextualFile = fileAsString.matches( ".*\\p{ASCII}+.*" );

EDIT #2
您可能想尝试将此作为您的正则表达式或类似的东西。不过，我承认它可能需要一些改进。

".*(?:\\p{Print}|\\p{Space})+.*"

It seems to me that what you're asking is to determine if a file is textual or not. Given that, there is a solution here that seems right:

Granted, he is talking about unix, bash and perl but the concept is the same:

Unless you inspect every byte of the
file, you are not going to get this
100%. And there is a big performance
hit with inspecting every byte. But
after some experiments, I settled on
an algorithm that works for me. I
examine the first line and declare the
file to be binary if I encounter even
one non-text byte. It seems a little
slack, I know, but I seem to get away
with it.

EDIT #1:
Expanding on this type of solution, it seems like a reasonable approach would be to ensure the file contains no non-ascii characters (unless you're dealing with files that are non-English...thats another solution). This could be done by checking if the file contents as a String does not match this:

// -- uses commons-io
String fileAsString = FileUtils.readFileToString( new File( "file-name-here" ) );
boolean isTextualFile = fileAsString.matches( ".*\\p{ASCII}+.*" );

EDIT #2
You may want to try this as your regex, or something close to it. Though, I'll admit it could likely use some refining.