java.net.URLConnection.guessContentTypeFromStream 和 text/plain

发布于 2024-10-07 18:40:41 字数 595 浏览 3 评论 0原文

所有,

我试图识别具有 Mac 行结尾的纯文本文件,并在 InputStream 内默默地将它们转换为 Windows 或 Linux 行结尾(实际上,重要的部分是 LF 字符)。具体来说,我正在使用几个采用 InputStreams 的 API,并且硬锁定寻找 \n 作为换行符。

有时,我会得到二进制文件。显然,不是类似文本的文件不应该完成此替换,因为恰好对应于 \r 的值显然不能默默地跟在 \n 后面而不严重破坏事情。

我正在尝试使用 java.net.URLConnection.guessContentTypeFromStream 并且仅在类型为文本/纯文本时执行行尾转换。不幸的是,“text/plain”似乎并不在其返回值范围内;对于我的平面文本文件,我得到的只是 null,并且假设所有无法识别的文件都可以修改可能并不安全。

我可以使用什么更好的库(最好是公共 Maven 存储库和开源库)来执行此操作?或者,我怎样才能让guessContentTypeFromStream为我工作?我知道我正在描述一个本质上危险的应用程序,没有一个解决方案是完美的,但是我是否应该将“null”视为可能是“text/plain”,而我只需要自己编写更多代码来寻找它不是的证据不是吗?

All,

I am trying to identify plain text files with Mac line endings and, inside an InputStream, silently convert them to Windows or Linux line endings (the important part is the LF character, really). Specifically, I'm working with several APIs that take InputStreams and are hard-locked to looking for \n as newlines.

Sometimes, I get binary files. Obviously, a file that isn't text-like shouldn't have this substitution done, because the value that happens to correspond to \r obviously can't silently be followed by a \n without mangling things badly.

I am attempting to use java.net.URLConnection.guessContentTypeFromStream and only performing endline conversions if the type is text/plain. Unfortunately, "text/plain" doesn't seem to be in its gamut of return values; all I get is null for my flat text files, and it's possibly not safe to assume all unidentifiable files can be modified.

What better library (preferably in a public Maven repository and open-source) can I use to do this? Alternatively, how can I make guessContentTypeFromStream work for me? I know I'm describing an inherently hazardous application and no solution can be perfect, but should I just treat "null" as likely to be "text/plain" and I simply need to write more code myself to look for evidence that it isn't?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

半世蒼涼 2024-10-14 18:40:41

在我看来,您要求的是确定文件是否是文本文件。鉴于此,这里有一个似乎正确的解决方案:

当然,他正在谈论unix 、 bash 和 perl 但概念是相同的:

除非你检查了每个字节
文件,你不会得到这个
100%。还有一场盛大的表演
检查每个字节。但
经过一些实验,我决定
适合我的算法。我
检查第一行并声明
如果我遇到甚至是二进制文件
一个非文本字节。好像有点
懈怠,我知道,但我似乎逃脱了
有了它。

编辑#1:
扩展这种类型的解决方案,似乎一个合理的方法是确保文件不包含非 ASCII 字符(除非您正在处理非英语文件......这是另一个解决方案)。这可以通过检查字符串形式的文件内容是否与此不匹配来完成:

// -- uses commons-io
String fileAsString = FileUtils.readFileToString( new File( "file-name-here" ) );
boolean isTextualFile = fileAsString.matches( ".*\\p{ASCII}+.*" );

EDIT #2
您可能想尝试将此作为您的正则表达式或类似的东西。不过,我承认它可能需要一些改进。

".*(?:\\p{Print}|\\p{Space})+.*"

It seems to me that what you're asking is to determine if a file is textual or not. Given that, there is a solution here that seems right:

Granted, he is talking about unix, bash and perl but the concept is the same:

Unless you inspect every byte of the
file, you are not going to get this
100%. And there is a big performance
hit with inspecting every byte. But
after some experiments, I settled on
an algorithm that works for me. I
examine the first line and declare the
file to be binary if I encounter even
one non-text byte. It seems a little
slack, I know, but I seem to get away
with it.

EDIT #1:
Expanding on this type of solution, it seems like a reasonable approach would be to ensure the file contains no non-ascii characters (unless you're dealing with files that are non-English...thats another solution). This could be done by checking if the file contents as a String does not match this:

// -- uses commons-io
String fileAsString = FileUtils.readFileToString( new File( "file-name-here" ) );
boolean isTextualFile = fileAsString.matches( ".*\\p{ASCII}+.*" );

EDIT #2
You may want to try this as your regex, or something close to it. Though, I'll admit it could likely use some refining.

".*(?:\\p{Print}|\\p{Space})+.*"
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文