有没有跨平台的Java方法来删除文件名特殊字符?
我正在制作一个跨平台应用程序,它根据在线检索的数据重命名文件。 我想清理从当前平台的 Web API 获取的字符串。
我知道不同的平台有不同的文件名要求,所以我想知道是否有跨平台的方法来做到这一点?
编辑:在 Windows 平台上不能有问号“?” 在文件名中,而在 Linux 中,您可以。 文件名可能包含此类字符,我希望支持这些字符的平台保留它们,否则将其删除。
另外,我更喜欢不需要第三方库的标准 Java 解决方案。
I'm making a cross-platform application that renames files based on data retrieved online. I'd like to sanitize the Strings I took from a web API for the current platform.
I know that different platforms have different file-name requirements, so I was wondering if there's a cross-platform way to do this?
Edit: On Windows platforms you cannot have a question mark '?' in a file name, whereas in Linux, you can. The file names may contain such characters and I would like for the platforms that support those characters to keep them, but otherwise, strip them out.
Also, I would prefer a standard Java solution that doesn't require third-party libraries.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
正如其他地方所建议的,这通常不是您想要做的。 通常最好使用安全方法(例如 File.createTempFile())创建临时文件。
您不应该使用白名单来执行此操作,而只保留“好”字符。 如果文件仅由中文字符组成,那么您将删除其中的所有内容。 因此我们不能使用包含列表,我们必须使用排除列表。
Linux 几乎允许任何可能真正令人痛苦的事情。 我只是将 Linux 限制在与 Windows 限制相同的列表中,这样您就可以避免将来遇到麻烦。
在 Windows 上使用此 C# 代码片段,我生成了一个在 Windows 上无效的字符列表。 此列表中的字符数量比您想象的要多 (41),因此我不建议您尝试创建自己的列表。
这是一个简单的 Java 类,它“清理”文件名。
编辑:
正如斯蒂芬所建议的,您可能还应该验证这些文件访问仅发生在您允许的目录中。
以下答案包含用于在 Java 中建立自定义安全上下文然后在该“沙箱”中执行代码的示例代码。
如何创建安全的 JEXL(脚本)沙箱?
As suggested elsewhere, this is not usually what you want to do. It is usually best to create a temporary file using a secure method such as File.createTempFile().
You should not do this with a whitelist and only keep 'good' characters. If the file is made up of only Chinese characters then you will strip everything out of it. We can't use an include list for this reason, we have to use an exclude list.
Linux pretty much allows anything which can be a real pain. I would just limit Linux to the same list that you limit Windows to so you save yourself headaches in the future.
Using this C# snippet on Windows I produced a list of characters that are not valid on Windows. There are quite a few more characters in this list than you may think (41) so I wouldn't recommend trying to create your own list.
Here is a simple Java class which 'cleans' a file name.
EDIT:
As Stephen suggested you probably also should verify that these file accesses only occur within the directory you allow.
The following answer has sample code for establishing a custom security context in Java and then executing code in that 'sandbox'.
How do you create a secure JEXL (scripting) sandbox?
或者直接这样做:
结果:
A20_B22b_A_BC_ld_ma.la.xps
解释:
[a-zA-Z0-9\\._]
匹配 az 小写或大写字母,数字、点和下划线[^a-zA-Z0-9\\._]
是相反的。 即与第一个表达式不匹配的所有字符[^a-zA-Z0-9\\._]+
是与第一个表达式不匹配的字符序列因此每个字符序列不包含 az、0-9 或 中的字符。 _ 将被替换。
or just do this:
Result:
A20_B22b_A_BC_ld_ma.la.xps
Explanation:
[a-zA-Z0-9\\._]
matches a letter from a-z lower or uppercase, numbers, dots and underscores[^a-zA-Z0-9\\._]
is the inverse. i.e. all characters which do not match the first expression[^a-zA-Z0-9\\._]+
is a sequence of characters which do not match the first expressionSo every sequence of characters which does not consist of characters from a-z, 0-9 or . _ will be replaced.
这是基于 Sarel Botha 接受的答案,只要您没有遇到任何外部字符,它就可以正常工作基本多语言平面。 如果您需要完整的 Unicode 支持(谁不需要?),请使用此代码,因为它是 Unicode 安全的:
此处的关键更改:
length
而不是仅length
charAt
append
char
转换为int
。 事实上,您永远不应该处理char
,因为它们对于 BMP 之外的任何内容基本上都是损坏的。This is based on the accepted answer by Sarel Botha which works fine as long as you don't encounter any characters outside of the Basic Multilingual Plane. If you need full Unicode support (and who doesn't?) use this code instead which is Unicode safe:
Key changes here:
length
instead of justlength
charAt
append
char
s toint
s. In fact, you should never deal withchar
s as they are basically broken for anything outside the BMP.这是我使用的代码:
SystemUtils
来自 Apache commons-lang3Here is the code I use:
SystemUtils
is from Apache commons-lang3有一个非常好的内置 Java 解决方案 - Character.isXxx()。
尝试
Character.isJavaIdentifierPart(c)
:结果是“name.é$_”。
There's a pretty good built-in Java solution - Character.isXxx().
Try
Character.isJavaIdentifierPart(c)
:Result is "name.é$_".
从您的问题中尚不清楚,但由于您计划接受网络表单中的路径名(?),您可能应该阻止尝试重命名某些内容; 例如“C:\Program Files”。 这意味着您需要规范化路径名以消除“.”。 和“..”,然后再进行访问检查。
鉴于此,我不会尝试删除非法字符。 相反,我会使用“new File(str).getCanonicalFile()”来生成规范路径,接下来检查它们是否满足您的沙箱限制,最后使用“File.exists()”、“File.isFile()”等来检查源和目标是否正确,并且不是同一文件系统对象。 我会通过尝试执行操作并捕获异常来处理非法字符。
It is not clear from your question, but since you are planning to accept pathnames from a web form (?) you probably ought block attempts renaming certain things; e.g. "C:\Program Files". This implies that you need to canonicalize the pathnames to eliminate "." and ".." before you make your access checks.
Given that, I wouldn't attempt to remove illegal characters. Instead, I'd use "new File(str).getCanonicalFile()" to produce the canonical paths, next check that they satisfy your sandboxing restrictions, and finally use "File.exists()", "File.isFile()", etc to check that the source and destination are kosher, and are not the same file system object. I'd deal with illegal characters by attempting to do the operations and catching the exceptions.
Paths.get(...)
抛出非法字符位置的详细异常。Paths.get(...)
throws a detailed exception with the position of the illegal character.如果您想使用 [A-Za-z0-9] 以外的内容,请检查 MS 命名约定,并且不要忘记过滤掉“...整数表示形式在 1 到 31 范围内的字符,... ”,就像亚伦·迪古拉(Aaron Digulla)的例子一样。 来自 David Carboni 的代码不足以满足这些字符。
包含保留字符列表的摘录:
If you want to use more than like [A-Za-z0-9], then check MS Naming Conventions, and dont forget to filter out "...Characters whose integer representations are in the range from 1 through 31,...", like the example of Aaron Digulla does. The code e.g. from David Carboni would not be sufficient for these chars.
Excerpt containing the list of reserved characters: