如何使 Ruby 字符串对文件系统安全?

发布于 2024-08-15 12:45:14 字数 459 浏览 2 评论 0 原文

我有用户条目作为文件名。当然这不是一个好主意,所以我想删除除 [az][AZ][0-9] 之外的所有内容、_-

例如:

my§document$is°°   very&interesting___thisIs%nice445.doc.pdf

应该成为

my_document_is_____very_interesting___thisIs_nice445_doc.pdf

然后理想地

my_document_is_very_interesting_thisIs_nice445_doc.pdf

有一个好的和优雅的方式来做到这一点吗?

I have user entries as filenames. Of course this is not a good idea, so I want to drop everything except [a-z], [A-Z], [0-9], _ and -.

For instance:

my§document$is°°   very&interesting___thisIs%nice445.doc.pdf

should become

my_document_is_____very_interesting___thisIs_nice445_doc.pdf

and then ideally

my_document_is_very_interesting_thisIs_nice445_doc.pdf

Is there a nice and elegant way for doing this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

冰火雁神 2024-08-22 12:45:14

我想提出一种与旧解决方案不同的解决方案。请注意,旧版本使用已弃用返回。顺便说一句,它无论如何都是特定于Rails的,并且您没有在问题中明确提及Rails(仅作为标签)。此外,现有的解决方案无法按照您的要求将 .doc.pdf 编码为 _doc.pdf。当然,它不会将下划线合并为一个。

这是我的解决方案:

def sanitize_filename(filename)
  # Split the name when finding a period which is preceded by some
  # character, and is followed by some character other than a period,
  # if there is no following period that is followed by something
  # other than a period (yeah, confusing, I know)
  fn = filename.split /(?<=.)\.(?=[^.])(?!.*\.[^.])/m

  # We now have one or two parts (depending on whether we could find
  # a suitable period). For each of these parts, replace any unwanted
  # sequence of characters with an underscore
  fn.map! { |s| s.gsub /[^a-z0-9\-]+/i, '_' }

  # Finally, join the parts with a period and return the result
  return fn.join '.'
end

您尚未指定有关转换的所有详细信息。因此,我做出以下假设:

  • 最多应该有一个文件扩展名,这意味着文件名中最多应该有一个句点
  • 尾随句点不标记扩展名的开始
  • 前导句点不标记扩展名的开始扩展
  • AZaz0 之外的任何字符序列>–9- 应折叠成单个 _ (即下划线本身被视为不允许的字符,并且字符串 '$%__°#' 将变为 '_' – 而不是 '$%' 部分中的 '___''__''°#'

其中复杂的部分是我将文件名分为主要部分和扩展名。在正则表达式的帮助下,我正在搜索最后一个句点,后面跟着句点以外的其他内容,这样字符串中就不会出现与相同条件匹配的后续句点。但是,它前面必须有某个字符,以确保它不是字符串中的第一个字符。

我测试该功能的结果:

1.9.3p125 :006 > sanitize_filename 'my§document$is°°   very&interesting___thisIs%nice445.doc.pdf'
 => "my_document_is_very_interesting_thisIs_nice445_doc.pdf"

我认为这就是您所要求的。我希望这足够漂亮和优雅。

I'd like to suggest a solution that differs from the old one. Note that the old one uses the deprecated returning. By the way, it's anyway specific to Rails, and you didn't explicitly mention Rails in your question (only as a tag). Also, the existing solution fails to encode .doc.pdf into _doc.pdf, as you requested. And, of course, it doesn't collapse the underscores into one.

Here's my solution:

def sanitize_filename(filename)
  # Split the name when finding a period which is preceded by some
  # character, and is followed by some character other than a period,
  # if there is no following period that is followed by something
  # other than a period (yeah, confusing, I know)
  fn = filename.split /(?<=.)\.(?=[^.])(?!.*\.[^.])/m

  # We now have one or two parts (depending on whether we could find
  # a suitable period). For each of these parts, replace any unwanted
  # sequence of characters with an underscore
  fn.map! { |s| s.gsub /[^a-z0-9\-]+/i, '_' }

  # Finally, join the parts with a period and return the result
  return fn.join '.'
end

You haven't specified all the details about the conversion. Thus, I'm making the following assumptions:

  • There should be at most one filename extension, which means that there should be at most one period in the filename
  • Trailing periods do not mark the start of an extension
  • Leading periods do not mark the start of an extension
  • Any sequence of characters beyond AZ, az, 09 and - should be collapsed into a single _ (i.e. underscore is itself regarded as a disallowed character, and the string '$%__°#' would become '_' – rather than '___' from the parts '$%', '__' and '°#')

The complicated part of this is where I split the filename into the main part and extension. With the help of a regular expression, I'm searching for the last period, which is followed by something else than a period, so that there are no following periods matching the same criteria in the string. It must, however, be preceded by some character to make sure it's not the first character in the string.

My results from testing the function:

1.9.3p125 :006 > sanitize_filename 'my§document$is°°   very&interesting___thisIs%nice445.doc.pdf'
 => "my_document_is_very_interesting_thisIs_nice445_doc.pdf"

which I think is what you requested. I hope this is nice and elegant enough.

月下伊人醉 2024-08-22 12:45:14

来自

def sanitize_filename(filename)
  filename.strip.tap do |name|
   # NOTE: File.basename doesn't work right with Windows paths on Unix
   # get only the filename, not the whole path
   name.gsub!(/^.*(\\|\/)/, '')

   # Strip out the non-ascii character
   name.gsub!(/[^0-9A-Za-z.\-]/, '_')
  end
end

From http://web.archive.org/web/20110529023841/http://devblog.muziboo.com/2008/06/17/attachment-fu-sanitize-filename-regex-and-unicode-gotcha/:

def sanitize_filename(filename)
  filename.strip.tap do |name|
   # NOTE: File.basename doesn't work right with Windows paths on Unix
   # get only the filename, not the whole path
   name.gsub!(/^.*(\\|\/)/, '')

   # Strip out the non-ascii character
   name.gsub!(/[^0-9A-Za-z.\-]/, '_')
  end
end
魂牵梦绕锁你心扉 2024-08-22 12:45:14

在 Rails 中,您还可以使用 ActiveStorage::Filename#sanitized

ActiveStorage::Filename.new("foo:bar.jpg").sanitized # => "foo-bar.jpg"
ActiveStorage::Filename.new("foo/bar.jpg").sanitized # => "foo-bar.jpg"

In Rails you might also be able to use ActiveStorage::Filename#sanitized:

ActiveStorage::Filename.new("foo:bar.jpg").sanitized # => "foo-bar.jpg"
ActiveStorage::Filename.new("foo/bar.jpg").sanitized # => "foo-bar.jpg"
梦中的蝴蝶 2024-08-22 12:45:14

如果您使用 Rails,您还可以使用 String#parameterize。这并不是专门为此目的,但您将获得令人满意的结果。

"my§document$is°°   very&interesting___thisIs%nice445.doc.pdf".parameterize

If you use Rails you can also use String#parameterize. This is not particularly intended for that, but you will obtain a satisfying result.

"my§document$is°°   very&interesting___thisIs%nice445.doc.pdf".parameterize
那些过往 2024-08-22 12:45:14

对于 Rails,我发现自己想要保留任何文件扩展名,但对其余字符使用 parameterize

filename = "my§doc$is°° very&itng___thsIs%nie445.doc.pdf"
cleaned = filename.split(".").map(&:parameterize).join(".")

实现细节和想法请参阅源代码: https://github.com/rails/rails/blob/master/activesupport/ lib/active_support/inflector/transliterate.rb

def parameterize(string, separator: "-", preserve_case: false)
  # Turn unwanted chars into the separator.
  parameterized_string.gsub!(/[^a-z0-9\-_]+/i, separator)
  #... some more stuff
end

For Rails I found myself wanting to keep any file extensions but using parameterize for the remainder of the characters:

filename = "my§doc$is°° very&itng___thsIs%nie445.doc.pdf"
cleaned = filename.split(".").map(&:parameterize).join(".")

Implementation details and ideas see source: https://github.com/rails/rails/blob/master/activesupport/lib/active_support/inflector/transliterate.rb

def parameterize(string, separator: "-", preserve_case: false)
  # Turn unwanted chars into the separator.
  parameterized_string.gsub!(/[^a-z0-9\-_]+/i, separator)
  #... some more stuff
end
单调的奢华 2024-08-22 12:45:14

如果您的目标只是生成一个在所有操作系统上“安全”使用的文件名(而不是删除任何和所有非 ASCII 字符),那么我会推荐 zaru 宝石。它不会执行原始问题指定的所有操作,但生成的文件名应该可以安全使用(并且仍然保持任何文件名安全的 unicode 字符不变):

Zaru.sanitize! "  what\ēver//wëird:user:înput:"
# => "whatēverwëirduserînput"
Zaru.sanitize! "my§docu*ment$is°°   very&interes:ting___thisIs%nice445.doc.pdf" 
# => "my§document$is°° very&interesting___thisIs%nice445.doc.pdf"

If your goal is just to generate a filename that is "safe" to use on all operating systems (and not to remove any and all non-ASCII characters), then I would recommend the zaru gem. It doesn't do everything the original question specifies, but the filename produced should be safe to use (and still keep any filename-safe unicode characters untouched):

Zaru.sanitize! "  what\ēver//wëird:user:înput:"
# => "whatēverwëirduserînput"
Zaru.sanitize! "my§docu*ment$is°°   very&interes:ting___thisIs%nice445.doc.pdf" 
# => "my§document$is°° very&interesting___thisIs%nice445.doc.pdf"
财迷小姐 2024-08-22 12:45:14

有一个库可能会有所帮助,特别是如果您有兴趣用 ASCII 替换奇怪的 Unicode 字符: unidecode< /a>.

irb(main):001:0> require 'unidecoder'
=> true
irb(main):004:0> "Grzegżółka".to_ascii
=> "Grzegzolka"

There is a library that may be helpful, especially if you're interested in replacing weird Unicode characters with ASCII: unidecode.

irb(main):001:0> require 'unidecoder'
=> true
irb(main):004:0> "Grzegżółka".to_ascii
=> "Grzegzolka"
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文