如何从 blob/文件流中识别文件类型?
我们很久以前购买了一个“现成的”应用程序,它能够在 SQL Server 中将文件存储为 blob。 我们注意到,由于一个部门更频繁地使用此 blob 字段,数据库的大小在过去六个月内增加了一倍多。 结果,应用程序变得非常缓慢。
我的任务是从数据库中删除 blob 字段并将文件保存到实际的文件系统上。 不幸的是,应用程序不存储数据库中的文件类型。 虽然我可以读取数据库中存在的文件,但我不知道将该文件保存为什么扩展名。 该应用程序的支持台不再支持该版本的软件,并且不会与我们讨论提取数据的问题。 不幸的是,我们无法访问他们的源代码。
任何建议将不胜感激! 提前致谢!
We bought an "off the shelf" application a lonnng time ago that is capable of storing files as a blob within SQL Server. We've noticed that the database has more than doubled in size within the past six months due to more frequent usage of this blob field by one department. As a result, the application has become painfully slow.
I've been tasked with removing the blob field from the database and saving the file onto the actual file system. Unfortunately, the application does not store what the file type is within the database. Although I can read the file as it exists in the database, I don't know what extension to save the file as. The application's support desk no longer supports this version of the software and will not talk to us about extracting the data. Unfortunately, we do not have access to their source code.
Any suggestions would be greatly appreciated! Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
您可以查看前几个字节并找出最常见的文件类型
http://www .garykessler.net/library/file_sigs.html
you can look at the first few bytes and figure it out for the most common file types
http://www.garykessler.net/library/file_sigs.html
不要将其保存为任何类型。 将其另存为不带扩展名的文件。 如果您不知道它是什么,请不要伪造它。 如果保存它的应用程序请求它,则从文件系统返回它,就像从数据库返回它一样; 作为二进制数据。 数据库不关心二进制对象是什么类型的数据; 你也不应该。
Don't save it as any type. Save it as a file with no extension. If you don't know what it is, don't fake it. If the app that saved it requests it, return it from the filesystem the same way it would be returned from the database; as binary data. The database doesn't care what type of data the Binary Object is; neither should you.
您可以尝试使用 TriD http://mark0.net/soft-trid-e.html
它将扫描文件并尝试确定扩展名。
You might try using TriD http://mark0.net/soft-trid-e.html
It will scan the files and try to determine the extension.
您可以使用 UrlMon.dll 中的 FindMimeFromData() 函数(使用 pinvoke)。
请参阅此页面获取示例和此该函数的文档的 MSDN 页面。
You could use the FindMimeFromData() function in UrlMon.dll (using pinvoke).
See this page for an example and this MSDN page for the documentation of the function.
您如何知道这些信息是人类可读的?
文件扩展名的目的是关联用于打开文件的特定软件,以便当您双击 .txt 文件时,notepad.exe 会处理它。
如果您只是为了卸载数据库而将它们写入文件系统,则不需要任何特定的扩展名。 如果您想将软件与其关联,请使用适当的扩展名
How do you know that this information is human readable?
files extensions purposes are to associate a particular software for opening the file so that when you double click on a .txt file, notepad.exe handles it.
If you're only writing them to the filesystem for the sake of offloading the database, then there is no need for any specific extension. If you want to associate a software with it, then use the appropriate extension
另一种选择是对数据库进行物理分区。 您可以将包含 blob 字段的表存储在不同的数据库文件中,该数据库文件可以存储在不同的硬盘/主轴上。 包含 blob 表的磁盘可以通过将其设置为 RAID 0 或 RAID 10 来进一步优化。
这可以加快速度。 此外,这种方法无需更改应用程序逻辑。
Another option would be to physically partition your database. You could store the table that contains the blob field in a different database file that could be stored on a different hard disk/spindle. The disk that contains the blob table could be further optimized by making it RAID 0 or RAID 10.
This could speed things up. Also, this approach eliminates the need to change the application logic.
用你的话来说“......我的任务是......” - 你确定这不是一个可以通过对数据库进行一些优化来解决的问题吗? 您要尽可能避免对没有源代码且没有支持的遗留应用程序进行黑客攻击。
所以,看看这里的问题——应用程序的缓慢。 那么为什么会慢呢? 数据库索引无法解决的问题?
如果必须将 BLOB 提取到文件系统,那么为什么文件的格式是相关的? 当然,应用程序的设计目的是对数据库进行查询,以便无论如何都能获取文件——将 BLOB 提取到文件系统对您有何帮助?
顺便说一句 - 我通常使用 Unix
file
命令通过幻数来识别文件。In your words "... I've been tasked with ..." - are you sure this is not a problem that can be solved by performing some optimisation on the database? Going down the path of hacking a legacy application for which you don't have the source code and for which there is no support is something that you want to avoid as much as possible.
So, look at the problem here - slowness of the application. So why is it slow? Something a database index couldn't solve?
If you have to extract the BLOBs to the filesystem, then why is the format of the file relevant? Surely the application is designed to do a query on the database in order to get the file anyway--how would extracting the BLOBs to the filesystem help you?
As an aside though - I usually use the Unix
file
command to identify files through magic numbers.