最快的 .Net 和 SQL 数据类型
我希望这个问题不太“正确”,我会坦率地说,与 stackflow 上的许多人相比,我是一个新手……
我想比较人工智能项目的图像、音频和文本的对象表示我正在努力。 我想将所有三个输入转换为单一数据类型,并使用中央比较算法来确定静态可能的匹配。
用于进行此类比较的“最快”本机 .Net 和 SQL 数据类型是什么? 在 .Net 中,什么数据类型在 CLR 中需要最少的转换? 对于SQL来说,什么类型的增删改查速度最快?
我考虑的是 .Net 的字节和 SQL 的整数,但整数作为一维概念存在问题。 你认为图像和音频应该在文件系统中处理而不是 SQL…我猜是这样…
FWIW 我正在用我在 TrossenRobotics.com 购买的零件构建一个机器人
I hope this question isn’t too “right field” and I'll be upfront in saying I'm a newb compared to many people on stackflow...
I want to compare object representations of images, audio and text for an AI project I am working on. I'd like to convert all three inputs into a single data type and use a central comparison algorithm to determine statically probable matches.
What are the “fastest” native .Net and SQL data types for making comparisons like this? In .Net what data type requires the least amount of conversions in the CLR? For SQL, what type can be “CRUD-ed” the fastest?
I was thinking bytes for .Net and integers for SQL but integers pose a problem of being a one dimensional concept. Do you think the images and audio should be handled within the file system rather than SQL…I’m guessing so…
FWIW I'm building a robot from parts I bought at TrossenRobotics.com
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
就个人而言,如果您需要在大型二进制对象之间进行频繁比较,我会散列对象并比较散列。
如果哈希值不匹配,那么您可以确定对象不匹配(这应该是大多数情况)。
如果散列确实匹配,则您可以启动更冗长的例程来比较实际对象。
如果您经常比较这些对象,那么仅此方法就可以大大提高您的性能。
Personally, if you need to do frequent comparisons between large binary objects, I would hash the objects and compare the hashes.
If the hashes don't match, then you can be sure the objects don't match (which should be the majority of the cases).
If the hashes do match, you can then start a more lengthy routine to compare the actual objects.
This method alone should boost your performance quite a bit if you're comparing these objects frequently.
数据类型的速度有点难以测量。 如果您使用 32 位操作系统或 64 位操作系统,这会产生很大的差异。 为什么? 因为它决定了处理这些数据的速度。
一般来说,在 32 位系统上,所有适合 32 位的数据类型(int16、int32、char、byte、指针)将以相同的速度进行处理。 如果需要处理大量数据,最好将其分成每个四个字节的块,以便 CPU 处理它们。
然而,当您将数据写入磁盘时,数据速度往往取决于更多因素。 如果您的磁盘设备位于某个 USB 端口上,则所有数据都会被序列化,因此它将是一个字节接着一个字节。 在这种情况下,大小并不重要,尽管最小的数据块会留下最小的间隙。 (在 Pascal 等语言中,您可以对此类数据使用打包记录来优化流性能,同时将记录中的字段以 4 字节的倍数对齐以提高 CPU 性能。)
常规磁盘将数据存储在更大的块中。 为了提高读/写速度,您希望数据结构尽可能紧凑。 但对于处理性能而言,让它们在 4 字节边界上对齐更有效。
这让我想起我曾经和某人讨论过在 NTFS 磁盘上使用压缩的问题。 我设法证明压缩 NTFS 分区实际上可以提高计算机的性能,因为它必须读取更少的数据块,尽管这意味着它必须执行更多处理来解压缩相同的数据块。
为了提高性能,您只需找到最薄弱(最慢)的链接并从那里开始。 一旦优化了,又会出现一个薄弱环节……
Speed of data types is a bit hard to measure. It makes a big difference if you're using a 32-bits operating system or a 64-bits. Why? Because it determines the speed at which this data can be processed.
In general, on a 32-bits system, all data types that fit inside 32 bits (int16, int32, char, byte, pointers) will be processed as the same speed. If you need lots of data to be processed, it's best to divide it in blocks of four bytes each for your CPU to process them.
However, when you're writing data to disk, data speed tends to depend on a lot more factors. If your disk device is on some USB port, all data gets serialized, thus it would be byte after byte. In that case, size doesn't matter much, although the smallest datablocks would leave the smallest gaps. (In languages like Pascal you'd use a packed record for this kind of data to optimize streaming performance, while having your fields in your records aligned at multiples of 4 bytes for CPU performance.)
Regular disks will store data in bigger blocks. To increase reading/writing speed, you'd prefer to make your data structures as compact as possible. But for processing performance, having them aligned on 4 bytes boundaries is more effective.
Which reminds me that I once had a discussion with someone about using compression on an NTFS disk. I managed to prove that compressing an NTFS partition could actually improve the performance of a computer since it had to read a lot less data blocks, even though it meant it had to do more processing to decompress the same data blocks.
To improve performance, you just have to find the weakest (slowest) link and start there. Once it's optimized, there will be another weak link...
就我个人而言,我认为你最好使用字节数组。 您可以轻松地将文件读入缓冲区...并从缓冲区读入字节数组,您可以在其中进行比较。
Personally, I'd say you're best off using a byte array. You can easily read the file in to the buffer...and from the buffer into the byte array where you can do the comparison.
据我记得,就纯粹的性能而言,Int32 类型是 .NET 中速度更快的数据类型之一。 但不能说它是否最适合您的应用程序。
As far as I recall, in terms of sheer performance, the Int32 type is among the faster data types of .NET. Can't say whether it is the most suitable in your application though.
在将任何内容拉入 .NET 之前,您应该使用 LEN 函数检查 SQL Server 中数据的长度。 如果长度不同,您就已经知道这两个对象是不同的。 这应该可以避免将大量不必要的数据从 SQL Server 传输到客户端应用程序。
我还建议使用 CHECKSUM 函数存储哈希码(在与二进制数据不同的单独列中)(http://msdn.microsoft.com/en-us/library/aa258245(SQL.80).aspx)。 仅当您使用 SQL Server 2005 及更高版本并且将数据存储为 varbinary(MAX) 时,此方法才有效。 再次强调,如果哈希码不同,则二进制数据肯定不同。
如果您使用的是 SQL Server 2000,您就会被“图像”数据类型所困扰。
image 或 varbinary(MAX) 都可以很好地映射到客户端上的 byte[] 对象,但是如果您使用的是 SQL Server 2008,则可以选择将数据存储为 FILESTREAM 数据类型 (http://blogs.msdn.com/manisblog/存档/2007/10/21/filestream-data-type-sql-server-2008.aspx)。
Before pulling anything into .NET, you should check the length of the data in SQL Server using the LEN function. If the length is different, you know already that the two objects are different. This should save bringing down lots of unnecessary data from SQL Server to your client application.
I would also recommend storing a hash code (in a separate column from the binary data) using the CHECKSUM function (http://msdn.microsoft.com/en-us/library/aa258245(SQL.80).aspx). This will only work if you are using SQL Server 2005 and above and you are storing your data as varbinary(MAX). Once again, if the hash codes are different, the binary data is definitely different.
If you are using SQL Server 2000, you are stuck with the 'image' data type.
Both image or varbinary(MAX) will map nicely to byte[] objects on the client, however if you are using SQL Server 2008, you have the option of storing your data as a FILESTREAM data type (http://blogs.msdn.com/manisblog/archive/2007/10/21/filestream-data-type-sql-server-2008.aspx).