我正在尝试用 SQLite DB 替换我们在内存中保存的大文本文件。
我们有这个大文件,其中包含当用户创建新密码时我们拒绝的常用密码。大约有 4M 行,磁盘上有 41MB(未压缩,压缩后为 11MB)。过去,我们在启动时将其加载到 Java Set
中并对其进行查询。问题是加载后需要 400MB 的内存。
所以我想,我会将这些单词放入 sqlite 数据库中并在运行时查询它。令我惊讶的是,即使在 VACUUM; 之后,数据库也是巨大的:
-rw-r--r-- 1 guillaume admin 137M 11 mar 18:47 commonPasswords.db
我使用的是一个非常简单的模式:
CREATE TABLE common_passwords (
id INTEGER PRIMARY KEY ,
password VARCHAR(50) NOT NULL UNIQUE
);
我认为 SQLite 能够将该文本列表压缩到 40MB 和 GZIPed 之间的某个位置大小(11MB)。 GZiping 数据库将其缩小到 63MB,但在运行时我仍然需要解压缩它(所以将它保存在内存中..),所以它毫无用处。
您是否知道如何以更有效的方式存储该列表,以便可以在 Java 中对其进行查询?或者如何调整 SQLite 以优化大小?我对性能不感兴趣,因为这个数据库每小时只查询几次。
I'm trying to replace a big text file that we keed in memory by an SQLite DB.
We have this large file, which consists of common passwords that we refuse when a user creates a new password. It's about 4M lines, and 41MB on disk (uncompressed, compressed is 11MB). We historically loaded it at startup in a Java Set
and would query it. The issue is that it takes a whopping 400MB of ram once loaded.
So I figured, I will put these words in a sqlite DB and query it at runtime. To my surprise, the database is humongous, even after a VACUUM;
:
-rw-r--r-- 1 guillaume admin 137M 11 mar 18:47 commonPasswords.db
I'm using a very simple schema:
CREATE TABLE common_passwords (
id INTEGER PRIMARY KEY ,
password VARCHAR(50) NOT NULL UNIQUE
);
I figured that SQLite would be able to compress that textual list to somewhere between 40MB and the GZIPed size (11MB). GZiping the DB shrinks it to 63MB but at runtime I'll still have to unzip it (so have it in memory..) so it's useless.
Do you have any idea on how to store that list in a more efficient way so that it is queryable in Java? Or how to tune SQLite to optimize for size? I'm not interested in perf, since this DB is only queried a few times per hour.
发布评论
评论(2)
不要将 sqlite 或任何数据库用于您的目的。为了压缩一长串数据而创建一个只有 2 列的表违背了 SQL 中结构的目的,并且您还必须处理额外的通信开销。
高级概述
在您的情况下,我假设您有一个 400 万行的未加密文本文件,其中每个密码都存储在一个新行中。
处理任何大量连续数据的 3 个简单步骤
Pagify:将文件划分为块或页面。在 400 万行中尝试处理每 100、1000 或 N 行,并将它们作为单独的块写入文件中
处理:处理意味着获取上述块并压缩/放气它们。
java Deflater 类有一个 setDictionary 方法,您可以在目标块中传递最常出现的字节序列,从而提高整体压缩质量。
找到要传递给此参数的值有点复杂,我们将在接下来的部分中看到。
延迟读取:您以前的方法的问题是您将文本文件视为一个完整的单元,并将其完全加载到内存中。但现在由于您的文件是分块的,我们一次读取、检查、释放每个块。这种方法有 2 个优点
读取一个块后,如果它不符合您的条件,您可以将其从内存中释放
如果该块包含您正在查找的密码,那么您可以停止处理本身,而不必处理更多块,从而节省计算时间。如果您要查找的内容位于文件开头,则可以跳过文件末尾的块。
测试用例
我使用此代码生成 400 万行文本用于测试目的
所以我们现在有了测试文件,让我们开始
查找最长的和最长的文本。密码列表中最常出现的字符串[可选操作]
如前所述 setDictionary 接受上述所需字节以进行正确压缩。在大多数情况下,您不需要使用它,因为压缩器大多数时候都足够聪明,可以解决这个问题,但如果您将密码列表中出现的确切模式传递给它,那么压缩速度和性能都会得到提高。现在是复杂的部分
弄清楚顺序
假设我们有一个示例字符串
最常出现的字符串是 /ABC/ 包括斜杠 我们如何找到该字符串如下
字符串越长,压缩效果越好,例如,假设我们有 2 个密码,
对于生成的每个字符串,我们查找密码列表中出现的次数,并存储出现次数最多的一个。
但是,如果我们遇到一个比前一个长度更长的子字符串,我们会覆盖前一个最长的单词,无论它是否出现,因为它在压缩中总是更受青睐。
Dictionary 类会执行此操作
这里我们正在寻找一个频繁出现的最小长度的单词2 和最大长度 4
我们得到
压缩差异?
有了如此复杂的逻辑,我们能得到什么压缩呢?让我们看看
输出
所以使用字典,未压缩数据与压缩数据的比率是 28/24
如果注释掉该行
//def.setDictionary(Dictionary.getLongestOccurence(List.of(text),2,6).getBytes());
我们得到的结果
不太好,但我们通过使用保存了 2 个完整字节的数据字典。当然,这是一个小测试用例,但如果您对密码进行排序,使非常相似的密码彼此接近,那么节省的费用就会更大。
分页数据[Main Core]
这是操作中最重要的部分,您的所有积蓄都会发挥作用。
这非常简单,因为代码将解释一切
基本上我们将文件分为 N 行页面并压缩每个页面并将其作为块写入新文件中。此方法还接受字典中使用的最小和最大单词长度。如果您想跳过使用字典,您可以为任一值传递 -1。字典的性能优势很大程度上取决于您的密码数据,并且是您的偏好
Lazy Reading[Final]
最后,我们读取每个块,如果该块包含我们正在查找的密码,我们仅返回那里的方法并且不这样做不再继续
性能
上面的测试用例是我从记事本密码文件的每个部分复制的一些随机密码
在执行之前,您必须使用以下命令压缩文件
这是结果
正如您所看到的最坏情况只需 1554 毫秒在我的马铃薯系统中执行超过 400 万行密码,内存使用量甚至不超过 1 MB
完整代码
Dictionary.java
Compressor.java
Validater.java
所有测试用例将任何方法重命名为 main 和测试
结论
嗯,这是非常有见地的。这里使用的所有类都存在于java 7中,除了stream,它是大多数android版本支持的java 8,所以你应该在这方面没问题。如果您懒得阅读我的记录不完善的解释,那么您可以直接在“完整代码”部分之后复制完整的代码。这几乎花了一整天的时间。一个缺点是,在您的压缩文件大小为 11 MB 之前,我系统中的最终压缩文件大小为 26.9 MB。对于我随机生成的密码,但正如您从输出中看到的那样,内存和速度总体上非常好。您可以尝试设置以获得您喜欢的尺寸此外,如果您有更好的方法来查找词典,请编辑此答案。不要费心使用 SqLite。
你今天过得很愉快:)
编辑
字典类速度更快并且得到了改进。希望有帮助
Dont use sqlite or any database for your purpose. Creating a table with just 2 columns for the sake of compressing an long list of data defeats the purpose of Structure in SQL and you also have to deal with the extra communication overhead.
High Level Overview
In your case i am assuming you have an unencrypted text file of 4 million lines where each password is stored in an new line.
3 simple steps for processing any large amount of sequential data
Pagify : Divide your file up into chunks or pages. Out of 4 million lines try to process every 100,1000 or N number of lines and write them as seperate chunks inside your file
Process : Processing means taking the above chunks and Compress/Deflate them.
The java Deflater class has an setDictionary method where you pass the most frequently occuring byte sequence inside your target chunk which improves overall compression quality.
Finding what value to pass to this parameter is a little complicated as we will see in the coming sections.
Lazy Read : The problem with your previous approach is you treated your text file as an single unit which you loaded fully into your memory. But now since your file is in chunks we read,check,free each chunk at a time. There are 2 advantages to this approach
After reading an chunk you can free it from memory if it does not match your criteria
If the chunk contains the password you are looking for then you can stop processing there itself and don't have to process further chunks thus saving computation time . Chunks towards the end of the file can be skipped if what you are looking for is in the beginning of the file.
Test Case
I use this code to generate your 4 million lines of text for testing purposes
So we have our test file now lets get started
Finding the longest & most frequently occuring string in your list of passwords[Optional operation]
As mentioned the setDictionary accepts the above required bytes for proper compression. For most of the cases you don't need to use this as the Compressor most of the time is smart enough to figure this out but if you pass it the exact pattern occurring in your list of passwords then compression speed and performance is improved. Now comes the complicated part
Figuring out the sequence
Lets say we have an example string
The most frequently occuring string is /ABC/ including the slashes how we find this string is as follows
Remember the longer the string we find the better for compression for example lets say we have 2 passwords
For each of these strings generated we find the number of occurences in the password list and we store the one with the highest occurence.
But if we encounter an substring with an longer length than the previous one we override the previous longest word irrespective of its occurences as it is always favoured more in compression
The Dictionary class does this
Here we are looking for an frequently occuring word which is minimum length 2 and maximum length 4
We get
Difference in compression?
With all that complex logic what compression do we get? lets see
Output
So using dictionary the ratio of the uncompressed to the compressed data is 28/24
If you comment out the line
//def.setDictionary(Dictionary.getLongestOccurence(List.of(text),2,6).getBytes());
We get
Not so great but we saved 2 whole bytes of data by using the dictionary. Off course this is an small test case but if you have sorted your passwords such that very similar passwords are close to each other then savings become much greater.
Pagifying your data[Main Core]
This is the most important part of the operation where all your savings come into play.
It's pretty simple as the code will explain it all
Basically we divide your file into pages of N lines and compress each page and write it as a chunk into an new file. This method also accepts the min and max word length to use in your dictionary . You can pass -1 for either value if you wish to skip using dictionary. Performance benefits of dictionary largely depends on your password data and is your preference
Lazy Reading[Final]
Finally we read each chunk and if the chunk contains the password we are looking for we return the method there only and don't proceed further
Performance
The above test case is some random passwords i copied from each part of the notepad passwords file
Before executing you must compress your file using
Here are the results
As you can see the worst case took just 1554 milliseconds to execute over 4 million lines of passwords and the memory usage didn't even exceed 1 MB in my potato system
The Full Code
Dictionary.java
Compressor.java
Validater.java
All Test cases rename any method to main and test
Conclusion
Well this was very insightful . All the classes used here are all present in java 7 except streams which is java 8 which most android versions support so you should be ok on that end. If you cant be bothered to read my poorly documented explanations then you can go right ahead and copy the full code after the FullCode Section. This took pretty much the whole day. One down side is while before your compressed file was 11 MB the final compressed file in my system came up to 26.9 MB. for my randomly generated passwords but as you have seen from the outputs memory and speed is overall very good. You can play around with the settings to get your preferred sizes Also if you have a better approach for finding the dictionary please do edit this answer. Don't bother using SqLite.
You have a good day now :)
EDIT
Dictionary class is much faster and improved. Hope it helps
如果性能确实没有问题,那么应用二分搜索的排序数组又如何呢?这占用了最接近数据原始大小的内存空间,并且一旦排序,对数组条目的访问相对较快。
承认,更新密码列表将是一场噩梦,但我知道这不是你的问题。
当然,这并不是您关于如何优化 SQLite 数据库问题的答案。
If performance is really no problem, what about a sorted array where you apply a binary search on? This is taking memory space closest to the raw size of the data, and once sorted, the access to the entries of the array is relatively fast.
Confessed, updating the password list would be a nightmare, but I understood that this is not your issue here.
And of course, this is not the answer on your question on how to optimise your SQLite database.