优化SQLITE存储大量单词

发布于 2025-01-14 01:15:00 字数 753 浏览 1 评论 0 原文

我正在尝试用 SQLite DB 替换我们在内存中保存的大文本文件。

我们有这个大文件,其中包含当用户创建新密码时我们拒绝的常用密码。大约有 4M 行,磁盘上有 41MB(未压缩,压缩后为 11MB)。过去,我们在启动时将其加载到 Java Set 中并对其进行查询。问题是加载后需要 400MB 的内存。

所以我想,我会将这些单词放入 sqlite 数据库中并在运行时查询它。令我惊讶的是,即使在 VACUUM; 之后,数据库也是巨大的:

-rw-r--r--  1 guillaume  admin   137M 11 mar 18:47 commonPasswords.db

我使用的是一个非常简单的模式:

CREATE TABLE common_passwords (
    id INTEGER PRIMARY KEY ,
    password VARCHAR(50) NOT NULL UNIQUE
);

我认为 SQLite 能够将该文本列表压缩到 40MB 和 GZIPed 之间的某个位置大小(11MB)。 GZiping 数据库将其缩小到 63MB,但在运行时我仍然需要解压缩它(所以将它保存在内存中..),所以它毫无用处。

您是否知道如何以更有效的方式存储该列表,以便可以在 Java 中对其进行查询?或者如何调整 SQLite 以优化大小?我对性能不感兴趣,因为这个数据库每小时只查询几次。

I'm trying to replace a big text file that we keed in memory by an SQLite DB.

We have this large file, which consists of common passwords that we refuse when a user creates a new password. It's about 4M lines, and 41MB on disk (uncompressed, compressed is 11MB). We historically loaded it at startup in a Java Set and would query it. The issue is that it takes a whopping 400MB of ram once loaded.

So I figured, I will put these words in a sqlite DB and query it at runtime. To my surprise, the database is humongous, even after a VACUUM;:

-rw-r--r--  1 guillaume  admin   137M 11 mar 18:47 commonPasswords.db

I'm using a very simple schema:

CREATE TABLE common_passwords (
    id INTEGER PRIMARY KEY ,
    password VARCHAR(50) NOT NULL UNIQUE
);

I figured that SQLite would be able to compress that textual list to somewhere between 40MB and the GZIPed size (11MB). GZiping the DB shrinks it to 63MB but at runtime I'll still have to unzip it (so have it in memory..) so it's useless.

Do you have any idea on how to store that list in a more efficient way so that it is queryable in Java? Or how to tune SQLite to optimize for size? I'm not interested in perf, since this DB is only queried a few times per hour.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

沫尐诺 2025-01-21 01:15:00

不要将 sqlite 或任何数据库用于您的目的。为了压缩一长串数据而创建一个只有 2 列的表违背了 SQL 中结构的目的,并且您还必须处理额外的通信开销。

高级概述

在您的情况下,我假设您有一个 400 万行的未加密文本文件,其中每个密码都存储在一个新行中。

处理任何大量连续数据的 3 个简单步骤

Pagify:将文件划分为块或页面。在 400 万行中尝试处理每 100、1000 或 N 行,并将它们作为单独的块写入文件中

处理:处理意味着获取上述块并压缩/放气它们。

java Deflater 类有一个 setDictionary 方法,您可以在目标块中传递最常出现的字节序列,从而提高整体压缩质量。

找到要传递给此参数的值有点复杂,我们将在接下来的部分中看到。

延迟读取:您以前的方法的问题是您将文本文件视为一个完整的单元,并将其完全加载到内存中。但现在由于您的文件是分块的,我们一次读取、检查、释放每个块。这种方法有 2 个优点

  1. 读取一个块后,如果它不符合您的条件,您可以将其从内存中释放

  2. 如果该块包含您正在查找的密码,那么您可以停止处理本身,而不必处理更多块,从而节省计算时间。如果您要查找的内容位于文件开头,则可以跳过文件末尾的块。

测试用例

我使用此代码生成 400 万行文本用于测试目的

public static void main(String[] args) throws Exception
 {
  try(PrintWriter writer=new PrintWriter(new File("E:/Passwords.txt")))
  {
   //for random password generation
   Random random=new Random();
   int 
   ch, 
   passlength;
   String password;
   
   //4 million lines[comes upto 40MB]
   for(int i=0;i<4 * 1000 * 1000;i++)
   {
    password="";
    
    //generate any password whose length is between 5 and 15 characters
    passlength=random.nextInt(5,15);
    for(int j=0;j<passlength;j++)
    {
     //65->122 is the range of all alphabetic characters in US ASCII
     ch=random.nextInt(65,122);
     
     //exclude characters like [ ( and so on 
     if(ch>=91 && ch<=96){continue;}
     
     password+=(char)(ch);
    }
    
    //write password to file
    writer.println(password);
   }
  }
 }

所以我们现在有了测试文件,让我们开始

查找最长的和最长的文本。密码列表中最常出现的字符串[可选操作]

如前所述 setDictionary 接受上述所需字节以进行正确压缩。在大多数情况下,您不需要使用它,因为压缩器大多数时候都足够聪明,可以解决这个问题,但如果您将密码列表中出现的确切模式传递给它,那么压缩速度和性能都会得到提高。现在是复杂的部分

弄清楚顺序
假设我们有一个示例字符串

  ABC/123/ABC/456/ABC/789

最常出现的字符串是 /ABC/ 包括斜杠 我们如何找到该字符串如下

  1. 找到列表中最长密码的长度,这也是最长的单词,以检查它是否在列表中出现[长度称为 N]

  2. 创建一个 for 循环 n,它从 2->N 开始,其中 n 是从 i->i+n 开始的子字符串的长度,我们应该创建该子字符串,然后检查它的值
    出现在密码列表中

字符串越长,压缩效果越好,例如,假设我们有 2 个密码,

ABC
ABCD

Longest Password=length 3[2nd line]

for i=0->length of full text
   for n=2->N
    Generate  substring from i->i+n
     check for occurences

对于生成的每个字符串,我们查找密码列表中出现的次数,并存储出现次数最多的一个。

但是,如果我们遇到一个比前一个长度更长的子字符串,我们会覆盖前一个最长的单词,无论它是否出现,因为它在压缩中总是更受青睐。

Dictionary 类会执行此操作

import java.util.List;
import java.util.stream.Collectors;

final class Dictionary
{
 private static final class Info
 {
  private final int longestLine;
  
  private int highestOccurence;
  private String longestWord="";
  
  public Info(int longestLine) {this.longestLine = longestLine;}
 } 
 
 private static Info getInfo(List<String> lines)
 {
  int 
  currentLine, 
  longestLine=0;
  
  for(String line:lines)
  {
   //find longest line in the password list
   if((currentLine=line.length())>longestLine)
   {
    longestLine=currentLine;
   }
  }
  
  return new Info(longestLine);
 }
 
 static String getLongestOccurence(List<String> lines,int minLength,int maxLength)
 {
  //user dosent want dictionary hence return empty string
  if(minLength<0){return "";}
  
  Info info=getInfo(lines);
 
  minLength=Math.max(2,minLength);
  maxLength=maxLength<=0?info.longestLine:Math.min(maxLength,info.longestLine);
  
  String fullText=lines.stream().collect(Collectors.joining("\n"));
  int 
  startIndex=0, 
  count;
  
  int length=fullText.length();
  outer : for(int i=0;i<length;i++)
  {
   for(int n=minLength;n<maxLength;n++)
   {
    //if substring is longer than text length break
    if((i+n+1)>length){break outer;}
    //if combination length is lesser than previous longestWord length then continue as we dont care about shorter words
    if(n<info.longestWord.length()){continue;}   
    //get combination
    String combination=fullText.substring(i,i+n+1);  
    
    //count occurences
    count=0;
    do
    {
     startIndex=fullText.indexOf(combination,startIndex);
     if(startIndex!=-1)
     {
      startIndex+=combination.length();
      
      count++;
     }
    } 
    while(startIndex!=-1);
    
    //we only care about more than one occurences
    if(count>1)
    {   
     if(length>info.longestWord.length()) //if new combination is longer than previous one give it more precedence
     {
      info.longestWord=combination;
      info.highestOccurence=count;
     }
     else if(count>info.highestOccurence)  //else if both are equal length then check counts
     {
      info.longestWord=combination;       
      info.highestOccurence=count;      
     }
    }
   }
  }
  
  return info.longestWord;
 }
}



 

这里我们正在寻找一个频繁出现的最小长度的单词2 和最大长度 4

 public static void main(String[] args)
 {
  System.out.println
  (
   "Dictionary="+Dictionary.getLongestOccurence
   (
    List.of
    (
      "Pass A"
     ,"Pass B"
     ,"Pass C"
     ,"Pass D"
    )
    ,2,4
   )
  );
 }

我们得到

Dictionary=Pass

压缩差异?

有了如此复杂的逻辑,我们能得到什么压缩呢?让我们看看

public static void main(String[] args)throws Exception
 {
  String text="Pass A\nPass B\n Pass C\nPass D";
  
  Deflater def=new Deflater(9);
  def.setDictionary(Dictionary.getLongestOccurence(List.of(text),2,6).getBytes());
  
  byte[] original=text.getBytes();
  try(ByteArrayOutputStream baos=new ByteArrayOutputStream())
  {
   try(DeflaterOutputStream dos=new DeflaterOutputStream(baos,def))
   {
    dos.write(original);
    dos.flush();
   }
   System.out.println(original.length+"/"+baos.toByteArray().length);
  }    
 }

输出

28/24

所以使用字典,未压缩数据与压缩数据的比率是 28/24

如果注释掉该行
//def.setDictionary(Dictionary.getLongestOccurence(List.of(text),2,6).getBytes());

我们得到的结果

    28/26

不太好,但我们通过使用保存了 2 个完整字节的数据字典。当然,这是一个小测试用例,但如果您对密码进行排序,使非常相似的密码彼此接近,那么节省的费用就会更大。

分页数据[Main Core]

这是操作中最重要的部分,您的所有积蓄都会发挥作用。

这非常简单,因为代码将解释一切

import java.io.BufferedReader;
import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.stream.Collectors;
import java.util.zip.Deflater;
import java.util.zip.DeflaterOutputStream;


final class Compressor
{
 public static void compress(File textFile,int linesPerChunk,int minWord,int maxWord,File output)throws Exception
 {
  try(BufferedReader reader=new BufferedReader(new FileReader(textFile));
      FileOutputStream fos=new FileOutputStream(output))
  {
   ArrayList<String> lines=new ArrayList();
   Deflater compresser=new Deflater(9);
   ByteArrayOutputStream baos=new ByteArrayOutputStream();
   DataOutputStream dos=new DataOutputStream(fos);
   int linesDone=0;
   
   String line;   
   while(true)
   {
    line=reader.readLine();
    
    //add only if more input is available
    if(line!=null){lines.add(line);}

    //parse if we have linePerChunk number of lines or if this was the last input
    if(line==null || lines.size()>=linesPerChunk)
    {
     //occurs if this was the last input and there are no more left over lines to parse
     if(lines.isEmpty()){break;}
     
     byte[] dictionary=Dictionary.getLongestOccurence(lines,minWord,maxWord).getBytes();
     
     //Compress chunk
     baos.reset();      
     compresser.reset();
     if(dictionary.length!=0){compresser.setDictionary(dictionary);}    
     try(DeflaterOutputStream deos=new DeflaterOutputStream(baos,compresser))
     {
      deos.write(lines.stream().collect(Collectors.joining("\n")).getBytes());
      deos.flush();
     }
     byte[] compressed=baos.toByteArray();
     
     //write chunk
     dos.writeInt(dictionary.length);
     dos.write(dictionary);
     dos.writeInt(compressed.length);
     dos.write(compressed);
     dos.flush();
     
     System.out.println("Compressed Lines="+(linesDone+=lines.size()));
     
     //clear to read next chunk
     lines.clear();
    }
    
   
    if(line==null){break;}
   }
   
   //used while reading to check for end point
   dos.writeInt(-1);
  }
 }
}

基本上我们将文件分为 N 行页面并压缩每个页面并将其作为块写入新文件中。此方法还接受字典中使用的最小和最大单词长度。如果您想跳过使用字典,您可以为任一值传递 -1。字典的性能优势很大程度上取决于您的密码数据,并且是您的偏好

Lazy Reading[Final]

最后,我们读取每个块,如果该块包含我们正在查找的密码,我们仅返回那里的方法并且不这样做不再继续

public class Validater 
{
 public static boolean isAllowed(String password,File compressed)throws Exception
 {
  Inflater inf=new Inflater();
  ByteArrayOutputStream baos=new ByteArrayOutputStream();
  
  try(DataInputStream dis=new DataInputStream(new FileInputStream(compressed));)
  {
   int size;
   while((size=dis.readInt())!=-1)
   {
     byte[] dictionary=new byte[size];
     dis.read(dictionary);
     byte[] data=new byte[dis.readInt()];
     dis.read(data);

     inf.reset();
     try(InflaterInputStream iis=new InflaterInputStream(new ByteArrayInputStream(data),inf))
     {
      baos.reset();
      
      int length;
      byte[] output=new byte[1000];   
      while(true)
      {
       length=iis.read(output);
       if(inf.needsDictionary()){inf.setDictionary(dictionary);}
       else
       {
        if(length==-1){break;}
        baos.write(output,0,length);
       }
      }
     }
     
     //the actual validation
     if(new String(baos.toByteArray()).contains(password)){return false;}  
   }
  }
  return true;
 }
}

性能

 public static void main(String[] args)throws Exception
 {
  //beginning of the file
  long start=System.currentTimeMillis();
  System.out.println(Validater.isAllowed("ELIdjajPKsmeZ",new File("E:/Compressed.bin")));
  System.out.println(System.currentTimeMillis()-start);
  
  //middle of the file
  start=System.currentTimeMillis();
  System.out.println(Validater.isAllowed("kSIvmu",new File("E:/Compressed.bin")));
  System.out.println(System.currentTimeMillis()-start);
  
  //end of the file
  start=System.currentTimeMillis();
  System.out.println(Validater.isAllowed("CcFylRChZDyd",new File("E:/Compressed.bin")));
  System.out.println(System.currentTimeMillis()-start);
 }

上面的测试用例是我从记事本密码文件的每个部分复制的一些随机密码

在执行之前,您必须使用以下命令压缩文件

public static void main4(String[] args)throws Exception
 {
  //i used -1 to skip dictionary but you can specify whatever value and i create each chunk of length 100 lines
  Compressor.compress(new File("E:/Passwords.txt"),100,-1,4,new File("E:/Compressed.bin"));
 }

这是结果

false
5
false
713
false
1554 

正如您所看到的最坏情况只需 1554 毫秒在我的马铃薯系统中执行超过 400 万行密码,内存使用量甚至不超过 1 MB

完整代码

Dictionary.java

import java.util.List;
import java.util.stream.Collectors;

final class Dictionary
{
 private static final class Info
 {
  private final int longestLine;
  
  private int highestOccurence;
  private String longestWord="";
  
  public Info(int longestLine) {this.longestLine = longestLine;}
 } 
 
 private static Info getInfo(List<String> lines)
 {
  int 
  currentLine, 
  longestLine=0;
  
  for(String line:lines)
  {
   //find longest line in the password list
   if((currentLine=line.length())>longestLine)
   {
    longestLine=currentLine;
   }
  }
  
  return new Info(longestLine);
 }
 
 static String getLongestOccurence(List<String> lines,int minLength,int maxLength)
 {
  //user dosent want dictionary hence return empty string
  if(minLength<0){return "";}
  
  Info info=getInfo(lines);
 
  minLength=Math.max(2,minLength);
  maxLength=maxLength<=0?info.longestLine:Math.min(maxLength,info.longestLine);
  
  String fullText=lines.stream().collect(Collectors.joining("\n"));
  int 
  startIndex=0, 
  count;
  
  int length=fullText.length();
  outer : for(int i=0;i<length;i++)
  {
   for(int n=minLength;n<maxLength;n++)
   {
    //if substring is longer than text length break
    if((i+n+1)>length){break outer;}
    //if combination length is lesser than previous longestWord length then continue as we dont care about shorter words
    if(n<info.longestWord.length()){continue;}   
    //get combination
    String combination=fullText.substring(i,i+n+1);  
    
    //count occurences
    count=0;
    do
    {
     startIndex=fullText.indexOf(combination,startIndex);
     if(startIndex!=-1)
     {
      startIndex+=combination.length();
      
      count++;
     }
    } 
    while(startIndex!=-1);
    
    //we only care about more than one occurences
    if(count>1)
    {   
     if(length>info.longestWord.length()) //if new combination is longer than previous one give it more precedence
     {
      info.longestWord=combination;
      info.highestOccurence=count;
     }
     else if(count>info.highestOccurence)  //else if both are equal length then check counts
     {
      info.longestWord=combination;       
      info.highestOccurence=count;      
     }
    }
   }
  }
  
  return info.longestWord;
 }
}
 

Compressor.java

import java.io.BufferedReader;
import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.stream.Collectors;
import java.util.zip.Deflater;
import java.util.zip.DeflaterOutputStream;


final class Compressor
{
 public static void compress(File textFile,int linesPerChunk,int minWord,int maxWord,File output)throws Exception
 {
  try(BufferedReader reader=new BufferedReader(new FileReader(textFile));
      FileOutputStream fos=new FileOutputStream(output))
  {
   ArrayList<String> lines=new ArrayList();
   Deflater compresser=new Deflater(9);
   ByteArrayOutputStream baos=new ByteArrayOutputStream();
   DataOutputStream dos=new DataOutputStream(fos);
   int linesDone=0;
   
   String line;   
   while(true)
   {
    line=reader.readLine();
    
    //add only if more input is available
    if(line!=null){lines.add(line);}

    //parse if we have linePerChunk number of lines or if this was the last input
    if(line==null || lines.size()>=linesPerChunk)
    {
     //occurs if this was the last input and there are no more left over lines to parse
     if(lines.isEmpty()){break;}
     
     byte[] dictionary=Dictionary.getLongestOccurence(lines,minWord,maxWord).getBytes();
     
     //Compress chunk
     baos.reset();      
     compresser.reset();
     if(dictionary.length!=0){compresser.setDictionary(dictionary);}    
     try(DeflaterOutputStream deos=new DeflaterOutputStream(baos,compresser))
     {
      deos.write(lines.stream().collect(Collectors.joining("\n")).getBytes());
      deos.flush();
     }
     byte[] compressed=baos.toByteArray();
     
     //write chunk
     dos.writeInt(dictionary.length);
     dos.write(dictionary);
     dos.writeInt(compressed.length);
     dos.write(compressed);
     dos.flush();
     
     System.out.println("Compressed Lines="+(linesDone+=lines.size()));
     
     lines.clear();
    }
    //clear to read next chunk
   
    if(line==null){break;}
   }
   
   //used while reading to check for end point
   dos.writeInt(-1);
  }
 }
}

Validater.java

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.util.zip.Inflater;
import java.util.zip.InflaterInputStream;


public class Validater 
{
 public static boolean isAllowed(String password,File compressed)throws Exception
 {
  Inflater inf=new Inflater();
  ByteArrayOutputStream baos=new ByteArrayOutputStream();
  
  try(DataInputStream dis=new DataInputStream(new FileInputStream(compressed));)
  {
   int size;
   while((size=dis.readInt())!=-1)
   {
     byte[] dictionary=new byte[size];
     dis.read(dictionary);
     byte[] data=new byte[dis.readInt()];
     dis.read(data);

     inf.reset();
     try(InflaterInputStream iis=new InflaterInputStream(new ByteArrayInputStream(data),inf))
     {
      baos.reset();
      
      int length;
      byte[] output=new byte[1000];   
      while(true)
      {
       length=iis.read(output);
       if(inf.needsDictionary()){inf.setDictionary(dictionary);}
       else
       {
        if(length==-1){break;}
        baos.write(output,0,length);
       }
      }
     }
     
     if(new String(baos.toByteArray()).contains(password)){return false;}  
   }
  }
  return true;
 }
}

所有测试用例将任何方法重命名为 main 和测试

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.PrintWriter;
import java.util.List;
import java.util.Random;
import java.util.zip.Deflater;
import java.util.zip.DeflaterOutputStream;

final class Test 
{
 public static void main1(String[] args)throws Exception
 {
  String text="Pass A\nPass B\n Pass C\nPass D";
  
  Deflater def=new Deflater(9);
  //def.setDictionary(Dictionary.getLongestOccurence(List.of(text),2,6).getBytes());
  
  byte[] original=text.getBytes();
  try(ByteArrayOutputStream baos=new ByteArrayOutputStream())
  {
   try(DeflaterOutputStream dos=new DeflaterOutputStream(baos,def))
   {
    dos.write(original);
    dos.flush();
   }
   System.out.println(original.length+"/"+baos.toByteArray().length);
  }    
 }
 
 public static void main2(String[] args) throws Exception
 {
  try(PrintWriter writer=new PrintWriter(new File("E:/Passwords.txt")))
  {
   Random random=new Random();
   int 
   ch, 
   passlength;
   String password;
   
   for(int i=0;i<4 * 1000 * 1000;i++)
   {
    password="";
    
    passlength=random.nextInt(5,15);
    for(int j=0;j<passlength;j++)
    {
     ch=random.nextInt(65,122);
     
     if(ch>=91 && ch<=96){continue;}
     
     password+=(char)(ch);
    }
    if(password.isEmpty()){continue;}
    
    writer.println(password);
   }
  }
 }
 


 public static void main(String[] args)
 {
  System.out.println
  (
   "Dictionary="+Dictionary.getLongestOccurence
   (
    List.of
    (
      "ABC"
     ,"ABCD"
    )
    ,2,3
   )
  );
  
  System.out.println
  (
   "Dictionary="+Dictionary.getLongestOccurence
   (
    List.of
    (
      "Pass A1"
     ,"Pass B2"
     ,"Pass C3"
     ,"Pass D4"
    )
    ,2,4
   )
  );
  
  System.out.println
  (
   "Dictionary="+Dictionary.getLongestOccurence
   (
    List.of
    (
      "123/ABC/345/ABC/678/ABC/908"
    )
    ,2,5
   )
  );
  
  System.out.println
  (
   "Dictionary="+Dictionary.getLongestOccurence
   (
    List.of
    (
      "HHH123HHH456HHH789HHH092"
    )
    ,2,4
   )
  );
 }
 
 public static void main4(String[] args)throws Exception
 {
  Compressor.compress(new File("E:/Passwords2.txt"),2,2,4,new File("E:/Compressed2.bin"));
 }
 
 public static void main5(String[] args)throws Exception
 {
  //beginning of the file
  long start=System.currentTimeMillis();
  System.out.println(Validater.isAllowed("ELIdjajPKsmeZ",new File("E:/Compressed.bin")));
  System.out.println(System.currentTimeMillis()-start);
  
  //middle of the file
  start=System.currentTimeMillis();
  System.out.println(Validater.isAllowed("kSIvmu",new File("E:/Compressed.bin")));
  System.out.println(System.currentTimeMillis()-start);
  
  //end of the file
  start=System.currentTimeMillis();
  System.out.println(Validater.isAllowed("CcFylRChZDyd",new File("E:/Compressed.bin")));
  System.out.println(System.currentTimeMillis()-start);
 }
}

结论

嗯,这是非常有见地的。这里使用的所有类都存在于java 7中,除了stream,它是大多数android版本支持的java 8,所以你应该在这方面没问题。如果您懒得阅读我的记录不完善的解释,那么您可以直接在“完整代码”部分之后复制完整的代码。这几乎花了一整天的时间。一个缺点是,在您的压缩文件大小为 11 MB 之前,我系统中的最终压缩文件大小为 26.9 MB。对于我随机生成的密码,但正如您从输出中看到的那样,内存和速度总体上非常好。您可以尝试设置以获得您喜欢的尺寸此外,如果您有更好的方法来查找词典,请编辑此答案。不要费心使用 SqLite。

你今天过得很愉快:)

编辑
字典类速度更快并且得到了改进。希望有帮助

Dont use sqlite or any database for your purpose. Creating a table with just 2 columns for the sake of compressing an long list of data defeats the purpose of Structure in SQL and you also have to deal with the extra communication overhead.

High Level Overview

In your case i am assuming you have an unencrypted text file of 4 million lines where each password is stored in an new line.

3 simple steps for processing any large amount of sequential data

Pagify : Divide your file up into chunks or pages. Out of 4 million lines try to process every 100,1000 or N number of lines and write them as seperate chunks inside your file

Process : Processing means taking the above chunks and Compress/Deflate them.

The java Deflater class has an setDictionary method where you pass the most frequently occuring byte sequence inside your target chunk which improves overall compression quality.

Finding what value to pass to this parameter is a little complicated as we will see in the coming sections.

Lazy Read : The problem with your previous approach is you treated your text file as an single unit which you loaded fully into your memory. But now since your file is in chunks we read,check,free each chunk at a time. There are 2 advantages to this approach

  1. After reading an chunk you can free it from memory if it does not match your criteria

  2. If the chunk contains the password you are looking for then you can stop processing there itself and don't have to process further chunks thus saving computation time . Chunks towards the end of the file can be skipped if what you are looking for is in the beginning of the file.

Test Case

I use this code to generate your 4 million lines of text for testing purposes

public static void main(String[] args) throws Exception
 {
  try(PrintWriter writer=new PrintWriter(new File("E:/Passwords.txt")))
  {
   //for random password generation
   Random random=new Random();
   int 
   ch, 
   passlength;
   String password;
   
   //4 million lines[comes upto 40MB]
   for(int i=0;i<4 * 1000 * 1000;i++)
   {
    password="";
    
    //generate any password whose length is between 5 and 15 characters
    passlength=random.nextInt(5,15);
    for(int j=0;j<passlength;j++)
    {
     //65->122 is the range of all alphabetic characters in US ASCII
     ch=random.nextInt(65,122);
     
     //exclude characters like [ ( and so on 
     if(ch>=91 && ch<=96){continue;}
     
     password+=(char)(ch);
    }
    
    //write password to file
    writer.println(password);
   }
  }
 }

So we have our test file now lets get started

Finding the longest & most frequently occuring string in your list of passwords[Optional operation]

As mentioned the setDictionary accepts the above required bytes for proper compression. For most of the cases you don't need to use this as the Compressor most of the time is smart enough to figure this out but if you pass it the exact pattern occurring in your list of passwords then compression speed and performance is improved. Now comes the complicated part

Figuring out the sequence
Lets say we have an example string

  ABC/123/ABC/456/ABC/789

The most frequently occuring string is /ABC/ including the slashes how we find this string is as follows

  1. Find the length of the longest password in your list which will also be longest word to check for it's occurence in the list[length call it N]

  2. Create an for loop n which goes from 2->N where n is the length of the substring from i->i+n which we should create and then check for its
    occurence in the password list

Remember the longer the string we find the better for compression for example lets say we have 2 passwords

ABC
ABCD

Longest Password=length 3[2nd line]

for i=0->length of full text
   for n=2->N
    Generate  substring from i->i+n
     check for occurences

For each of these strings generated we find the number of occurences in the password list and we store the one with the highest occurence.

But if we encounter an substring with an longer length than the previous one we override the previous longest word irrespective of its occurences as it is always favoured more in compression

The Dictionary class does this

import java.util.List;
import java.util.stream.Collectors;

final class Dictionary
{
 private static final class Info
 {
  private final int longestLine;
  
  private int highestOccurence;
  private String longestWord="";
  
  public Info(int longestLine) {this.longestLine = longestLine;}
 } 
 
 private static Info getInfo(List<String> lines)
 {
  int 
  currentLine, 
  longestLine=0;
  
  for(String line:lines)
  {
   //find longest line in the password list
   if((currentLine=line.length())>longestLine)
   {
    longestLine=currentLine;
   }
  }
  
  return new Info(longestLine);
 }
 
 static String getLongestOccurence(List<String> lines,int minLength,int maxLength)
 {
  //user dosent want dictionary hence return empty string
  if(minLength<0){return "";}
  
  Info info=getInfo(lines);
 
  minLength=Math.max(2,minLength);
  maxLength=maxLength<=0?info.longestLine:Math.min(maxLength,info.longestLine);
  
  String fullText=lines.stream().collect(Collectors.joining("\n"));
  int 
  startIndex=0, 
  count;
  
  int length=fullText.length();
  outer : for(int i=0;i<length;i++)
  {
   for(int n=minLength;n<maxLength;n++)
   {
    //if substring is longer than text length break
    if((i+n+1)>length){break outer;}
    //if combination length is lesser than previous longestWord length then continue as we dont care about shorter words
    if(n<info.longestWord.length()){continue;}   
    //get combination
    String combination=fullText.substring(i,i+n+1);  
    
    //count occurences
    count=0;
    do
    {
     startIndex=fullText.indexOf(combination,startIndex);
     if(startIndex!=-1)
     {
      startIndex+=combination.length();
      
      count++;
     }
    } 
    while(startIndex!=-1);
    
    //we only care about more than one occurences
    if(count>1)
    {   
     if(length>info.longestWord.length()) //if new combination is longer than previous one give it more precedence
     {
      info.longestWord=combination;
      info.highestOccurence=count;
     }
     else if(count>info.highestOccurence)  //else if both are equal length then check counts
     {
      info.longestWord=combination;       
      info.highestOccurence=count;      
     }
    }
   }
  }
  
  return info.longestWord;
 }
}



 

Here we are looking for an frequently occuring word which is minimum length 2 and maximum length 4

 public static void main(String[] args)
 {
  System.out.println
  (
   "Dictionary="+Dictionary.getLongestOccurence
   (
    List.of
    (
      "Pass A"
     ,"Pass B"
     ,"Pass C"
     ,"Pass D"
    )
    ,2,4
   )
  );
 }

We get

Dictionary=Pass

Difference in compression?

With all that complex logic what compression do we get? lets see

public static void main(String[] args)throws Exception
 {
  String text="Pass A\nPass B\n Pass C\nPass D";
  
  Deflater def=new Deflater(9);
  def.setDictionary(Dictionary.getLongestOccurence(List.of(text),2,6).getBytes());
  
  byte[] original=text.getBytes();
  try(ByteArrayOutputStream baos=new ByteArrayOutputStream())
  {
   try(DeflaterOutputStream dos=new DeflaterOutputStream(baos,def))
   {
    dos.write(original);
    dos.flush();
   }
   System.out.println(original.length+"/"+baos.toByteArray().length);
  }    
 }

Output

28/24

So using dictionary the ratio of the uncompressed to the compressed data is 28/24

If you comment out the line
//def.setDictionary(Dictionary.getLongestOccurence(List.of(text),2,6).getBytes());

We get

    28/26

Not so great but we saved 2 whole bytes of data by using the dictionary. Off course this is an small test case but if you have sorted your passwords such that very similar passwords are close to each other then savings become much greater.

Pagifying your data[Main Core]

This is the most important part of the operation where all your savings come into play.

It's pretty simple as the code will explain it all

import java.io.BufferedReader;
import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.stream.Collectors;
import java.util.zip.Deflater;
import java.util.zip.DeflaterOutputStream;


final class Compressor
{
 public static void compress(File textFile,int linesPerChunk,int minWord,int maxWord,File output)throws Exception
 {
  try(BufferedReader reader=new BufferedReader(new FileReader(textFile));
      FileOutputStream fos=new FileOutputStream(output))
  {
   ArrayList<String> lines=new ArrayList();
   Deflater compresser=new Deflater(9);
   ByteArrayOutputStream baos=new ByteArrayOutputStream();
   DataOutputStream dos=new DataOutputStream(fos);
   int linesDone=0;
   
   String line;   
   while(true)
   {
    line=reader.readLine();
    
    //add only if more input is available
    if(line!=null){lines.add(line);}

    //parse if we have linePerChunk number of lines or if this was the last input
    if(line==null || lines.size()>=linesPerChunk)
    {
     //occurs if this was the last input and there are no more left over lines to parse
     if(lines.isEmpty()){break;}
     
     byte[] dictionary=Dictionary.getLongestOccurence(lines,minWord,maxWord).getBytes();
     
     //Compress chunk
     baos.reset();      
     compresser.reset();
     if(dictionary.length!=0){compresser.setDictionary(dictionary);}    
     try(DeflaterOutputStream deos=new DeflaterOutputStream(baos,compresser))
     {
      deos.write(lines.stream().collect(Collectors.joining("\n")).getBytes());
      deos.flush();
     }
     byte[] compressed=baos.toByteArray();
     
     //write chunk
     dos.writeInt(dictionary.length);
     dos.write(dictionary);
     dos.writeInt(compressed.length);
     dos.write(compressed);
     dos.flush();
     
     System.out.println("Compressed Lines="+(linesDone+=lines.size()));
     
     //clear to read next chunk
     lines.clear();
    }
    
   
    if(line==null){break;}
   }
   
   //used while reading to check for end point
   dos.writeInt(-1);
  }
 }
}

Basically we divide your file into pages of N lines and compress each page and write it as a chunk into an new file. This method also accepts the min and max word length to use in your dictionary . You can pass -1 for either value if you wish to skip using dictionary. Performance benefits of dictionary largely depends on your password data and is your preference

Lazy Reading[Final]

Finally we read each chunk and if the chunk contains the password we are looking for we return the method there only and don't proceed further

public class Validater 
{
 public static boolean isAllowed(String password,File compressed)throws Exception
 {
  Inflater inf=new Inflater();
  ByteArrayOutputStream baos=new ByteArrayOutputStream();
  
  try(DataInputStream dis=new DataInputStream(new FileInputStream(compressed));)
  {
   int size;
   while((size=dis.readInt())!=-1)
   {
     byte[] dictionary=new byte[size];
     dis.read(dictionary);
     byte[] data=new byte[dis.readInt()];
     dis.read(data);

     inf.reset();
     try(InflaterInputStream iis=new InflaterInputStream(new ByteArrayInputStream(data),inf))
     {
      baos.reset();
      
      int length;
      byte[] output=new byte[1000];   
      while(true)
      {
       length=iis.read(output);
       if(inf.needsDictionary()){inf.setDictionary(dictionary);}
       else
       {
        if(length==-1){break;}
        baos.write(output,0,length);
       }
      }
     }
     
     //the actual validation
     if(new String(baos.toByteArray()).contains(password)){return false;}  
   }
  }
  return true;
 }
}

Performance

 public static void main(String[] args)throws Exception
 {
  //beginning of the file
  long start=System.currentTimeMillis();
  System.out.println(Validater.isAllowed("ELIdjajPKsmeZ",new File("E:/Compressed.bin")));
  System.out.println(System.currentTimeMillis()-start);
  
  //middle of the file
  start=System.currentTimeMillis();
  System.out.println(Validater.isAllowed("kSIvmu",new File("E:/Compressed.bin")));
  System.out.println(System.currentTimeMillis()-start);
  
  //end of the file
  start=System.currentTimeMillis();
  System.out.println(Validater.isAllowed("CcFylRChZDyd",new File("E:/Compressed.bin")));
  System.out.println(System.currentTimeMillis()-start);
 }

The above test case is some random passwords i copied from each part of the notepad passwords file

Before executing you must compress your file using

public static void main4(String[] args)throws Exception
 {
  //i used -1 to skip dictionary but you can specify whatever value and i create each chunk of length 100 lines
  Compressor.compress(new File("E:/Passwords.txt"),100,-1,4,new File("E:/Compressed.bin"));
 }

Here are the results

false
5
false
713
false
1554 

As you can see the worst case took just 1554 milliseconds to execute over 4 million lines of passwords and the memory usage didn't even exceed 1 MB in my potato system

The Full Code

Dictionary.java

import java.util.List;
import java.util.stream.Collectors;

final class Dictionary
{
 private static final class Info
 {
  private final int longestLine;
  
  private int highestOccurence;
  private String longestWord="";
  
  public Info(int longestLine) {this.longestLine = longestLine;}
 } 
 
 private static Info getInfo(List<String> lines)
 {
  int 
  currentLine, 
  longestLine=0;
  
  for(String line:lines)
  {
   //find longest line in the password list
   if((currentLine=line.length())>longestLine)
   {
    longestLine=currentLine;
   }
  }
  
  return new Info(longestLine);
 }
 
 static String getLongestOccurence(List<String> lines,int minLength,int maxLength)
 {
  //user dosent want dictionary hence return empty string
  if(minLength<0){return "";}
  
  Info info=getInfo(lines);
 
  minLength=Math.max(2,minLength);
  maxLength=maxLength<=0?info.longestLine:Math.min(maxLength,info.longestLine);
  
  String fullText=lines.stream().collect(Collectors.joining("\n"));
  int 
  startIndex=0, 
  count;
  
  int length=fullText.length();
  outer : for(int i=0;i<length;i++)
  {
   for(int n=minLength;n<maxLength;n++)
   {
    //if substring is longer than text length break
    if((i+n+1)>length){break outer;}
    //if combination length is lesser than previous longestWord length then continue as we dont care about shorter words
    if(n<info.longestWord.length()){continue;}   
    //get combination
    String combination=fullText.substring(i,i+n+1);  
    
    //count occurences
    count=0;
    do
    {
     startIndex=fullText.indexOf(combination,startIndex);
     if(startIndex!=-1)
     {
      startIndex+=combination.length();
      
      count++;
     }
    } 
    while(startIndex!=-1);
    
    //we only care about more than one occurences
    if(count>1)
    {   
     if(length>info.longestWord.length()) //if new combination is longer than previous one give it more precedence
     {
      info.longestWord=combination;
      info.highestOccurence=count;
     }
     else if(count>info.highestOccurence)  //else if both are equal length then check counts
     {
      info.longestWord=combination;       
      info.highestOccurence=count;      
     }
    }
   }
  }
  
  return info.longestWord;
 }
}
 

Compressor.java

import java.io.BufferedReader;
import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.stream.Collectors;
import java.util.zip.Deflater;
import java.util.zip.DeflaterOutputStream;


final class Compressor
{
 public static void compress(File textFile,int linesPerChunk,int minWord,int maxWord,File output)throws Exception
 {
  try(BufferedReader reader=new BufferedReader(new FileReader(textFile));
      FileOutputStream fos=new FileOutputStream(output))
  {
   ArrayList<String> lines=new ArrayList();
   Deflater compresser=new Deflater(9);
   ByteArrayOutputStream baos=new ByteArrayOutputStream();
   DataOutputStream dos=new DataOutputStream(fos);
   int linesDone=0;
   
   String line;   
   while(true)
   {
    line=reader.readLine();
    
    //add only if more input is available
    if(line!=null){lines.add(line);}

    //parse if we have linePerChunk number of lines or if this was the last input
    if(line==null || lines.size()>=linesPerChunk)
    {
     //occurs if this was the last input and there are no more left over lines to parse
     if(lines.isEmpty()){break;}
     
     byte[] dictionary=Dictionary.getLongestOccurence(lines,minWord,maxWord).getBytes();
     
     //Compress chunk
     baos.reset();      
     compresser.reset();
     if(dictionary.length!=0){compresser.setDictionary(dictionary);}    
     try(DeflaterOutputStream deos=new DeflaterOutputStream(baos,compresser))
     {
      deos.write(lines.stream().collect(Collectors.joining("\n")).getBytes());
      deos.flush();
     }
     byte[] compressed=baos.toByteArray();
     
     //write chunk
     dos.writeInt(dictionary.length);
     dos.write(dictionary);
     dos.writeInt(compressed.length);
     dos.write(compressed);
     dos.flush();
     
     System.out.println("Compressed Lines="+(linesDone+=lines.size()));
     
     lines.clear();
    }
    //clear to read next chunk
   
    if(line==null){break;}
   }
   
   //used while reading to check for end point
   dos.writeInt(-1);
  }
 }
}

Validater.java

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.util.zip.Inflater;
import java.util.zip.InflaterInputStream;


public class Validater 
{
 public static boolean isAllowed(String password,File compressed)throws Exception
 {
  Inflater inf=new Inflater();
  ByteArrayOutputStream baos=new ByteArrayOutputStream();
  
  try(DataInputStream dis=new DataInputStream(new FileInputStream(compressed));)
  {
   int size;
   while((size=dis.readInt())!=-1)
   {
     byte[] dictionary=new byte[size];
     dis.read(dictionary);
     byte[] data=new byte[dis.readInt()];
     dis.read(data);

     inf.reset();
     try(InflaterInputStream iis=new InflaterInputStream(new ByteArrayInputStream(data),inf))
     {
      baos.reset();
      
      int length;
      byte[] output=new byte[1000];   
      while(true)
      {
       length=iis.read(output);
       if(inf.needsDictionary()){inf.setDictionary(dictionary);}
       else
       {
        if(length==-1){break;}
        baos.write(output,0,length);
       }
      }
     }
     
     if(new String(baos.toByteArray()).contains(password)){return false;}  
   }
  }
  return true;
 }
}

All Test cases rename any method to main and test

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.PrintWriter;
import java.util.List;
import java.util.Random;
import java.util.zip.Deflater;
import java.util.zip.DeflaterOutputStream;

final class Test 
{
 public static void main1(String[] args)throws Exception
 {
  String text="Pass A\nPass B\n Pass C\nPass D";
  
  Deflater def=new Deflater(9);
  //def.setDictionary(Dictionary.getLongestOccurence(List.of(text),2,6).getBytes());
  
  byte[] original=text.getBytes();
  try(ByteArrayOutputStream baos=new ByteArrayOutputStream())
  {
   try(DeflaterOutputStream dos=new DeflaterOutputStream(baos,def))
   {
    dos.write(original);
    dos.flush();
   }
   System.out.println(original.length+"/"+baos.toByteArray().length);
  }    
 }
 
 public static void main2(String[] args) throws Exception
 {
  try(PrintWriter writer=new PrintWriter(new File("E:/Passwords.txt")))
  {
   Random random=new Random();
   int 
   ch, 
   passlength;
   String password;
   
   for(int i=0;i<4 * 1000 * 1000;i++)
   {
    password="";
    
    passlength=random.nextInt(5,15);
    for(int j=0;j<passlength;j++)
    {
     ch=random.nextInt(65,122);
     
     if(ch>=91 && ch<=96){continue;}
     
     password+=(char)(ch);
    }
    if(password.isEmpty()){continue;}
    
    writer.println(password);
   }
  }
 }
 


 public static void main(String[] args)
 {
  System.out.println
  (
   "Dictionary="+Dictionary.getLongestOccurence
   (
    List.of
    (
      "ABC"
     ,"ABCD"
    )
    ,2,3
   )
  );
  
  System.out.println
  (
   "Dictionary="+Dictionary.getLongestOccurence
   (
    List.of
    (
      "Pass A1"
     ,"Pass B2"
     ,"Pass C3"
     ,"Pass D4"
    )
    ,2,4
   )
  );
  
  System.out.println
  (
   "Dictionary="+Dictionary.getLongestOccurence
   (
    List.of
    (
      "123/ABC/345/ABC/678/ABC/908"
    )
    ,2,5
   )
  );
  
  System.out.println
  (
   "Dictionary="+Dictionary.getLongestOccurence
   (
    List.of
    (
      "HHH123HHH456HHH789HHH092"
    )
    ,2,4
   )
  );
 }
 
 public static void main4(String[] args)throws Exception
 {
  Compressor.compress(new File("E:/Passwords2.txt"),2,2,4,new File("E:/Compressed2.bin"));
 }
 
 public static void main5(String[] args)throws Exception
 {
  //beginning of the file
  long start=System.currentTimeMillis();
  System.out.println(Validater.isAllowed("ELIdjajPKsmeZ",new File("E:/Compressed.bin")));
  System.out.println(System.currentTimeMillis()-start);
  
  //middle of the file
  start=System.currentTimeMillis();
  System.out.println(Validater.isAllowed("kSIvmu",new File("E:/Compressed.bin")));
  System.out.println(System.currentTimeMillis()-start);
  
  //end of the file
  start=System.currentTimeMillis();
  System.out.println(Validater.isAllowed("CcFylRChZDyd",new File("E:/Compressed.bin")));
  System.out.println(System.currentTimeMillis()-start);
 }
}

Conclusion

Well this was very insightful . All the classes used here are all present in java 7 except streams which is java 8 which most android versions support so you should be ok on that end. If you cant be bothered to read my poorly documented explanations then you can go right ahead and copy the full code after the FullCode Section. This took pretty much the whole day. One down side is while before your compressed file was 11 MB the final compressed file in my system came up to 26.9 MB. for my randomly generated passwords but as you have seen from the outputs memory and speed is overall very good. You can play around with the settings to get your preferred sizes Also if you have a better approach for finding the dictionary please do edit this answer. Don't bother using SqLite.

You have a good day now :)

EDIT
Dictionary class is much faster and improved. Hope it helps

浅沫记忆 2025-01-21 01:15:00

如果性能确实没有问题,那么应用二分搜索的排序数组又如何呢?这占用了最接近数据原始大小的内存空间,并且一旦排序,对数组条目的访问相对较快。

final String [] passwords = … // Array is already sorted!
final var index = Arrays.binarySearch( passwords, password );
if( index < 0 )
{
  out.printf( "Password '%1$s' is discouraged!%n", password );
}
else
{
  out.printf( "Password '%1$s' seems to be ok!%n", password );
} 

承认,更新密码列表将是一场噩梦,但我知道这不是你的问题。

当然,这并不是您关于如何优化 SQLite 数据库问题的答案。

If performance is really no problem, what about a sorted array where you apply a binary search on? This is taking memory space closest to the raw size of the data, and once sorted, the access to the entries of the array is relatively fast.

final String [] passwords = … // Array is already sorted!
final var index = Arrays.binarySearch( passwords, password );
if( index < 0 )
{
  out.printf( "Password '%1$s' is discouraged!%n", password );
}
else
{
  out.printf( "Password '%1$s' seems to be ok!%n", password );
} 

Confessed, updating the password list would be a nightmare, but I understood that this is not your issue here.

And of course, this is not the answer on your question on how to optimise your SQLite database.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文