对字符串数组进行计数和排序的最佳方法是什么

发布于 2025-01-04 22:40:42 字数 520 浏览 0 评论 0原文

我试图找到是否有一种好的方法来搜索(计算出现次数),然后以有效的方式对字符串数组进行排序...这是一种在嵌入式系统(32Mb)中运行良好的方法

示例:我有计算字符 A、B、C 等的使用次数...保存该结果以供后序排序...

我可以使用 public int count(String searchDomain, char searchValue) 方法进行计数,但每个字符串应该例如有所有字母:

"This is a test string"
A:1,B:0,C:0,D:0,E:1,I:3,F:0,...
"ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCC"
A:7,B:0,C:22,G:18

我的排序方法需要能够回答以下问题:按 A、B 的数量排序 首先按 As 排序,然后按 Bs 对该子域进行排序

这不是为了家庭作业,它是为了需要在手机上运行的应用程序,我需要它是高效的,我当前的实现太慢并且使用了太多内存。

I am trying to find if there is a good way to search (count number of occurrences) and then sort a String array in a efficient way... that is a way that will work well in embedded systems (32Mb)

Example: I have to count the number of time the character A, B, C, etc... is used save that result for posterior sorting...

I can count using a public int count(String searchDomain, char searchValue) method, but each string should have all alphabet letter for instance:

"This is a test string"
A:1,B:0,C:0,D:0,E:1,I:3,F:0,...
"ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCC"
A:7,B:0,C:22,G:18

My sorting method need to be able to answer to things like: Sort by number of As, Bs
sort first by As and then sort that subdomain by Bs

This is not for homework, it's for an application that needs to run on mobile phones, i need this to be efficient, my current implementation is too slow and uses too much memory.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

橘虞初梦 2025-01-11 22:40:42

我会利用 Java 的(非常高效的)内置排序功能。首先,定义一个简单的类来包含您的字符串及其元数据:

class Item
{
    // Your string. It's public, so you can get it if you want,
    // but also final, so you can't accidentally change it.
    public final String string;

    // An array of counts, where the offset is the alphabetical position
    // of the letter it's counting. (A = 0, B = 1, C=2...)
    private final short[] instanceCounts = new short[32];

    public Item(String string)
    {
        this.string = string;
        for(char c : string.toCharArray())
        {
            // Increment the count for this character
            instanceCounts[(byte)c - 65] ++;
        }
    }

    public int getCount(char c)
    {
        return instanceCounts[(byte)c - 65];
    }
}

这将保存您的字符串(用于搜索和显示),并设置一个包含匹配字符计数的 Short 数组。 (如果您的内存确实不足,并且您知道字符串中的任意一个字符超过 255 个,您甚至可以将其更改为字节数组。)short 只有 16 个字节,因此无论字符串有多复杂,数组本身总共只占用 64 个字节。如果您宁愿为每次计算计数而付出性能损失,您可以摆脱数组并替换 getCount() 方法,但您可能最终会通过消耗频繁的垃圾收集来节省一次性内存内存,这对性能有很大影响。 :)

现在,使用比较器定义您要搜索的规则。例如,要按字符串中 A 的数量进行排序:

class CompareByNumberOfA implements Comparator<Item>
{
    public int compare(Item arg0, Item arg1) 
    {
        return arg1.getCount('A') - arg0.getCount('A');
    }
}

最后,将所有项目放入一个数组中,并使用内置(且内存效率很高)的 Arrays 方法进行排序。例如:

public static void main(String args[])
{
    Item[] items = new Item[5];
    items[0]= new Item("ABC");
    items[1]= new Item("ABCAA");
    items[2]= new Item("ABCAAC");
    items[3]= new Item("ABCAAA");
    items[4]= new Item("ABBABZ");

    // THIS IS THE IMPORTANT PART!
    Arrays.sort(items, new CompareByNumberOfA());

    System.out.println(items[0].string);
    System.out.println(items[1].string);
    System.out.println(items[2].string);
    System.out.println(items[3].string);
    System.out.println(items[4].string);
}

您可以定义一大堆比较器,并按照您喜欢的方式使用它们。

使用 Java 编码要记住的一件事是不要变得太聪明。只要您利用它们可以优化的东西(例如包括 Arrays.sort 在内的内置 API),编译器在针对其平台进行优化方面就做得非常出色。

通常,如果你试图变得太聪明,你只会从有效的解决方案中优化自己。 :)

I'd take advantage of Java's (very efficient) built in sorting capabilities. To start with, define a simple class to contain your string and its metadata:

class Item
{
    // Your string. It's public, so you can get it if you want,
    // but also final, so you can't accidentally change it.
    public final String string;

    // An array of counts, where the offset is the alphabetical position
    // of the letter it's counting. (A = 0, B = 1, C=2...)
    private final short[] instanceCounts = new short[32];

    public Item(String string)
    {
        this.string = string;
        for(char c : string.toCharArray())
        {
            // Increment the count for this character
            instanceCounts[(byte)c - 65] ++;
        }
    }

    public int getCount(char c)
    {
        return instanceCounts[(byte)c - 65];
    }
}

This will hold your String (for searching and display), and set up an array of shorts with the count of the matching characters. (If you're really low on memory and you know your strings have more than 255 of any one character, you can even change this to an array of bytes.) A short is only 16 bytes, so the array itself will only take 64 bytes all together regardless of how complex your string. If you'd rather pay the performance hit for calculating the counts every time, you can get rid of the array and replace the getCount() method, but you'll probably end up saving once-off memory by consuming frequently-garbage-collected memory, which is a big performance hit. :)

Now, define the rule you want to search on using a Comparator. For example, to sort by the number of A's in your string:

class CompareByNumberOfA implements Comparator<Item>
{
    public int compare(Item arg0, Item arg1) 
    {
        return arg1.getCount('A') - arg0.getCount('A');
    }
}

Finally, stick all of your items in an array, and use the built in (and highly memory efficient) Arrays methods to sort. For example:

public static void main(String args[])
{
    Item[] items = new Item[5];
    items[0]= new Item("ABC");
    items[1]= new Item("ABCAA");
    items[2]= new Item("ABCAAC");
    items[3]= new Item("ABCAAA");
    items[4]= new Item("ABBABZ");

    // THIS IS THE IMPORTANT PART!
    Arrays.sort(items, new CompareByNumberOfA());

    System.out.println(items[0].string);
    System.out.println(items[1].string);
    System.out.println(items[2].string);
    System.out.println(items[3].string);
    System.out.println(items[4].string);
}

You can define a whole bunch of comparators, and use them how you like.

One of the things to remember about coding with Java is not to get too clever. Compilers do a damn fine job of optimizing for their platform, as long as you take advantage of things they can optimize (like built-in APIs including Arrays.sort).

Often, if you try to get too clever, you'll just optimize yourself right out of an efficient solution. :)

痴梦一场 2025-01-11 22:40:42

我相信您所追求的是树结构,事实上,问题会更好地重写,谈论树结构来索引长连续字符串,而不是“计数”或“排序”。

我不确定这是否是问题的解决方案或重述。你想要一个树的数据结构吗,其中根有26个子树,一个子树用于以“A”开头的字符串,下一个子树用于“B”,依此类推;那么'A'孩子有例如20个孩子代表“AB”,“AC”,“AT”等;依此类推,直到代表“ABALXYZQ”的子项,其中每个子项包含一个代表计数的整数字段,即子字符串出现的次数?

class AdamTree {
    char ch;
    List<AdamTree> children;
    int count;
}

如果这使用了太多内存,那么您就会寻找用内存换取 CPU 时间的方法,但这可能很难做到……什么也没有想到。

I believe that what you're after is a tree structure, and that in fact the question would be better rewritten talking about a tree structure to index a long continuous string rather than "count" or "sort".

I'm not sure if this is a solution or a restatement of the question. Do you want a data-structure which is a tree, where the root has e.g. 26 sub-trees, one for strings starting with 'A', the next child for 'B', and so on; then the 'A' child has e.g. 20 children representing "AB", "AC", "AT" etc.; and so on down to children representing e.g. "ABALXYZQ", where each child contains an integer field representing the count, i.e. the number of times that sub-string occurs?

class AdamTree {
    char ch;
    List<AdamTree> children;
    int count;
}

If this uses too much memory then you'd be looking at ways of trading off memory for CPU time, but that might be difficult to do...nothing comes to mind.

依 靠 2025-01-11 22:40:42

抱歉,我没有时间以更好的方式写这篇文章。为了最大限度地减少空间,我将创建两个 mxn(密集)数组,一个字节一个短数组,其中:

  • m 是输入字符串的数量
  • n 是每个字符串中的字符数;该维度因行而
  • 异 字节数组包含字符
  • 短数组包含该字符的计数

如果保证计数 < 256,您可以只使用一个 mxnx 2 字节数组。

如果您使用的字符集很密集,即任何字符串中使用的所有字符集并不比每个字符串中使用的字符集大很多,您可以摆脱字节数组,只使用固定的“ n”(上图)带有从字符映射到索引的函数。这样会快很多。

对于任何带有 Q 子句的查询,这将需要对该数组进行 2Q 遍历。希望这会足够快。

Sorry I don't have time to write this up in a better way. To minimize space, I would make an two m x n (dense) arrays, one byte and one short where:

  • m is the number of input strings
  • n is the number of characters in each string; this dimension varies from row to row
  • the byte array contains the character
  • the short array contains the count for that character

If counts are guaranteed < 256, you could just use one m x n x 2 byte array.

If the set of characters you are using is dense, i.e., the set of ALL characters used in ANY string is not much larger than the set of characters used in EACH string, you could get rid of the byte array and just use a fixed "n" (above) with a function that maps from character to index. This is would be much faster.

This would requires 2Q traversals of this array for any query with Q clauses. Hopefully this will be fast enough.

哽咽笑 2025-01-11 22:40:42

我可以协助 php/伪代码和哈希图或关联数组。

$hash="";

$string = "ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCC"
while ( read each $char from $string ) {

  if ( isset($hash[$char]) ) { 
      $hash[$char] = $hash[$char]+1 
  } else {
      $hash[$char]=1
  }
}

最后你将得到一个包含 1 个键/字符的关联数组
在哈希值中,您将获得出现次数的计数。

它不是 PHP(或任何其他语言),但原理应该有所帮助。

I can assist with php/pseudo code and hashmaps or associative arrays.

$hash="";

$string = "ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCC"
while ( read each $char from $string ) {

  if ( isset($hash[$char]) ) { 
      $hash[$char] = $hash[$char]+1 
  } else {
      $hash[$char]=1
  }
}

at the end you'll have an associative array with 1 key / char found
and in the hash value you'll have the count of the occurences

It's not PHP (or any other language for that matter) but the principle should help.

夜访吸血鬼 2025-01-11 22:40:42

http://en.wikipedia.org/wiki/Knuth %E2%80%93Morris%E2%80%93Pratt_算法
看看KMP算法。这是一个相当常见的编程问题。在上面您将找到最快的解决方案之一。易于理解和实施。

使用 KMP 计算出现次数,然后在插入后进行合并排序,或者如果您知道数组/等已排序,则进行二分搜索/方向插入。

http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
Have a look at the KMP algorithm. This is a rather common programming problem. Above you will find one of the fastest solutions possible. Easy to understand and implement.

Count the occurences with KMP then either go with a merge sort after insertion, or if you know that the array/etc is sorted, go with binary search/direction insertion.

层林尽染 2025-01-11 22:40:42

也许您可以使用一种树结构,其中深度对应于给定的字母。因此,树中的每个节点对应于一个字母+该字母出现的次数。如果只有一个字符串与该节点(及其父节点)匹配,则将其存储在该节点中。否则,该节点具有用于下一个字母和字母计数的子节点。

因此,这会给出这样的结果:

A:     0                  1                   3           ...
       |               /     \              /    \
B:     0             0        1           1        3
      / \          heaven   /   \     barracuda    ababab
C:   0   1                 0     1
   foo   cow             bar     bac

不确定这会比数组计数解决方案花费更少,但至少您不必存储所有字符串的所有字母的计数(当字母计数唯一标识一个字符串时,树会停止)

您可能可以通过切割没有兄弟姐妹的长分支来优化它

Maybe you could use a kind of tree structure, where the depth corresponds to a given letter. Each node in the tree thus corresponds to a letter + a count of occurrences of that letter. If only one string matches this node (and its parent nodes), then it is stored in the node. Otherwise, the node has child nodes for the next letters and the letter count.

This would thus give something like this:

A:     0                  1                   3           ...
       |               /     \              /    \
B:     0             0        1           1        3
      / \          heaven   /   \     barracuda    ababab
C:   0   1                 0     1
   foo   cow             bar     bac

Not sure this would cost less than the array count solution but at least you wouldn't have to store the count for all letters for all strings (the tree stops when the letter count uniquely identifies a string)

You could probably optimize it by cutting long branches without siblings

幸福还没到 2025-01-11 22:40:42

您可以尝试下面的 Java 代码

int[] data = new int[254];//we have 254 different characters 
void processData(String mString){

    for (int i=0 ; i< mString.length;i++){
       char c = mString.charAt(i); 
        data[c]++;
    }
}
int getCountOfChar(char c){
     return data[c];
}

You could try the code in Java below

int[] data = new int[254];//we have 254 different characters 
void processData(String mString){

    for (int i=0 ; i< mString.length;i++){
       char c = mString.charAt(i); 
        data[c]++;
    }
}
int getCountOfChar(char c){
     return data[c];
}
二智少女 2025-01-11 22:40:42

您的要求和目标似乎有些混乱。

如果您的搜索结果占用太多空间,为什么不“有损压缩”(如音乐压缩)结果呢?有点像哈希函数。然后,当您需要检索结果时,散列指示需要使用更冗长的搜索算法来正确搜索的更小的字符串子集。

如果您实际存储 String 对象,并且您的字符串实际上是人类可读的文本,那么您可以在完成搜索和索引后尝试使用 java.util.zip 缩小它们等等。如果您确实想让它们保持很小,并且您没有收到实际的String对象,并且您说您只有26个不同的字母,那么您可以将它们压缩为5个一组位并像这样存储它们。为此,请使用 CharSequence 接口。

It seems there's some confusion on what your requirements and goals are.

If your search results take up too much space, why not "lossily compress" (like music compression) the results? Kind of like a hash function. Then, when you need to retrieve results, your hash indicates a much smaller subset of strings that needed to be searched properly with a more lengthy searching algorithm.

If you actually store the String objects, and your strings are actually human readable text, you could try deflating them with java.util.zip after you're done searching and index and all that. If you really want to keep them tiny and you don't receive actual String objects, and you said you only have 26 different letters, you can compress them into groups of 5 bits and store them like that. Use the CharSequence interface for this.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文