以数学方式查找数字子字符串，无需字符串比较

发布于 2024-07-07 10:13:16 字数 1340 浏览 7 评论 0原文

这原本是我在工作中遇到的问题，但现在我只是出于自己的好奇心而试图解决这个问题。

我想以最有效的方式找出 int 'a' 是否包含 int 'b' 。我写了一些代码，但似乎无论我写什么，将其解析为字符串然后使用indexOf比用数学方法快两倍。

内存不是问题（在合理范围内），只是纯粹的处理速度。

这是我编写的数学方法代码：

private static int[] exponents = {10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000 };

private static boolean findMatch(int a, int b) {
    if (b > a) return false;

    if (a == b) return true;

    int needleLength = getLength(b);

    int exponent = exponents[needleLength];
    int subNum;
    while (a >= 1) {
        subNum = a % exponent;

        if (subNum == b)
            return true;

        a /= 10;
    }
    return false;
}

private static int getLength(int b) {

    int len = 0;

    while (b >= 1) {
        len++;
        b /= 10;
    }

    return len;
}

这是我正在使用的字符串方法，它似乎胜过了上面的数学方法：

private static boolean findStringMatch(int a, int b) {      
    return String.valueOf(a).indexOf(String.valueOf(b)) != -1;      
}

因此，虽然这并不是我完成工作所必需的，但我只是想知道是否任何人都可以想出任何方法来进一步优化我的数学方法，或者完全是一种全新的方法。再说一遍，内存不是问题，我只是追求纯粹的速度。

我真的很想看到或听到任何人对此提供的任何信息。

编辑：当我说包含时，我的意思是可以在任何地方，例如， findMatch(1234, 23) == true

编辑： 对于每个说这个废话不可读并且不必要：你没有抓住重点。重点是要研究一个有趣的问题，而不是想出一个在生产代码中使用的答案。

原文

This originally was a problem I ran into at work, but is now something I'm just trying to solve for my own curiosity.

I want to find out if int 'a' contains the int 'b' in the most efficient way possible. I wrote some code, but it seems no matter what I write, parsing it into a string and then using indexOf is twice as fast as doing it mathematically.

Memory is not an issue (within reason), just sheer processing speed.

This is the code I have written to do it mathematically:

private static int[] exponents = {10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000 };

private static boolean findMatch(int a, int b) {
    if (b > a) return false;

    if (a == b) return true;

    int needleLength = getLength(b);

    int exponent = exponents[needleLength];
    int subNum;
    while (a >= 1) {
        subNum = a % exponent;

        if (subNum == b)
            return true;

        a /= 10;
    }
    return false;
}

private static int getLength(int b) {

    int len = 0;

    while (b >= 1) {
        len++;
        b /= 10;
    }

    return len;
}

Here's the string method I'm using, which seems to trump the mathematical method above:

private static boolean findStringMatch(int a, int b) {      
    return String.valueOf(a).indexOf(String.valueOf(b)) != -1;      
}

So although this isn't really required for me to complete my work, I was just wondering if anyone could think of any way to further optimize my way of doing it mathematically, or an entirely new approach altogether. Again memory is no problem, I am just shooting for sheer speed.

I'm really interested to see or hear anything anyone has to offer on this.

EDIT: When I say contains I mean can be anywhere, so for example, findMatch(1234, 23) == true

EDIT: For everyone saying that this crap is unreadable and unnecessary: you're missing the point. The point was to get to geek out on an interesting problem, not come up with an answer to be used in production code.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

雨夜星沙 2024-07-14 10:13:16

它应该更快的字符串方式，因为你的问题是文本的，而不是数学的。请注意，您的“包含”关系没有说明数字，它只说明了它们的十进制表示形式。

还要注意，您想要编写的函数将是不可读的 - 其他开发人员永远不会理解您在做什么。（看看你在这里遇到了什么麻烦。）另一方面，字符串版本非常清晰。

回复收藏 0 原文

南笙 2024-07-14 10:13:16

这是沿着 Kibbee 的路线，但在他发帖之前我对此有点感兴趣并解决了这个问题：

long mask ( long n ) { 
    long m   = n % 10;
    long n_d = n;
    long div = 10;
    int  shl = 0;
    while ( n_d >= 10 ) { 
        n_d /= 10;
        long t = n_d % 10;
        m |= ( t << ( shl += 4 ));
    }
    return m;
}

boolean findMatch( int a, int b ) { 
    if ( b < a  ) return false;
    if ( a == b ) return true;

    long m_a = mask( a );    // set up mask O(n)
    long m_b = mask( b );    // set up mask O(m)

    while ( m_a < m_b ) {
        if (( m_a & m_b ) == m_a ) return true;
        m_a <<= 4;  // shift - fast!
        if ( m_a == m_b ) return true;
    }  // O(p)
    return false;
}       

void testContains( int a, int b ) { 
    print( "findMatch( " + a + ", " + b + " )=" + findMatch( a, b ));
}

testContains( 12, 120 );
testContains( 12, 125 );
testContains( 123, 551241238 );
testContains( 131, 1214124 );
testContains( 131, 1314124 );

由于 300 个字符太少而无法进行争论，我正在编辑这篇主要帖子来回应 Pyrolistical。

与 OP 不同，我对本机编译的 indexOf 比带有原语的 Java 代码更快并不感到惊讶。所以我的目标不是找到比 Java 代码中调用无数次的本机方法更快的东西。

OP 明确表示这不是生产问题，更多的是出于闲置的好奇心，所以我的回答解决了这种好奇心。我的猜测是，当他试图在生产中解决这个问题时，速度是一个问题，但出于一种闲置的好奇心，“这种方法将被调用数百万次”不再适用。正如他必须向一位发帖人解释的那样，它不再被视为生产代码，因此复杂性不再重要。

另外，它提供了页面上唯一能够在“551241238”中找到“123”的实现，因此除非正确性是一个无关紧要的问题，否则它会提供这一点。此外，“使用 Java 原语以数学方式解决问题但击败优化的本机代码的算法”的解决方案空间可能是空。

另外，从您的评论中不清楚您是否将苹果与苹果进行比较。功能规范为 f( int, int )-> 布尔值，而不是 f( String, String )-> boolean （这是 indexOf 的域）。因此，除非您测试了类似的东西（它仍然可以击败我的，我不会感到非常惊讶。）额外的开销可能会消耗掉多余的 40% 的一部分。

boolean findMatch( int a, int b ) { 
    String s_a = "" + a;
    String s_b = "" + b;
    return s_a.indexOf( s_b ) > -1;
}

它执行相同的基本步骤。 log₁₀( a ) 编码 + log₁₀( b ) 编码 + 实际找到匹配项，这也是 O(n) 其中 < em>n 是最大对数。

This is along Kibbee's line, but I got a little intrigued by this before he posted and worked this out:

long mask ( long n ) { 
    long m   = n % 10;
    long n_d = n;
    long div = 10;
    int  shl = 0;
    while ( n_d >= 10 ) { 
        n_d /= 10;
        long t = n_d % 10;
        m |= ( t << ( shl += 4 ));
    }
    return m;
}

boolean findMatch( int a, int b ) { 
    if ( b < a  ) return false;
    if ( a == b ) return true;

    long m_a = mask( a );    // set up mask O(n)
    long m_b = mask( b );    // set up mask O(m)

    while ( m_a < m_b ) {
        if (( m_a & m_b ) == m_a ) return true;
        m_a <<= 4;  // shift - fast!
        if ( m_a == m_b ) return true;
    }  // O(p)
    return false;
}       

void testContains( int a, int b ) { 
    print( "findMatch( " + a + ", " + b + " )=" + findMatch( a, b ));
}

testContains( 12, 120 );
testContains( 12, 125 );
testContains( 123, 551241238 );
testContains( 131, 1214124 );
testContains( 131, 1314124 );

Since 300 characters is far too little to make an argument in, I'm editing this main post to respond to Pyrolistical.

Unlike the OP, I wasn't that surprised that a native compiled indexOf was faster than Java code with primitives. So my goal was not to find something I thought was faster than a native method called zillions of times all over Java code.

The OP made it clear that this was not a production problem and more along the lines of an idle curiosity, so my answer solves that curiosity. My guess was that speed was an issue, when he was trying to solve it in production, but as an idle curiosity, "This method will be called millions and millions of times" no longer applies. As he had to explain to one poster, it's no longer pursued as production code, so the complexity no longer matters.

Plus it provides the only implementation on the page that manages to find the "123" in "551241238", so unless correctness is an extraneous concern, it provides that. Also the solution space of "an algorithm that solves the problem mathematically using Java primitives but beats optimized native code" might be EMPTY.

Plus, it's not clear from your comment whether or not you compared apples to apples. The functional spec is f( int, int )-> boolean, not f( String, String )-> boolean (which is kind of the domain of indexOf) . So unless you tested something like this (which could still beat mine, and I wouldn't be awfully surprised.) the additional overhead might eat up some of that excess 40%.

boolean findMatch( int a, int b ) { 
    String s_a = "" + a;
    String s_b = "" + b;
    return s_a.indexOf( s_b ) > -1;
}

It does the same basic steps. log₁₀( a ) encoding + log₁₀( b ) encoding + actually finding the match, which is as well O(n) where n is the largest logarithm.

回复收藏 0 原文

暮倦 2024-07-14 10:13:16

我能想到的唯一优化是自己进行字符串转换，并在转换时比较数字（从右到左）。首先转换b的所有数字，然后从a的右侧开始转换，直到找到b的第一个数字（从右侧）匹配。进行比较，直到所有 b 都匹配或者出现不匹配。如果遇到不匹配，请回溯到开始匹配 b 的第一个数字的点，前进到 a 并重新开始。

IndexOf 必须执行基本相同的回溯算法，除了从左侧开始。根据实际数字，这可能会更快。我认为如果数字是随机的，那应该是随机的，因为应该有很多时候不需要转换所有的 a。

回复收藏 0 原文

司马昭之心 2024-07-14 10:13:16

看起来你的函数实际上做得很好，但有一个小小的改进：

private static boolean findMatch(int a, int b) {
        if (b > a) return false;

        if (a == b) return true;

        int needleLength = getLength(b);

        int exponent = exponents[needleLength];
        int subNum;
        while (a > b) {
                subNum = a % exponent;

                if (subNum == b)
                        return true;

                a /= 10;
        }
        return false;
}

仅仅因为一旦 a 小于 b，就不值得继续寻找，不是吗？
祝你好运，如果你找到解决方案，请发布！

Looks like your function is actually doing pretty well, but an small improvement:

private static boolean findMatch(int a, int b) {
        if (b > a) return false;

        if (a == b) return true;

        int needleLength = getLength(b);

        int exponent = exponents[needleLength];
        int subNum;
        while (a > b) {
                subNum = a % exponent;

                if (subNum == b)
                        return true;

                a /= 10;
        }
        return false;
}

Just because once that a is smaller than b, is not worthy keeps looking, isnt it?
Good luck and post if you find the solution!

回复收藏 0 原文

拔了角的鹿 2024-07-14 10:13:16

这是一个有趣的问题。 String.class 的许多函数实际上是本机的，这使得击败 String 成为一个困难的命题。但这里有一些帮助：

提示 1：不同的简单整数运算具有不同的速度。

通过示例程序中的快速计算表明：

% ~ T
* ~ 4T
/ ~ 7T

因此，您希望使用尽可能少的除法，以支持乘法或取模。未显示减法、加法和比较运算符，因为它们将所有这些都从水中剔除。另外，尽可能使用“final”可以让 JVM 进行某些优化。加快“getLength”函数的速度：

private static int getLength(final int b) {        
   int len = 0;
   while (b > exponents[len]) {
       len++;
   }
   return len + 1
}

该函数大约提高了 7 倍。如果 b > ，您会得到一个 indexOutOfBounds 异常。你的最大指数。为了解决这个问题，你可以这样做：

private static int getLength(final int b) {        
   int len = 0;
   final int maxLen = exponents.length;
   while (len < maxLen && b > exponents[len]) {
       len++;
   }
   return len + 1;
}

这会稍微慢一些，并且如果 b 太大的话，会给你一个不正确的长度，但它不会抛出异常。

提示 2： 不必要的对象/基元创建和方法调用会增加运行时间。

我猜测“getLength”在其他任何地方都没有被调用，所以虽然拥有一个单独的函数可能会很好，但从优化的角度来看，它是不必要的方法调用和对象“len”的创建。我们可以将该代码放在我们使用它的地方。

private static boolean findMatch(int a, final int b) {
        if (b > a) return false;
        if (a == b) return true;
        int needleLength = 0;
        while (b > exponents[len]) {
            needleLength ++;
        }
        needleLength++;

        final int exponent = exponents[needleLength];
        int subNum;
        while (a >= 1 && a <= b) {
                subNum = a % exponent;
                if (subNum == b)
                        return true;
                a /= 10;
        }
        return false;
}

另外，请注意我更改了底部 while 循环以包含“a <= b”。我还没有对此进行测试，并且不确定每次迭代的惩罚是否胜过您不浪费任何迭代的事实。我确信有一种方法可以使用巧妙的数学来消除除法，但我现在想不到。

This is an interesting problem. Many of String.class's functions are actually native making beating String a difficult proposition. But here's some helpers:

TIP 1: Different simple integer operations have different speeds.

By quick calculations in sample programs showed:

% ~ T
* ~ 4T
/ ~ 7T

So you want to use as little division as possible in favor of multiplication or modulo. Not shown are subtraction, addition, and comparison operators cause they blow all of these out of the water. Also, using "final" as much as possible allows the JVM to do certain optimizations. Speeding up you "getLength" function:

private static int getLength(final int b) {        
   int len = 0;
   while (b > exponents[len]) {
       len++;
   }
   return len + 1
}

That gives about a 7x improvement in the function. You get an indexOutOfBounds exception if b > your max in exponents. To solve that, you can have:

private static int getLength(final int b) {        
   int len = 0;
   final int maxLen = exponents.length;
   while (len < maxLen && b > exponents[len]) {
       len++;
   }
   return len + 1;
}

That's slightly slower and gives you an incorrect length if b is too big, but it doesn't throw an exception.

TIP 2: Unnecessary object/primitive creation and method calls add to run time.

I'm guessing that "getLength" isn't called anywhere else, so while it might be nice to have a separate function, from a optimization standpoint its an unnecessary method call and creation of the object "len". We can put that code right where we use it.

private static boolean findMatch(int a, final int b) {
        if (b > a) return false;
        if (a == b) return true;
        int needleLength = 0;
        while (b > exponents[len]) {
            needleLength ++;
        }
        needleLength++;

        final int exponent = exponents[needleLength];
        int subNum;
        while (a >= 1 && a <= b) {
                subNum = a % exponent;
                if (subNum == b)
                        return true;
                a /= 10;
        }
        return false;
}

Also, note I changed the bottom while loop to also include "a <= b". I haven't tested that and not sure if the per-iteration penalty beats the fact that you don't waste any iterations. I'm sure there's a way to get rid of the division using clever math, but I can't think of it right now.

回复收藏 0 原文

那请放手 2024-07-14 10:13:16

嗯，我可能完全误解了这个问题，但是......

// Check if A is inside B lol
bool Contains (int a, int b)
{
    return (a <= b);
}

除非你想知道一个特定的数字序列是否在另一个数字序列中。

在这种情况下，将其转换为字符串会比进行数学计算更快。

Umm, I'm probably totally misunderstanding the question, but.....

// Check if A is inside B lol
bool Contains (int a, int b)
{
    return (a <= b);
}

Unless you want to know if a particular sequence of numbers is within another sequence of numbers.

In that case, converting it to a string WILL be faster than doing the math to figure it out.

回复收藏 0 原文

悲喜皆因你 2024-07-14 10:13:16

无论如何，这无论如何都不能回答您的问题，但无论如何，这都是建议:-)

方法名称 findMatch 的描述性不是很强。在本例中，我有一个静态方法 ContainerBuilder.number(int)，它返回一个 ContainerBuilder，其中包含方法 contains它。这样你的代码就变成了：

boolean b = number(12345).contains(234);

Juts some suggest for the long run!

哦，是的，我还想说，你应该定义“包含”的含义

This in no way answers your question, whatsoever, but it's advice anyway :-)

The method name findMatch is not very descriptive. In this case, I'd have a static method ContainerBuilder.number(int), which returned a ContainerBuilder, which has the method contains on it. In this way your code becomes:

boolean b = number(12345).contains(234);

Juts some advice for the long run!

Oh yes, I meant to say also, you should define what you mean by "contains"

回复收藏 0 原文

失退 2024-07-14 10:13:16

有什么办法可以用二进制计算这个吗？显然，包含另一个字符的二进制整数的整数的二进制值并不意味着十进制也有同样的作用。然而，是否有某种可以使用的二进制技巧？也许将像 12345 这样的数字转换为 0001 0010 0011 0100 0101，然后进行一些位移以确定其中是否包含 23 (0010 0011)。由于您的字符集只有 10 个字符，因此您可以通过在单个字节中存储 2 个字符值来减少计算时间。

编辑

稍微扩展一下这个想法。如果你有 2 个整数 A 和 B，并且想知道 A 是否包含 B，你首先检查 2 件事。如果A小于B，则A不能包含B。如果A = B，则A包含B。此时您可以将它们转换为字符串*。如果A包含与B相同数量的字符数，那么A不包含B，除非它们相等，但如果它们相等，我们就不会在这里，所以如果两个字符串长度相同，a不包含b 。此时，A 的长度将比 B 长。因此，现在您可以将字符串转换为其打包的二进制值，正如我在本文第一部分中指出的那样。将这些值存储在整数数组中。现在，您对数组中的整数值进行按位与操作，如果结果是 A，则 A 包含 B。现在，您将 B 的整数数组向左移动 4 位，然后再次进行比较。执行此操作，直到您开始从 B 左侧弹出位。

*上一段中的 * 意味着您可以跳过此步骤。可能有一种方法可以完全不使用字符串来做到这一点。您可能可以使用一些奇特的二进制技巧来获取我在第一段中讨论的打包二进制表示。应该有一些您可以使用的二进制技巧，或者一些快速数学，可以将整数转换为我之前讨论过的十进制值。

回复收藏 0 原文