dalex's blog

By dalex, 4 years ago, In English

Recently one my friend got a TL verdict in some problem, and another my friend tried to help him to overcome it. Then I joined, got shocked by the results they got and we together stabilized the approach they came to.

By "deep recursion" we will mean a recursion with depth >= 100000. If it's less than 1000, this approach doesn't work at all, if it's between 1000 and 100000, most likely the speedup won't be so high so that will be decisive.

Let's consider this simple DFS function (it just counts the number of vertices):

private int dfs(int x, int p, List<Integer>[] g) {
    int result = 1;
    for (int y : g[x]) {
        if (y == p) {
            continue;
        }
        result += dfs(y, x, g);
    }
    return result;
}

generate a chain tree with 500000 vertices:

int n = 500000;
List<Integer>[] g = new List[n];
for (int i = 0; i < n; i++) {
    g[i] = new ArrayList<>();
}
for (int i = 1; i < n; i++) {
    int p = i - 1;
    g[i].add(p);
    g[p].add(i);
}

and run it:

long t1 = System.nanoTime();
int result = dfs(0, -1, g);
long t2 = System.nanoTime();
System.out.println("time = " + (t2 - t1) / 1.0e9 + " sec, result = " + result);

On my computer it runs 1.8 sec. Too slow for such a simple function, isn't it?

Let's speedup it. First, add two dummy parameters curDepth and maxDepth to DFS:

private int dfs(int x, int p, List<Integer>[] g, int curDepth, int maxDepth) {
    if (curDepth > maxDepth) {
        return 0;
    }
    int result = 1;
    for (int y : g[x]) {
        if (y == p) {
            continue;
        }
        result += dfs(y, x, g, curDepth + 1, maxDepth);
    }
    return result;
}

and then... OMG WTF

long t1 = System.nanoTime();
for (int i = 0; i < 100000; i++) {
    dfs(0, -1, g, 0, 0); // JIT please
}
int result = dfs(0, -1, g, 0, Integer.MAX_VALUE);
long t2 = System.nanoTime();
System.out.println("time = " + (t2 - t1) / 1.0e9 + " sec, result = " + result);

Now it works for 0.1 sec :)

When some method in Java is called too frequently, JIT optimizes it. It gets recompiled in runtime and on every new call the recompiled version will be called. However, the method cannot be recompiled while it is being executed. See this StackOverflow thread for more info. Maybe some Java expert can add something?

So in the first example, we enter into not optimized version of DFS and use it all the time. In the second example we do a lot of very short DFS calls, it for sure gets optimized, and finally we use optimized version to solve the actual problem.

Make sure the warming up does many calls but doesn't do many operations. In this example it's wise to do these fake DFS calls from a leaf and stop after the first recursive call.

Make sure all branches in your recursive function are actually run. It would be wrong to call dfs(0, -1, g, 0, -1) because the execution finishes quickly on if (curDepth > maxDepth), and the remaining code with iteration over adjacency list is not run and therefore is not optimized.

It doesn't work very well on Codeforces:

But it works just fine on Ideone:

and also works on:

  • Atcoder
  • CSES
  • Yandex.Contest in Java 8 compiler
  • ...

Maybe it's Windows/Linux or 32-bit/64-bit or OpenJDK/Oracle JDK differences, I haven't tested it yet.

Once more, it works only when recursion has depth >= 100000, and only on specific sites. But maybe it will help someone, and it is funny anyway.

  • Vote: I like it
  • +110
  • Vote: I do not like it

»
4 years ago, # |
  Vote: I like it 0 Vote: I do not like it

More then TLEs in recursion, I face StackOverflow exceptions, mostly with dp with memoization with ~1000000 calls, any suggestions for that?

»
4 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Thank you so much for the blog! Really appreciate this information. Optimisations really interest me now and knowing that they can help speed up submissions on CSES is definitely motivating.

»
4 years ago, # |
  Vote: I like it +41 Vote: I do not like it

How to speed up deep recursion in Java — use C++

»
4 years ago, # |
  Vote: I like it +1 Vote: I do not like it

"only on specific sites"... that sucks especially since you have no guarantee it won't break or start working on specific sites

  • »
    »
    4 years ago, # ^ |
    Rev. 2   Vote: I like it +1 Vote: I do not like it

    You can just do a runtime check to see if it's openjdk or oracle jdk. Same problem exists with C++ code if you really wanted to make sure it can run across different compilers/platforms.

    It's kind of weird that cf runs openjdk though. I thought openjdk had a reputation of being less performant and you kind of need things to be fast in CP. (it happened to beat oracle in this case but it loses to oracle again once their jit kicks in).

    • »
      »
      »
      4 years ago, # ^ |
        Vote: I like it -10 Vote: I do not like it

      C++ has much less variability. A code written for gcc will normally work with clang, msvc, etc. I know because I needed to deal with that. A problem like this, with a code that's slow by design of the language, and with some compilers finding workarounds through complex background analysis in runtime, can't occur in C++.

»
4 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Correct me if I'm wrong — but aren't you are running 10^5 identical calls of 1-depth recursion with this two pieces:

private int dfs(int x, int p, List<Integer>[] g, int curDepth, int maxDepth) {
    if (curDepth > maxDepth) {
        return 0;
    [loop] result += dfs(y, x, g, curDepth + 1, maxDepth);
for (int i = 0; i < 100000; i++) {
    dfs(0, -1, g, 0, 0);

This doesn't seem to solve deep-recursion problem. Answer seams right only because every node has the same weight as zero-node. Or should there be another code?

  • »
    »
    4 years ago, # ^ |
      Vote: I like it +17 Vote: I do not like it

    The code you quote is the trick that makes JIT optimize that function. (even though it doesn't actually do anything)

    Actual logic is int result = dfs(0, -1, g, 0, Integer.MAX_VALUE);

»
4 years ago, # |
Rev. 2   Vote: I like it 0 Vote: I do not like it

Interesting article thanks. I've also noticed significant speed improvements by unpacking graphs into primitive integer arrays (if possible avoiding boxed types at all costs)

List[] g -> int[][] should give some significant boost too

»
4 years ago, # |
Rev. 3   Vote: I like it 0 Vote: I do not like it

More tests showed that the depth limitation may not always be correcct.

You have to be careful when the first node in the graph has many edges (e.g. first node in the tree is connected to any other node). Then you will get a TLE since you're basically traverse whole tree every time, the whole traversing does not exceed depth 1.

So, instead of only depth limitation you have to add some other limitation for the iterations.

UPD: Nevermind, just realized that the author already suggested to call it from a leaf

»
2 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Tree Coloring A simple DFS problem where we need to do a single DFS on 10^6 nodes.

I am still getting TLE, even after this optimization. I got AC with BFS But can this be solved using this trick?