Blog entries - Codeforces

#	User	Rating
1	tourist	3856
2	jiangly	3747
3	orzdevinwang	3706
4	jqdai0815	3682
5	ksun48	3591
6	gamegame	3477
7	Benq	3468
8	Radewoosh	3462
9	ecnerwala	3451
10	heuristica	3431

#	User	Contrib.
1	cry	167
2	-is-this-fft-	162
3	Dominater069	160
4	Um_nik	158
5	atcoder_official	156
6	Qingyu	153
7	djm03178	152
7	adamant	152
9	luogu_official	150
10	awoo	147

drdilyor's blog

LAZY solution to problem D in Codeton round 8 + PYTHON solution

By drdilyor, history, 11 months ago, In English

Heyo! If you tried to solve problem D you have probably ended up trying to optimize this algorithm:

C++ brute force

    // input
    int n, k;
    cin >> n >> k;
    vector arr(n, vector<int>(n));
    for (int i = 0; i < n;i++)
        for (int j = i; j < n; j++)
            cin >> arr[i][j];

    vector dp(n+2, vector<int>());
    dp[n] = dp[n+1] = {0};

    for (int i = n-1; i >= 0; i--) {
        vector<int> cur;
        for (int j = i-1; j < n; j++) {
            // prepare to merge by adding a[i][j]
            vector<int> tomerge = dp[j+2];
            if (j >= i) {
                for (int& x : tomerge)
                    x += arr[i][j];
            }

            vector<int> merged;
            merge(cur.begin(), cur.end(),
                  tomerge.begin(), tomerge.end(),
                  back_inserter(merged), greater());
            cur = merged;
        }
        dp[i] = cur;
    }

    // output top k values
    for (int i = 0; i < k && i < dp[0].size(); i++)
        cout << dp[0][i] << ' ';
    cout << '\n';

But of course this won't work! The number of values in dp[i] will grow exponentially. But we only need to output first K values of the final dp[0]. Can we do better?

We only need to store first K values for each dp[i]. So we can rewrite: dp[i] = cur; while (dp[i].size() > k) dp[i].pop_back();

This leaves us with O(n^2 k) solution, this time the bottleneck is the merging part. The OFFICIAL solution is to use priority queue to select first K values after merging, without going through all values. But we are not interested in that!

LAZY solution

If you have heard about Haskell, you might know that it's a lazy language, it only computes things when it absolutely needs them -- e.g. when you print them. Could it be the case that just switching our programming language to Haskell make the solution pass? (Spoiler: yes, kindof)

Skeleton of Haskell code

code

{-# LANGUAGE BangPatterns #-}
{-# OPTIONS_GHC -O3 -optc-O3 -funbox-strict-fields -funfolding-use-threshold=16 #-}

import Control.Monad
import qualified Data.ByteString.Char8 as B
import Data.Maybe

import Data.Array.Unboxed
import Data.Array.IO.Safe
import Data.Array.IO.Internals

readInt = fst . fromJust . B.readInt
readInts = map readInt . B.words
putInts xs = do
  forM_ xs $ \x -> do
    B.putStr $$$ B.pack $$$ show x
    B.putStr $ B.pack " "
  B.putStr $ B.pack "\n"

main = do
  tests <- readInt <$> B.getLine
  replicateM_ tests solve

solve = do
  [n, k] <- readInts <$> B.getLine
  arrm <- newArray ((0, 0), (n-1, n-1)) 0 :: IO (IOUArray (Int, Int) Int)
  forM_ [0..n-1] $ \i -> do
    arrList <- readInts <$> B.getLine
    forM_ (zip arrList [i..n-1]) $ \(x, j) -> do
      writeArray arrm (i, j) x

  arr <- unsafeFreezeIOUArray arrm
  -- to be continued

Note: Codeforces is sometimes showing single dollar sign as 3 dollar signs, I don't know how to prevent that.

The code above reads the input into 2D array arr such that we can access element as arr ! (i, j).

Now we need the merge function, it takes 2 sorted lists and returns merged one:

merge :: [Int] -> [Int] -> [Int]
merge (x : xs) (y : ys) =
  if x > y
    then x : merge xs (y : ys)
    else y : merge (x : xs) ys
merge [] xs = xs
merge xs [] = xs

Pretty simple if you know the syntax, but it's not very important.

Now to rewrite the core of the algorithm above (it's possible to do this without recursion but it's less readable):

  let
    calc dp (-1) = head dp
    calc dp i = calc (curdp : dp) (i-1)
      where
        prepareMerge j dpval =
          if j < i then dpval
          else map (+ (arr ! (i, j))) dpval -- add arr[i][j] to every element
        tofold = zipWith prepareMerge [(i-1)..] dp
        curdp = ...
  putInts $$$ take k $$$ calc [[0], [0]] (n-1)

The base case is when i=-1, we just return dp[0]. dp stores the values of dp[i+1..n+1] for each iteration, instead of storing the full dp array. This better fits recursive implementation.

Where clause declares variables: tofold stores all the lists that we want to merge. Now consider:

curdp = foldr merge [] tofold

foldr function applies merge to every element from right to left:

merge(tofold[0], merge(tofold[1], ... merge(tofold[m-1], [])))

Now our submission 254204408 confidently passes in 1300ms

How easy was that?

We are not even truncating dp[i] every iteration, we take first K values only at the end!

The catch

I have no idea how the above works. It should've worked in O(n*n*k) but it looks like working in O(n*n + k), it's either GHC magic or the tests are weak.

Anyways, foldr version leads to this computation tree:

Spoiler

To get the first element of the array, Haskell might have to go through each merge node to find the smallest element. So the worst case for merging NxK arrays is O(nk), but if we use divide and conquer to merge the arrays:

mergeAll [] = []
mergeAll [arr] = arr
mergeAll arrs = merge (mergeAll (take mid arrs)) (mergeAll (drop mid arrs))
  where mid = length arrs `div` 2

This leads to this computation tree:

Spoiler

This way haskell will only have to go compare and go down O(log n) merge nodes to find the next smallest element. Now our submission 254204435 passes in 2500ms.

Python solution

If you look at Accepted python submissions to problem D you will see there are only 5 (!!) submissions. This is because python doesn't have builtin priority queue and writing PQ in python is very slow.

Adapting the lazy solution to python is possible by using half baked python feature called generators.

This is the submission 254208539 (...yeahhh.. merge function is very long in python). The foldr version is TLEing for some reason, I really have no idea how that passes in Haskell. The divide and conquer version passes in 4100ms. Note that we are using lazy only to merge the lists. We are using normal lists when storing them in dp[i]. This is because generators can be only consumed once in python, while Haskell lists act as both lazy lists and generators, depends on your view.

Conclusion

This was one of the coolest things I've done with Haskell (including outside of CP) and I wanted to share it :)

Full text and comments »

+142

drdilyor
11 months ago
7

[C++] Use basic_string for graphs to speedup code and reduce memory usage

By drdilyor, history, 14 months ago, In English

TL;DR: Simply use vector<basic_string<int>> adj(n); instead of vector<vector<int>> adj(n); for graphs. Don't use it anywhere else.

Hey Codeforces! I learned from our template lord nor about basic_string. Contrary to vector, it has memory optimization. When you use a vector, it will allocate on heap and the structure of vector is similar to this:

struct vector {
    size_t size;
    size_t capacity;
    T*     data;
}
/*
64 bits - size
64 bits - capacity
64 bits - data
*/

(Not to mention, the overhead of malloc)

Which means if you are using vector<int> with size 1, it will be using at least 28 bytes and accessing it requires indirection to a completely different part of memory, which is bad for cache. On 32-bit systems, memory usage is lower since they are 32-bits each, but the cache unfriendliness still holds. Also, as nor mentioned, this isn't exactly how the struct vector is implemented in libstdc++.

basic_string has Short string optimization, where it uses 1 bit flag for short or long mode. In long mode it behaves like vector, but in short mode it has different memory layout. For example, basic_string<int> in short form is similar to this (Note that capacity isn't needed as there is no allocation):

/*
 1 bit  - the flag
 7 bits - size
32 bits - int

32 bits - int
32 bits - int

32 bits - int
32 bits - int

24 bits - unused
*/

(I'm simplifying and not including memory alignment)

So in short form, basic_string stores elements directly inside the struct when the size is small! With int, it stores up to 5 ints without allocating on heap, which makes DFS much more cache friendly and uses less memory.

How much would you ask?

Cycle detection, 1905B

Problem 1 vector<edge>       121 ms  25.2 Mib
Problem 1 basic_string<edge>  83 ms  24.3 Mib
Problem 2 vector<int>         62 ms   4.5  MB
Problem 2 basic_string<int>   62 ms   2.9  MB

edge is a struct of 2 ints, so it's bigger and makes the optimization little less useful.

(Problem 2 doesn't require storing adjacency list but I guessed it's a good enough benchmark)

Bear in mind that basic_string requires the element types to be "simple" types, don't use it for vectors, sets and maps, only for ints, simple structs of ints or pairs of these (pair is a bit nuanced but it works).

More about it here

Full text and comments »

drdilyor
14 months ago
1

Initializer list vs tuple with std::tie()

By drdilyor, history, 21 month(s) ago, In English

Take a look at this innocent piece of code:

    vector<vector<int>> arr{
        {1, 2, 3, 4},
        {5, 6, 7, 8},
        {9, 10, 11, 12},
        {13, 14, 15, 16},
    };
    int n = arr.size();

    for (int i = 0; i < n/2; i++) {
        for (int j = 0; j < n/2; j++) {
            tie(arr[i][j], arr[j][n-i-1], arr[n-i-1][n-j-1], arr[n-j-1][i]) =
                {arr[j][n-i-1], arr[n-i-1][n-j-1], arr[n-j-1][i], arr[i][j]};
            //tie(arr[i][j], arr[j][n-i-1], arr[n-i-1][n-j-1], arr[n-j-1][i]) =
            //    tuple{arr[j][n-i-1], arr[n-i-1][n-j-1], arr[n-j-1][i], arr[i][j]};
        }
    }
    for (auto& i : arr) {
        for (int j : i) 
            cout << j << ' ';
        cout << '\n';
    }

Outputs (with optimizations disabled):

which is totally gibberish.

Using tuple{} instead of initializer list ({}) for tie() gives the correct answer. This behavior is observed both on GCC and clang.

In fact, this is present with code as simple as this:

int a = 1, b = 2;
tie(a, b) = {b, a};

This once again shows one of the dark sides of C++. Hope no one spends 30 mins searching for such bug after this blog :):

Full text and comments »

c++

drdilyor
21 month(s) ago
6

Divide array into K segments with max sum (N,K <= 2e5)

By drdilyor, history, 2 years ago, In English

I haven't been able to solve this problem for a long time (10 months), so I would greatly appreciate anyone can solve it or can provide hints.

For $$$ N, K \le 2*10^3 $$$ subtask we can use a straightforward DP, but I have no idea for DP optimization or Greedy algorithm for $$$ N, K \le 2*10^5 $$$ full task.

Examples:

6 2
2 -1 3 -4 -2 6

Answer: segments [1, 3] and [6, 6] with sum 10. No need to output which segments are used, just the sum itself.

Full text and comments »

dp, problem

drdilyor
2 years ago
2