#	User	Rating
1	tourist	3856
2	jiangly	3747
3	orzdevinwang	3706
4	jqdai0815	3682
5	ksun48	3591
6	gamegame	3477
7	Benq	3468
8	Radewoosh	3462
9	ecnerwala	3451
10	heuristica	3431

#	User	Contrib.
1	cry	167
2	-is-this-fft-	162
3	Dominater069	160
4	Um_nik	158
5	atcoder_official	157
6	Qingyu	156
7	adamant	151
7	djm03178	151
7	luogu_official	151
10	awoo	146

Problem: 1864F. Exotic Queries

The time limit is 4s. The input size is $$$n,q \le 10^6$$$. The standard solution uses a segment tree and the input/output size is very large. The sync of cin,cout with stdio is turned off in all implementations if cin,cout is used.

Cin,cout vs Fast IO, Scanf on C++20 (64)

220904724 Array implementation with cin,cout on C++20 (64) -> TLE
220952057 Array implementation with Fast IO on C++20 (64) -> 1762 ms
220950571 Array implementation with Scanf on C++20 (64) -> 3181 ms
220950753 Pointer implementation with Scanf on C++20 (64) -> 3868 ms
220947347 Pointer implementation with cin,cout on C++20 (64) -> 1872ms
220954316 Pointer implementation with Fast IO on C++20 (64) -> 2277ms

C++17 (32) vs C++20 (64)

220904724 Array implementation with cin,cout on C++20 (64) -> TLE
220904724 Array implementation with cin,cout on C++17 (32) -> 1621ms
220947347 Pointer implementation with cin,cout on C++20 (64) -> 1872ms
220953695 Pointer implementation with cin,cout on C++17 (32) -> 1934ms

Pointer implementation vs Array implementation

220904724 Array implementation with cin,cout on C++20 (64) -> TLE
220947347 Pointer implementation with cin,cout on C++20 (64) -> 1872ms

Some guessed conclusions

These facts makes no sense to me. Based on the facts, I guess

It's better to use Pointer implementation if you use cin,cout on C++20 (64)
If you prefer array implementation, use Fast IO or scanf on C++20 (64)
The 32/64 bit compiler has some influence on cin,cout performance

Do you guys have some idea why such cases happen? Or did I implement something improper so that it reduced the efficiency?

UPD: Reason of TLE

Thanks areke for pointing out the reason, and related comment and blog.

The thing is, when using cin in a 64-bit compiler, do cin in a separate loop.

Specifically, do

vector<int> vec(n+1);
for (int i = 1;i <= n; ++i) {
    cin >> vec[i];
}
for (int i = 1;i <= n; ++i) {
    pos[vec[i]].push_back(i);
}

instead of

for (int i = 1;i <= n; ++i) {
    int x; cin >> x;
    pos[x].push_back(i);
}

This modification speeds up the code from TLE to 1216ms !!! 221383761

define inf 1e18

template using ordered_set = tree<T, null_type, less, rb_tree_tag, tree_order_statistics_node_update> ; template<class key, class value, class cmp = std::less> using ordered_map = tree<key, value, cmp, rb_tree_tag, tree_order_statistics_node_update>; // x.find_by_order(1) -> return pointer to element present at index 1 // x.order_of_key(2) -> return index of element 2

endif

template void _print(T &x) {cerr << x;}

template <class T, class V> void _print(pair <T, V> p); template void _print(vector v); template void _print(set v); template <class T, class V> void _print(map <T, V> v); template void _print(multiset v); template <class T, class V> void _print(pair <T, V> p) {cerr << "{"; _print(p.ff); cerr << ","; _print(p.ss); cerr << "}";} template void _print(vector v) {cerr << "[ "; for (T i : v) {_print(i); cerr << " ";} cerr << "]";} template void _print(set v) {cerr << "[ "; for (T i : v) {_print(i); cerr << " ";} cerr << "]";} template void _print(multiset v) {cerr << "[ "; for (T i : v) {_print(i); cerr << " ";} cerr << "]";} template <class T, class V> void _print(map <T, V> v) {cerr << "[ "; for (auto i : v) {_print(i); cerr << " ";} cerr << "]";}

//To make unordered_map unhackable // use it as unordered_map<int,int,custom_hash> mapp; struct custom_hash { static uint64_t splitmix64(uint64_t x) { /* http://xorshift.di.unimi.it/splitmix64.c */ x += 0x9e3779b97f4a7c15; x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9; x = (x ^ (x >> 27)) * 0x94d049bb133111eb; return x ^ (x >> 31); }

size_t operator()(uint64_t x) const { static const uint64_t FIXED_RANDOM = chrono::steady_clock::now().time_since_epoch().count(); return splitmix64(x + FIXED_RANDOM); }

};

// Predefined values const ll mod = 1e9+7; const int N = 4010; int s[4010][4010]; int v[N*N];

void solve(){ int n,q; // cin>>n>>q; scanf("%d%d",&n,&q); for(int i=1; i<=n*n; i++){ // cin>>v[i]; scanf("%d",&v[i]); s[i%(2*n)][(i+2*n)/(2*n)] = s[i%(2*n)][(i+2*n)/(2*n)-1] + v[i]; }

int t,x; int p1, p2; int ans=0; for(int i=0; i<q; i++){ // int t,x; // cin>>t>>x; scanf("%d%d",&t,&x); p1 = t-x+1, p2 = t-2*n+x; ans=0; if(p1>0) ans += s[p1%(2*n)][(p1+2*n)/(2*n)]; if(p2>0) ans += s[p2%(2*n)][(p2+2*n)/(2*n)]; // cout<<ans<<endl; printf("%d\n",ans); }

}

int main(){

// Deepu Yadav // ios_base :: sync_with_stdio(false); // cin.tie(NULL); // cout.tie(NULL); #ifndef ONLINE_JUDGE // freopen("Error.txt", "w", stderr); #endif int t=1; // cin>>t; while(t--){ solve(); } return 0;

}

Comments (19)

Show archived | Write comment?

DPUYDV

18 months ago, # |

← Rev. 3 →

Today, I noticed same thing.

In this problem on using cin/cout I am getting TLE. But if I use fast I/O i.e. scanf/printf my solution gets accepted.

Accepted code

include<bits/stdc++.h>

include

include<ext/pb_ds/assoc_container.hpp>

include<ext/pb_ds/tree_policy.hpp>

using namespace std; using namespace chrono; using namespace __gnu_pbds;

// Macros definition

define ll long long

define lld long double

ifndef ONLINE_JUDGE

define debug(x) cerr << #x <<" "; _print(x); cerr << endl;

else

define debug(x)

Getting TLE

Accepted Solution

220982107 TLE 220982253 AC

TLE Code Accepted

→ Reply

yeminghan

18 months ago, # ^ |

You are not allowed to view the requested page

oversolver

Your rmq is recursive

Fyind

Do you mean the segment tree part? I think most people would like to use the recursive style of segment tree, so I only tested this style. Indeed, using a bottom-up segment tree (in author's solution) will have lower constant and will be accepted in this case.

← Rev. 2 →

... most people would like to use the recursive style ...

hard to believe it because this approach is faster and shorter

Yes, this approach is faster and shorter. Emm, I mean, the recursive approach is known to most people. Btw, although I know the bottom-up approach, I still use the recursive one. My reason is that the bottom-up approach is not applicable to lazy propagation in range updates. (Not sure, is it true?)

there is bottom-up approach, see atcoder library

I see, thanks for sharing the link

Dominater069

+10

I prefer a approach which is easier to understand than faster/shorter code. Ill happily code the slower longer code if i can understand it easily, and not so easily the other one

Can I then assume that you doesn't use BIT?

Actually, yes, well i didnt use to before 2 months ago.

In our countrys training camp for ioi, one of the people there made me understand how bit works and why the short implementation we write is correct. I have only started using it thereafter.

rgnerdplayer

Array (actually vectors) and cin worked fine for me.

Hi, thank you for your segment tree implementation and your submission. I only replaced your segment tree implementation in my old submission. Unfortunately, it got TLE. 221057129. Maybe you got accepted because your implementation in other parts is different from mine and somehow has low constant?

lis05

Strange, shouldn't array segtree version be faster than pointers segtree one?

balalaika

It is, and this is main question — why array version on C++20 x64 is TLE

My guess, that int_fast32_t would work

int_fast32_t is just a type synonym, and that fast in it doesn't mean it is fast

Sure I know it (for C++20 11.2 64x it is just int64_t)

I mean that in x64 arch massive arithmetic operations on int32_t are not optimal while operations on int64_t are. So for performance one should use it (yes with memory increase, but oh well...)

221387109

Update this is difference: 221387456 (int64_t vs int32_t in pos array makes difference)

areke

See this comment.

It's related to https://codeforces.net/blog/entry/99038.

If you read in the input separately from pushing back to the pos array, the implementation only takes 1216ms: https://codeforces.net/contest/1864/submission/221383761.

Amazing! This tiny modification can make such a huge difference. Learned something today!

Fyind's blog

Cin,cout vs Fast IO, Scanf on C++20 (64)

C++17 (32) vs C++20 (64)

Pointer implementation vs Array implementation

Some guessed conclusions

UPD: Reason of TLE

include<bits/stdc++.h>

include

include

include<ext/pb_ds/assoc_container.hpp>

include<ext/pb_ds/tree_policy.hpp>

define ll long long

define lld long double

define inf 1e18

ifndef ONLINE_JUDGE

define debug(x) cerr << #x <<" "; _print(x); cerr << endl;

else

define debug(x)

endif