Блог пользователя s-lissov

Автор s-lissov, история, 2 месяца назад, По-английски

...for fun and absolutely no profit.

Part 0. Introduction

It so happened that I haven't written any Codeforces contest in a few months. When I was logging into my account to leave a comment under some post (I ended up not doing that), I noticed something weird:

Please wait. Your browser is being checked. It may take a few seconds...

After about half a second, this screen disappeared, and the normal login page showed up. But since I'm maintaining a CLI client for Codeforces, this immediately caught my attention with an intrusive question: did Mike break my client once again? After firing up a shell and trying to login, the answer was clear: yes, he did.

Part 1. "What the heck?"

At the first glance, this screen looked suspiciously like the Cloudflare "Checking your browser" screen. This would have been a dead end, as even "professional" scrapers prefer using headless browsers and such instead of relying on "Cloudflare bypasses". The idea of the CLI tool in question, however, was to be a "Swiss army knife" for dealing with various judging systems, and as part of that, it was meant to have no external dependencies, so that, for example, it could be easily installed during an onsite contest. Adding heavy "headless browser" libraries such as puppeteer was a no-go from the start, so I decided to look closer at the suspect.

But after trying to download the login page, instead of this "Please wait" page, I got this:

Spoiler

So, I was getting the normal Cloudflare "You've been blocked" page, instead of the supposedly-custom check page. After checking several clients and options, the following picture emerged:

  • Firefox, with session => normal login page
  • Firefox, private tab => "please wait" page (which then automatically refreshes to the login page)
  • Firefox, private tab, JS disabled => "please wait" page
  • Firefox, private tab, JS disabled, HTTP/2 and QUIC disabled => "please wait" page
  • Netsurf (graphical browser without JS support) => "please wait" page
  • curl => "Just a moment..." page (Cloudflare's interactive "verification" with CAPTCHA)
  • Python's urllib.request => "Just a moment..." page
  • w3m (text-mode browser without JS support) => "You've been blocked"
  • Python, custom HTTP client => "You've been blocked"

So, the pattern was clear:

  • For "very suspicious" connections, Cloudflare sends a blocked page immediately
  • For "less suspicious" connections, Cloudflare sends a "Just a moment..." page with a JS blob
  • For "not suspicious" connections without a valid session, the "Please wait" check is sent
  • For "not suspicious" connections that already have a valid session, access is allowed unconditionally

Part 2. "Just a moment"? No, I don't have time for you

I opened Wireshark and captured decrypted SSL traffic from the web browser visiting the login page. After copying the HTTP request vebratim, however, all I got was a block page, which meant that Cloudflare is likely employing some sort of TLS fingerprinting. After checking Python's urllib code, which receives a "Just a moment..." instead of a permission error, I learned that it sets the expected ALPN protocol to "http/1.1". Doing that caused my HTTP client to receive the "Just a moment..." JS blob instead of a permission error. Comparing the TLS handshake with what the browser does, I saw that the browser is announcing a lot of TLS extensions that Python's ssl module doesn't. Unfortunately, a Google search told me that Python does not provide a way to reliably forge a specific TLS fingerprint. Forcing Python to use TLS 1.2 worked though, so I settled for that. If things change on this front, I may consider writing a TLS/SSL client in pure Python, but for now it is not needed.

At this point, most pages were loading just fine without any checks, however the login page still returned the "Please wait" check page. To be able to login we, at the very least, need to grab a CSRF token, which means that we have to be able to fetch that specific page. So, we have to pass this challenge somehow. Challenge accepted.

Part 3. "Please wait"? No, I'm impatient!

Now that we've finally fetched the challenge page, let's look at it closer. The page contains two embedded scripts: the first one is obfuscated (we'll look at it later), the second one is not obfuscated and looks like this. We can see that it waits for a DOMContentLoaded event, then creates an empty invisible iframe and injects into it a script from a string literal. That script then sets up some global object, and calls loads a script from the path /cdn-cgi/challenge-platform/scripts/jsd/main.js. After googling this path, I found this forum thread, which claims that this is a so-called "Cloudflare bot fight mode". So is this also a Cloudflare check, but some lite version of it? Spoiler: yes it is, but not quite.

Now that I knew that my enemy is actually Cloudflare, I decided that the only way to defeat them would be to actually run the code. But I'm still constrained within the "no external dependencies" paradigm, so the only way to go was to actually write a JS interpreter from scratch. Challenge accepted!

Part 4: Anatomy of a JS interpreter

Logically, an interpreter for a (source-based) programming language consists of 3 parts: lexer, parser, and runtime. Lexer turns a string containing the source code into a list of tokens (for example, abc == defg would get converted to ["abc", "==", "defg"]), a parser then turns that list of tokens into an abstract syntax tree, which is then interpreted by the runtime. Actual modern JS engines also contain things like object structures, several tiers of JIT, and other technical buzzwords, but for our purpose of running non-computational code it should not matter, so we'll just do a DFS walk on the AST to run the code.

Logically, a JS lexer would split the string into these types of tokens:

  • Identifiers and number literals (consist of letters, digits, underscore and the dollar sign). We can (later) tell them apart by checking whether the first character is a digit.
  • Tokens. We can hardcode a list of accepted tokens (like +, +=, ==, ===, parentheses, brackets, curly braces, and so on), and find the longest one that matches the string at the current position. This is a greedy algorithm, but it's what actual JS parsers do, so we should do the same to stay compatible.
  • String literals. If a ' or " character is found, the corresponding closing quote must be found (mind the backslash escape!), and the whole string literal is considered a single token.
  • Whitespaces, when encountered, split what would otherwise be a single token into two parts, but are otherwise ignored.

It quickly turned out that three things do not follow this pattern: regex literals, number literals with a leading dot, and automatic semicolon insertion.

e /f/g is "e divided by f, then by g". /f/g is a regular expression literal.

a.b is an attribute access, with dot as a separate token. .01 is a number literal denoting 1/100.

This returns 3:

return 1+2;

This returns undefined, as a semicolon is automagically added before a newline:

return
1+2;

Note: in the last example, the only difference is a single space character being replaced with a newline. This means that JS is not really whitespace-agnostic.

For these reasons, instead of implementing a lexer as a separate step, I decided to implement it as an iterator. It would provide methods peek(), which returns the next token without "eating" it, and get(), which returns the next token and discards it. It can be seen as a Python collections.deque, where x.peek() is x[0] and x.get() is x.popleft(), except that the tokens are lexed lazily rather than ahead of time. These methods accept two extra arguments: newline_is_ws and allow_regex, with both defaulting to True. Then, the parser would override them with False whenever it knows from context that ., /, or \n should be treated as separate tokens. Problem solved.

Part 5. Actually parsing JS

For parsing, I decided to implement a simple recursive parser. For each possible context (statement, several levels of expressions) there is a function that attempts to parse the requested thing from the token stream. Most of the time, it calls peek() to see the next token, determine what comes next, and call the parse method of the corresponding class (for example, For.parse, TryCatch.parse or VarDecl.parse).

For parsing expressions containing binary operations (which have operator precedence), the function parse_expression accepts a priority argument, which specifies that operators of priority lower than what's requested should not be part of the expression. This way, when parsing operators of priority X, their operands are parsed recursively using the same function, but with priority X+1. If the currently-parsed operator has right-to-left associativity, then its second argument can be parsed with priority X.

Another problem is postfix operators (++, --, calls, subscripts, attribute accesses). To handle these, the function parse_postfix_expression first tries to parse a "simple" expression (which is a variable, a literal, or an arbitrary other expression enclosed in parentheses), then eagerly looks at the next token and parses a postfix while possible.

After all of this, we end up with a tree-like structure in memory describing the program. Now it's time to try running it!

Part 6. Actually running JS

As said above, the "runtime" is essentially a DFS-like walk of the AST tree, doing whatever actions the AST describes. At this point, the main problem becomes the object model. How do we represent JS values? How do we implement operations? What do we do with JS and DOM APIs?

I settled on the following principles for the design of the runtime:

  • No JS exception handling. Correct programs do not cause exceptions, it they do it's most probably a bug in the runtime. It's better to report the bug to me (the runtime developer), than to let the JS code handle the error and detect that it's being emulated. Anything that normally causes a JS exception is a fatal error.
  • The runtime is deterministic. Calls to things like Date.now() are emulated. setTimeout is emulated using a priority queue, with the fake "current" time being incremented to account for the "delay". The caller is expected to provide the value for the "current" time, it's not automatically substituted with anything. Random numbers, if they were to be implemented, would be generated from a fixed caller-provided seed. This makes reproduction trivial, as there's no nondeterminism in the first place.
  • There is a special "Limbo" type, which is not present in the normal JS object model. The only thing it internally contains is a textual representation of the object (like "Date.now()"). When any operation is performed on a limbo value, the result is also a limbo value, and contains an approximation of the expression that was used to "compute" it. All JS operators, including built-in arithmetic and even the boolean not, are allowed to return limbo values. At the same time, trying to use a limbo value as a condition of an if, while, for, switch, or a ternary operator is a fatal error.
  • Types whose fields are not fully known (to me as a developer), such as DOM objects, return their unknown fields as limbo values. All function calls and property assignments involving limbos on the left are printed. This way I can see the general pattern of what the obfuscated JS code is actually doing, before having implemented any APIs at all!

Part 7. Debugging

At some point, after mocking a dozen of JS standard library functions, I've got the following very suspicious API call: new Set(Set). Nobody with a sane mind would do that, right? After inspecting the backtrace, it turned out that the following code was triggering this:

(H) = (((C[a6(249)][a6(212)]) && (C[a6(245)])) ? (C[a6(249)][a6(212)](new C[a6(245)](H))) : ...

Which, after deobfuscating the a6 calls (which really just return constant strings), yields the following:

H = window.Array.from && window.Set ? window.Array.from(new window.Set(H)) : ...

It turned out that, in my implementation of the parser, the ?: ternary operator was left-associative and had wrong priority, thus it got parsed as:

((H = window.Array.from && window.Set) ? window.Array.from(new window.Set(H)) : ...) ? ...

After fixing that, this specific error went away. But wait, how can one debug JS engine misbehavior on a piece of code they don't control? Simple, by adding debug prints everywhere! I made a script that inserted a console.log of a unique number after each ; in the source, then ran the script both in the browser and in my custom JS engine. Although I didn't specifically implement console.log there, the limbo type ensured that all calls to unknown functions got logged, and thus the printed numbers were shown on the screen nevertheless. And the numbers did not match. Tracing why they don't match revealed a bug in the implementation of !== operator, which I implemented the same as === and forgot to negate the result.

While testing in the browser, I noticed two other important things. First of all, my fully deterministic execution order did not match the actual execution order: in the actual web browser, the page would have reloaded before the server had the chance to serve the challenge script, and so the script never got to actually run. This meant that I could replace it with an empty string and not have to emulate that part of the code (which I immediately did). The second thing was this message in the developer console:

Spoiler

From the context, it was pretty obvious that the "pow" in the cookie name does not mean exponentiation. I supposed (and, in the end, was right) that it means "proof of work". This gave a clue of what to search for — I needed to know which hash function is used, how its argument is constructed, and what's the exit condition for the PoW to be considered "solved".

The next error I've hit was an attempt to call the toLowerCase() method on a NaN value. Inspecting the backtrace revealed that the erroneous NaN value came from the following function:

var _0x27054a = '', _0x23d630, _0xc498df;
for((_0x23d630) = (7); (_0x23d630) >= (0); _0x23d630--)
{
  (_0xc498df) = (((_0x5336ac) >>> ((_0x23d630) * (4))) & (15)), (_0x27054a) += (_0xc498df[(((_cs[143]) + (_cs[48])) + (_cs[159])) + (_cs[178])](16));
}

From the code structure it's pretty obvious that this function is computing the hex representation of a 32-bit integer. It turned out that I forgot that the += operator can be used for string concatenation, not only for numeric addition, and it ended up converting its arguments to numbers.

After fixing that obvious issue it... just hung. There were a few messages about limbos being called, and then it just hung, printing nothing. "Mining the PoW", I assumed. "Just too slow to be practically usable".

Part 8. Making it fast

In Part 4, I said that we're going to run non-computational code, and the performance of our JS engine does not matter. It turns out that I was wrong. So, do we really have to implement multiple tiers of JIT to solve the PoW? Not really. After all, this is our engine, and we can do arbitrary instrumentation on the code being executed. And so I envisioned a plan: I make a hook, that will be called after a JS function returns. This hook will have access to both the arguments and the return value. Then I will check if the return value is the hash of the argument, and if so, replace that function with a fast native implementation of the same hash algorithm. But what hash do we need to search for?

I decided to first search for the hex function I've seen earlier. And... nothing! No function ever returned a hex representation of its argument. After adding code for printing every function ever called, I saw that, after a certain point, only one function is being called, the one that performs a bit rotation operation. After examining the backtrace (yes, again) I saw the following for loop:

Spoiler

This looks very similar to a hashing loop of something like SHA-256, which reinforced my guess that this JS blob is actually solving a PoW challenge. The bug turned out to be simple: after fixing the += code to handle strings properly, it would no longer update the variable when applied on numbers! Another dumb bug got fixed, and then I finally found the hex function in the trace. After again inspecting the backtrace, I found the place where it was called from:

var _0x365471 = ((((_0x21b5bb(_0x199b1a)) + (_0x21b5bb(_0x475b8a))) + (_0x21b5bb(_0x45b004))) + (_0x21b5bb(_0x52f4a0))) + (_0x21b5bb(_0x871e14));

This expression concatenates 5 32-bit hex numbers, so the result is a 160-bit hash, I assumed. After looking at the list of hash functions on Wikipedia, I found out that there are 4 hashes with this bitness: FSB, HAS-160, RIPEMD-160, and SHA-1. Given that I never heard of the first 3, I assumed that it was SHA-1. Wrote the instrumentation code to find calls to functions that calculate SHA-1 of something and... nothing.

This turned to be another bug, this time in the implementation of Number.prototype.toString(). The function that was formatting hex was doing it by applying .toString(16) on a number, and my implementation simply ignored the argument and returned the number as decimal. After fixing that issue, the instrumentation was finally able to spot the function computing SHA-1. The engine turned out to be so slow that it would only to about 10 hashes per second...

After replacing this function with a native one, though, I got the following log:

...
52670_ 38238aa639fab7523eeaa1d89aae97f88d82660f
52671_ 5fe53a4ed901c3785f8e494022a09357ceff43b5
52672_ 8d571855c868f9d63be97b6e86403d8af1f2c4dd
52673_ 9a7f4acf99e0bd93541a0c3cd6bca368a2b25242
52674_ cc46cfc5c5f832cc9960644b7987878fb96530c3
52675_ bbe4999ee2e55890fbf0d5eba45c02eafe2767d4
52676_ 84a83434d338f2cbcc13174294fd1a4995192aad
52677_ f24be1a953f3530522ae11d145785438ae1ffbec
52678_ a223af9383b546e84758c63f0a41f2695edad57c
52679_ cab141d1878198e02607a9c4351f72adb6158b86
52680_ 9e71dec00bee61c2475b1328240aff38913121bc
52681_ d56db065065403f9d74b698e3cd3c32e93a36b76
52682_ 8db1b0cf3fce304c0c3116c65ade74978327cff0
52683_ 3bc9a2125c6fb758cae9790fe808b27588b3829f
52684_ f6b9aa8650048d657e51837c226e8ca51b38c1dc
52685_ 999599a78d896d6dd12da55ac7efd8656a486219
52686_ 00008aba374f0c30ea6a1094f750c9e539c8dd63

It then crashed on an unimplemented standard library function. After implementing a few remaining stubs, the script finally worked to an end, and document.cookie was set to "pow=52686_".

Conclusion

In the end, the algorithm to "bypass" the anti-bot protection turned out to be simple:

  1. Grab the suffix from the "pow" cookie (of the response from where you got the JS blob).
  2. Calculate SHA-1 hashes of string like "0_"+suffix, "1_"+suffix, etc., until you find a hash with the first 4 digits (16 bits) being zero.
  3. After that, request the login page again, setting the "pow" cookie to the string you've calculated (including the suffix).

I implemented this algorithm into my CLI tool, and, after debugging a few minor issues, the login worked!

The concept of using proof-of-work to guard against bots is actually clever. Instead of performing some heavily obfuscated tricks to generate a browser fingerprint, or something like that, they simply make the suspected bot perform about a CPU-second of useless computation in order to be granted access. This does not really bother real users, but owners of big bot farms would have to spend colossal amounts of electricity to get access for all of the bots. Instead of preventing automation, this simply makes it impractical enough to attempt.

To reemphasise: this does not "bypass" any "security measure". My custom client is doing the same wasteful computations as any legit user, and this essay does not describe any way to avoid doing that. Also, using custom clients is not prohibited by the Codeforces' terms of service, neither by the official contests' rules. Ideally this algorithm should be officially documented somewhere, so that other writers of custom clients can implement this and get access. Once again, this does not defeat the security, as any custom client will still need to burn a fixed amount of electricity to get access, so this physics-based rate-limit will still work.

As you can see, I also ended up not using the custom JS engine in the actual CLI tool, and reimplemented the algorithm instead. If Codeforces or Cloudflare end up changing their PoW hash function or how the string is formatted (which, again, can be avoided by publicly documenting the algorithm), I might actually make use of it in production. But otherwise I don't know if I will publish this code or not. Please let me know in the comments if you are interested in that.

  • Проголосовать: нравится
  • +214
  • Проголосовать: не нравится

»
2 месяца назад, # |
Rev. 3   Проголосовать: нравится +25 Проголосовать: не нравится

Fun story, good storytelling, and neat tricks all around. Thanks, I might use this in some of my projects. Does this mean that alternate front-ends to Codeforces are possible to implement without triggering Cloudflare defenses?

  • »
    »
    2 месяца назад, # ^ |
      Проголосовать: нравится +8 Проголосовать: не нравится

    It seems that this custom protection only protects the login page and is fairly easy to bypass, however the website is also protected with the ordinary CloudFlare. If you read part 2, you'll see that I had to use a specific TLS fingerprint and a User-Agent header to get past that check, and I don't know if it will trigger during a high load (such as during an official contest, or if your proxy suddenly becomes popular).

»
2 месяца назад, # |
  Проголосовать: нравится +27 Проголосовать: не нравится

Is your CLI client publicly available?

»
2 месяца назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

I'm gonna use this, Thank you for your time.