Hey all,
We are excited to invite everyone to the Quora Programming Challenge 2021, a free programming competition open to participants from around the world*! The competition will take place from Saturday, February 6th, 2021 to Sunday, February 7th, 2021, in two divisions of 4 hours each, to accommodate different timezones.
Quora is a platform to ask questions, get useful answers, and share what you know with the world. On this contest, we include some problems inspired by real-world challenges that Quora engineers faced in building and growing the product. In addition to competitive programming-style questions, there will also be some machine learning problems. We hope that this contest will be fun for everyone!
The top contestants in each division will be able to win the following prizes:
- 1st place: $2000
- 2nd place: $1000
- 3rd place: $500
- 4th — 10th: $200
- Top 50 finishers: Quora Challenge T-Shirt
Our recruiting team will also be reaching out to top participants for their interest in interviewing for roles at Quora. We recently became a remote-first company, so you don’t have to relocate to work with us.
Register for the competition here by 2/5/2021: https://www.quora.com/careers/challenge.
Hope to see you at the contest!
- Except for excluded countries. Full details, including the dates and rules of the competition, can be found here: https://www.quora.com/careers/challenge
Auto comment: topic has been updated by vlchen888 (previous revision, new revision, compare).
Deja vu.
Why is LinkedIn profile a required field?
If it is so much required, why does it accept
https://linked.in/none
as answer?Mind that the answer is accurate and truthful to the extent possible, as required in the rules.
It's not really surprising that the form doesn't validate that the URL is a real profile URL, they did some rough validation and that's enough; do you expect forms to also automatically check that your City is a "real city" against some magical database before allowing your submission? No one ever does that.
LinkedIn profile is required because, clearly, Quora is running this (for free, with real prize money!) as a tool to recruit people to work there.
For important stuff, there's a way to validate an address: "We sent you a mail / message / etc. with a code, please input it in the next form to complete the registration process".
This one looks mandatory (required field in the form) but not enforced (no validation and no specific mention of LinkedIn profile in the rules, only "fill the form accurately and truthfully"). Hence the question, which is more about the company's intent: if it is really required, it can be clearer in the rules or via validation, if not, the required status can be dropped.
Validation only ever really happens during actual account creation; no one ever validates things like that in registration forms.
I guess the real question is whether they'll disqualify you if you don't have a LinkedIn? I'm not sure if they will, but from the rules about "accurately and truthfully", it sounds like they can, so I'd just make a LinkedIn and leave it mostly empty.
The problems in the 2014 version of this challenge has a not-strictly-algorithmic challenge, in particular Sorted Set where you have to parallelize a task, which is not common in CP. Are these types of questions fair game under your description of "competitive programming-style questions"?
Hi, thanks for the question! This iteration of the challenge will contain problems that fall into two categories.
I think many people are familiar with type 1, but for type 2, do you think you could provide a simple example problem ahead of time?
+1. Will any machine learning background be required to solve those problems?
Hi Anand, you can take a look at this problem from our 2014 contest for an example of what we’re considering an ML problem: https://www.quora.com/q/quorahaqathon/Quora-Haqathon-Labeler. These problems will not have perfect solutions, and you’ll find knowledge of machine learning useful for solving them. Resources are constrained though, so you won’t be training a SOTA model, and heuristic based approaches may be sufficient or superior for some problems.
Have you considered splitting the challenge into two contests? Also it's been my experience that it's quite hard to give a meaningful solution to a ML problem in 5 hours :(.
Will it be clear on which problems you can guarantee a full score?
Since I don't really like Quora as a platform (I believe in the idea, but not the current implementation), is there any way for me to register for the contest without logging into a Quora account? I try to minimize my social media accounts, especially those based around ads and data monetization (which is all of them nowadays, except Microsoft Circles).
Additionally, what's the approximate value of a Quora Challenge T-Shirt? I'm trying to decide whether it's worth attending based on the expected value.
I think a shirt like that mainly has value to you, so no one else can answer that question. Unless you mean the monetary value of a shirt which can't be over 15 dollars (and could be a lot less)
Reminder that registration for this contest is closing soon!
Am I right that all problems allow partial scoring, and that wrong attempts don't increase penalty? Will there be a limit on a number of submissions per problem?
29 January 2021 — "In the week before the contest, you should receive another email from us with the link to the contest platform and instructions on how to log in. We will use the same link for both divisions of the contest, and you will be able to log in to either one (or both) of the divisions."
I have not received another email ...
Ok I have received an email
"The Quora Programming Challenge 2021 is almost upon us!
Here is the login information that you will need to access the contest (please save this information):
Contest URL: challenge.quora.com ..."
Missed registration by some time, thought I will be able to register by 5 Feb IST time :(
I thought we had all of February 5th (until 23:59 UTC) at least to register... ):
I thought the same thing, two rounds missed.
+1, can the registration time be extended? :(
The time limit and memory limit of each problem is shown in the corresponding problem page.
I find this interesting for discussion :)
Doubling the time limit is understandable but not needed; doubling the memory limit is just weird.
I think they were right to allow double time limit (for java at least, no idea about python).
I personally solved the datacenter problem from division 2 first in java, got TLE, implemented same approach (almost same code) which got accepted.
Any Java Coders out here, I keep getting the error, "Execution failed because the return code was nonzero". I named my file as Example.java and removed the package and also removed the public keyword from my class. But this continues to show up. I tried submitting a solution using CPP and It got accepted. (BTW this is for the practice session.)
EDIT : (I managed to solve it, my file name had to be "example.java" and not "Example.java")
There isn't a custom invocation like what Codeforces have that can show us what is the error code :/
Actually, I missed the practice session. Can you tell me that what should be the name of class for Java solution ? Or can we name it anything ?
What I think is, class and file name will be the same as question name.
But the thing is, they have not mentioned this anywhere and if you name it something else then a weird error will come
"Since the grading system renames files to match the task name, you will need to use that same task name for the public class name in Java solutions. In the practice problem, this would be "example", since the grading system renames the file to be example.java" . They've sent something like this as an announcement. So, according to it name your file as the "questionname".java and make sure there are no capitals.
Ok, Thank you.
Will standings page be available during the contest? I don't see one in practice session.
"A Participant choosing to participate in both divisions could win prizes in both divisions. Limit one (1) prize per division per Participant.".
So, is it possible that best 50 in both divisions is exactly the same?
FYI, here are the Installed Python packages quoted from the official website:
Is there any possibility of late registration for either division ?
anton would've surely rejected atleast 6/7 problems. :)
seeing same type of comments nowadays It feels that anton have redefined competitive programming.
I think this problemset would have been rejected by 2015 coordinators as well.
why because they were too difficult ? I usually observe that people like kind of easy problem set and dislike difficult one.
Also notice that format was completely different . It was 4 hrs instead of 2 hrs.
-is-this-fft- i don't know why i can't reply two times in 10 minutes , So you want to say because problems were kind of popular or well known and thus they were boring ? From "standard" i usually feel well known problems which people solve while learning some topic.
Was only me who found problemset was only for > master people ?
No, because of problem style. Most of the problems were very standard, even a few years ago. I basically quit the contest out of boredom.
Probably because of unrated and low contribution.
Yes, that. For example there was a basic "learning König's theorem" problem and a pretty basic "learning HLD" problem. Note that using these techniques to set problems is fine, but not if knowing the thing is basically the whole problem. I also think that it is fine if there is an "educational" problem from time to time, but if all problems are like that then the problemset is bad.
Well, not master+, more like expert+. But you're right that this contest wasn't very green-friendly.
There was no need for HLD, binary climb was more than enough
In problem C(Problem ID: students), won't the graph always be bipartite? (Solving with this assumption fetched me 21 points though) or was it an implementation mistake?
It is bipartite but your complexity is wrong.
Yeah I did the same. But I dont know anything about Maxmimum Independent set. So I just copied some template from github.
I am really curious about the expected solution.
Yes, the graph is bipartite. But I think you are assuming that you can always get a maximum independent set in a bipartite graph by taking all the nodes in one of it sides (same color). That's clearly not true.
Yes, you're absolutely right, so is finding maximum independent set of a graph some standard problem?
for a bipartite graph, it is
Can you please tell me what is wrong with my code. I also used bipartite graph and took max of two colors for every connected component. https://ideone.com/cKVuu2
Imagine bipartie graph
Each half has 3 vertices, so your solution will output 3. But if we take 1, 2 from left side and 2, 3 from right side we can achieve 4
I'm sorry but didn't understand what you mean by 1 2 and 3 in that array. Basically I did a bfs on each connected components and assigned 0 or 1 color to each node and took the maximum of two colors for each connected components. Can you please tell me what is wrong with my methodology here.
Sorry, it was badly formatted. Hope it is understandable now
I think you have to find the maximum independent set instead of choosing only one color.
Maximum independent set = total vertexes — edges in maximum matching
Do you mean maximum bipartite matching? If yes can you share some insight of how you got this formula(sorry if it's something very standard)
From König's theorem we know that number of vertexes in minimal cover equals to number of edges in maximum matching. So total number of vertex is sum of vertex in minimum cover and maximal independent set so we have this equation.
How to find maximum matching for this graph? My Dinic only gets sub 2.
If u use dinic on vertex only if it appears in the input, which is $$$4\times 10^5$$$, then it would be fast enough. Since it works in O(sqrt(V) * E) in bipartite graph.
I got TLE when I did what you mentioned. I got AC by running dinic for every connetected component.
Ah, I forgot it :|. Thanks
Is the wall problem (forgot the exact name) a connected component dp problem? Can't find a valid transition during contest.
Yea it is, you just need to maintain the current type of connected components in each column and also which squares in the column are "active".
Any chance you could send me your solution? I'm not sure why mine doesn't pass, and I'd like to stress test it to find a countercase.
Code
Thank you!
I coded the bruteforce solution during the contest to debug my full solution. I felt like it was worth it since it is difficult to debug otherwise, and at least I can gain 14 pts from the first subtask in case I did not manage to debug it until the end of the contest.
Even the O(2^N) solution requires a MST. :|
I also wrote a brute force in contest, both to submit for 14 points and to use for stress testing. Unfortunately, I was unable to find a countercase using that solution, so I requested other competitors' solutions so that I could try to generate a countercase with a higher N. Unfortunately, even that didn't work out, so apparently there's a relatively obscure bug somewhere in my solution.
Did your brute force give you the countercase you were looking for, and if so, would you mind sharing the case? It sounds like a fair number of others failed on the same test case I did (TC7 from subtask one), so there's a decent chance countercases that worked for others will also break my solution.
Fortunately, yes. This is the countercase
It should return 111 where my buggy full solution code returns 115. It does not pass the first hidden testcase though. The issue was related to the first and the third row squares being connected while the second is not.
Thank you!
Thanks man
How can people apart from top-50 check their ranks?
Did anyone get an issue on the wall task where on test case 7 the code would throw a runtime error? I know others who had the same problem
I didn't end up with an RTE, but I got WA7 on the first subtask and couldn't finish debugging before the end of the round. I'm also looking for a countercase; I've stress tested my solution on thousands of cases against a slow solution I wrote and the correct solutions I've just been sent and can't find a case that I WA. If anyone happens to have gotten WA7/RTE7, found a countercase, and successfully debugged their solution, that case would be very much appreciated!
Can you show your code?
Sure, here's a link: https://pastebin.com/pRjgYdNH.
Essentially, I use a state that, after all vertical walls prior to a given column and all horizontal walls in that column have been processed, indicates the connectivity structure of the current column and which connected components contain workers. There are then 22 states, though I track them as values from 0 to 47 to make keeping track of the underlying values easier--essentially, for a given state S, column 0 is in connected component 0, S % 2 is the connected component containing column 1, (S / 2) % 3 is the connected component containing column 2, and S / 6 is a bitmask representing which components contain workers, and 32 possible transitions at each column (as there are five walls that could be removed in each transition). I precalculate all possible transitions, after which my solution is a fairly straightforward DP approach.
I ran a stress-test against your solution compiled with sanitizers, and what I found is that your code fails with RE when $$$n$$$ is $$$1$$$. The problem is that in line 150 you allocate an array of size $$$N-1$$$, and it seems to be not legal (maybe it is undefined behaviour, I am not sure): I get an error "variable length array evaluates to non-positive value 0".
Ooh, gotcha—thanks! I’ll see if fixing that helps.
Will tasks be available to submit somewhere later?
what was test case 7 of problem escape. it was getting stuck.
One thing that I missed in first attempt and got WA at test 7 was that guards will also be stopped by the walls. So you can't just check manhattan distance of a cell from a guard and determine its feasibility.
Ya I got it now that I also missed this only . Why not it was in bold? Thanks
How to solve flipped? My 93 points code: CODE
I got AC by assuming each column originally followed a normal distribution with its post-flip mean and standard deviation and picking the rows where the product of the normal PMF corresponding to each value assuming the row is not flipped divided by the product of the normal PMF corresponding to each value assuming the row is flipped is minimized. Intuitively, I assumed that the flips didn't substantially change the overall distribution of each column and then picked the rows where flipping most drastically increases the probability that the values in the row appear in their corresponding distributions.
Here's my code (excluding my template): https://pastebin.com/YtYNwPh5
I computed the mean and standard deviation of each column. Then, for each row $$$r$$$, let $$$f(r)$$$ denote the sum of squared distance from 0.5 of all the value of the cumulative normal distribition on each value in the row (drawn from the estimated normal distribution for each column). Do the same for reversed row $$$r'$$$ and finally sort all rows by the value $$$f(r')-f(r)$$$ and output the first $$$b$$$ rows.
Seems like the problemset is not (yet) publicly available. I uploaded the problemset to my server for public consumption (so those who missed the registration can also follow the discussion).
thanks!
And here is the second division problemset.
Does anyone know the ML techniques that should be used for the last 2 problems?
Also, I was getting RTE for 3 cases in problem Flipped with Cpp. I switched to Java and I was able to pass those cases. I suspect that I had to use the extra memory limit.
For the second last problem, we want to select the reversed rows. A cell is misplaced if its value deviates from the mean of the column. We do want to normalise the deviation by dividing with the standard deviation of the column. The rows with many very misplaced cells are likely reversed.
We can reverse the b worst rows, and the result was already quite good (around 70?).
The issue with reversing many rows at once is that we might reverse the wrong row. One way is to reverse one row at a time, and recalculate the mean and standard deviation. But this is too slow.
The middle ground is to reverse a portion of the b rows, in maybe 16 portions. I scored 99 points (even though it displayed 100 on the console).
The "ML technique" here is statistical knowledge of mean and standard deviation.
Code: https://github.com/tonghuikang/codecomp/blob/quora-2021-1/template/f.py
For the last problem, I used LightGBM (which is one the packages Quora allows, and is faster than XGBoost).
The challenge is feature engineering. Each data should have the same set of features, but what we see in the data is that each user can visit multiple sites. What I did is to cycle through the visits — each user will revisit the first site again. For the time, I calculated the time difference, and cycle. Each data will have 200+200 features.
I was prepared to handle more feature engineering and fine-tune the threshold, but I got full marks after a few tries.
The "ML technique" feature engineering, and trust the black box GBDT model and pray.
Code: https://github.com/tonghuikang/codecomp/blob/quora-2021-1/template/spam.ipynb
For the last problem, I think trying to do feature engineering is unnecessary. They tell you the exact model used to generate the data (a Markov chain for each user type), and there are pretty much no hidden variables, so you can directly estimate the parameters of the Markov chain using the input data, and then directly compute the Bayesian probabilities of the user types given the test inputs.
For problem Walls, how can we use the size of the grid to our advantage, or is there any algorithm to find the minimum spanning tree of a subgraph?
Will there be editorial?
Sorry for Asking, but can anyone help me with problem B " Escape ". I could not find the idea how to mark all the cell visible to atleast 1 guard any help.
For each cell $$$c$$$ that a guard with radius $$$r$$$ is on, set $$$dist[c] = -(r+1)$$$. Now do a multisource shortest path using a priority queue starting from all these cells. For any cell, if the shortest distance is found to be less than 0 at the end, then it is covered by atleast one guard.
Do you mean we should use multisource BFS with priority queue to prioritize popping the cells with lower $$$dist[c]$$$ first? Because when using a normal queue it won't be correct I think, for instance on this example.
Yep, you're right! Sorry I missed mentioning priority queue.
shiven But what if I come across a cell that is already marked(monitored by another guard), now should I add this cell in the queue ? If yes then time complexity will be quadratic, if No, then what if a cell adjacent to that cell is not monitored by the previous guards, but is monitored by the current guard ?
You'll be using a priority queue. It's almost dijkstra.
ok, but what should we do if we come across a cell that is already guarded by someone else, should we add it to priority queue ? If No then what if a cell adjacent to that cell is not monitored by the previous guards, but is monitored by the current guard ?
Exactly like you handle relaxations in Dijkstra.
How to solve problem "Escape". My approach was to first block all the cells which are at the distance <= d from guard. Then I find the length of the shortest path using bfs. It runs on subtask 1 only. Code
Hey, I used multisource BFS, pushed the coordinate of every guard in the queue with their distance, and mark the cell visited if a guard can reach the cell. Now, if our source is visited means at least one guard can reach our initial position and we can't go any further, so
"IMPOSSIBLE"
. If our destination is visited means we can't go to destination, so"IMPOSSIBLE"
and if all the neighbors of destination are visited then there is no chance to go to destination cell so the answer is"IMPOSSIBLE"
. If none of the conditions is satisfied above then apply simple BFS from source to destination. If we can reach the destination cell then print the distance else"IMPOSSIBLE"
.Division 1 recently concluded! Congratulations to all the winners and really happy to see all of the interesting discussion here. We hope you enjoyed the Division 1 problems and hope to see you at Division 2, which is starting in less than 6 hours: Feb 7 2021, 01:00 to 05:00 (UTC + 00:00)
Can someone provide me the final ranklist link?
It's Out Now: Quora Programming Challenge 2021 Full Ranking (https://docs.google.com/spreadsheets/d/1xVzryOFrhq7G9YpePcstexg8owvQPyev7EOXogdaI2c/edit?usp=sharing)
How tf ML problems are ML?
Does anyone happen to have the sample I/O for Spam downloaded, and if so, would you mind posting it publicly? I wrote a solution to the problem after the contest ended, and I'm curious about how it would perform on a larger test case.
I have input, not sure if there was an output
Thanks!
Any link for tutorial?
Can anybody tell the efficient solution for escape problem for full points ? None of the comments in this blog mention the efficient solution .
I think Multi-source BFS would give TLE as well , because you will visit a cell say (x,y) which is being gaurd by multiple enemies , so you will have to mark it multiple times which will waste time , and you cant ignore to mark it multiple times as well as each enemy may have different 'd' .
Please someone throw light on the efficient solution for escape ?
You only need to keep track of the enemy with the longest remaining range if starting from that cell. That way you only need to propagate from that cell once.
Did you use a priority queue to maintain the "enemy with longest remaining range"?
Very surprised that https://en.wikipedia.org/wiki/Strong_connectivity_augmentation is not given as a CP problem before, or maybe my Google-fu skill is bad.
Same, I found a solution on Google, but I could only get the first subtask because I had some weird bug causing me to WA on the second subtask.
Best I could find was this link in Russian, which I tried to piece together with Google Translate :)
Doesn't this give the solution as well?
Yes, but good luck trying to understand / implement this in 45 mins if you haven't seen the idea before.
That gives the correct answer, I believe, but getting the construction right is the (much) harder part of the problem. I attempted to implement the approach given in this paper http://www.springer.com/cda/content/document/cda_downloaddocument/9780387235288-c2.pdf but was unsuccessful in doing so (I suspect because I couldn't get the details of the T > S case, which isn't discussed in the paper, right).
I was not a fan of this problem, given that it's relatively standard and is essentially an implementation (or copying/pasting, if anyone was able to find code online) test. (I also liked malicious less than the other ML problems--it felt like it required much more esoteric knowledge of machine learning, while most of the other ML problems could be solved using general intuition.)
My guess is, it's very hard to give meaningful ML problems within the time frame of 3-4 hour contest that is not just merely pulling something from the library. Also the satisfaction depends on how quickly you get to a good score :)
I got quite frustrated with malicious; turns out I only scratch the surface idea, kudos to huikang for going into exploring the dataset and figuring out the correct insight!
Oops, I forgot the crucial line 327 of breaking when you find a matching. My code now AC's on CSES.
Submission
Actually, the paper I was following to implement this code also forgot to put that line in. Though it's definitely my fault for not checking their code first before blindly implementing :P
CSES' testcases are weak. I stresstested my code and it got wrong answer on simple testcase.
https://cses.fi/problemset/task/1685 The exact same problem.. This is a joke :/
The quora one got me wrong answer on TC 6.
Finally tourist and I have something in common — we are the only ones in the top 100 who completely solve malicious before the last hour.
https://www.quora.com/q/quoraprogrammingchallenge/Division-2-One-Hour-Remaining
Code: https://github.com/tonghuikang/codecomp/blob/quora-2021-2/template/malicious.ipynb
Where can the list of all Div1 problems be found ?
https://codeforces.net/blog/entry/86539?#comment-758158
A chance of editorials for the divisions in this contest?
By any chance if anyone who has downloaded the problemset for Div2 can share it here?
Jonathan has uploaded it on his website
https://jonathanirvin.gs/files/division2_3d16774b0423.pdf
Thank you!
you can find div2 problems in this pdf file **https://drive.google.com/file/d/1ztxHk9Y0Knkm662F2s5dEIOYiP-a7VSk/view?usp=sharing**
Thanks a lot bud!
Someone pls share the implementation of Escape and Students problem of DIV-1 Thanks in advance:)
Here is my code for Problem Escape (excluding my template):
EDIT: It seems the test-cases for this problems were weak, thanks to quinoa for giving a counter case for the above solution. The following solution seems correct to us now, though confirmation of the same is appreciated.
Did you get AC with this? I don't think it's correct since you're not prioritizing the nodes with greatest remaining distance when popping them from the queue. For instance it will fail on this testcase. Your code outputs "2" whereas it should say "IMPOSSIBLE".
Whereas when removing the first guard
your code gives "IMPOSSIBLE" (which is correct).
True, it seems like test-cases were weak, so how will we remove them prioritizing the distance, using a priority queue. Can you share your implementation?
I could not solve the problem, my solution only worked for the small test case. I think if you change your queue by a priority queue then it will work. Because then we will visit every cell when the remaining distance is maximal and then we should never visit it again. Whereas when using a normal queue we might visit a node when this distance is not maximal, and it is incorrect to label it as visited at this point.
But I'm also looking for some confirmation that this is the expected solution..
Here :
https://pastebin.com/dFs0MUt3
How do we solve datacenter and skyscraper(div-2) optimally?
Datacenter: transform each coordinate (x, y) into (x + y, x — y). The problem is now equivalent to finding the datacenter i that minimizes
which is much nicer to handle: precompute some prefix / suffix sum and iterate through each possible datacenter as the choice.
Skyscraper: maintain a segment tree where each node contains the maximum height (and possibly the minimum height, not sure if this is needed). The goal is, for a skyscraper i, to quickly find the minimum and maximum indices L and R such that all skyscrapers within [L; R] have heights no more than that of i. You can adapt the segment tree to quickly find these indices in O(log N) time.
I don't get the solution for Data Center. Can you please elaborate why it is equivalent?
We need to compute the distance of two points $$$(x_1, y_1)$$$ and $$$(x_2, y_2)$$$ under the original and transformed formula. By suitable translation and reflection, we can shift these two points to $$$(0, 0)$$$ and $$$(x, y)$$$, and furthermore we can assume $$$x \ge y \ge 0$$$ (the formula is similar for all other cases).
Under the original formula, the distance of two points is $$$\max (x, y) = x$$$.
Under the transformed formula, the distance of two transformed points $$$(0, 0)$$$ and $$$(x + y, x - y)$$$ are $$$(x + y) + (x - y) = 2x$$$. So the new distance under the transformed formula is just double the old distance.
Another way to think about this is we are rotating the entire coordinate by 45 degrees. So Manhattan distance becomes Chebyshev (chessboard) distance, and vice-versa.
How do you adapt segment tree for this problem?
Store the maximum for its range in every node.
Now, to find the smallest $$$i$$$ in the range $$$[l, r]$$$, for which $$$h[i] > x$$$, descend carefully down the segtree.
Something like:
This way, you will find the required index in $$$O(log N)$$$, per query.
Does it mean we have to decompose [i+1, n] into $$$O(\log n)$$$ intervals and find the first one having max > h[i]?
Yeah, in a way.
Dividing it into intervals on the fly.
This should help (see "Searching for the first element greater than a given amount")
Every time there is an asterisk after "world", I automatically assume it excludes Iran. What a shame :(
How to solve div2 message for any decent score? The English prediction problem.
I know that the dataset is generated from actual English phrases, so I've tried different greedy techniques of picking words based on some brewed up weighted score (frequency, position of the missing word, distance to other words of the same phrase, etc.). But I wasn't able to get anywhere 30+
We were tasked to predict the missing word in the sentence.
When I want to predict the missing word b in a triplet a,b,c — I consider for every possible b, the frequency of (a,b), the frequency of (b,c) and the freqency of (a,b,c). The candidate word for b that appears the most frequently will be my prediction.
It makes sense if we give greater weights to (a,b,c) because it is less likely to appear compared to (a,b) or (b,c). I count the freqency every appearance of (a,b,c) 10 times rather than once that I did for (a,b) and (b,c). This scored around 47/100 points.
To get 78/100 points, I considered two words before and two words after the missing word. In the final minutes of the contest, I was fine-tuning the weights and submitted the code repeatedly.
Code: https://github.com/tonghuikang/codecomp/blob/quora-2021-2/template/f.py
It is also possible to consider up to 30 characters before and after for a slightly higher score at a very little increase in complexity, but my code was not structured to do that.
For the full score, I think we need to use a neural network model. The general problem here is masked language encoding.
https://keras.io/examples/nlp/masked_language_modeling/
There are readily available templates out there, but it needs to download billon-parameter models which is not allowed by the online judge. We need to build (or copy) a downsized model that is suitable for the dataset.
I think I got lucky when hitting high score pretty quickly.
First I got 70 pts by keeping track of a frequency table of all 2-gram (an ordered pair of word (x, y)), and for each possible candidate w, its score is freq (p, w) * freq (w, r) where p and r are the word before and after w, respectively (if there's no such word p/r or the pair doesn't exist in the frequency table, assign the weight of 1 for that pair).
Return the word of highest score.
I got 82 by improving the above idea with including the 3-grams.
Got my T-shirt (Singapore)
Applied and got an offer, and I have accepted the offer ... Yay
congrats !
Did you get any link to track the T-shirt? (cause no link sent to me)
There was no tracking link sent to me too, the package does not indicate UPS or FedEx.