[Help]How to solve this expected value problem for trie

→ Обратите внимание

До соревнования
CodeTON Round 9 (Div. 1 + Div. 2, Rated, Prizes!)
28:47:12
Зарегистрироваться »

*есть доп. регистрация

→ Лидеры (рейтинг)

№	Пользователь	Рейтинг
1	tourist	4009
2	jiangly	3823
3	Benq	3738
4	Radewoosh	3633
5	jqdai0815	3620
6	orzdevinwang	3529
7	ecnerwala	3446
8	Um_nik	3396
9	ksun48	3390
10	gamegame	3386

Страны | Города | Организации

Всё →

→ Лидеры (вклад)

№	Пользователь	Вклад
1	cry	167
2	Um_nik	163
3	maomao90	162
3	atcoder_official	162
5	adamant	159
6	-is-this-fft-	158
7	awoo	157
8	TheScrasse	154
9	Dominater069	153
9	nor	153

Всё →

→ Найти пользователя

→ Прямой эфир

Детальнее →

Блог пользователя its_aks_ulure

[Help]How to solve this expected value problem for trie

Автор its_aks_ulure, история, 5 лет назад, По-английски

I was preparing some problems for a college contest from the past few weeks and came up with this problem idea.

Given N words consisting of either lowercase English letters or a special character '#'. 
Find the expected number of nodes in the trie after a trie is constructed replacing all the special character('#') in the words uniformly randomly with any letter between 'a' to 'z'.

I want to know if there exists any polynomial time solution for this problem.

Thanks

#help, #trie

its_aks_ulure
5 лет назад
6

Комментарии (6)

Написать комментарий?

its_aks_ulure

5 лет назад, # |

+16

Bump! Anyone who can help me?

→ Ответить

emorgan

5 лет назад, # |

← Rev. 2 →

+29

Suppose we are given an ordered list of strings $$$S$$$. Assume the words are inserted in the order in which they appear in $$$S$$$. By linearity of expectation, the final answer is

$$$ \sum_\limits{i=1}^n \sum_\limits{j=1}^{|S_i|} P(i, j) $$$

where $$$P(i, j)$$$ is the probability that character $$$j$$$ of string $$$S_i$$$ is not already in the trie when it is added. Without loss of generality, assume that $$$S_{ij}$$$ is not #, since if it is, we can just add up the probabilities for all characters a-z and divide by $$$26$$$. (edit: this isn't necessary, the algorithm below works perfectly fine if $$$S_{ij}$$$ is #)

The probability that $$$S_{ij}$$$ is not in the trie, is simply the product of the probabilities that $$$S_i$$$ does not share a prefix of length $$$j$$$ with $$$S_k$$$, for all $$$k<i$$$. For fixed $$$i,j,k$$$, we just do casework on the two prefixes. If they disagree anywhere, the answer is automatically $$$1$$$. Otherwise, the probability that they end up agreeing is $$${\frac 1{26}}$$$ to the power of the number of indices where at least one string has a #. Subtract from $$$1$$$ to get the probability that they disagree.

Let $$$m$$$ be the length of each string, and let $$$a=26$$$. The total runtime is $$$O(an^2m)$$$, which I'm sure can be optimized further, but is indeed a polynomial time solution. This is also just a rough sketch of a solution and may have errors, I welcome corrections in comment replies.

→ Ответить

dragonslayerintraining

5 лет назад, # ^ |

+32

I don't believe the claim "The probability that $$$S_{ij}$$$ is not in the trie, is simply the product of the probabilities that $$$S_i$$$ does not share a prefix of length $$$j$$$ with $$$S_k$$$, for all $$$k<i$$$", because the events could be dependent.

Consider the case where $$$S_1=\text{ab},S_2=\text{ac}$$$, $$$S_3=\text{#}$$$ and $$$j=1$$$.

→ Ответить

emorgan

5 лет назад, # ^ |

Yes, I see. So we would actually have to iterate over all possible configurations of $$$S_i$$$ achievable by replacing # with a-z, which is $$$O(a^m)$$$ in the worst case.

If we ignore the case of having a # in previous strings before $$$S_i$$$, we can permute the indices up to $$$j$$$ to float all of the # characters to the beginning, then delete all elements of $$$S$$$ before index $$$i$$$ which disagree at at least one index, since they contribute nothing to $$$P(i, j)$$$. So, without loss of generality, we can assume $$$S_i$$$ contains nothing but # characters, and the problem reduces down to "given a bunch of strings of length $$$j$$$, what is the probability that a uniformly randomly chosen string of length $$$j$$$ is not equal to any of the other strings?" This can be answered easily, the answer is $$$1-\text{# of unique strings}/a^j$$$.

So, if we ignore the case of # occurring in other strings, then we can solve it in polynomial time. We have a solution for when $$$S_i$$$ contains no # character, and all other strings contain # characters. We also have a solution for when $$$S_i$$$ contains a # characters, and all other strings contain no # character. Is it possible to combine these two solutions somehow to get a general case answer?

I think the answer is no, because if we try to perform the same trick of separating $$$S_i$$$ into # components and non-# components, which have potentially $$$2^n$$$ duplicate-deletion scenarios, as opposed to only 1 in the previous case. However, this is still an improvement over the $$$a^m$$$ time we had before. So, I now think that the problem is NP-hard.

→ Ответить

SuperJ6

5 лет назад, # |

+17

Would love to see the problem added to a judge by someone (or maybe it already exists?).

→ Ответить

bicsi

5 лет назад, # |

+14

I would suspect this problem is NP-hard, just like most counting problems that boil down to fixing some order of elements. I have no idea how to (dis)prove that, though, so I might be terribly wrong here.

→ Ответить

Соревнования по программированию 2.0

Время на сервере: 22.11.2024 12:47:49 (h1).

Десктопная версия, переключиться на мобильную.

При поддержке