Hi everyone!
Recently aryanc403 brought up a topic of subset convolution and some operations related to it.
This inspired me to write this blog entry as existing explanations on how it works seemed unintuitive for me. I believe that having viable interpretations of how things work is of extreme importance as it greatly simplifies understanding and allows us to reproduce some results without lust learning them by heart.
Also this approach allows us to easily and intuitively generalize subset convolution to sum over $$$i \cup j = k$$$ and $$$|i \cap j|=l$$$, while in competitive programming we usually only do it for $$$|i \cap j|=0$$$. Enjoy the reading!
Subset convolution $$$a \circ b$$$ is defined as follows, let $$$a=(a_0, ..., a_{2^n-1})$$$ and $$$b=(b_0, ..., b_{2^n-1})$$$. Then
\begin{equation} (a \star b)_k = \sum\limits_{i\subset k} a_i b_{k \setminus i} \end{equation}
That is, $$$k$$$-th term of this convolution is a sum over all disjoint partitions of $$$k$$$ into two submasks.
Generic approach to compute such kinds of sums is to make a transform which maps initial arrays into some space where convolution is done via pointwise (Hadamard) product $$$(a \circ b)_k = a_k b_k$$$. That is, to find some invertible transform $$$F$$$ such that
\begin{equation} F(a \star b) = F(a) \circ F(b) \end{equation}
These kinds of convolutions can be interpreted as a product in groupoid ring. Let $$$x$$$ and $$$y$$$ be integers from $$$0$$$ to $$$2^n-1$$$. We define
\begin{equation} x \star y = \begin{cases} x \cup y,&\text{if $$$x \cap y=0$$$},\newline \emptyset,&\text{otherwise} \end{cases} \end{equation}
where $$$\emptyset$$$ is a special "sink" element indicating that this operation is invalid and shouldn't be counted.
Now if we consider formal sums $$$a=\sum\limits_{g=0}^{2^n-1} a_g g$$$ and $$$b=\sum\limits_{g=0}^{2^n-1} b_g g$$$, their product over groupoid ring is defined as
\begin{equation} a \star b = \sum\limits_{i=0}^{2^n-1} \sum\limits_{j=0}^{2^n-1} a_i b_j (i \star j) \end{equation}
What would be extremely nice for us here is to introduce formal variable $$$x$$$ such that $$$x^{i}x^j=x^{i \star j}$$$. Then these convolutions can be interpreted as polynomial products where polynomials are taken over these weird variable. To compute this product, a common trick is to instead compute it in some specific values of $$$x$$$, do the component-wise product and interpolate it back to polynomial to find the answer.
And since we work with a bitwise operation, we will use $$$x^i=x_1^{i_1} x_2^{i_2} \dots x_n^{i_n}$$$, where $$$i_k$$$ is $$$0$$$ or $$$1$$$ depending on the $$$k$$$-th bit of $$$i$$$. Then
\begin{equation} x^i x^j = (x_1^{i_1} x_2^{i_2} \dots x_n^{i_n}) (x_1^{j_1} x_2^{j_2} \dots x_n^{j_n}) = x_1^{i_1 \star j_1} x_2^{i_2 \star j_2} \dots x_n^{i_n \star j_n} = x^{i \star j} \end{equation}
Now before working with subset convolution, let's take a step back and recall what we're doing in "or" and "xor" convolutions. For "or" convolution we would need $$$x^0 x^0 = x^0$$$, $$$x^0 x^1 = x^1$$$ and $$$x^1 x^1 = x^1$$$. What are the actual numbers we can think of for which such identities hold? There are only two such numbers, they're $$$x=0$$$ and $$$x=1$$$. So, to compute the "or" convolution is equivalent to compute the values of the multivariate polynomial in all the points of {$$$0, 1$$$}$$$^n$$$ boolean cube.
And what about "xor" convolution? For it we need $$$x^0 x^0 = x^0$$$, $$$x^0 x^1 = x^1$$$ and $$$x^1 x^1 = x^0$$$. What are numbers for which these identities hold? It is $$$x=-1$$$ and $$$x=1$$$. So, the "xor" convolution can be interpreted through computing values in {$$$-1, 1$$$}$$$^n$$$ cube (and that's exactly what Walsh-Hadamard transform does as it is an $$$n$$$-dimensional DFT of size $$$2$$$).
Now comes the "subset" convolution. We need some value such that $$$x^0 x^0 = x^0$$$, $$$x^0 x^1 = x^1$$$ and $$$x^1 x^1 = 0$$$. Woah, there is only one such number, $$$x=0$$$. One value is not enough for us to uniquely recover the polynomial. Now, what do we do if we desperately need an element with specific properties but we do not have one?
Correct, we formally introduce it and pretend that it actually exists! Remember $$$i$$$ which was introduced with $$$i^2 = -1$$$? Now here we formally introduce element $$$\varepsilon$$$ such that $$$\varepsilon^2=0$$$ but $$$\varepsilon \neq 0$$$ and treat all elements of extended ring as $$$a+b\varepsilon$$$ with corresponding addition and multiplication. What was just introduced here is the concept of dual numbers.
It would almost work as it is possible to compute the values of a polynomial in {$$$0,\varepsilon$$$}$$$^n$$$. The problem begins when we need to inverse the transform as it requires division by $$$\varepsilon$$$ which is not doable as it doesn't have an inverse. Here comes another trick. What do we do if we desperately need to compute something modulo prime number $$$p$$$, but in the process it can happen that we may need to divide by $$$p$$$? What if we at the same time certainly know that such division will occur at most once?
We calculate everything modulo $$$p^2$$$ and when the division is needed, we just divide it directly! And if we need to divide at most $$$k$$$ times, we may compute everything modulo $$$p^{k+1}$$$. What we know here is that while computing values in $$$\varepsilon$$$, we will multiply by it every time there is a bit in the number, same goes for dividing by $$$\varepsilon$$$ during the inverse calculations: we store elements of kind $$$a_0 + a_1 \varepsilon + a_2 \varepsilon^2 + \dots + a_k \varepsilon^k$$$ to make sure that when time comes we are able to divide them by $$$\varepsilon$$$ at most $$$k$$$ times (in our case, $$$k$$$ is the number of bits we use, that is $$$n$$$).
That being said, subset convolution also has a nice intuitive interpretation which is nothing more than computing the values of multivariate polynomial in {$$$0, \varepsilon$$$}$$$^n$$$ cube in the ring of polynomials over $$$\varepsilon$$$ as a formal variable.
And, as a sweet bonus, after the inverse convolution we will actually obtain some polynomials over $$$\varepsilon$$$ instead of simply numbers. Their "free" terms will correspond to $$$(a \star b)_k$$$, but what's the meaning of other terms? As it turns out, \begin{equation} [\varepsilon^l](a \star b)_k = \sum\limits_{i \cup j = k,\newline |i\cap j|=l} a_i b_j \end{equation} So after the inverse transform we obtain grouping not only by $$$i \cup j$$$, but also by the size of $$$i \cap j$$$. And it in fact makes much sense, as if $$$x$$$ is either $$$0$$$ or $$$\varepsilon$$$ then $$$x^i x^j = \varepsilon^{i \wedge j} x^{i \vee j}$$$, which generalizes for $$$x_1, \dots, x_n$$$ as:
\begin{equation} (x_1^{i_1} x_2^{i_2} \dots x_n^{i_n}) (x_{1}^{j_1} x_{2}^{j_2} \dots x_{n}^{j_n}) = \varepsilon^{(i_1 \wedge j_1) + \dots + (i_n \wedge j_n)}x_1^{i_1 \vee j_1} x_2^{i_2 \vee j_2} \dots x_n^{i_n \vee j_n} = \varepsilon^{|i \cap j|} x^{i \cup j} \end{equation}
In this way, as promised earlier, we make a convolution over both $$$i \cup j = k$$$ and $$$|i \cap j| = l$$$ and doing this we somewhat justify the additional $$$n$$$ factor in the complexity.
UPD: note that you can also do arbitrary operations over such series by doing them on polynomial values.
For example, exponent of $$$A$$$ is defined as
\begin{equation} \exp A = 1 + A + \frac{A^2}{2} + \dots + \frac{A^k}{k!} + \dots \end{equation}
This exponent can be computed by taking the polynomial exponent from all polynomial values in {$$$0,\varepsilon$$$}$$$^n$$$, leading to
\begin{equation} [\varepsilon^l x^k]\exp A = e^{a_0}\sum\limits_{\substack{i_1 \cup i_2 \cup \dots \cup i_m = k\newline f(i_1, i_2, \dots, i_m) = l\newline 0 < i_1, i_2, \dots, i_m}} \frac{a_{i_1} a_{i_2} \dots a_{i_m}}{m!}, \end{equation}
where $$$f(i_1,\dots, i_m)$$$ is a function defined as
\begin{equation} f(i_1,\dots, i_m) = |i_1|+|i_2|+\dots+|i_m|-|i_1 \cup i_2 \cup \dots \cup i_m| \end{equation} In particular,
\begin{equation} [\varepsilon^0 x^k]\exp A = e^{a_0} \sum\limits_{\substack{i_1 \sqcup i_2 \sqcup \dots \sqcup i_m = k\newline 0 < i_1 < i_2 < \dots < i_m}} a_{i_1} a_{i_2} \dots a_{i_m}. \end{equation} Which allows various computations across unordered disjoint partitions of $$$k$$$.
UPD2: I also solved Lights out using this approach (23730542).
UPD3: Instead of {$$$0,\varepsilon$$$}$$$^n$$$ cube you may evaluate polynomial over {$$$-z,z$$$}$$$^n$$$ (see here), which is more similar to Walsh-Hadamard Transform which is used for "xor" rather than the transform used in "or" convolution. In this way, coefficient near $$$z^0$$$ will still correspond to subset convolution and coefficient near $$$z^l x^k$$$ would correspond to the sum of $$$a_i b_j$$$ over $$$i \Delta j = k$$$ and $$$|i \cap j|=\frac{l}{2}$$$ where $$$\Delta$$$ is the symmetric difference of $$$i$$$ and $$$j$$$, thus corresponding to the "xor" of such numbers.
UPD4: A more formal way to understand it is through generalized Chinese remainder theorem. For univariate polynomial it is known that \begin{equation} p(x) \equiv p(a) \pmod{x-a} \end{equation} Thus, computing value of polynomial in different points $$$a_1,\dots,a_n$$$ is equivalent to computing it modulo $$$(x-a_1)\dots(x-a_n)$$$. Thus if you multiply polynomials component-wise and interpolate them back, you'll get the result modulo $$$(x-a_1)\dots(x-a_n)$$$. In particular, for component-wise multiplication in DFT space one obtains result modulo $$$(x-1)(x-\omega_n)\dots(x-\omega_n^{n-1})=x^n-1$$$.
For multidimensional case computing $$$p(x_1,\dots,x_n)$$$ in the point $$$(a_1, \dots, a_n)$$$ yields \begin{equation} p(x_1,\dots,x_n) \equiv p(a_1, \dots, a_n) \pmod{x_1-a_1, \dots, x_n-a_n} \end{equation} Meaning that $$$p(x_1,\dots,x_n) - p(a_1, \dots, a_n)$$$ lies in $$$\langle x_1-a_1, \dots, x_n-a_n\rangle$$$ ideal. Due to generalized CRT, computing it in several points is equivalent to getting result in the quotient ring corresponding to the intersection of ideals formed by these points. In particular, computing values in {$$$a,b$$$}$$$^n$$$ is equivalent to computing it modulo $$$\langle(x_1-a)(x_1-b), \dots, (x_n-a)(x_n-b)\rangle$$$ ideal.
Elegia's post and discussion with negiizhao in the comments helped me a lot in understanding this point of view on multivariate evaluation, and I hope that I got it correctly. I studied ring theory quite a long ago and I can't read original Chinese articles, so I just hope my vague understanding here is at least somewhat correct.