Hello everyone!
I'd like to share a simple algorithm for computing the Frobenius form (or called rational canonical form) of a matrix. Before stating what is the algorithm, let me first give some applications.
Matrix Exponentiation
We know that one can compute the matrix exponential of a matrix $$$A^K$$$ in $$$O(N^3 \log K)$$$ time, via binary exponentiation.
If one knows how to compute the characteristic polynomial $$$p(T)$$$ of $$$A$$$ in $$$O(N^3)$$$ time, then one can first use the Cayley–Hamilton theorem to reduce the problem into computing
where $$$r_i$$$ is the coefficient of
This reduces the time to $$$O(N^4 + N^2\log K)$$$, or even $$$O(N^4 + N\log N\log K)$$$ if you use FFT to speed up the computation of $$$T^K \bmod p(T)$$$.
Some sqrt tricks can further reduce the time to $$$O(N^{3.5} + N\log N \log K)$$$, but it is still not good enough.
When you attend some linear algebra course, you may learn that, if one can find the Jordan form of $$$A$$$, then one can compute $$$A^K$$$ faster. However, the Jordan form is usually not a good choice for computation, because it at least needs to find the root of the characteristic polynomial, which in general needs to consider the field extension.
The Frobenius form (or called rational canonical form) is a better choice. It states that any matrix $$$A$$$ can be uniquely written as $$$A = Q F Q^{-1}$$$, where $$$F$$$ has the form
where $$$C_i$$$ is the companion matrix of polynomial $$$p_i(T)$$$, satisfying $$$p_k \mid \cdots \mid p_2 \mid p_1$$$.
Note that the Frobenius form $$$F$$$ is very sparse (only contains $$$< 2N$$$ non-zero entries), then we can compute $$$A^K$$$
where can be computed in $$$O(N^3 + N\log N \log K)$$$ time.
The Algorithm
To state the algorithm, we first consider a proof of the Cayley–Hamilton theorem that directly gives an algorithm to compute the characteristic polynomial.
Let $$$A$$$ be a matrix, we first pick an arbitrary nonzero vector $$$v$$$, and consider the sequence $$$v, Av, A^2 v, \dots$$$, these vectors are linearly independent until some point, say $$$A^k v$$$, then we have a relation
Let $$$p(T) = T^k - \sum_{i=0}^{k-1} c_i T^i$$$, we have $$$p(A)v = 0$$$.
We call the above vector $$$v_1$$$, say $$$V_1$$$ be the span of $$${v_1,\dots, A^{k-1} v_1}$$$, and then pick another vector $$$v_2$$$ such that $$$v_2$$$ is not spanned by $$$v_1, Av_1, \dots, A^{k-1}v_1$$$, then consider the sequence $$$v_1,\dots, A^{k-1}v_1, v_2, Av_2, A^2 v_2, \dots$$$, we can find another polynomial $$$p_2(T)$$$ such that $$$p_2(A)v_2$$$ is in $$$V_1$$$. Then we take $$$V_2$$$ to be the span of $$${v_1,\dots, A^{k-1}v_1, v_2, \dots, A^{k-1}v_2}$$$, and we repeat such process until
where $$$V$$$ is the whole space. Let $$$\beta = {v_1,Av_1,\dots,}$$$ be the basis.
Under our chosen basis, the matrix looks like
where $$$C_i$$$ is the companion matrix of $$$p_i$$$.
Then for each vector $$$v$$$, we show that $$$p_1(A)\cdots p_\ell(A) v = 0$$$. First consider $$$p_\ell(A)v$$$, this eliminates all components of the basis elements given by $$$v_\ell$$$, so $$$p_\ell(A) \in V_{\ell - 1}$$$, then we can inductively show $$$p_i(A) \cdots p_\ell(A) v \subset V_{i-1}$$$, in conclusion, we have $$$p_1(A)\cdots p_\ell(A) v = 0$$$. Note that by direct computation, we have $$$p(T) = \det(T I - A) = p_1(T) \cdots p_\ell(T)$$$, so the characteristic polynomial annihilates all vectors, the Cayley–Hamilton theorem is proved.
Note that there is a difference that outside the diagonal blocks, there are still some terms. For each $$$A^j v_i$$$, if $$$j$$$ is not the largest one, then $$$A(A^j v_i)$$$ takes the vector to $$$A^{j+1} v_i$$$, so this columns only contains one $$$1$$$ in $$$[A]_\beta$$$. The only problem is that the column for each maximal $$$j$$$ such that $$$A^j v_i \in\beta$$$ may contain some other terms (this corresponds to the $$$*$$$ in $$$0|*$$$).
A simple randomized strategy may help us to remedy this problem. What if we randomly choose a vector $$$v_i$$$ each time? Actually, it can be shown that, when the base field is $$$\mathbb F_q$$$, we have probability $$$\geq 1 - O(N / q)$$$ that $$$p_1, \dots , p_\ell$$$ are exactly those polynomials in the rational canonical form! (They are also called the elementary divisors when we give $$$V$$$ the $$$\mathbb F_q[T]$$$-module structure by $$$f(T) v = f(A) v$$$, and use the structure theorem of finitely generated module over PID.)
So what happens when we are guaranteed that $$$p_\ell \mid \cdots \mid p_1$$$? We first look at what happened at $$$v_2$$$.
Firstly, we have
and since $$$p_1$$$ is the minimal polynomial $$$A$$$ of $$$v_1$$$, let $$$q(T) = p_1(T) / p_2(T)$$$, we have
since $$$v_1$$$ is only annihilated by a multiple of $$$p_1(T)$$$, we have $$$p_1 \mid q r = p_1 r / p_2$$$, thus we have $$$p_2 \mid r$$$. This shows that we can have some $$$s = r / p_2$$$, then consider
we would have
This really looks like the "orthogonalization" step as what was done in the Gram–Schmidt process!
This process can also be done at larger $$$i$$$ s. Now we can state the algorithm.
[Datas] Maintain a basis $$$B$$$ of vectors $$${ A^j v_i }$$$ for $$$0\leq j < d_i$$$, where $$$d_i$$$ is the degree of the elementary divisor $$$p_i$$$. Initially $$$i=0$$$.
[Initial vector] Increase $$$i$$$ by $$$1$$$. Uniformly pick some random vector $$$v_i$$$.
[Iteration] Take $$$B' = B$$$, repeatedly insert $$$B'$$$ with vectors $$$v_i, Av_i, \dots$$$ until finds a relation $$$p_i(A) v_i = \sum_{j < i} g_{ji}(A) v_j$$$.
[Orthogonalization] For each $$$j < i$$$, compute $$$s_{ji} = g_{ji} / p_i$$$, then take $$$v_i \gets v_i - \sum_j s_{ji}(A) v_j$$$.
[Update] For the new orthogonalized vector $$$v_i$$$, insert $$$B$$$ with vectors $$$v_i, Av_i, \dots, A^{d_i - 1} v_i$$$.
[Finish] If $$$B$$$ has $$$n$$$ vectors, halt, otherwise go to 1.
After a careful implementation, the algorithm can be implemented in $$$O(N^3)$$$ time, and the probability that the algorithm fails is $$$O(N / q)$$$, where $$$q$$$ is the size of the base field.
Although the ideas behind need some further analysis (for example I didn't prove the probability of success), but the final algorithm is quite clean. I wrote a code for this algorithm, for the Pow of Matrix problem on Library Checker. You can refer to this to have a more precise understanding of how this algorithm works.
orz
orz
orz
aeren OTZ OTL OFZ OLZ ORZ oz osz sto ozz oyz roz hso ost OJZ OPZ <(_ _)> O]L O[Z orx ozk OFX огz лго 口厂乙 ОГZ :weary:rz rto fTO PTO rso fSO stotz :wheelchair:rz
I mentioned some $$$O(n^3)$$$ paper algorithm for this earlier in the comments when people were discussing how to compute matrix's characteristic polynomial in $$$O(n^3)$$$. Ultimately, people told me that it's doable without Frobenius form, so I didn't bother actually learning what's going on there. Does the paper use the same algorithm as the blog?
Using it it to speed up pow of matrix so significantly and actually get the top place for Pow of Matrix is very impressive! Maybe it will give me enough motivation to actually learn what's going on here...
Note that $$$p_1(x)$$$ is the minimal polynomial of $$$A$$$, and we have a lot of blocks because it may be different from the characteristic polynomial. Apparently, in "black box linear algebra" you can multiply the matrix with a random diagonal matrix, and get a matrix whose characteristic polynomial is also minimal?
If I understand correctly, it should also allow to only have one Frobenius block, somewhat simplifying things?
I really don't have a lot of experience with this, so I may as well be spilling nonsense right now :)
UPD: Ok, now that I think of it, even if it is true, and we represent it as $$$A = DB$$$, where $$$B$$$ has a single Frobenius block, they probably don't commute, so it shouldn't help multiplying with $$$A^k$$$.
TL;DR but the paper you mentioned seems to differ from this blog in that:
I didn't learn this algorithm, but it looks like it is a matrix-theoretical style algorithm, while my approach is somehow called the "Krylov method" (pick a vector $$$v$$$ and iteratively compute $$$A^i v$$$). I think usually it will be somehow more clean, but has a larger constant compared to clever matrix operations instead. For example, my algorithm seems to have a larger constant than computing the Hessenberg form, when the goal is to just compute the characteristic polynomial. Maybe one reason is that maintaining a basis is not that cache-friendly.
Wow! Amazing blog!
Thanks for sharing! Really simple.
In the "orthogonalization" step, the statement "since $$$p_1$$$ is the minimal polynomial $$$A$$$ of $$$v_1$$$" should instead be a stronger thing "$$$p_1$$$ is the minimal polynomial of $$$A$$$", which is true by the assumption $$$p_\ell \mid \cdots \mid p_1$$$, right?
Also I'm adding a rough proof for the probability
as my exercise, correct me if I'm wrong. Failure to find $$$p_1(T)$$$ correctly means that $$$v_1$$$ is annihilated with some proper factor of $$$p_1(T)$$$. Let $$$V \cong \bigoplus_i \bigoplus_j \mathbb{F}_q[T] / (f_i(T))^{e_{i,j}}$$$ as $$$\mathbb{F}_q[T]$$$-modules, where $$$f_i$$$'s are distinct irreducible polynomials. We have $$$p_1(T) = \prod_i f_i(T)^{\max_j e_{i,j}}$$$. The failure means that $$$\exists i$$$, $$$\forall j$$$ s.t. $$$e_i = \max_j e_{i,j}$$$, the $$$(i, j)$$$ component of $$$v_1$$$ (through the isomorphism) is a multiple of $$$f_i(T)$$$. For a particular $$$(i, j)$$$ this happens with probability $$$q^{-\deg(d_i)} \le 1/q$$$, so "$$$\exists i$$$, $$$\forall j$$$" with probability $$$\le \sum_i 1/q \le \deg(p_1)/q$$$. The rest is the same for $$$V/V_1$$$, therefore the total failure probability is $$$\le \sum_i \deg(p_i)/q \le N/q$$$.Yes! This can be extended to a rigorous proof. :)
An alternative implementation in Algorithms for Matrix Canonical Forms is that after we compute the blocked upper triangular matrix like
where
and $$$v_1,\dots ,v_k\in\left(\mathbb{F}_q\right)^n$$$, we could write
where $$$\mathbf{B}_{b_j}$$$ is a column vector corresponding to the $$$*$$$ stuff above. $$$b_j=\sum_{i=0}^{n-1}c_ix^i$$$ is a polynomial and $$$\mathbf{B}_{b_j}=\begin{bmatrix}c_0&c_1&\cdots\end{bmatrix}^{\intercal}$$$, we can use
where
my code. It seems work.