What does the RREF tell us, and why do we spend so much time on it? Why do we define matrix multiplication the way we do?
I’ve stood by and let this go on for too long. As someone who has been tutoring/teaching linear algebra for five years now, I’ve looked at the subject as a whole through multiple different perspectives. We have this abstract difficult subject with a million terms and concepts (many terms actually being the SAME concept under a different name), problems that can look entirely different but end up asking the exact same question, AND most students have to figure it out while they are learning how to properly do a mathematical proof for the first time (which is NOT an easy skill to learn). No wonder people struggle with it! And, so, I can understand why many professors choose to offload the concepts and underlying “algebra” until later in the course, and start off with “Here’s a system of equations, row reduce it!” for three weeks. And though tedious, and generally unmotivated, it’s certainly… easy?
The problem is that with so much to cover, my experience with most students is that the key concepts that tie EVERYTHING together are not being emphasized. So when the more important topics like basis, linear independence, span, coordinate vectors, kernel, image, etc. are introduced, the pathways between the new topic and what the student has already learned aren’t being drawn, and it all seems so disconnected. More like you’re learning something completely different solved with a similar method, rather than learning an extension of a previous concept.
So that’s what I want to do today. I’m going to start from square one, and we’re going to change how we view row reduction, matrix multiplication, and even a matrix itself. And with this new perspective, I hope that everything feels more cohesive and straightforward.
This post is pretty much entirely about functions, and to really talk about functions properly, we’re going to use the proper terminology. These are concepts you’re probably familiar with, but perhaps under different names.
We denote \(f:X\to Y\) to mean \(f\) is a function that takes stuff in the domain \(X\) to the codomain \(Y\). i.e. everything in \(X\) can be put into the function, and every output is in \(Y\) (though we can’t necessarily reach everything in \(Y\)). For example, \(\exp:\mathbb{R}\to\mathbb{R}\) for \(\exp(x)=e^x\) is a function that takes in real numbers and outputs real numbers (so the \(\mathbb{R}\) is both the domain and codomain). This is because we can put in any real number, and we’ll get out a real number, so we write . Even though every output of \(e^x\) is actually strictly positive (there’s no intput to obtain \(-1\), for example), it’s still correct to say it’s a function with codomain \(\mathbb{R}\).
If \(f(x)=y\), then we say \(y\) is the image of \(x\). Images are unique (because a function must be “well-defined”). If you vaguely remember that a function can’t send one input to two outputs, or that it must pass the “vertical line test”, this is basically the more rigorous way to say it: the image is unique.
It’s also very useful to say that \(f(x)=y\) means that \(x\) is a preimage of \(y\). Preimages are not necessarily unique. For example, with the function \(f(x)=x^2\), \(f(2)=f(-2)=4\), so \(4\) has two preimages: \(2,-2\).
We call the complete set of all possible outputs that \(f\) can produce the image/range of \(f\). And for special types of functions (like the ones we study in linear algebra), we call the set of all preimages of \(0\) the kernel.
This is important to talk about up front, because understanding why matrices are useful requires understanding how the images of a few special vectors under special functions allows us to simplify how we write that function.
Now, some smart-asses (many of which, I love) might object to what I’m about to present and say, “a matrix is JUST an array of numbers which CAN induce a linear transformation!” And, yeah, but we’re talking in the context of linear algebra, here. In my opinion, the way to view a matrix that provides the most intuitive bang for your buck is as a Linear Transformation itself.
First, though, I’m going to give you the completely oversimplified low-down of what the majority of abstract algebra is like, on the motivational level.
“We have some things that interact in some nice way, let’s look at functions and substructures that preserve those interactions.”
In terms of linear algebra, our “things” are vectors (over some field of scalars), and the interaction is vector addition and scalar multiplication.
NOTE: You can think of a “field” as just some collection of “numbers” that you can add/subtract/multiply and divide (if they’re nonzero). Just think \(\mathbb{R}\) or \(\mathbb{C}\) for the purposes of this post.
We can do these things and still get a vector, and combining these operations is called taking a “linear combination”.
\[v=c_1v_1+\ldots+c_nv_n\]So, in a sense, we can view vector spaces as some collection of vectors, and we can take linear combinations of those vectors using scalars from the field the vector space is over.
NOTE: Saying scalars from a “field” is important! There are other nonvector things we can use to “scale/multiply to” vectors, but being able to divide nonzero scalars is what makes linear algebra so nice (and what makes Modules–vector spaces over a ring instead of a field–much crazier).
So, algebraists, then, are interested in functions and substructures of the vector space that preserve the interactions of vectors (the interactions are linear combinations, in our case). The substructures of vector spaces that preserve linear combinations are subspaces, and the functions that preserve linear combinations are called Linear Transformations. Another sort of “type” of substructure is a quotient space (but that’s far beyond the scope of this post). Our focus today is on linear transformations, today.
For a linear transformation \(T\) to preserve linear combinations, it means that for any \(v=c_1v_1+\ldots+c_nv_n\),
\[\begin{gather*} T(v)=T(c_1v_1+\ldots+c_nv_n)\\=c_1T(v_1)+\ldots+c_nT(v_n) \end{gather*}\]That is, the image of a linear combination is the linear combination of the images.
But why are you (someone who is probably not an algebraist) interested in functions that preserve linear combinations? Well, if you’re taken calculus, then you definitely are! How do you take the derivative of a polynomial?
\[\frac{d}{dx}(ax^2+bx+c)=a\frac{d}{dx}x^2+b\frac{d}{dx}x+c\frac{d}{dx}1\] \[=a(2x)+b(1)+c(0)=2ax+b\]You probably don’t think about it that way (with so many steps), and you can just see the derivative. But this is really what’s going on under the hood. We don’t have an explicit formula for the derivative of every single polynomial that exists, but we do know
We can combine these properties to just say the derivative “preserves” linear combinations:
\[\frac{d}{dx}(c_1f(x)+c_2g(x))=c_1\frac{d}{dx}f(x)+c_2\frac{d}{dx}g(x)\]and so differentiating a polynomial is just breaking up the linear combination and applying the derivative to each power of \(x\). This is the key idea: we only need to know how the operation acts on each vector, and then we know exactly how it acts on any linear combination of those vectors. We don’t need to compute it from scratch.
And, in that way, the derivative is one of the most important examples of a linear transformation (and the study of the derivative as a linear operator on smooth functions is pretty much the most important part of an Ordinary Differential Equations course). The integral can also be considered a linear operator.
Hopefully, the derivative and integral are sufficient to convince you that “yeah, studying linear operators might be worth doing”, but in case you need a few more, the following are examples of things that are actually just applications of linear transformations:
Alright, so I’ll assume you’re on board with “linear transformations are worth studying”. Now, we’re going to restrict ourselves to linear transformations between finite dimensional vector spaces (the main topic for an introductory course in linear algebra). For our purposes, a finite dimensional vector space is necessary because it has a finite basis, which is the key fact that makes matrices so useful.
Now, what do I mean by finite basis? Well, let’s focus on the main dish of finite dimensional vector spaces: coordinate spaces! You might know them in the case of \(F=\mathbb{R}\) as “Euclidean vector spaces” of the form \(\mathbb{R}^n\). We’ll just call them \(F^n\): an array or list of \(n\) numbers from the underlying field of scalars (which we’re just calling \(F\) instead of specifying \(\mathbb{R}\) or \(\mathbb{C}\) or whatever other field have).
Note: One of the reasons linear algebra is so powerful is that all the non \(F^n\) finite dimensional vector spaces are actually just isomorphic to \(F^n\) by the coordinate map. In layman’s terms: it’s all just \(F^n\). We can without loss of generality assume we’re taking about \(F^n\) coordinate spaces, which mesh the best with matrices. And then we only have to slightly adjust the execution for general finite dimensional vector spaces.
Okay, so why do we care about a “basis”, and what is it? A basis is a “generating set” of vectors with unique representations. What does that mean?
The fact that it’s a “generating set” means that the basis can “generate” or produce any vector in the space. If a vector exists in the space, I can give you a linear combination of the basis vectors that equals that vector. If our basis is \(\left\{v_1,\ldots,v_n\right\}\), then that means for all \(x\in F^n\), we can find constants \(c_1,\ldots, c_n\) depending on \(x\) such that
\[x=c_1v_1+\ldots+c_nv_n\]You probably learned the word “span” to describe this concept, but “generate” is a more descriptive word (in my opinion), and the more general term used for similar concepts in abstract algebra. But it’s still the same thing: we’ll use “generates” as a synonym for “spans”.
For example, take \(F^2\) with the standard basis \(\left\{\begin{pmatrix}1\\0\end{pmatrix},\begin{pmatrix}0\\1\end{pmatrix}\right\}\). For any vector in the space, \(x=\begin{pmatrix}x_1\\x_2\end{pmatrix}\), we can easily find a linear combination of the basis vectors to reach \(x\):
\[x=\begin{pmatrix}x_1\\x_2\end{pmatrix}=x_1\begin{pmatrix}1\\0\end{pmatrix}+x_2\begin{pmatrix}0\\1\end{pmatrix}\]The constants in the linear combination are just the entries! That’s what makes this the “standard basis”. It’s the simplest possible basis, which is easiest to use. We’ll use \(e_i\) for the \(i\)th standard basis vector (that is, a \(1\) in the \(i\)th entry and zeroes everywhere else).
I would like to remark about the connection between span/generation and the substructure I mentioned earlier: “subspaces”. A subspace is a subcollection of vectors which are closed under linear combinations. Meaning any linear combinations of vectors in the subspace is still in the subspace. It turns out that subspaces are exactly just spans of some generating set. That is, every subspace is the span of some set of vectors in the subspace, and every span of a set of vectors is itself a subspace. But it’s better to think of span and subspace being synonymous ideas, rather than a happenstance. Our primary interaction of vectors is linear combinations, and span is just taking all linear combinations. Can you justify why that would necessarily form a closed structure under linear combinations? The other direction is actually almost silly: any structure closed under linear combinations (subspace) is just the span of itself. You can say something less trivial by using the axiom of choice to justify that any vector space (and any subspace is itself a vector space) has a basis (and, thus, a generating set).
Later I mention the row space and column space of a matrix. These are examples of subspaces which are defined by being generated by the rows or columns respectively. Note that the kernel is also a subspace.
Now, there’s one other part from the definition for basis I gave above: the generating set has “unique representation”, which essentially means that it’s “linearly independent”. This means that the representation of a vector as a linear combination of the basis vectors is unique. So if we have
\[x=c_1v_1+\ldots+c_nv_n=d_1v_1+\ldots+d_nv_n\]Then that implies that \(c_i=d_i\) for all \(i\).
Now, if you’ve already learned linear independence, you might be thinking “WTF THIS IS NOT WHAT I LEARNED LINEAR INDEPENDENCE MEANS”. And, yes, this is not the standard definition. But it is the most conceptually useful definition, especially for our purposes here (though not the easiest to use for proofs). See, if we stick with just this idea of “unique representations”, then what if \(x=0\)? We can clearly use \(d_1=\ldots=d_n=0\) as a linear combination to get \(x\) because \(0v_1+\ldots+0v_n=0\). However, this means that, by our established uniqueness,
\[c_1v_1+\ldots+c_nv_n=0\implies c_1=\ldots=c_n=0\]All the \(c_i\)’s must be zero! And this is now closer to what you probably learned before. In fact, you can actually read the more familiar definition as saying “the zero vector can only (uniquely) be represented as a linear combination of these vectors using the trivial linear combination of all zero coefficients (trivial solution)”. And, as it turns out, if \(0\) can only be represented by the trivial solution, then that means ALL representations are unique! This is not as obvious (and worth trying to prove if you can)!
The other super important and useful fact: all bases have the same size! Don’t take this for granted, the fact we’re over a field is the only reason this is guaranteed to be true. There are Modules which can have a generating sets of any positive integer size which are all linearly independent. For us, it’s SUPER simple: \(F^n\) has \(n\) basis vectors. Don’t overthink it!
Okay, so basically I took your familiar “basis” = “spans” + “linearly independent” definition, and slapped on confusing words: “spans” = “generates” and “linearly independent” = “unique representations”. Why? Because now everything is going to come together.
Okay, so to summarize so far:
We’re going to use these facts together to get the main reason linear transformations are so nice and easy to study: To know how it acts on ANY vector, we ONLY need to know how it acts on the BASIS vectors of the domain.
This is HUGE. Functions can be very complicated and complex! It can be computationally intense to even compute many functions. But with linear transformations between finite dimensional vector spaces? Easy-peasy. Because if \(T\) is a linear transformation from \(F^n\to F^m\), then we only need to know the images of the \(n\) basis vectors of \(F^n\) to know where \(T\) sends EVERY vector. (I feel like you aren’t getting as excited about this as I am)
So suppose you are trying to tell your friend across the sea about your interesting linear transformation between \(\mathbb{R}^3\to \mathbb{R}^2\). How do you do it? \(\mathbb{R}^3\) has infinitely many vectors (uncountably many, in fact). You could never write down every image (output) for every vector in \(\mathbb{R}^3\). But, you don’t need to! For,
\[\begin{multline*} T\begin{pmatrix}x_1\\x_2\\x_3\end{pmatrix}=T(x_1e_1+x_2e_2+x_3e_3)\\=x_1T(e_1)+x_2T(e_2)+x_3T(e_3) \end{multline*}\]Therefore, all we need is \(T(e_1),T(e_2),T(e_3)\). That’s just three \(\mathbb{R}^2\) vectors: which is \(6\) simple numbers total. That means, all you need to send to your friend is those six numbers, and they will have EXACTLY your linear transformation. Let’s say the images are
\[T(e_1)=\begin{pmatrix}1\\1\end{pmatrix},\quad T(e_2)=\begin{pmatrix}0\\1\end{pmatrix},\quad T(e_3)=\begin{pmatrix}1\\-1\end{pmatrix}\]So, if you were lazy, and you wanted to send those six numbers to your friend in as clear and simple a way as possible, how might you do it?
Well sending just 1,1,0,1,1,-1
is kind of confusing and hard to parse. It’s also not immediately clear what this is a transformation between. It could be \(\mathbb{R}^1\) to \(\mathbb{R}^6\) or vice versa, or \(\mathbb{R}^2\) to \(\mathbb{R}^3\). So a one-dimensional array doesn’t really work.
Okay, you know exactly what I’m getting at: YOU PUT IT IN A MATRIX. A TWO-dimensional array. And we have the vectors already in their little column form, why not just concatenate them? Let’s define
\[A=\begin{pmatrix}1&0&1\\1&1&-1\end{pmatrix}\]to be our little package that encodes \(T\). This perfectly stores all the information we need to communicate \(T\). Each column tell us the image of each standard basis vector.
Now, let’s take it one step further. What if we could use \(A\) instead of \(T\)? What if we didn’t have to write the linear transformation at all? What if we could stick to just these simple arrays of numbers to carry all the information of our linear transformation? How would we do it?
Well we clearly want \(Ax=T(x)\). That much is obvious. And we know that
\[T\begin{pmatrix}x_1\\x_2\\x_3\end{pmatrix}=x_1\begin{pmatrix}1\\1\end{pmatrix}+x_2\begin{pmatrix}0\\1\end{pmatrix}+x_3\begin{pmatrix}1\\-1\end{pmatrix}\]So… then… let’s just define
\[\begin{multline*} \begin{pmatrix}1&0&1\\1&1&-1\end{pmatrix}\begin{pmatrix}x_1\\x_2\\x_3\end{pmatrix}\\ =x_1\begin{pmatrix}1\\1\end{pmatrix}+x_2\begin{pmatrix}0\\1\end{pmatrix}+x_3\begin{pmatrix}1\\-1\end{pmatrix} \end{multline*}\]And there, now we have matrix vector multiplication defined.
Essentially, all we’re doing is
\[\begin{equation} A= \Bigg(\begin{matrix}T(e_1)&\cdots&T(e_n)\end{matrix}\Bigg) \end{equation}\]Observe that this also, at a glance, tells us exactly what \(T\) is a function between. Each column has two entries (i.e. \(A\) has two rows), so the codomain must be \(\mathbb{R}^2\). We also have three columns, so our domain has three basis vectors. Clearly, the domain must be \(\mathbb{R}^3\). Thus, our construction of \(A\) as a “package” for \(T\) means that \(A\) being \(m\times n\) (\(m\) rows and \(n\) columns) means that the linear transformation that \(A\) encodes is from \(F^n\to F^m\).
I’d like to acknowledge that with this perspective, a student doesn’t have to “memorize” the fact that \(m\times n\iff F^n\to F^m\). This isn’t a “theorem” or something that requires excessive scratch work. This is a damn feature of matrices, which makes them so informative and useful. Not only do the columns encode the outputs, the very shape itself tells you one of the most important aspects of the function: the dimension of the domain and codomain (AND their size relative to each other).
So by our construction, this allows us to utilize other theorems to know from a glance that
Now, it’s time for my trump card. Why you should definitely view a matrix from the perspective that the columns are the images of the basis vectors. Because that’s how we define a matrix for a linear transformation with respect to other bases! If your course covers finding a matrix with respect to a certain basis, then you absolutely should start with this perspective.
I see it, time and time again. A student asks how to find the matrix for \(T\) with respect to some given basis \(\beta\) of the domain and \(\gamma\) for the codomain (or maybe just the matrix for a described linear transformation), and they’re lost. And the inevitable first question someone asks the student is “do you know the ‘formula’ for the matrix with respect to the bases \(\beta\) and \(\gamma\)?”
\[[T]_\beta^\gamma = \Bigg(\begin{matrix}\left[T(\beta_1)\right]_\gamma&\cdots&\left[T(\beta_n)\right]_\gamma\end{matrix}\Bigg)\]But, really, this is the definition for the matrix representation for a given linear transformation when bases are specified for the domain and codomain. Now, if you adopt the view I’m trying to push here, then this definition is just… the natural generalization. The columns are still just the images of the basis vectors of the domain, we’re just being more explicit that we’re using a specific basis. But now we make sure the images are the coordinates with respect to the proper basis of the codomain. That is literally the only change. If both \(\beta\) and \(\gamma\) are just the standard basis of \(F^n\) and \(F^m\) respectively, it’s exactly what we described above.
If you haven’t learned coordinate vectors or change of basis, you can ignore this. But just know that this is literally how we define the matrix representation of a linear transformation!
In case you’ve seen me rant online about how column perspective is the ultimate definition of matrix vector multiplication, perhaps now you can see why I think so. With our perspective of a matrix as a compact definition for a linear transformation, this definition is just obvious.
\[\begin{equation} \Bigg(\begin{matrix}a_1&\cdots&a_n\end{matrix}\Bigg) \begin{pmatrix}x_1\\\vdots\\x_n\end{pmatrix} =x_1a_1+\ldots+x_na_n \end{equation}\]where \(a_i\) is the \(i\)th column of \(A\) (which, by definition, is \(Ae_i\)). As a side note, the following two facts are things that most students I’ve worked with don’t find obvious, but really are with this perspective:
Observe, also, that taking some linear combination of the columns is the same as just taking the image of the vector with the scalars as the entries. That is, \(c_1a_1+c_2a_2+\ldots+c_na_n=A\begin{pmatrix}c_1\\\vdots\\c_n\end{pmatrix}\). Hence, if you can find some combination of the column that gives you zero, then that gives you a vector in the kernel.
\[c_1a_1+\ldots+c_na_n=0\implies \begin{pmatrix}c_1\\\vdots\\c_n\end{pmatrix}\in\ker(A)\]More generally, if you can find a combination of the columns to give you some vector \(b=c_1a_1+c_2a_2+\ldots+c_na_n\), then \(\begin{pmatrix}c_1\\\vdots\\c_n\end{pmatrix}\) is a preimage of \(b\) under \(A\) (and thus a solution to \(Ax=b\)).
So, then how does matrix-matrix multiplication fit into this perspective? Well, if \(T_A\) has the matrix \(A\), and \(T_B\) has the matrix \(B\) (and \(T_B\circ T_A\) is defined), then we want \(BAx=T_B(T_A(x))\). Once again, we just need to make sure that both \(BA\) and \(T_B\circ T_A\) map the standard basis vectors to the same thing.
\[T_B(T_A(e_j))=T_B(a_j),\quad BAe_j=Ba_j\]And with our definition, there’s nothing immediately wrong or strange we have to do. We just define the \(j\)th column of \(BA\) to be \(Ba_j\). That is,
\[\begin{multline} BA=B\Bigg(\begin{matrix}a_1&\ldots&a_n\end{matrix}\Bigg)\\ =\Bigg(\begin{matrix}Ba_1&\ldots&Ba_n\end{matrix}\Bigg) \end{multline}\]If you think about it, this is actually quite reasonable. \(Ax\) is going to be some linear combination of the columns of \(A\). And \(B(Ax)\) is just applying \(B\) to every single one of those vectors in the linear combination. So the image of the standard basis vectors under \(BA\) is just \(B\) applied to the images of the standard basis vectors under \(A\).
Am I going to talk about the row column rule now? No. Because there are a million videos on youtube about it, and your professor probably drilled it into your head. This post is focused on intuition. It doesn’t take that long to see that the row column rule follows directly from the column definition, once you just combine it all into one vector. To be clear, the row column rule is not evil, and I definitely use it because it’s quicker. But I think it’s better to keep in mind the column definition is really the ultimate definition, and it ties pretty much everything you do with matrices together.
Alright, so at the beginning I talked about how we call the set of all outputs of a function the “image” or “range”. For some reason, in linear algebra, we try to obscure the connection between the image/range of a linear transformation and the image/range of a matrix function by calling it a different name: “the column space of \(A\)”. This is an intuitive name for the span/set of all linear combinations of the columns of \(A\), but it distracts from the fact that we already define multiplication by \(A\) to be taking linear combinations of the columns of \(A\).
This is one of the most frustrating parts of tutoring linear algebra for me. Because for most students, span, column space, and range are basically separate concepts in their head, when it’s really just one thing. The fact that
”\(Ax=b\) is consistent if and only if \(b\) is in the column space of \(A\)”
has to be a theorem and is not as self-evident as “\(f(x)=y\) is possible if and only if \(y\) is an output of \(f\)” makes me want to tear my hair out. And it’s not the student’s fault! It’s how they’re being taught linear algebra.
So screw “column space” (and “null space” for that matter). All matrix functions in linear algebra are linear transformations, and we’re interested in the image and kernel of them. Consider this yeeting two redundant terms from the subject entirely.
(Plus, you’ll always sound smarter if you say “image” and “kernel” instead of “column space” and “null space”)
Alright, let’s talk about a new matrix \(A=\begin{pmatrix}1&-1&1&2\\1&-1&2&3\\1&-1&3&4\end{pmatrix}\). This is a \(3\times4\) matrix, so it’s a transformation from \(F^4\to F^3\). Now, as is usual in linear algebra, we’re interested in the image and kernel of this linear transformation.
The matrix is wider than it is tall, and so it’s trying to stuff a higher dimensional space into a lower dimensional one. Thus, it’s intuitive that we’ll have a nontrivial kernel. A wider matrix can map to the entire codomain, but it’s not immediately obvious if every vector in \(F^3\) will actually have a preimage under \(A\). It’s not even immediately obvious how you’d even determine that! All we theoretically know is that an output is a linear combination of the columns. So, can we reach any \(F^3\) vector with a linear combination of these columns? If not, how can we tell?
Well, if we think about it, what we really want is a basis for the range of \(A\). Of course, every column is in the range, so we just want to know which columns to take for a basis, and which columns are redundant. If we look carefully, we can see one major relationship: Column 2 = -(Column 1). This means that if we use both columns one and two, then our representations won’t be unique (because they are linearly dependent). So we should definitely exclude column 2 if we’re going to use column 1.
If we look at Column 3, it doesn’t have any obvious relationship to Column 1. In fact, you can show they are linearly independent. But, you might just notice that Column 4 = Column 1 + Column 3. Another dependence relationship. Thus, it appears that we also have to exclude Column 4, leaving just Columns 1 and 3. Hence,
\[\left\{\begin{pmatrix}1\\1\\1\end{pmatrix},\begin{pmatrix}1\\2\\3\end{pmatrix}\right\}\]is a basis for the image of \(A\). You can confirm this is true, because we can uniquely write every column of \(A\) as a linear combination of these vectors, so then we can write any linear combination of the columns as a linear combination of these vectors. And from here, we can see that the image is two-dimensional, so it won’t span or generate \(F^3\). Thus, \(A\) is not a surjective or onto transformation.
But what of the kernel? Well, Column 2 = -(Column 1) is actually just saying
\[A(e_2)=-A(e_1)\implies A(e_1+e_2)=0\]So \(\begin{pmatrix}1\\1\\0\\0\end{pmatrix}\) is in the kernel. Can you see how by similar logic \(\begin{pmatrix}1\\0\\1\\-1\end{pmatrix}\) is also in the kernel? You can verify it yourself by applying \(A\) to both vectors, and when you use column perspective, it should become clear. And we can justify through rank-nullity that these bases are complete because their sizes sum to the dimension of the domain (but let’s just vibe it).
Okay, so I lead you through a problem you would have learned to do by row reduction without it. Any liquored-up hillbilly with a shotgun could have done that at the zoo. (TODO: remove this)
Let’s try a much easier, seemingly unrelated problem:
\[R=\begin{pmatrix}1&-1&0&1\\0&0&1&1\\0&0&0&0\end{pmatrix}\]Here, it’s so much easier to find a basis for the image and kernel. Every column is a linear combination of the independent vectors \(e_1,e_2\), so they form a basis for the image. Oh, and look! Column 1 is \(e_1\) and Column 3 is \(e_2\). So Column 1 and Column 3 themselves form a basis for the image.
As for the kernel, clearly if we take Column 1 + Column 2, we’ll cancel them out and get zero. So actually similar to above \(\begin{pmatrix}1\\1\\0\\0\end{pmatrix}\) is in the kernel. And, similarly, it’s pretty easy to see that if we take Column 4 and then subtract off Columns 1 and 3, then that’s also zero. So \(\begin{pmatrix}1\\0\\1\\-1\end{pmatrix}\) is also in the kernel. Now… that seems familiar. Both \(A\) and \(R\) have the same kernel, the same column relationships, and the same columns form a basis for the image*. What is this sorcery??
Note*: It turns out that the kernel actually uniquely determines which columns can be a basis for the image.
It turns out that \(A\) and \(R\) are actually significantly linked. Because \(R\) can be obtained from \(A\) by applying an invertible operator on the left. But there is also a more fundamental observation we can make. We established that each column of \(A\) can be uniquely written as a linear combination of \(\left\{\begin{pmatrix}1\\1\\1\end{pmatrix},\begin{pmatrix}1\\2\\3\end{pmatrix}\right\}\). Well, let’s look at the linear combination for each column. We’ll call the vectors in that set \(v_1\) and \(v_2\) respectively.
\[A=\begin{pmatrix}1&-1&1&2\\1&-1&2&3\\1&-1&3&4\end{pmatrix}\]Well, the columns of \(A\) are, respectively: \(1v_1\), \(-1v_1\), \(1v_2\), \(1v_1+1v_2\). Thus, you can verify that
\[A=\begin{pmatrix}1&1\\1&2\\1&3\end{pmatrix} \begin{pmatrix}1&-1&0&1\\0&0&1&1\end{pmatrix}\]We often call this \(CR\) decomposition, since \(C\) contains a basis for the columns of \(A\), and \(R\) contains a basis for the rows of \(A\). Notice how the columns of the right matrix essentially tell you what linear combination of the columns of \(C\) you need to get the corresponding column of \(A\). We call these the “coordinate vectors” with respect to the pivot column basis.
Disregarding the \(CR\) decomposition, let’s talk about what I meant by “\(R\) can be obtained from \(A\) by applying an invertible operator on the left”.
\[\begin{multline*} \begin{pmatrix} 0 & 3 & -2 \\ 0 & -1 & 1 \\ 1 & -2 & 1 \end{pmatrix}\begin{pmatrix}1&-1&1&2\\1&-1&2&3\\1&-1&3&4\end{pmatrix}\\ =\begin{pmatrix}1&-1&0&1\\0&0&1&1\\0&0&0&0\end{pmatrix} \end{multline*}\]And since it’s invertible, we can also obtain \(A\) from \(R\) by applying the inverse on the left.
Note: If you want to know how I found the matrix, you can just augment \(A\) with the identity and row reduce. You will get \(R\) augmented with a matrix that puts \(A\) into its RREF.
So the question becomes “why does applying an invertible operator on the left not change the kernel or column relationships”? Well, like I said above, it turns out that the kernel actually uniquely determines the column relationships of the matrix. See the end of the post for an algorithm that gives you the unique set of nonzero RREF rows that has the specified kernel.
So then why does applying an invertible operator on the left preserve the kernel? Well, this is actually a lot easier. If \(E\) is invertible, then
\[Ax=0\iff EAx=0\]So if \(EA=R\), then \(Ax=0\iff Rx=0\implies \ker(A)=\ker(R)\).
So, in summary, invertible operations on the left of a matrix don’t change the kernel, which also tell us which choices of columns give us a basis for the image.
There is another perspective for what an “invertible operator on the left” actually means using “row perspective”. It’s essentially the transpose of column perspective.
\[\begin{equation} AB=\begin{pmatrix} {A}_1^T\\{A}_2^T\\\vdots\\{A}_m^T \end{pmatrix}{B}= \begin{pmatrix} {A}_1^T{B}\\{A}_2^T{B}\\\vdots\\{A}_m^T{B} \end{pmatrix} \end{equation}\]Where we are denoting the \(i\)th row of \(A\) as \(A_i^T\).
If we think about what \(x^TB\) means, it’s just the transpose of \(B^Tx\), which is taking a linear combination of the columns of \(B^T\), which are the rows of \(B\). Then \(x^TB\) is just a row vector which is a linear combination of the rows of \(B\). Thus, the rows of \(AB\) are just linear combinations of the rows of \(B\). This means that every row of \(AB\) is in the row space of \(B\), and if \(A\) is invertible, then the row space of \(B\) is going to be entirely preserved!
This is where elementary matrices come into play. It’s a nifty theorem that every invertible matrix is some product of elementary matrices. That is, any invertible matrix is equivalent to some sequence of row operations. In a way, elementary matrices “generate” the invertible matrices (but not in exactly the same way a set of vectors “generates” or spans a subspace. Here it’s through noncommutative products and not linear combinations).
So when we say an “invertible operator on the left”, that can be thought of as performing some elementary row operations. Which, through all we have thus far established, preserves the row space, and thus also the kernel, and thus also the column relationships.
In case you forgot, the elementary row operations are the following. I encourage you to convince yourself that these three operations would preserve column relationships.
So, in summary, this is why we do row operations. They preserve the most fundamental aspects of a matrix (besides the image itself): the kernel and which columns generate the image. They can therefore be used to “simplify” a matrix. To its reduced row echelon form, for instance. The RREF, as we have seen, being the clearest and simplest way to see the relationship between the columns of the matrix.
Another quick remark: We’ve established that row reduction preserves the row space. And, looking at the structure of the RREF, it’s clear that the nonzero rows of the RREF are linearly independent. They thus form a basis for the row space. Note also that since we can switch around rows when row reducing, that we must take the RREF rows as our row space basis rather than the corresponding rows of the original matrix. I explain that more in-depth here. But I also want to point out the difference in how the RREF gives us a basis for the row space and column space of the original matrix.
Since row reduction can swap and change rows, we usually don’t end up with rows from the original matrix, but we do end up with a much nicer basis. Columns, on the other hand, never change their relative position, and that’s one of the reasons the relationships are preserved. However, though the rows of the RREF are still in the row space (just linear combinations of the original), the resulting columns are almost always completely different. This is why we go back to the original matrix. But this means that our column space basis is usually not as nice.
In summary: the RREF of \(A\) gives a nicer row space basis not usually containing the rows of \(A\), while it gives a not so nice column space basis using the original columns of \(A\). This means that the RREF of \(A^T\) gives a nicer column space basis for \(A\) not in terms of the original columns, but a not so nice basis for the row space of \(A\) in terms of the original rows.
All this to say, that row reduction not only shows us the relationships between the columns, but it also gives us the nicest possible basis for the rows. That means, we can also use the RREF to get the nicest possible basis for the span of any arbitrary set of vectors by sticking the vectors as the rows of a matrix and row reducing. Functionally also giving us a way to determine dimension computationally: just count the number of nonzero rows of the RREF.
Now, you may realize that I’m only getting to solving systems of equations now. And this may be surprising because row operations and matrices are often introduced FOR solving systems of equations. But hear me out! Everything we have talked about thus far is going to make row reduction’s application to solving systems of equations extremely simple and intuitive.
See, we have thus far established row operations and row reduction as a way to see and make clear the relationship between a matrix’s columns. That is, row reduction can allow us to more easily see how to write one column as a linear combination of the others. Let’s see how this applies to systems of equations.
A system of equations has approximately four “forms”. The one we use most commonly is \(Ax=b\). Each perspective has its own intuitive benefit. \(Ax=b\), for example, emphasizes that we’re trying to find a preimage under the linear transformation defined by \(A\). For example,
\[\begin{pmatrix}1&-1&1\\1&-1&2\\1&-1&3\end{pmatrix}x=\begin{pmatrix}2\\3\\4\end{pmatrix}\]But we can also write it in
You may have noticed that this augmented matrix is the matrix \(A\) from the Column relationships section! Recall that we found that Column 4 = Column 1 + Column 3. That is,
\[\begin{pmatrix}1\\1\\1\end{pmatrix} +\begin{pmatrix}1\\2\\3\end{pmatrix}=\begin{pmatrix}2\\3\\4\end{pmatrix}\]And, again, we can easily see this is true by looking at the RREF:
\[\left(\begin{array}{ccc\|c}1&-1&1&2\\1&-1&2&3\\1&-1&3&4\end{array}\right) \sim \left(\begin{array}{ccc\|c}1&-1&0&1\\0&0&1&1\\0&0&0&0\end{array}\right)\]This may seem completely inconsistent with how you learned it. You probably learned to just row reduce and then turn it back into equation form. Which would be
\[\begin{matrix}1x_1&-&1x_2&&&=&1\\&&&&x_3&=&1\end{matrix}\]And then solve for the pivot variables blah blah blah. But, instead, you could just look at the columns.
The augmented column 4, or \(b\) column, is Column 1 + Column 3. So a particular solution is \(\begin{pmatrix}1\\0\\1\end{pmatrix}\). We can identify that the kernel vector, or homogeneous solution, associated with the nonpivot column (Column 2) of the coefficient matrix is \(\begin{pmatrix}1\\1\\0\end{pmatrix}\) because Column 1 + Column 2 = 0. And that gives us our general solution: a particular solution plus a linear combination of the homogeneous solutions (a basis of the kernel of the coefficient matrix).
\[x=\begin{pmatrix}1\\0\\1\end{pmatrix}+c\begin{pmatrix}1\\1\\0\end{pmatrix}\]I’m not saying this is the one only good way to solve systems, but I’m trying to convey the fact that with this perspective of matrices, matrix multiplication, and row reduction I’ve set up, the application to solving system of equations is clear and intuitive, and it can be explained in a way that all of these concepts can be tied together.
Alright, now I didn’t want to stick this in the middle of the post. I had a sort of rage flow going. But this is an algorithm that allows you to determine the RREF directly from the kernel. Specifically, from the kernel alone you can actually determine exactly what the nonzero rows of rref(\(A\)) must be. Here’s how:
The algorithm is mostly just something for you to think about. It’s a little advanced to explain exactly why it works, but it has to do with the fact that the row space is the orthogonal complement of the kernel, and the nonzero rows of the RREF are just a specific form of basis for the row space. I encourage you to try to think it through.