Interdisciplinary Numerical Methods: Linear-Algebra "Spoke" 18.C21B/16.C21B

This repository holds materials the second 6-unit "spoke" half of a new MIT course (Spring 2026) introducing numerical methods and numerical analysis to a broad audience. 18.C21B/16.C21B covers large-scale linear algebra: what do you do when the matrices get so huge that you probably can't even store them as $n \times n$ arrays, much less run $O(n^3)$ algorithms like ordinary Gaussian elimination?

Prerequisites: 18.03, 18.06, or equivalents, and some programming experience. You should have taken the first half-semester numerical "hub" 18.C21/16.C21, or alternatively any other introductory numerical-methods course (e.g. 18.330, 18.C25, 18.C20, 16.90, 2.086, 12.010, 6.7330J/16.920J/2.097J, or 6.S955).

Taking both the hub and any spoke will count as an 18.3xx class for math majors, similar to 18.330, and as 16.90 for course-16 majors.

Instructor: Prof. Steven G. Johnson.

Lectures: MWF10 in 2-142 (Mar 30 – May 11, slides and notes posted below.

Grading (all assignments submitted electronically via Gradescope on Canvas):

30% 4 weekly psets: due Wednesdays at midnight: April 8, 15, 22, and 29, 10% pset check-ins (oral check-ins on selected psets where they have to explain their work; pass/fail per problem).
30% final project: due May 11 (last day of class). The final project will be an 8–15 page paper reviewing, implementing, and testing some interesting numerical linear-algebra algorithm not covered in the course. A 1-page proposal will be due April 17 (but you are encouraged to submit earlier). See final-project/proposal information. 20% final project presentation in-class (last week).
10% attendance

Collaboration policy: Talk to anyone you want to and read anything you want to, with two caveats: First, make a solid effort to solve a problem on your own before discussing it with classmates or googling. Second, no matter whom you talk to or what you read, write up the solution on your own, without having their answer in front of you (this includes ChatGPT and similar). (You can use psetpartners.mit.edu to find problem-set partners.)

Teaching Assistant: Rodrigo Arietta Candia

Office Hours: Wednesday 3pm in 2-345 (Prof. Johnson)

Resources: Piazza discussion forum, math learning center, pset partners.

Textbook: No required textbook, but suggestions for further reading will be posted after each lecture. The book Fundamentals of Numerical Computation (FNC) by Driscoll and Braun is freely available online, has examples in Julia, Python, and Matlab, and is a valuable resource. Another important textbook for the course is Numerical Linear Algebra by Trefethen and Bau. (Readable online with MIT certificates, and there is also a PDF posted online at uchicago, though this is a graduate-level textbook and hence is somewhat more sophisticated than the coverage in this course.)

This document is a brief summary of what was covered in each lecture, along with links and suggestions for further reading. It is not a good substitute for attending lecture, but may provide a useful study guide.

Lecture 1 (March 31)

Overview and syllabus (this web page).
Julia notebook with scaling examples

Reviewed the fact that traditional "dense" linear-algebra algorthms (factorizations LU, QR, diagonalization, SVD, etc.), which assume little or no special structure of the matrix, typically require $O(m^3)$ cost and $O(m^2)$ memory. In practice, this means that $m=1000$ matrices are easy laptop-scale problems, but much bigger matrices like $m=10^6$ seem out of reach.

However, huge matrices ($m=10^6$ or more) commonly arise in many applications, such ans engineering models of partial differential equations (where functions are discretized onto a grid or mesh). A $100 \times 100 \times 100$ mesh is a small model of a 3d system, and that has $10^6$ unknowns. How do people handle such problems?

The trick is that huge matrices typically have some special structure that you can exploit, and the most common such structure is sparsity: the matrix entries are mostly zero. Ideally, an $m \times m$ sparse matrix should only have $O(m)$ nonzero entries. More generally, an $m \times n$ array $A$ of numbers, a "matrix", is just one way to represent a linear operator — more generally, there are many other ways to represent the operation $x \mapsto Ax$ (or $y \mapsto A^T y$), and many types of linear operations can be represented with less storage and time than $mn$ numbers. There are not only sparse matrices, but also things like convolutions, low-rank matrices, linear combinations thereof, and more.

Showed how a sparse matrix, in fact a symmetric tridiagonal matrix arises from discretizing a simple PDE on a grid with finite differences: Poisson's equation $d^2 u / dx^2 = f$. (Sparsity arises because differential operators like $\nabla^2$ are local operators in space, so on a grid or mesh they only couple neighboring elements.) Computationally, we can store and solve such a matrix very efficiently, with $O(m)$ work.

How can we generalize this to other sparsity patterns and other types of large-scale matrix structures?

Further reading: FNC book section 8.1: sparsity and structure. The example of discretizing a 1d Poisson equation with finite differences, resulting in a tridiagonal matrix, is discussed in many sources, e.g. these UTexas course notes.

Lecture 2 (April 1)

Sparse matrices and data structures
sparse-matrix slides from 18.335 (Fall 2006)
Julia notebook on dense and sparse matrices from 18.06 (Fall 2022)
pset 1: due Wed, Apr 8 at midnight

Further reading: Sparse matrices in CSC format, along with sparse-direct algorithms, are provided in Julia by the SparseArrays standard library, and in Python by scipy.sparse, along with additional packages that provide other algorithms and data structures; for example, a famous C library for this is PETSc, which also has Python and Julia interfaces, and there are many sparse-direct libraries such as MUMPS and Pardiso. Sparse-direct solvers are described in detail by the book Direct Methods for Sparse Linear Systems by Davis; the corresponding software library is Davis's SuiteSparse, which is used by SparseArrays in Julia and is available in Python via scikit-sparse. We will soon start talking about iterative methods; more advanced treatments include the book Numerical Linear Algebra by Trefethen and Bao, and surveys of algorithms can be found in the Templates books for Ax=b and Ax=λx. Some crude rules of thumb for solving linear systems (from 18.335 spring 2020) may be useful.

Lecture 3 (April 3)

Iterative methods: the big picture is to solve $Ax=b$, $Ax = \lambda x$ and similar methods by iteratively improving a "guess" (usually starting with $x=0$ or random numbers), where the matrix $A$ is only used via matrix–vector products $Av$.

Pro: can be fast whenever $Av$ is fast (e.g. if $A$ is sparse, low-rank, Toeplitz, etc.). Can scale to huge probems.
Con: hard to design an iteration that converges quickly, how well the methods work is often problem-dependent, often requires problem-depending tuning and experimentation (e.g. preconditioners)

The simplest iterative method is the power method for eigenvalues: repeatedly multipling a random initial vector by $A$ converges to an eigenvector of the largest $|\lambda|$, if there is no other eigenvalue with equal magnitude. Showed an example (easiest for real-symmetric $A=A^T$ matrices).

Analyzed the convergence of the power method: if $\lambda_1$ is the biggest-magnitude eigenvalue and $\lambda_2$ is the second-biggest, then the error in the eigenvector on the k-th iteration is $O(|\lambda_2/\lambda_1|^k)$.

Given an estimate $x$ for an eigenvector, a nice way to get an estimate for the corresponding eigenvalue is the Rayleigh quotient $\lambda \approx x^T A x / x^T x$. Numerically, showed that the error in this eigenvalue estimate is the square of the error in $x$ for real-symmetric $A$, and in fact the eigenvalue error converges as $O(|\lambda_2/\lambda_1|^{2k})$ in this case. (We will see why this happens next time.)

To find other eigenvectors and eigenvalues, one possibility is an algorithm called deflation. It exploits the fact that for real-symmetric $A$, the eigenvectors $q_1, q_2, \ldots$ for distinct $\lambda$ are orthogonal. So, once we have found $q_1$, we can repeat the power method but project each step to be orthogonal to the previously found eigenvector, i.e. replace $x \longleftarrow x - q_1 (q_1^T x)$. This will then converge to $q_2$ (for the second-biggest $|\lambda|$). To get $q_3$, repeat the power method but project orthogonal to both $q_1, q_2$ with $x \longleftarrow x - q_1 (q_1^T x) - q_2 (q_2^T x)$, etcetera.

Deflation is a terrible scheme if you want the smallest magnitude eigenvalue, however, since you'd have to compute all the other eigenvalues/vectors first. Instead, to find the smallest |λ| one can simply apply the power method to $A^{-1}$: on each step, compute $y = A^{-1} x$, i.e. solve $Ay = x$. This is called (unshifted) inverse iteration. It relies on a fast way to solve $Ay = x$; for example, if $A$ is sparse, you can compute the sparse LU factorization and re-use it on each step with a sparse-direct solve. In general, we will see that solution methods for eigenproblems and linear systems are often closely related!

Further reading: FNC book section 8.2: power iteration and section 8.3: inverse iteration. Trefethen & Bau, lecture 27. See this notebook on the power method from 18.06.

Lecture 4 (April 6)

Handwritten notes from spring 2025, page 29+

If you want to find the smallest |λ| instead of the biggest, you can simply apply the power method to $A^{-1}$: on each step, compute $y = A^{-1} x$, i.e. solve $Ay = x$. This is called (unshifted) inverse iteration. It relies on a fast way to solve $Ay = x$; for example, if $A$ is sparse, you can compute the sparse LU factorization and re-use it on each step with a sparse-direct solve. In general, we will see that solution methods for eigenproblems and linear systems are often closely related!

Proved that, for a real-symmetric (Hermitian) matrix A=Aᵀ, the Rayleigh quotient R(x)=xᵀAx/xᵀx is bounded above and below by the largest and smallest eigenvalues of A (the "min–max theorem"). This theorem is useful for lots of things in linear algebra. Here, it helps us understand why the Rayleigh quotient is so accurate: the power method converges to a maximum-|λ| eigenvalue, which is either the smallest (most negative) or the largest (most positive) λ of a real-symmetric A, and hence that λ is an extremum (minimum or maximum) of the Rayleigh quotient where its gradient is zero. In fact, you can show that ∇R=0 for any eigenvector (not necessarily min/max λ). This means, if we Taylor expand R(x+δx) around an eigenvector x where R(x)=λ, you get R(x+δx)=λ+O(‖δx‖^2): the error in the eigenvalue λ goes as the square of the error in the eigenvector (for real-symmetric A).

Above, we considered inverse iteration. A more general idea is shifted inverse iteration: at each step, compute $(A - \mu I)^{-1}$ times the previous step, for some shift $\mu$. This converges to an eigenvector of the λ closest to μ. If μ is very close to an eigenvalue, it converges extremely quickly. Not only does this allow you to search for eigenvalues anywhere in the spectrum that you want, but it also gives you a way to accelerate iteration if you have a good guess for λ.

But where would you get a good guess for λ? A simple answer is to use the Rayleigh quotient R(x), where x comes from previous steps of the power iteration. Even if the power iteration is converging slowly, once you have even a rough approximation for λ you can use it as a shift. This leads to the algorithm of Rayleigh-quotient iteration: at each step, compute $x_k = (A - \mu_k I)^{-1} x_{k-1} / \Vert \cdots \Vert$, where $\mu_k = R(x_{k-1})$. It turns out that this converges faster than exponentially with $k$: for a Hermitian problem, it cubes the error (triples the number of digits) at every step, once you get close enough to the eigenvalue. (This is even faster than the quadratic convergence of Newton's method!) For a non-Hermitian problem, it has quadratic convergence similar to Newton.

The big problem with Rayleigh-quotient iteration, like Newton's method, is the need for a good initial guess — if you have a bad initial guess, it can be quite unpredictable what eigenvalue it converges to! But any time you can obtain a rough idea of where the desired eigenvalue is, it means you can zoom into the exact value extremely quickly.

Further reading: FNC book section 8.3: inverse iteration; however, beware that the book currently shows a less accurate (for real-symmetric/Hermitian A) method to estimate eigenvalues (issue fnc#16). Trefethen & Bau, lecture 27 covers these algorithms in much more depth. These slides by Per Persson (2006) are a useful summary.

Lecture 5 (April 8)

Handwritten notes from spring 2025, page 35+
pset 1 solutions: coming soon
pset 2: due April 15

To find other eigenvectors and eigenvalues of a Hermitian problem, one possibility is an algorithm called deflation. It exploits the fact that for real-symmetric $A$, the eigenvectors $q_1, q_2, \ldots$ for distinct $\lambda$ are orthogonal. So, once we have found $q_1$, we can repeat the power method but project each step to be orthogonal to the previously found eigenvector, i.e. replace $x \longleftarrow x - q_1 (q_1^T x)$. This will then converge to $q_2$ (for the second-biggest $|\lambda|$). To get $q_3$, repeat the power method but project orthogonal to both $q_1, q_2$ with $x \longleftarrow x - q_1 (q_1^T x) - q_2 (q_2^T x)$, etcetera.

Introduced Krylov subspaces, and the idea of Krylov subspace methods: ideally, we want to find the "best" solution in the whole subspace 𝒦ₙ spanned by {x₀,Ax₀,...,Aⁿ⁻¹x₀}, which is the only subspace you can get starting from x₀ if you are only allowed linear operations and matrix–vector products.

The Arnoldi algorithm is a Krylov algorithm for eigenproblems. It basically has two components:

Find an orthonormal basis Qₙ for 𝒦ₙ. Essentially, we will to this by a form of Gram–Schmidt, to be determined.
Given the basis, give the "best" estimate in 𝒦ₙ for one or more eigenvectors and eigenvalues.

How do we construct the orthonormal basis $Q_n$ of the Krylov space? Reviewed the Gram–Schmidt algorithm, along with its numerically stable cousin, modified Gram–Schmidt. Described how a variation on this idea can be used for the Krylov space: at each step, take the most recent orthonormal basis vector $q_n$, multiply it by $A$ to get $v = Aq_n$, and then project $v$ to be orthogonal to $q_1,\ldots,q_n$ by modified Gram–Schmidt to get $q_{n+1}$. Crucially, this avoids explicitly computing {x₀,Ax₀,...,Aⁿ⁻¹x₀}, which is a terribly ill-conditioned basis for 𝒦ₙ cannot be post-processed (in finite precision) to obtain an accurate orthogonalization.

Further reading: FNC book, section 8.4 on Krylov subspaces and Arnoldi. Trefethen lecture 33 on Arnoldi. This 2009 course on numerical linear algebra by Zhaojun Bai has useful notes on Krylov methods, including a discussion of the Rayleigh–Ritz procedure.

Lecture 6 (April 10)

Discussed what it means to find the "best" solution in the Krylov subspace 𝒦ₙ. Discussed the general principle of Rayleigh–Ritz methods for approximately solving the eigenproblem in a subspace: finding the Ritz vectors/values (= eigenvector/value approximations) with a residual perpendicular to the subspace (a special case of a Galerkin method).

For Hermitian matrices A, I showed that the max/min Ritz values are the maximum/minimum of the Rayleigh quotient in the subspace, via the min–max theorem. In this sense, at least for Hermitian matrices, the Ritz vectors are optimal in the sense of maximizing (or minimizing) the Rayleigh quotient in the Krylov space. Another sense in which they are optimal for Hermitian A is that the Ritz vectors/values minimize ‖AV-VD‖₂ over all possible orthonormal bases V of the Krylov space and all possible diagonal matrices D (see the Bai notes below for a proof). (Later, we will discuss an "optimal polynomial" interpretation of Arnoldi+Rayleigh–Ritz from Trefethen lecture 34.)

Moreover, showed that the dot products taken during the Gram–Schmidt process are exactly the entries of our Rayleigh–Ritz matrix $H_n = Q_n^T A Q_n$. This also means that $H_n$ is an upper-Hessenberg matrix (almost upper triangular), a common intermediate step in many eigensolver algorithms.

Showed that in the case where A is Hermitian, Hₙ is Hermitian as well, so Hₙ is tridiagonal and most of the dot products in the Arnoldi process are zero. Hence Arnoldi reduces to a three-term recurrence, and the Ritz matrix is tridiagonal. This is called the Lanczos algorithm.

Further reading: for Gram–Schmidt, see e.g. Strang Intro to Linear Algebra, chapter 4, and Strang 18.06 lecture 17. Modified Gram–Schmidt is analyzed in Trefethen lecture 8, and a detailed analysis with proofs can be found in e.g. this 2006 paper by Paige et al. [SIAM J. Matrix Anal. Appl. 28, pp. 264-284]. See also Per Persson's 18.335 slides on Gram–Schmidt. See also the links on Arnoldi from last lecture.

Lecture 7 (April 13)

Arnoldi experiments in Julia

Reviewed some experimental results with a very simple implementation of the Arnoldi algorithm (see notebook above). Arnoldi indeed converges much faster than power iterations, and can give multiple eigenvalues at once. Like the power method convergence is slower if the desired eigenvalues are clustered closely with undesired ones. Unlike the power method, it can converge not just to the largest |λ| but to any desired "edge" of the set of eigenvalues (the "spectrum"), e.g. the λ with the most positive or most negative real parts. Unlike the power method, the convergence of the Arnoldi algorithm is shift-invariant: it is the same for $A$ and $A + \mu I$ for any shift $\mu$. Like the power method, you can also apply Arnoldi to a shift/invert operator $(A - \mu I)^{-1}$ to find the λ closest to any desired μ, assuming you have a fast way to solve $(A - \mu I)y=x$ for $y$.

Discussed how rounding problems cause a loss of orthogonality in Lanczos, leading to "ghost" eigenvalues where extremal eigenvalues re-appear. In Arnoldi, we explicitly store and orthogonalize against all $q_j$ vectors, but then another problem arises: this becomes more and more expensive as n increases. In general, the computational cost of $n$ steps Arnoldi with an $m \times m$ matrix is $O(mn^2)$ plus $n$ matrix–vector multiplications, and the storage is $O(mn)$ (for $Q_n$). Without re-orthogonalization, Lanczos has $O(mn)$ computational cost (+ matvecs), but you still need to store $Q_n$ if you want eigenvectors. Often, the limiting factor is the $O(mn)$ storage: in linear algebra, we often run out of memory before we run out of time.

A solution to the loss of orthogonality in Lanczos and the growing computational effort in Arnoldi, along with the growing storage, is restarting schemes, where we go for n steps and then restart with the k "best" eigenvectors. If we restart with k=1 every step, then we essentially have the power method, so while restarting makes the convergence worse, the algorithm still converges (at least for the largest |λ| eigenvalues), and converges significantly faster than the power method for n>1.

Further reading: Trefethen, lecture 36. See the section on implicitly restarted Lanczos in Templates for the Solution of Algebraic Eigenvalue Problems. Restarting schemes for Arnoldi (and Lanczos) turn out to be rather subtle — you first need to understand why the most obvious idea (taking the $k$ best Ritz vectors) is not a good idea, as explained in these 18.335 notes.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
notes		notes
psets		psets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Interdisciplinary Numerical Methods: Linear-Algebra "Spoke" 18.C21B/16.C21B

Lecture 1 (March 31)

Lecture 2 (April 1)

Lecture 3 (April 3)

Lecture 4 (April 6)

Lecture 5 (April 8)

Lecture 6 (April 10)

Lecture 7 (April 13)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Interdisciplinary Numerical Methods: Linear-Algebra "Spoke" 18.C21B/16.C21B

Lecture 1 (March 31)

Lecture 2 (April 1)

Lecture 3 (April 3)

Lecture 4 (April 6)

Lecture 5 (April 8)

Lecture 6 (April 10)

Lecture 7 (April 13)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages