Sequential Monte Carlo: A Unified Review

Sequential Monte Carlo methods—also known as particle filters—offer approximate solutions to filtering problems for nonlinear state-space systems. These filtering problems are notoriously difficult to solve in general due to a lack of closed-form expressions and challenging expectation integrals. The essential idea behind particle filters is to employ Monte Carlo integration techniques in order to ameliorate both of these challenges. This article presents an intuitive introduction to the main particle filter ideas and then unifies three commonly employed particle filtering algorithms. This unified approach relies on a nonstandard presentation of the particle filter,which has the advantage of highlighting precisely where the differences between these algorithms stem from. Some relevant extensions and successful application domains of the particle filter are also presented.


INTRODUCTION
Decision-making in the presence of uncertainty is a fundamental aspect of our modern world.For example, consider an autonomous car that is faced with an obstacle in its path.Assuming that it is important to avoid colliding with obstacles, a decision is required to determine the best course of action for the steering, braking, and acceleration.Importantly, these decisions must be made without complete knowledge of the environment, the response of the vehicle, or even the location and orientation of the car.Another example is the many decisions surrounding the best way to manage a disease epidemic.These decisions are inevitably made without perfect knowledge of their impact.
In any case, it is wise to consider any prior knowledge and available evidence when making decisions.For example, using prior knowledge of vehicle dynamics and measurements from sensors (such as GPS, lidar, radar, inertial measurement units, and cameras) has proven to be hugely successful in autonomous vehicle applications (1).Similarly, dynamic models of disease spread and infection rate measurements have been successfully combined to help predict outbreaks and therefore allow decision makers to take evasive action (2).
In many important cases, new evidence is obtained sequentially.Therefore, it is also prudent to repeat the decision-making process as the sequence evolves; otherwise, actions may lead to devastating consequences.For example, forward-looking vehicle lidar measurements may indicate the presence of an obstacle that was previously (in time) occluded from view.Ignoring this new information may have catastrophic effects.
Combining prior knowledge with evidence that is revealed sequentially is the topic of this article.In a general sense, these types of problems can be considered within a so-called filtering framework.It is important to recognize that there are many different theoretical frameworks for considering this filtering problem.This article is concerned with a probabilistic approach, where tools from probability theory and statistics are used to describe levels of belief or uncertainty.
We will consider a general filtering problem that is expressed in so-called probabilistic statespace form (3). In principle, the solution to this filtering problem is already known and relies on the sequential use of Bayes' rule, among other probabilistic identities.There are two fundamental problems with this general solution: (a) It is impossible to obtain closed-form solutions in general, and (b) the solution requires evaluation of potentially large-dimensional integrals.
Due to the importance of filtering and the difficulty associated with solving the general problem, enormous research attention has been directed toward various approximations.In many cases, these approximations require implicitly or explicitly modifying the problem from its original form in order to employ a particular approximation method (this essentially simplifies the integration problem).Two well-known examples include the extended Kalman filter (4) and unscented Kalman filter (5) approaches.These approaches are very popular and attractive due to their performance and low computational complexity.A potential drawback of these methods is that the approximation accuracy cannot, in general, be arbitrarily improved.
This article reviews a popular alternative approximation where the problem remains unaltered from its stated form and the approximation of the required integrals can be made arbitrarily accurate.The key mechanism relies on Monte Carlo integration techniques, which are applicable to a broad set of problems.Since the filtering problem requires a sequential update based on newly available evidence, we employ a specialized form of Monte Carlo integration called sequential Monte Carlo (SMC) methods.The SMC methods are also often called particle filters.
These ideas have their origins in works by Gordon et al. (6), Kitagawa (7), and Stewart & McCarty (8), and there have been many relevant reviews (9)(10)(11) and important monographs (12)(13)(14)(15)(16). Importantly, the efficacy of SMC methods has been demonstrated in many and disparate situations, from autonomous vehicles (1) to disease modeling (17) to machine learning (18) to searching for the MH370 aircraft (19).This article is therefore not intended to argue the case for SMC methods.Rather, we are interested in achieving two primary goals: First, we intend to introduce the key ideas in an intuitive manner using a pedagogical example of estimating the position and velocity of a target mirror using the interferometer apparatus shown in Figure 1a, and second, we abstract from the particulars of this example and present the details of three popular particle filters in a unified manner.
In particular, we use a nonstandard presentation of the particle filter in order to unify three commonly used algorithms.This perspective highlights the different ways of employing Monte Carlo integration and the impact of these choices within the filter.
The article is organized as follows.Section 2 presents a motivating example and the essential ideas underlying the particle filter.Section 3 details the key technical ideas and a unifying framework to consider several popular particle filtering variants.Section 4 shows how the particle filters play an important role in other related fields and how they can be generalized to a broad class of problems.Section 5 provides some concluding remarks.It is assumed that the reader has a basic working knowledge of probability theory and statistics, but for those who do not, potentially helpful resources include books by Gut (20) and Gelman et al. (21).

A PEDAGOGICAL EXAMPLE
This section introduces the key principles and assumptions underlying SMC methods.To make these ideas concrete, we introduce the essential components by way of a pedagogical example.It is important to note that this example is intended for illustrative purposes and is not, in any way, intended to represent state-of-the-art estimation in this case.
More specifically, we consider a problem of estimating the position and velocity of a moving mirror using an interferometer apparatus (shown in Figure 1a).Interferometry dates back more than 100 years and was vital in accurately measuring flatness, an essential ingredient in the development of modern science and engineering (22).Perhaps the most famous interferometers today are the two large instruments that form the Laser Interferometer Gravitational-Wave Observatory (LIGO).This apparatus was used in the 2015 verification of gravitational waves (23), for which Rainer Weiss, Barry Barish, and Kip Thorne were awarded the 2017 Nobel Prize in Physics.
The particular setup in Figure 1a is a homodyne Michelson interferometer that is commonly used for very-high-precision displacement measurements (see, e.g., 24).These displacement measurements have found application in many areas, including vibration measurement, gas flow analysis, high-precision gyroscopes, and high-precision position control, to name just a few (25).
The basic principle of interferometry is to use the wave properties of light to measure distance.Splitting a coherent light source into two separate paths and then recombining them will generate an interference pattern.Differences in the path length for the split light will produce changes in the interference pattern, which is the observed quantity.That is, assuming a light wavelength of λ, path-length differences that are integer multiples of λ result in constructive interference, which results in high levels of measured light intensity.Fractional path-length differences result in lower levels of measured light intensity, and fully destructive interference results in the lowest levels of measured light intensity.
To make these ideas more concrete, the apparatus shown in Figure 1a can be represented by the schematic in Figure 1b.The laser, with wavelength λ, provides a coherent light source that is split in two using a beam splitter.The split light is then reflected by two separate mirrors, one that is fixed (the top mirror, called the reference mirror) and one that is allowed to move (the rightmost mirror, called the target mirror).The moving mirror then provides a time-dependent path-length difference.Between the beam splitter and the target mirror is a wave plate, which adds a λ/8 phase shift between the horizontal and vertical polarization directions of the light.This wave plate adds the same phase shift on the returning light, for a total λ/4 phase shift between polarizations.The polarizing beam splitter then splits the incident light depending on the polarization direction, and the newly split light intensity is detected by the two photodetectors.
If we label the two measured intensities from the photodetectors as y 1 and y 2 , then we can model the measured signals as 1 .
where d is the position of the target mirror and κá2π/λ.The sine and cosine terms stem from the wave properties of light and the phase shift introduced by the wave plate.The parameters α 1,2 and β 1,2 account for the offsets and gain terms of the measured intensities, respectively.The noise terms v 1,2 account for uncertainty in the measured intensity.Figure 2a shows a segment of the measured light intensities for the simulated target mirror position shown in Figure 3b, and Figure 2b shows a scatter plot of the measurements for the full data sequence.
(Figure 3a is further discussed below, and Figure 3b-d is further explained at the end of this section.)Assuming that the intensities can be measured at regular time intervals spaced seconds apart, we use the notation y 1 (k) and y 2 (k) to indicate the measured intensities at time instant k , where k is an integer and is called the discrete-time index.We can further collect these measurement at index k in the so-called output variable y k , defined as Given these intensity measurements, the aim is to estimate both the position d and the associated velocity ḋ at each discrete-time index k.We can assume for simplicity that the accelerations causing the mirror to move are unknown, and we can therefore model the discrete-time evolution of position and velocity via a simple kinematic, stochastic state-space model, where the implicit definition of x k is the so-called model state vector, and η k is a random variable used to model uncertainty in the state evolution model.
Restating the aim, we wish to estimate the state x k (the position and velocity of the target mirror) given all the measurements y 1:k {y 1 , . . ., y k }.There are many ways to attack this, but here we follow a probabilistic approach.The main idea is to model our belief of the state using a probability density function.It is important to clarify that the actual position and velocity are not uncertain; rather, it is our knowledge that is uncertain.In particular, we are interested in providing a distribution of the state conditioned on all of the available data-that is, we seek to find p(x k | y 1:k ). 5.
It is perhaps not immediately clear how we can arrive at this distribution, which we now turn our attention to.Suppose that our belief of the state at time k = 1, prior to obtaining any measurements, can be described via the distribution For example, suppose that we believe that p(x 1 ) can be adequately modeled via a multivariate normal distribution according to where μ 1 is the mean and P 1 is the covariance.The actual values employed for each depends on prior knowledge.Figure 3a shows samples (also referred to as particles) from p(x 1 ).
Then we are at time k = 1 provided with light intensity measurements y 1 , and we would like to update our belief of the state-that is, we want p(x 1 | y 1 ).To this end, we can employ Bayes' rule to obtain  For this to be useful, we need to consider each of the distributions on the right-hand side.We start by noting that the denominator term p(y 1 ) is a normalizing constant that ensures a proper density function, and it may be safely ignored for now.Concerning p(x 1 ), notice that this is the prior distribution available in Equation 7. We now concentrate on p(y 1 | x 1 ), which describes our belief of the measurements y 1 given the state.This is a fundamentally important object, known as the measurement likelihood model, relating the state to the measurements.
We already have such a relationship described in Equation 3, but it is not quite in the required probabilistic form p(y 1 | x 1 ).To remedy this, we will further assume a distribution for the measurement noise term e k , which is independent over k and where for each k, With this in place, we can then say that the distribution of y 1 given x 1 is multivariate normal via where g(x k ) is defined as (refer to Equation 3) where x k (1) is used to indicate the position state at discrete-time index k.Put another way, we believe that given the state x 1 , the measurements y 1 are a realization from p(y 1 | x 1 ), and that if the target mirror was fixed, then repeated measurements would just be different realizations from p(y 1 | x 1 ).In principle, with the measurement likelihood model in Equation 10and prior in Equation 7 in place, we can describe the posterior p(x 1 | y 1 ) via Equation 8.At the same time, it is generally not possible to express this posterior in closed form. 1 A major benefit of SMC methods is that they approximate the posterior using a finite number of terms.The essential idea is to represent the filtered density p(x 1 | y 1 ) as an empirical distribution represented using a finite weighted sum of point-mass distributions (Dirac delta functions), where the locations of the point masses are determined by the so-called particles x i 1 .The associated weight w i 1 represents the relative importance of the ith particle.To be more specific, consider each of the M = 1,000 samples for the initial state x 1 ∼ p(x 1 ) from Figure 3a.Intuitively, we can determine the importance of each particle x i 1 by pretending that the target mirror had this position and velocity and then generate a virtual measurement using this state via Equation 11.This can then be compared with the actual measurement, where strong agreement would result in high importance, and strong disagreement would result in low importance.This comparison is neatly captured by our likelihood model (Equation 10), which will have larger values when y 1 − g(x i 1 ) is small and small values when y 1 − g(x i 1 ) is large.Therefore, intuitively speaking, we could weight each of the particles by the following so-called importance weight: For the current example, Figure 3a also shows that 99.99% of the importance comes from just 30 particles.This indicates that most of the prior states are extremely unlikely, and we have effectively reduced our options to just 30 possibilities.Also note that the likely particles occur in bands, spaced roughly λ distance in the position state, which stems from the cyclic nature of the measurements.Suppose, for the moment, that at time k = 2 we are presented with new light intensity measurements y 2 and wish to provide a distribution of the state x 2 based on all the measurements so far (i.e., y 1 , y 2 ).That is, we seek where the right-hand side again stems from application of Bayes' rule.These distributions require some discussion.The denominator term is again a normalizing constant and can be safely ignored.
The term p(y 2 | x 2 , y 1 ) is similar to the measurement likelihood previously defined but includes conditioning on y 1 .Intuitively, it may be argued that if the state x 2 was given, then the current intensity measurements y 2 should not depend on previous measurements.This is often called a conditional independence assumption and results from the Markov nature of state-space systems (see, e.g., 3), where we assume that y k is conditionally independent of the past when given the state x k -that is, we will henceforth assume that Considering the remaining term p(x 2 | y 1 ) from Equation 14, we have at our disposal the posterior p(x 1 | y 1 ), which from the law of total probability affords where the second equality stems from an application of conditional probability.In principle, p(x 1 | y 1 ) is already known via Equation 8.The remaining term, p(x 2 | x 1 , y 1 ), is a distribution of the state at k = 2 given the previous state x 1 and past measurements; this is another fundamentally important distribution, known as the transition distribution.A commonly employed modeling assumption is that x k is independent of past data given x k−1 , and we say that x k satisfies the Markov property (see, e.g., 3)-that is, we assume Therefore, the distribution p(x 2 | y 1 ) (known as the prediction distribution since it involves predicting the state based on previous data) can be expressed as 19.
For the current example, the transition distribution p(x 2 | x 1 ) is a model of the dynamic behavior of the target mirror over the time interval between points k = 1 and k = 2.We already have such a model in Equation 4, but again it is not in the required form p(x 2 | x 1 ).In a similar manner to before, we remedy this by imposing an assumption on the distribution for η k (the state noise); specifically, we assume that With this in place, the distribution for x k+1 given x k is given by where f (x k ) is defined as (refer to Equation 4) In principle, we have all the terms required to provide the prediction distribution from Equation 19.At the same time, we face the same challenge as before, in that we cannot express this solution in closed form in general.We also face another challenge in that the integral is intractable for even modest dimensions of x k (for a detailed discussion of these ideas, see 26,27).A major benefit of the SMC approach is that it overcomes both of these problems.In essence, we can simply substitute the approximation from Equation 12into Equation 19 to reveal Therefore, the SMC approach results in an approximation of the prediction density that involves a weighted combination of the transition model (i.e., our dynamics model for the target mirror).Faced with a new measurement y 2 , we can in principle repeat the above process to arrive at an approximation p(x 3 | y 1:2 ).More specifically, we note that and that the filter distribution p(x 2 | y 1:2 ) is given by Equation 14.Following a similar line of argument to the above discussion, we can see that the essential idea is to once again approximate this filter distribution as where the particles x i 2 are this time samples from the prior x i 2 ∼ p(x 2 | y 1 ) and the weights w i 2 are defined as the likelihood p(y 2 | x i 2 ); this has the same interpretation as previously, in that particles that are highly likely will have higher weight values and vice versa.Substitution of Equation 25into Equation 24 reveals that These steps can be repeated for each new sample time k as new measurements y k become available.It is important to remember that the above discussion is using finite sample approximations.Under some mild conditions (see Section 3.6), the resulting approximation is guaranteed to converge to the true underlying distribution as M → ∞.
As a preview of the utility of this approach, using M = 1,000 particles and a variant of Algorithm 1 [using the sequential importance resampling (SIR) choice], given in Section 3.4, results in the position estimates shown in Figure 3b.For the purposes of simple comparison, an unscented Kalman filter estimate is also provided.Figure 3c,d provides more detail around the start of the simulation, where the particle locations demonstrate the multimodal nature of the state distribution; three regions of support are provided by the particles, each separated by a single wavelength (refer to Figure 3a).Note that the unscented Kalman filter mean jumps several wavelengths around the 40th sample, which is perfectly well supported by the measurements.

SEQUENTIAL MONTE CARLO: A UNIFIED APPROACH
The above discussion of the position estimation problem highlighted a number of key elements when considering a probabilistic estimation approach.The first key element was that the quantities of interest at time k were collected into a vector x k ∈ R nx , which is called the state.The second was the availability of a belief about the state x 1 prior to collecting any measurements (but possibly conditioned on any other relevant knowledge of the problem).The third was the availability of a measurement likelihood model that relates the state x k to the measurements y k .And finally, the fourth was the availability of a state transition model that predicts the distribution of the state one step into the future x k+1 based on knowledge of the current state x k .
In this section, we abstract from the particulars of this pedagogical example and return to it when pertinent to do so.Our focus in this section is to consider the same type of problem from the example-namely, that at time k we seek the distribution of the state based on all the available data so far.This is known as the Bayesian filtering problem for state-space models.Using slightly more formal notation than in the example, the state at time index k is treated as a random variable and denoted as X k .We assume that the state evolves over time according to the following conditional distribution (akin to the dynamic model from the example): 27.
We note that this state transition model does not explicitly rely on inputs or other possibly important knowledge.This is purely for ease of exposition, and the above model is allowed to depend on inputs and other knowledge where appropriate.We further assume that the measured outputs are a realization of the random variable Y k and that the state is related to Y k via the following conditional distribution (the measurement likelihood model): 28.
The so-called Bayes filter aims to provide the state distribution at time k conditioned on a collection of output measurements y 1:k {y 1 , . . ., y k }.That is, the filter distribution is This filtered distribution can be expressed using the measurement likelihood p(y k | x k ) and a prior on X k given the previous data where the distribution p(x k | y 1:k−1 ) is known as the prediction distribution and describes the distribution of the state given all measurements except y k -that is, This prediction distribution can also be linked to the previous filtered distribution and the state transition model p( The filtering and prediction stages can be combined to provide a recursion from one prediction distribution to the next via Equation 33 serves as the central object throughout the remainder of this article.
Without loss of generality, we can assume that the initial state X 1 is distributed according to a finite mixture distribution: The distributions p(x 1 | x i 0 ) are allowed to depend on regular variables x i 0 , in case this is useful.Furthermore, the distributions p(x 1 | x i 0 ) should not be confused with the state transition model (although this is technically allowed and possibly useful in some circumstances).As a concrete example, the prior from the interferometry example can be recovered with M = 1, w 1 0 = 1, and p(x 1 | x i 0 ) p(x 1 ).We note that this level of generality may not be warranted, but it does allow for a parsimonious presentation of the various SMC methods without any special consideration regarding the initial distribution.
Armed with the initial state distribution and presented with a single observation y 1 , the application of the Bayes filter equations results in a prediction distribution p( As mentioned previously, it is not generally possible to obtain closed-form solutions to the above sum integral, except in some well-known and important cases (see, e.g., 13,28).Therefore, we seek instead to arrive at asymptotic solutions that exhibit convergence to the desired solution, with a guaranteed convergence rate.Toward this goal, the primary mechanism employed in SMC methods relies on the law of large numbers (LLN), often called Monte Carlo integration, whereby the sample mean converges to the expected value: The above assumes that z i ∼ p(z) are independent and identically distributed (i.i.d.) for i = 1, . . .and that f (•) is a measurable function.In general, it may be difficult to sample directly from p(z) but possible to evaluate it pointwise.If we suppose that it is instead straightforward to sample from another distribution q(z), it follows under mild conditions that lim where z i ∼ q(z) are i.i.d.Furthermore, and importantly, the LLN can be used to approximate expectations for joint discrete-continuous random variables.Indeed, where ( j i , z i ) are i.i.d. with the joint discrete-continuous distribution q( j, z).
The various incarnations of particle filters can all be explained by observing that each employs the LLN to approximate various integrals and summations within the Bayes filter equations.As a summary of what follows, all SMC methods considered in this article result in an approximation of the prediction distribution in the form of a mixture, Toward this result, in the subsequent sections we introduce the key ideas behind three popular particle filters and then discuss a unified algorithm that encapsulates all three.We remark that this presentation is somewhat nonstandard in that our focus is directed toward the prediction distribution rather than the filter p(x k | y 1: k ), which is more commonly discussed.Due to its importance, we address the filter distribution following the discussion on various particle filter methods.

The Sequential Importance Sampling Particle Filter
The sequential importance sampling (SIS) particle filter has a long history in the field of Bayesian filtering (29).In the framework of the current article, the essential idea behind the SIS approach is to approximate the integral in for each i via Monte Carlo integration, but, importantly, using only a single sample.
We have a number of options at this point.We could draw a sample from p(x 1 | x i 0 ) (which is the most common choice in the literature), 5 or, as mentioned in the previous section (see Equations 37 and 38), it may be beneficial to draw a sample from a more general proposal distribution, q(x 1 | x i 0 , y 1 ).41.
It is important to reflect on this proposal for a moment.Notice that the proposal distribution has the flexibility to be conditioned on the most recent observation y 1 ; this flexibility can help in proposing particles in more likely locations since we are allowed to use the most recent data.We can therefore express the required sum integral as The SIS particle filter approximates the above integral for each i using a single sample (also called a particle) from a proposal distribution p(y 1 ) This leads to the following approximation for the prediction distribution:

44.
We note that the above-defined weights6 wi 1 are not guaranteed to sum to one, which unfortunately implies that the above approximation is not a proper distribution.A simple way to correct this is to replace wi 1 with a new normalized weight, defined as Here we notice that the action of normalizing removes the dependence on p(y 1 ) since it cancels in the fraction; we can therefore safely ignore this term.Combining Equation 44 and the normalized weights in Equation 45 allows the prediction distribution p(x 2 | y 1 ) to be approximated by the following mixture distribution: It is important to highlight that the mixture relies only on weights w i 1 and particles x i 1 .Interestingly, our approximation of the prediction distribution is also a mixture distribution, which was the assumed form for the prior p(x 1 ) = M i=1 w i 0 p(x 1 | x i 0 ).Therefore, at least in principle, we could replace the unknown prediction distribution with this approximation and repeat the above process.To see this, note that the next prediction density p(x 3 | y 1:2 ) is given by where the final approximation comes from substituting Equation 46 into Equation 47.This is the same form as Equation 40, and we can therefore repeat the same reasoning to arrive at the mixture 49.
The sequential nature of using Monte Carlo integration in this way reveals the source of the term sequential Monte Carlo.The SIS particle filter is summarized in Algorithm 1 (see Section 3.4) with the SIS choice.

The Sequential Importance Resampling Particle Filter
A potential issue with the SIS approach is that employing only one sample for each i in the LLN approximation generally leads to high variance in the estimate (i.e., a poor approximation).In practice, the lack of particle diversity caused by using a single sample often means that the current measurement is not well supported, and most (if not all) particles will be very unlikely.Since it is typical that the variance is reduced as the number of samples increases, we could just use more samples for each i so that where . Following similar reasoning as before, we arrive at the following approximation of the prediction distribution with NM components: 51.
While this is perfectly fine, repeating this approach will ultimately produce an exponential growth in the number of mixture components.To combat this, we could take a slightly different approach and use the LLN to approximate both the integral and the sum: Introducing the distribution q(x 1 | x i 0 , y 1 ) provides In contrast to the SIS case, where we used a single sample for each i, here we choose to sample both the index i and x 1 jointly; this added flexibility allows the possibility of ignoring certain indices and concentrating on others of higher utility.It will be important later that we have knowledge of which index i was sampled.Therefore, we denote the jth sample of i as the integer a j 1 to indicate that this was the jth sample at k = 1; this is called an ancestor index.We sample (a where we choose the joint distribution to be defined as q(a 1 , x 1 ) q(x 1 | x a 1 0 , y 1 ) w .

55.
Importantly, we can straightforwardly sample from the joint by first sampling a j 1 from the categorical distribution 7 Choosing the index a j 1 based on the previous weights {w i 0 } M i=1 is known as resampling.In contrast to the SIS case, where all indices were explicitly considered, the effect of resampling is to choose indices at random with a probability proportional to the weights w i 0 .One interpretation of this effect is that resampling concentrates attention to mixture components with higher weights.This is not always well justified since the utility of a mixture component relates to both the weight w i 0 and the distribution p(x 1 | x i 0 ).Nevertheless, having sampled a j 1 , we then sample x 1 conditioned on a j 1 via 57. 7 The categorical distribution Cat({w i } M i=1 ) is parameterized by the nonnegative numbers w i ≥ 0, with M i=1 w i = 1, and the probability mass P( j = i) = w i .There are many ways to sample from this distribution, each with slightly different properties (30).
Using this approach, we arrive at the following approximation by using N samples of (a p(y 1 ) .

58.
It is common, but not essential, that N = M, which leads to the following approximation for the prediction distribution: .

59.
Similar to the SIS case, the expression on the right-hand side is not guaranteed to be a distribution for x 2 since it may not have unit area.We can apply the same normalization strategy to arrive at the normalized weights via Once again, we have that the prediction density is given by As with the SIS case, we can therefore repeat the above steps to arrive at the SIR particle filter provided in Algorithm 1 with the SIR choice.

The Auxiliary Particle Filter
In the SIR case, we chose the joint distribution according to Equation 55, but this is not essential.We can, in fact, use any reasonable joint distribution q(a 1 , x 1 ).This flexibility was first employed within the auxiliary particle filter (APF) approach (31).In particular, where the main difference compared with Equation 55 is that we allow more flexibility in choosing the probability masses v i 0 for the indices a 1 .Following similar reasoning to the SIS and SIR cases, we arrive at the following mixture approximation of the prediction distribution: where the normalized weights for the APF case are w A commonly used choice for the proposal is to set The argument for this choice stems from the fact that 8 , 68.
which implies that all particles are equally important, and we therefore minimize wasted effort.This approach is known as the fully adapted APF.In general, it is not possible to compute v 0 and draw from q(•) defined in this way, in which case several authors have proposed replacing p(y 1 | x 0 ) and p(x 1 | x 0 , y 1 ) with approximations (31).This leads to the so-called partially adapted APF.At any rate, we arrive back at another mixture distribution for the prediction density, and the above can be repeated sequentially to deliver the APF summarized in Algorithm 1 with the APF choice.

A Unified View
The above discussion shows that three important variants of the particle filter-SIS, SIR, and APF-all employ Monte Carlo integration to approximate various expectations.These algorithms may be differentiated on the basis of which part of the expectation they approximate and on the basis of the proposal they employ.These similarities and differences can be summarized in Algorithm 1, which highlights that the practical difference in implementation is very subtle.It essentially reduces to how the so-called ancestor index a i k is chosen.In summary, (a) the SIS filter selects all ancestor indices, even if they are extremely unlikely; (b) the SIR filter chooses ancestor indices randomly according to the weights w i k−1 ; (c) the APF chooses ancestor indices randomly according to a more flexible set of weights v i k−1 , and this added flexibility can be exploited to minimize wasted effort; and (d) all filters allow for the use of a general proposal q(x k | x i k , y k ).Algorithm 1 (unified particle filter: generate particles and weights {x i k , w i k } M i=1 , ∀k).Require: M > 0 and the particle filter variant to use (SIS, SIR, or APF).
Compute weights wi k according to wi .

end for
Compute normalized weights end for 8 Using conditional probability and the Markov property.

The Bootstrap Particle Filter
We would be remiss not to mention that choosing the proposal q(x i k | x leads to the celebrated bootstrap particle filter, where some terms cancel to reveal

Sequential Monte Carlo Convergence
The convergence of SMC methods has received significant attention (for a full theoretical treatment, see, e.g., 16).In essence, provided that the particle filter can correct errors in the initial state particles (a type of forgetting behavior), it can be shown that the particle filter converges.To be a little more specific, we define two expectations, I k and ÎM k , via where pM (x k | y 1:k−1 ) denotes the M-particle approximation of the prediction distribution at time k, and ψ satisfies |ψ (x k )| ≤ 1 for all x k .Then it can be shown that ( 16) where a and b are constants that do not depend on the number of observations N; this is the reason why particle methods can be used in online applications for state estimation.Unfortunately, while this is promising in terms of the data length N, the same does not necessarily hold for the state dimension n x .When naively implemented, particle filters are known to be impractical for state dimensions greater than n x ≈ 10 (see, e.g., 26,27).In short, the main reason for this is that the terms a and b grow at an exponential rate in the state dimension; this is certainly the case for a standard implementation of the bootstrap particle filter (27).
This phenomenon is well known to the research community and to practitioners of SMC methods and has recently received theoretical underpinnings (26,27).In particular, Rebeschini & van Handel (27) outlined a theoretical argument for constructing algorithms that do not suffer from the same exponential growth in state dimension, which fits within the unified presentation presented in this section.This was the motivation behind the work of Andersson et al. (32), who employed local SMC methods to adapt the proposal distribution within the broader SMC framework for large-dimension spatiotemporal systems.

PROBLEMS SUCCESSFULLY SOLVED BY SEQUENTIAL MONTE CARLO METHODS
The aim of this section is to provide a rough overview of problems where SMC has been-and will most likely continue to be-useful.The problem areas of system identification (Section 4.1) and state estimation (Section 4.2) are at the heart of both control and robotics, and we will see in Section 4.3 that SMC is in fact much more generally applicable than most of us thought at first.

System Identification
In system identification, we build mathematical models of dynamical systems from measured data (33).SMC is a useful component in solving the problem when the dynamics are nonlinear (for overviews providing ample entry points into this development, see, e.g., 34,35).The commonly used maximum likelihood formulation, θ = arg max θ p θ (y 1:N ), 73.
requires the likelihood p θ (y 1:N )-and often also its derivatives-to be available for evaluation.By writing the likelihood in the form p θ (y 1:N ) = p θ (y 1:N , x 1:N )dx 1:N , 74.
we better see its relationship to the latent states.The mathematical motivation for Equation 74 is marginalization, with the interpretation that we average p θ (y 1:N , x 1:N ) over all possible state sequences x 1:N .Equivalently, we can write where y 1:0 = ∅.This intractable integral explains why SMC is so natural for maximum likelihood estimation of the parameters in nonlinear state-space models.When SMC is used to approximate the likelihood in the integral in Equation 75, it is quite remarkable that the resulting likelihood estimate is in fact unbiased (16,36).However, it is still a stochastic quantity, implying that the resulting optimization problem in Equation 73 is stochastic.One way around this stochastic optimization problem is to make use of the expectation maximization algorithm (37), which amounts to iteratively solving θ i+1 = arg max θ ln p θ (x 1:N , y 1:N )p θ i (x 1:N | y 1:N )dx 1:N 76.
for i = 1, 2, . . ., initialized as θ 0 .The sequence {θ i } i≥0 computed in this way is guaranteed to not decrease the log-likelihood (37) [i.e., ln p θ i+1 (y 1:N ) ≥ ln p θ i (y 1:N )], which explains why expectation maximization can be used in solving maximum likelihood problems.The additional challenge with the formulation in Equation 76 is that the smoothing distribution p θ i (x 1:N | y 1:N ) is not available in closed form, but we can employ SMC-based methods to approximate this quantity, effectively replacing the integral with the following tractable sum over all particles: The particles {x i 1:N } M i=1 and their weights {w i 1:N } M i=1 are computed for a model parameterized by the current iteration θ i .Details on this solution are available in papers by Olsson et al. (38) and Schön et al. (39) and were later further improved by Lindholm & Lindsten (40).
A more direct solution to the maximum likelihood problem is to acknowledge its stochastic nature in the first place and make use of stochastic optimization algorithms.These algorithms have-due to their importance in solving deep learning problems-experienced a good development over the past decade (see, e.g., 41).Stochastic optimization algorithms rely on a Markov chain of the form providing an iterative updating mechanism for the parameters.Here, d i is the search direction, and α i > 0 denotes the so-called step length, also referred to as the learning rate.Wills & Schön (42) developed a second-order stochastic optimization algorithm specifically tailored for nonlinear system identification using SMC.
The joint distribution of all random variables used in the model-sometimes referred to as the full probabilistic model-within the maximum likelihood formulation is given by where the latent states follow a prior distribution according to model specification.The maximum likelihood formulation implies that we assume the unknown parameters θ to be modeled as deterministic variables.Depending on the problem setting, a Bayesian formulation ( 21) might be more useful.Here, the unknown parameters are instead modeled as random variables, implying that we need to complement the model with an assumption of this prior p(θ) as well.The full probabilistic model is now instead given by p(x 1:N , y 1:N , θ) = p(x 1:N , y 1:N | θ)p(θ), where the first term is-with a slight change of notation-given in Equation 79.The goal is then to compute the posterior distribution The first thing to note is that the likelihood p(y 1:k | θ) takes center stage in the Bayesian formulation as well.Over the past decade, we have seen a most interesting and useful development when it comes to Bayesian solutions based on SMC.To a large extent, this all started with the particle Markov chain Monte Carlo (MCMC) construction (43).Andrieu et al. (43) introduced both Metropolis-Hastings (44,45) and Gibbs (46) constructions.The resulting MCMC algorithms are exact in the sense that the target distribution of interest-typically p(θ | y 1:N ) or p θ (x 1:N | y 1:N )-is the stationary distribution of the Markov chain, even though it makes use of an SMC-based approximation of the likelihood in evaluating the acceptance probability.This has resulted in the somewhat peculiar but descriptive term for this class of algorithms, namely exact approximations.
The particle Metropolis-Hastings sampler makes use of SMC to guide a Metropolis-Hastings method through the parameter space (for a tutorial introduction, see 47).Slightly more specifically, it makes use of a nonnegative and unbiased likelihood estimate provided by SMC, which is a version of pseudomarginal  applied to this particular setting.A possibly underappreciated fact is that the particle Metropolis-Hastings algorithm provides a solution to the smoothing problem as well.
When it comes to the particle Gibbs construction, SMC is used as a high-dimensional proposal mechanism on the space of state trajectories x 1:N .The original particle Gibbs construction has been improved via the addition of a so-called ancestor sampling step (49).

State Estimation
We have already discussed aspects of the state estimation problem to a great extent in the earlier sections of this review, via the practical example in Section 2, after which we used the nonlinear filtering problem to derive the SMC method in Section 3. The information about the state is represented using probability density functions of the form p(x r:k | y 1:s ).Depending on the relationships between the time indices r, k, and s, this problem falls into one of three main categories: filtering, prediction, or smoothing.Table 1 provides a more detailed enumeration of the most commonly encountered state inference problems.a These densities may appear either on their own or as components in a larger algorithm.
When it comes to smoothing, the joint smoothing density p(x 1:N | y 1:N ) is the main object we are after.The other smoothing solutions are marginal densities with respect to this density.One of the most commonly used strategies for computing smoothing solutions is to first run a (forward) filter and then perform a backwards pass to carry the information from the future backward in time.An early and well-used example of this strategy is provided by the Rauch-Tung-Striebel smoother (50) for linear Gaussian state-space models.The equivalent idea for particle filters was introduced in 2000 ( 51), but it was not practical due to its high computational cost.However, since then we have seen very useful developments-in particular, when it comes to backward simulation, relying on a backward pass, where the states are simulated backward in time, resulting in (uncorrelated) samples from p(x 1:N | y 1:N ).Lindsten & Schön (52) provided a tutorial introduction to the backward simulation idea.
Since the introduction of the particle MCMC construction (43) roughly a decade ago, we have seen the emergence of a completely different class of very capable smoothing algorithms-namely, those based on carefully engineered Markov kernels.The idea is to run SMC methods within an outer MCMC construction, implying an iterative algorithm.Svensson et al. (53) provided a concrete example of such a construction.

Sequential Monte Carlo Is Generally Applicable
SMC is useful not only when it comes to nonlinear state-space models, but also for a much broader class of models.It can be used whenever the model contains a sequential structure, be it natural or artificial.Concrete examples include the class of probabilistic graphical models (54) and the even more general programmatic model (55) offered by probabilistic programming languages (56).The combined use of variational inference (57) and SMC has also seen useful developments recently (see, e.g., 58-60).Naesseth et al. (10) provided a tutorial introduction to SMC in this more general setting.

CONCLUSIONS AND FUTURE WORK
While the use of SMC to solve nonlinear filtering and system identification problems is starting to mature, its use in more general settings is only just starting to emerge.In this article, we have focused on providing a nonstandard and hopefully intuitive presentation of the SMC method when it is used to solve the nonlinear filtering problem.
A clear trend is that SMC methods are increasingly being used as components in various larger (often iterative) algorithms.Examples include Bayesian learning, where SMC is used within an MCMC algorithm (43); iterated filtering (61); maximum likelihood using stochastic optimization (42); maximum likelihood via expectation maximization (39); and more elaborate blends of variational inference and SMC (58)(59)(60), to name a few.
A rewarding way forward is likely offered by the two main design choices of SMC-the intermediate target distributions and the proposal distribution-which pave the way for new algorithm development.By multiplying the intermediate target in each step by a so-called twisting potential, one can utilize this design freedom to obtain a better approximation (62).Here, there are interesting possibilities in the use of deterministic algorithms to approximate the twisting potential.While algorithms based on variational inference are fast, the resulting estimates suffer from biases that are hard to quantify.The Monte Carlo methods are the other way around, in that they enjoy asymptotic consistency and are well supported by theory but can instead suffer from a high computational cost.Hence, it is natural to follow the path started by Maddison et al. (58), Naesseth et al. (59), and Le et al. (60) and develop solutions that blend variational inference and SMC to achieve fast algorithms with theoretical guarantees.
Various deep architectures have recently also proved highly useful when it comes to nonlinear dynamics [see, e.g., the ordinary differential equation variational autoencoder (ODE 2 VAE) (63) and Kalman variational autoencoder (64) constructions].Including SMC in this setting is, naturally, an interesting next step.The main roadblock is arguably the fact that the essential resampling step is not differentiable with respect to the model and SMC parameters.However, recent developments (65) have introduced a first differentiable particle filter, effectively opening the way for end-to-end training.

Figure 1 (
Figure 1 (a) A homodyne Michelson interferometer setup.(b) A schematic of the interferometer shown in panel a.Abbreviations: BS, beam splitter; PBS, polarizing beam splitter; PD, photodetector. Figure adapted from Reference 66 with permission from Petter Ersbo.

2 Figure 2 (
Figure 2 (a) Segment of measured light intensities.(b) Scatter plot of measured light intensities.

Figure 3 (
Figure 3 (a) Initial samples (particles) x i 1 , shown as red x's, drawn from a multivariate normal with mean μ 1 = 0 and covariance P 1 = 1 4 λ 2 I 2×2 .Resampled particles based on y 1 are shown as blue circles, where three vertical bands are roughly λ distance apart.(b) Simulated (true) target mirror position (solid red line), the mean position estimate from an unscented Kalman filter (UKF) (dashed black line), and the mean position estimate from the particle filter (PF) (solid blue line).The transient is shown in panel c.(c) Simulated (true) target mirror position (solid red line), UKF mean estimate (dashed black line), and PF mean estimate (particles shown as blue dots), showing more detail at the start of the simulation.(d) Simulated (true) target mirror velocity (solid red line), UKF mean estimate (dashed black line), and PF mean estimate (particle mean shown as a solid blue line), again showing more detail at the start of the simulation.