research-article

Open access

Conditional Independence by Typing

Authors:

Maria I. Gorinova,

Andrew D. Gordon,

Charles Sutton, and

Matthijs VákárAuthors Info & Claims

ACM Transactions on Programming Languages and Systems (TOPLAS), Volume 44, Issue 1

Article No.: 4, Pages 1 - 54

https://doi.org/10.1145/3490421

Published: 09 December 2021 Publication History

All formats PDF

Abstract

A central goal of probabilistic programming languages (PPLs) is to separate modelling from inference. However, this goal is hard to achieve in practice. Users are often forced to re-write their models to improve efficiency of inference or meet restrictions imposed by the PPL. Conditional independence (CI) relationships among parameters are a crucial aspect of probabilistic models that capture a qualitative summary of the specified model and can facilitate more efficient inference.

We present an information flow type system for probabilistic programming that captures conditional independence (CI) relationships and show that, for a well-typed program in our system, the distribution it implements is guaranteed to have certain CI-relationships. Further, by using type inference, we can statically deduce which CI-properties are present in a specified model.

As a practical application, we consider the problem of how to perform inference on models with mixed discrete and continuous parameters. Inference on such models is challenging in many existing PPLs, but can be improved through a workaround, where the discrete parameters are used implicitly, at the expense of manual model re-writing. We present a source-to-source semantics-preserving transformation, which uses our CI-type system to automate this workaround by eliminating the discrete parameters from a probabilistic program. The resulting program can be seen as a hybrid inference algorithm on the original program, where continuous parameters can be drawn using efficient gradient-based inference methods, while the discrete parameters are inferred using variable elimination.

We implement our CI-type system and its example application in SlicStan: a compositional variant of Stan.¹

1 Introduction

The number of probabilistic programming languages (PPLs) has grown far and wide, and so has the range of inference techniques they support. Some focus on problems that can be solved analytically and provide a symbolic solution Gehr et al. 2016; others are very flexible in the models they can express and use general-purpose inference algorithms Wood et al. 2014. Some use gradient-based methods Carpenter et al. 2017 or message-passing methods Minka et al. 2014 to provide an efficient solution at the cost of restricting the range of expressible programs. Each option presents its own challenges, whether in terms of speed, accuracy or inference constraints, which is why PPL users often are required to learn a set of model re-writing techniques: to be able to change the program until it can be feasibly used within the backend inference algorithm.

Take for example Stan Carpenter et al. 2017, which is used by practitioners in a wide range of sciences and industries to analyse their data using Bayesian inference. While efficient inference algorithms exist for continuous-only and for some discrete-only models, it is much less clear what algorithm to use for arbitrary models with large numbers of both discrete and continuous (latent, i.e., unobserved) parameters. Stan has made a conscious choice not to support probabilistic models with discrete parameters, to perform inference using (dynamic) Hamiltonian Monte Carlo (HMC) Neal et al. 2011 Betancourt and Girolami 2015 Hoffman and Gelman 2014), which provides efficient, gradient-based inference for differentiable models. As a result, Stan has often been criticised Gelman et al. 2015 for its lack of support for discrete parameters. What is usually overlooked is that many models with discrete parameters can, in fact, be accommodated in Stan, by manually marginalising (summing) out the discrete parameters and drawing them conditionally on the continuous parameters Stan Development Team 2019bChapter 7. One of the core model rewriting techniques is marginalisation: summing over all possible values that a random variable can take to obtain a marginal density function that does not involve that variable. Marginalising efficiently is not always an obvious procedure, as it requires exploiting conditional independence relationships among the variables in the model. For probabilistic graphical models, there are well-known algorithms for enumerating all of the conditional independence assumptions implied by a model. But probabilistic programs are much more general, including control flow and assignment. For this more general case, it is much less clear how to determine conditional independence relationships automatically, and doing so requires combining ideas from traditional program analysis and from probabilistic graphical modelling.

In this article, we introduce an information flow type system that can deduce conditional independence relationships between parameters in a probabilistic program. Finding such relationships can be useful in many scenarios. As an example application, we implement a semantics-preserving source-to-source transformation that automatically marginalises discrete parameters. We work in SlicStan Gorinova et al. 2019, a form of Stan with a more compositional syntax than the original language. Our system extends SlicStan to support discrete parameters in the case when the discrete parameter space is bounded. This transform corresponds to the variable elimination algorithm Zhang and Poole 1994 Koller and Friedman 2009: an exact inference algorithm, efficient in models with sparse structure. Combining this transformation with an efficient algorithm for continuous parameters, like HMC, gives us a model-specific, automatically derived inference strategy, which is a composition of variable elimination and the algorithm of choice. While we only focus on one application in this article, our type system for conditional independence is applicable to program transformations of probabilistic programs more generally, and we believe it can enable other composed-inference strategies.

In short, we make the following contributions:

(1)

Factorised semantics for SlicStan: As a basis for proving correctness of our transformation, we extend SlicStan’s type system, so shredding (which slices a SlicStan program into Stan for execution) correctly separates well-typed programs into data preprocessing, main model, and purely generative code (Theorem 1).

(2)

Main theoretical result: We show how a very simple, relatively standard information flow type system can be used to capture a conditional independence in probabilistic programs (Section 3) and establish a correspondence between well-typed programs and conditional independence properties of the probability distribution it implements (Theorem 2, Theorem 3).

(3)

Main practical result: We describe and implement (in SlicStan) a source-to-source transformation that repeatedly uses the result from (2) to efficiently marginalise out the discrete parameters of the program, and we give a generative procedure for drawing these parameters (Section 4), thus automating inference for mixed discrete-continuous models. We prove that our transformation is semantics-preserving (Theorem 4).

2 SlicStan: Extended Syntax and Semantics

SlicStan Gorinova et al. 2019 is a Stan-like probabilistic programming language. Compared to Stan, it provides extra compositionality by dropping the requirement that programs be block-structured. SlicStan uses type inference in an information-flow type system Volpano et al. 1996 Abadi et al. 1999 Gordon et al. 2015 to automatically rearrange the program into parts roughly corresponding to the block structure of Stan: pre-processing (data), model, and post-processing (generated quantities). Originally, this shredding was developed to compile SlicStan to Stan. In this article, we show that it can be used, more generally, to automatically compile to an efficient program-specific inference scheme.

Like Stan, SlicStan is imperative and allows for deterministic assignment, for-loops, if-statements, probabilistic assignment, and factor-statements. One contribution of this work is that we present an updated version of SlicStan.

A key difference to the original version of SlicStan is the treatment of sampling (∼) statements. In the original SlicStan paper Gorinova et al. 2019, a statement such as $ x \sim \mathcal {N}(0, 1) $ was understood simply as a syntactic sugar for $ \sf factor(\mathcal {N}(x \mid 0, 1)) $: adding a factor to the underlying density of the model, rather than performing actual sampling. In our updated version of SlicStan, sampling statements are part of the core syntax. The semantics of $ x \sim \mathcal {N}(0, 1) $ remains equivalent to that of $ \sf factor(\mathcal {N}(x \mid 0, 1)) $ in terms of density semantics, however it could be implemented differently depending on the context. In particular, $ x \sim \mathcal {N}(0, 1) $ could be implemented as a simple call to a random number generator in Stan, $ x = \mathcal {N}_{rng}(0, 1) $, like in the example in Figure 1.

This way of treating ∼ statements differently is useful, as it allows for an increase of the functionality of the SlicStan’s information-flow analysis. Consider, for example, the SlicStan program on the right of Figure 1. Using the original type system, both μ and $ x_{\mathrm{pred}} $ will be of level $ \mathbf {{\rm\small MODEL}} $, as they are both involved in a ∼ statement. Thus, when translated to Stan, both μ and $ x_{\mathrm{pred}} $ must be inferred with HMC (or another similar algorithm), which is expensive. However, the updated type system of this article allows for $ x_{\mathrm{pred}} $ to be of level $ \mathbf {{\rm\small GENQUANT}} $, which is preferable: In the context of Stan, this means only μ needs to be inferred with HMC, while $ x_{\mathrm{pred}} $ can be simply drawn using a random number generator. More generally, the updated SlicStan type system allows for factorising the density defined by the program: For data $ \mathcal {D} $, parameters $ \boldsymbol {\theta } $ and generated quantities Q, a program defining a density $ p(\mathcal {D}, \boldsymbol {\theta }, Q) $ can be sliced into two programs with densities $ p(\mathcal {D}, \boldsymbol {\theta }) $ and $ p(Q \mid \mathcal {D}, \boldsymbol {\theta }) $, respectively (Theorem 1). The parameters $ \boldsymbol {\theta } $ are inferred using HMC (or another general-purpose inference algorithm) according to $ p(\mathcal {D}, \boldsymbol {\theta }) $, while the quantities Q are directly generated according to $ p(Q \mid \mathcal {D}, \boldsymbol {\theta }) $.

Fig. 1.

Treating ∼ statements differently based on context is very similar in spirit to existing effect-handling-based PPLs Moore and Gorinova 2018 such as Edward2 and Pyro, where ∼ can be handled in different ways. However, in our case, this difference in treatment is determined statically, automatically, and only in the translation to Stan or another backend.

Another difference between Gorinova et al. 2019’s SlicStan and our updated version is the $ \sf target(S) $ expression, which we use to capture locally the density defined by statements.

These changes are a small but useful contribution of the current work: They are key to allowing us to decompose the program and compose different inference strategies for efficiency.

In the rest of this section, we give the updated formal syntax, typing, and semantics of SlicStan and describe shredding—the procedure key to the translation of Stan / inference composition.

2.1 Syntax

SlicStan has the following types, programs, L-values, statements, and expressions. We highlight the difference with Gorinova et al. 2019 with boxes.

SlicStan programs consist of a pair Γ, S of a typing environment Γ (a finite map that assigns global variables x to their types T) and a statement S. Following the usual style of declaring variables in C-like languages, we informally present programs Γ, S in examples by sprinkling the type declarations of Γ throughout the statement S. For example, we write $ \sf data~\sf real~x \sim ~ \sf normal(0, 1) $ for the program $ \lbrace x\mapsto (\sf real,\mathbf {{\rm\small DATA}})\rbrace , x \sim \sf normal(0, 1) $. Sometimes, we will leave out types or write incomplete types in our examples. In this case, we intend for the missing types to be determined using type inference.

As we discuss in detail in Section 2.3, a $ \sf factor(E) $ statement can be read as multiplying the current weight (contribution to the model’s joint density) of the program trace by the value of E. Conversely, a $ \sf target(S) $ expression initialises the weight to 1 and returns the weight that is accumulated after evaluating S. For example, if:

\begin{align*} S &= \sf x ~ normal(0,1); y = 2 * x; z ~ normal(y,1); \\ &= \sf factor(normal_pdf(x|0,1)); y = 2 * x; factor(normal_pdf(z|y,1)); \end{align*}

Then $ \sf target(S) $ is semantically equivalent to $ \sf normal_pdf(x|0,1) * normal_pdf(z|2 * x,1) $.

We extend the base types of the language of Gorinova et al. 2019 with $ \sf int\langle n \rangle $, which denotes a positive integer constrained from above by an integer n. For example, if x is of type $ \sf int\langle 2 \rangle $, then x can only be 1 or 2. These types allow us to specify the support of discrete variables, and they can easily be extended to include both upper and lower bounds. For the purpose of our typing rules, we treat $ \sf int\langle n \rangle $ identically to $ \sf int $. We only differentiate between these types in Section 4, where our transformation uses the size annotation to eliminate a discrete variable.

2.2 Typing

Types T in SlicStan range over pairs $ (\tau , \ell) $ of a base type τ, and a level type ℓ. The level types ℓ form a lattice $ \left(\lbrace \mathbf {{\rm\small DATA}}, \mathbf {{\rm\small MODEL}}, \mathbf {{\rm\small GENQUANT}}\rbrace , \le \right) $, where $ \mathbf {{\rm\small DATA}} \le \mathbf {{\rm\small MODEL}} \le \mathbf {{\rm\small GENQUANT}} $. We write $ \bigsqcup _{i=1}^n \ell _i $ for the least upper bound of the levels $ \ell _1,\ldots ,\ell _n $. We call variables of level $ \mathbf {{\rm\small DATA}} $ data (variables), of level $ \mathbf {{\rm\small MODEL}} $ model parameters, and of level $ \mathbf {{\rm\small GENQUANT}} $ generated quantities. We refer to variables that are either of level $ \mathbf {{\rm\small MODEL}} $ or $ \mathbf {{\rm\small GENQUANT}} $ simply as parameters. Given a typing environment Γ, we can consider the well-typedness of expressions and statements, given the types assigned to variables by Γ. The judgment $ \Gamma \vdash E : (\tau , \ell) $ means that expression E has type τ and reads only level ℓ and below. The judgment $ \Gamma \vdash S : \ell $ means that statement S assigns only to level ℓ and above. We write $ \Gamma \vdash S $ as a shorthand for $ \Gamma \vdash S:\mathbf {{\rm\small DATA}} $.

The typing rules for expressions are those of Gorinova et al. 2019 with added rules for the two constructs of array comprehensions and $ \sf target(S) $-expressions. The typing rules for statements are as in Gorinova et al. 2019, with three differences (highlighted in boxes). Factor and Sample add typing rules for the now new language constructs $ \sf factor(E) $ and $ L \sim d(E_1,\ldots , E_n) $. The language supports a finite number of built-in functions f with type $ \tau _1,\ldots ,\tau _n\rightarrow \tau $ and (conditional) distributions $ d\in \mathrm{Dist}(\tau _1,\ldots ,\tau _n;\tau) $ over τ given values of types $ \tau _1,\ldots ,\tau _n $.

In these rules, we make use of the following notation (see Appendix A for precise definitions).

•

W(S): the set of variables x that have been assigned to in S.

•

$ R_{\Gamma \vdash \ell }(S) $: the set of variables x that are read at level ℓ in S.

•

$ W_{\Gamma \vdash \ell }(S) $: the set of variables x of level ℓ that have been assigned to in S.

•

$ \widetilde{W}_{\Gamma \vdash \ell }(S) $: the set of variables x of level ℓ that have been ∼-ed in S.

•

$ W\widetilde{W}_{\Gamma \vdash \ell }(S)= W_{\Gamma \vdash \ell }(S)\cup \widetilde{W}_{\Gamma \vdash \ell }(S) $

The intention in SlicStan is that statements of level ℓ are executed before those of $ \ell ^{\prime } $ if $ \ell \lt \ell ^{\prime } $. To follow that implementation strategy without reordering possibly non-commutative pairs of statements, we impose the condition $ \mathcal {S}(S_1, S_2) $ when we sequence $ S_1 $ and S₂ in Seq.

Definition 1 (Shreddable Seq).

$ \mathcal {S}(S_1, S_2) \triangleq \forall \ell _1,\ell _2. (\ell _2 \lt \ell _1) \Rightarrow R_{\Gamma \vdash \ell _1}(S_1) \cap W_{\Gamma \vdash \ell _2}(S_2) = \varnothing $.

For example, this excludes the following problematic program:

Above, $ \sf sigma $ and the statements sigma=1 and sigma=2 are of level $ \mathbf {{\rm\small DATA}} $, which means they should be executed before the statement mu normal(0,sigma), which is of level $ \mathbf {{\rm\small MODEL}} $. However, this would change the intended semantics of the program, giving $ \sf mu $ a $ \mathcal {N}(0, 2) $ prior instead of the intended $ \mathcal {N}(0, 1) $ prior. This problematic program fails to typecheck in SlicStan, as it is not shreddable: $ \lnot \mathcal {S}(\sf mu ~ normal(0,sigma),\,\; \sf sigma = 2) $.

Definition 2 (Generative Seq).

$ \mathcal {G}(S_1, S_2) \triangleq \forall \ell \ne \mathbf {{\rm\small MODEL}}.\, \widetilde{W}_{\Gamma \vdash \ell }(S_1)\,\cap \, W\widetilde{W}_{\Gamma \vdash \ell }(S_2)=\varnothing \, \wedge \, W\widetilde{W}_{\Gamma \vdash \ell }(S_1)\,\cap \, \widetilde{W}_{\Gamma \vdash \ell }(S_2)=\varnothing . $

To be able to read $ x\sim \mathcal {N}(0,1) $ at level $ \mathbf {{\rm\small GENQUANT}} $, depending on the context, either as a probabilistic assignment to x or as a density contribution, we impose the condition $ \mathcal {G}(S_1, S_2) $ when we sequence S₁ and S₂. This excludes problematic programs like the following, in which the multiple assignments to $ \sf y $ create a discrepancy between the density semantics of the program $ p(y) = \mathcal {N}(y \mid 0, 1)\mathcal {N}(y \mid 0, 1) $ and the sampling-based semantics of the program $ \sf y = 5 $.

This problematic program fails to typecheck in SlicStan owing to the $ \mathcal {G} $ constraint: $ \lnot \mathcal {G}(\sf y ~ normal(0,1),\,\; \sf y ~ normal(0,1)) $, and also $ \lnot \mathcal {G}(\sf y ~ normal(0,1),\,\; \sf y = 5) $.

2.3 Operational Semantics of SlicStan Statements

In this article, we use a modified version of the semantics given in Gorinova et al. 2019. We extend the call-by-value operational semantics given in that paper and derive a more equational form that also includes the generated quantities.

We define a standard big-step operational semantics for SlicStan expressions and statements:

Here, s and $ s^{\prime } $ are states, V is a value and $ w\in \mathbb {R}_{\gt 0} $ is a weight. Our statements can read and write the state with arbitrary destructive updates. The weight can be thought of as an element of state that stores a positive real value that only gets accessed by multiplying it with the value of an expression E, through the use of $ \sf factor(E) $-statements. It can only be read through a $ \sf target(S) $-statement that initialises the weight to 1, evaluates the statement S, and returns the final weight.

Formally, states and values are defined as follows:

In the rest of the article, we use the notation for states $ s = x_1 \mapsto V_1, \dots , x_n \mapsto V_n $:

•

$ s[x \mapsto V] $ is the state s, but where the value of x is updated to V if $ x \in \mathrm{dom}(s) $, or the element $ x \mapsto V $ is added to s if $ x \notin \mathrm{dom}(s) $.

•

$ s[-x] $ is the state s, but where x is removed from the domain of s (if it were present).

We also define lookup and update operations on values:

•

If U is an n-dimensional array value for $ n \ge 0 $ and c₁, ..., c_n are suitable indexes into U, then the lookup $ U[c_1]\dots [c_n] $ is the value in U indexed by c₁, ..., c_n.

•

If U is an n-dimensional array value for $ n \ge 0 $ and c₁, ..., c_n are suitable indexes into U, then the (functional) update $ U[c_1]\dots [c_n] := V $ is the array that is the same as U except that the value indexed by c₁, ..., c_n is V.

The relation $ \Downarrow $ is deterministic but partial, as we do not explicitly handle error states. The purpose of the operational semantics is to define a density function in Section 2.4, and any errors lead to the density being undefined. The big-step semantics is defined as follows:

Most rules of the big-step operational semantics are standard, with the exception of Eval Factor and Eval Sample, which correspond to the PPL-specific language constructs $ \sf factor $ and $ L \sim d(E_1, \dots , E_n) $. While we refer to the latter construct as probabilistic assignment, its formal semantics is not that of an assignment statement: Both the left- and the right-hand side of the “assignment” are evaluated to a value, for the density contribution $ d(V \mid V_1, \dots , V_n) $ to be evaluated and factored into the weight of the current execution trace. Contrary to Eval Assign, there is no binding of a result to a variable in Eval Sample. Of course, as is common in probabilistic programming, it might, at times,⁴ be beneficial to execute these statements as actual probabilistic assignments. Our treatment of these statements is agnostic of such implementation details, however.

The design of the type system ensures that information can flow from a level ℓ to a higher one $ \ell ^{\prime }\ge \ell $, but not a lower one $ \ell ^{\prime }\lt \ell $: a noninterference result. To state this formally, we introduce the notions of conformance between a state s and a typing environment Γ and ℓ-equality of states.

We define a conformance relation on states s and typing environments Γ. A state s conforms to an environment Γ, whenever s provides values of the correct types for the variables used in Γ:

Here, $ V \models \tau $ denotes that the value V is of type τ, and it has the following definition:

•

$ c \models \sf int $, if $ c \in \mathbb {Z} $, and $ c \models \sf real $, if $ c \in \mathbb {R} $.

•

$ [V_1,\dots ,V_n] \models \tau [n] $, if $ \forall i \in 1\dots n. V_i \models \tau $.

Definition 3 (ℓ-equal States)

Given a typing environment Γ, states $ s_1 \models \Gamma $ and $ s_2 \models \Gamma $ are ℓ-equal for level ℓ (written $ s_1 \approx _{\ell } s_2 $), if they differ only for variables of a level strictly higher than ℓ:

\[ s_1 \approx _{\ell } s_2 \triangleq \forall x:(\tau , \ell ^{\prime }) \in \Gamma . \left(\ell ^{\prime } \le \ell \Rightarrow s_1(x) = s_2(x) \right). \]

Lemma 1 (Noninterference of $ \vdash $)

Suppose $ s_1 \models \Gamma $, $ s_2 \models \Gamma $, and $ s_1 \approx _{\ell } s_2 $ for some ℓ. Then for SlicStan statement S and expression E:

(1)

If $ ~\Gamma \vdash E:(\tau ,\ell) $ and $ (s_1, E) \Downarrow V_1 $ and $ (s_2, E) \Downarrow V_2 $, then $ V_1 = V_2 $.

(2)

If $ ~\Gamma \vdash S:\ell $ and $ (s_1, S) \Downarrow s_1^{\prime }, w_1 $ and $ (s_2, S) \Downarrow s_2^{\prime }, w_2 $, then $ s_1^{\prime } \approx _{\ell } s_2^{\prime } $.

Proof.

(1) follows by rule induction on the derivation $ \Gamma \vdash E:(\tau , \ell) $, and using that if $ \Gamma \vdash E:(\tau , \ell) $, E reads x and $ \Gamma (x) = (\tau ^{\prime }, \ell ^{\prime }) $, then $ \ell ^{\prime } \le \ell $. (2) follows by rule induction on the derivation $ \Gamma \vdash S:\ell $ and using (1). We present more details of the proof in Appendix A.□

2.4 Density Semantics

The semantic aspect of a SlicStan program $ \Gamma , S $ that we are the most interested in is the final weight w obtained after evaluating the program S. This is the value the program computes for the unnormalised joint density $ p^*(\mathbf {x}) = p^*(\mathcal {D},\boldsymbol {\theta }, Q) $ over the data $ \mathcal {D} $, the model parameters $ \boldsymbol {\theta } $, and generated quantities Q of the program (see Section 2.6). Given a program $ \Gamma , S $, we separate the typing environment Γ into disjoint parts: $ \Gamma _{\sigma } $ and $ \Gamma _{\mathbf {x}} $, such that $ \Gamma _{\sigma } $ contains precisely the variables that are deterministically assigned in S and $ \Gamma _{\mathbf {x}} $ contains those that never get deterministically assigned; that is, the variables $ \mathbf {x} $ with respect to which we define the target unnormalised density $ p^*(\mathbf {x}) $:

\begin{align*} \Gamma _{\sigma }=\lbrace (x:T)\in \Gamma \mid x\in W(S)\rbrace \qquad \Gamma _{\mathbf {x}}=\Gamma \setminus \Gamma _{\sigma }. \end{align*}

Similarly, any conforming state $ s\models \Gamma $ separates as $ \sigma \uplus \mathbf {x} $ with

\begin{align*} \sigma =\lbrace (x\mapsto V)\in s\mid x\in W(S)\rbrace \quad \mathbf {x}=s\setminus \sigma . \end{align*}

Then, $ \sigma \models \Gamma _{\sigma } $ and $ \mathbf {x}\models \Gamma _{\mathbf {x}} $.

The semantics of a SlicStan program $ \Gamma _{\sigma }, \Gamma _{\mathbf {x}}, S $ is a function $ [\![ S ]\!] $ on states $ \sigma \models \Gamma _{\sigma } $ and $ \mathbf {x} \models \Gamma _{\mathbf {x}} $ that yields a pair of a state $ \sigma ^{\prime } $ and a weight w, such that:

\[ [\![ S ]\!] (\sigma)(\mathbf {x}) = \sigma ^{\prime }, w, \quad \text{ where } \sigma \uplus \mathbf {x}, S \Downarrow \sigma ^{\prime }\uplus \mathbf {x}, w. \]

We will sometimes refer only to one of the two elements of the pair $ \sigma , w $. In those cases, we use the notation: $ [\![ S ]\!] _s(\sigma)(\mathbf {x}),[\![ S ]\!] _p(\sigma)(\mathbf {x})=[\![ S ]\!] (\sigma)(\mathbf {x}) $. We call $ [\![ S ]\!] _s $ the state semantics and $ [\![ S ]\!] _p $ the density semantics of $ \Gamma , S $. We will be particularly interested in the density semantics.

The function $ [\![ S ]\!] _p(\sigma) $ is some positive function $ \phi (\mathbf {x}) $ of $ \mathbf {x} $. If $ \mathbf {x}_1, \mathbf {x}_2 $ is a partitioning of $ \mathbf {x} $ and $ \int \phi (\mathbf {x}) \mathrm{d}\mathbf {x}_1 $ is finite, we say $ \phi (\mathbf {x}) $ is an unnormalised density corresponding to the normalised density $ p(\mathbf {x}_1 \mid \mathbf {x}_2) = \phi (\mathbf {x}) / \int \phi (\mathbf {x}) \mathrm{d}\mathbf {x}_1 $ over $ \mathbf {x}_1 $ and we write $ [\![ S ]\!] _p(\sigma){(\mathbf {x})} \propto p(\mathbf {x}_1 \mid \mathbf {x}_2) $. Sometimes, when $ \sigma $ is clear from context, we will leave it implicit and simply write $ p(\mathbf {x}) $ for $ p(\mathbf {x}; \sigma) $.

Next, we observe how the state and density semantics compose.

Lemma 2 (Semantics Composes).

The state and density semantics compose as follows:

\[ [\![ S_1; S_2 ]\!] _s(\sigma)(\mathbf {x}) = [\![ S_2 ]\!] _s([\![ S_2 ]\!] _s(\sigma)(\mathbf {x}))(\mathbf {x}) \qquad [\![ S_1; S_2 ]\!] _p(\sigma)(\mathbf {x}) = [\![ S_1 ]\!] _p(\sigma)(\mathbf {x}) \times [\![ S_2 ]\!] _p([\![ S_1 ]\!] _s(\sigma)(\mathbf {x}))(\mathbf {x}) \]

Throughout the article, we use the following notation to separate the store in a concise way.

Definition 4 ($ \Gamma _{\ell }(s) $ or $ s_{\ell } $)

For a typing environment Γ and a store $ s \models \Gamma $, let $ \Gamma _{\ell }(s) = \lbrace (x \mapsto V) \in s \mid \Gamma (x) = (\_, \ell)\rbrace $. When it is clear which typing environment the notation refers to, we write simply $ s_{\ell } $ instead of $ \Gamma _{\ell }(s) $.

Using this definition, we re-state the noninterference result in the following convenient form:

Lemma 3 (Noninterference of $ \vdash $ Reformulated)

Let $ ~\Gamma _{\sigma }, \Gamma _{\mathbf {x}}\vdash S $ be a well-typed SlicStan program. For all levels $ \ell \in \lbrace \mathbf {{\rm\small DATA}}, \mathbf {{\rm\small MODEL}}, \mathbf {{\rm\small GENQUANT}}\rbrace $, there exist unique functions $ f_{\ell } $, such that for all $ \sigma \models \Gamma _{\sigma } $, $ \mathbf {x} \models \Gamma _{\mathbf {x}} $ and $ \sigma ^{\prime } $ such that $ [\![ S ]\!] _s(\sigma)(\mathbf {x}) = \sigma ^{\prime } $, $ \sigma _{\ell }^{\prime } = f_{\ell }(\lbrace \sigma _{\ell ^{\prime }}, \mathbf {x}_{\ell ^{\prime }} \mid \ell ^{\prime } \le \ell \rbrace) $.

2.5 Shredding and Translation to Stan

A key aim of SlicStan is to rearrange the input program into three phases of execution, corresponding to the levels of the type system: $ \mathbf {{\rm\small DATA}} $ preprocessing, core $ \mathbf {{\rm\small MODEL}} $ code to run MCMC or another inference algorithm on, and $ \mathbf {{\rm\small GENQUANT}} $, or generated quantities, which amount to sample post-processing after inference is performed. The motivation for these phases is that they all naturally appear in the workflow of probabilistic programming. The blocks of the Stan are built around this phase distinction, and compilation of SlicStan to Stan and comparable backends requires it.

The phases impose different restrictions on the code and make it incur differing computational costs. The model phase is by far the most expensive to evaluate: Code in this phase tends to be executed repeatedly within the inner loop of an inference algorithm like an MCMC method. Further, it tends to be automatically differentiated Griewank and Walther 2008 in case gradient-based inference algorithms are used, which restricts the available programming features and increases the space and time complexity of evaluation. Type inference in SlicStan combined with shredding allows the user to write their code without worrying about the performance of different phases, as code will be shredded into its optimal phase of execution.

The shredding relation is in the core of this rearrangement. Shredding takes a SlicStan statement S and splits it into three single-level statements (Definition 5). That is, $ S \Updownarrow _{\Gamma }S_D, S_M, S_Q $ means we split S into sub-statements $ S_D, S_M, S_Q $, where S_D mentions only $ \mathbf {{\rm\small DATA}} $ variables, S_M mentions $ \mathbf {{\rm\small DATA}} $ and $ \mathbf {{\rm\small MODEL}} $ variables, and S_Q is the rest of the program, and such that the composition $ S_D; S_M; S_Q $ behaves the same as the original program S. When combined with type inference, shredding automatically determines optimal statement placement, such that only necessary work is executed in the “heavy-weight” $ \mathbf {{\rm\small MODEL}} $ part of inference.

We adapt the shredding from Gorinova et al. 2019, so the following holds for the three sub-statements of a shredded well-typed SlicStan program $ \Gamma \vdash S $:

•

S_D implements deterministic data preprocessing: No contributions to the density are allowed.

•

S_M is the inference core: It is the least restrictive of the three slices—either or both of S_D and S_Q can be merged into S_M. It can involve contributions to the density that require advanced inference for sampling. Therefore, this is the part of the program that requires the most computation during inference (in Stan, what is run inside HMC).

•

S_Q represents sample post-processing: Any contributions to the density are generative. That is, they can immediately be implemented using draws from random number generators.

In terms of inference, we can run S_D once as a pre-processing step. Then use a suitable inference algorithm for S_M (in the case of Stan, that is HMC, but we can use other MCMC or VI algorithms), and, finally, we use ancestral sampling for S_Q.⁴

Here, $ \Gamma (E) $ (Definition 14) gives the principal type of an expression E, while $ \Gamma (E_1, \dots , E_n) $ (Definition 15) gives the least upper bound of the principal types of $ E_1, \dots , E_n $.

The Shred If and Shred For rules make sure to shred if and for statements so they are separated into parts that can be computed independently at each of the three levels. Note that the usage of $ \sf if $ and $ \sf for $ guards is simplified to avoid stating rules for when the guard(s) are of different levels. For example, if we have a statement $ \sf if(E)~S_1~\sf else~S_2 $, where E is of level $ \mathbf {{\rm\small MODEL}} $, then we cannot access E at level $ \mathbf {{\rm\small DATA}} $, thus the actual shredding rule we would use is:

These shredding rules follow very closely those given by Gorinova et al. 2019. The main difference is that sample statements ($ L \sim d(E_1, \dots , E_n) $) are allowed to be of $ \mathbf {{\rm\small GENQUANT}} $ level and can be included in the last, generative slice of the program (see rule Shred Sample). In other words, such $ \mathbf {{\rm\small GENQUANT}} $ sample statements are those statements that can be interpreted as probabilistic assignment (using random number generator functions) to directly sample from the posterior distribution according to ancestral sampling.

We provide proofs for the following key results in Appendix A: Shredding produces single-level statements (Definition 5 and Lemma 4) and shredding is semantics preserving (Lemma 6).

Intuitively, a single-level statement of level ℓ is one that updates only variables of level ℓ.

Definition 5 (Single-level Statement $ \Gamma \vdash \ell (S) $)

We define single-level statements S of level ℓ with respect to Γ (written $ \Gamma \vdash \ell (S) $), by induction:

Lemma 4 (Shredding Produces Single-level Statements).

Lemma 5 (Property of Single-level Statements).

Let $ ~\Gamma _{\sigma }, \Gamma _{\mathbf {x}}, S $ be a SlicStan program, such that S is a single-level statement of level ℓ, $ \Gamma \vdash \ell (S) $. Then there exist unique functions f and φ, such that for any $ \sigma , \mathbf {x} \models \Gamma _{\sigma }, \Gamma _{\mathbf {x}} $:

\[ [\![ S ]\!] (\sigma)(x) = f(\sigma _{\le \ell }, \mathbf {x}_{\le \ell })\cup \sigma _{\gt \ell } , \;\;\phi (\sigma _{\le \ell })(\mathbf {x}_{\le \ell }), \]

where we write $ \sigma _{\le \ell }=\lbrace (x\mapsto V)\in \sigma \mid \Gamma _{\sigma }(x)=(\_,\ell)\rbrace $ and $ \sigma _{\gt \ell }=\sigma \setminus \sigma _{\le \ell } $.

Lemma 6 (Semantic Preservation of $ \Updownarrow _{\Gamma } $)

If $ ~\Gamma \vdash S:\mathbf {{\rm\small DATA}} $ and $ S \Updownarrow _{\Gamma } (S_{D_{}}, S_{M_{}}, S_{Q_{}}) $, then $ [\![ S ]\!] = [\![ S_D; S_M; S_Q ]\!] $.

2.6 Density Factorisation

As an extension of Gorinova et al. 2019, we show that shredding induces a natural factorization of the density implemented by the program: $ p(\mathcal {D}, \boldsymbol {\theta }, Q) = p(\boldsymbol {\theta },\mathcal {D})p(Q \mid \boldsymbol {\theta },\mathcal {D}) $.⁴ This means that we can separate the program into a deterministic preprocessing part, a part that uses a “heavy-weight” inference algorithm, such as HMC, and a part that uses simple ancestral sampling.

Theorem 1 (Shredding Induces a Factorisation of the Density).

Suppose $ \Gamma \vdash S : \mathbf {{\rm\small DATA}} $ and $ ~S \Updownarrow _{\Gamma } S_D, S_M, S_Q $ and $ \Gamma = \Gamma _{\sigma }, \Gamma _{\mathcal {D}}, \Gamma _{\boldsymbol {\theta }}, \Gamma _{Q} $. For all $ \sigma $, $ \mathcal {D} $, $ \boldsymbol {\theta } $, and Q: if $ \sigma , \mathcal {D}, \boldsymbol {\theta }, Q\models \Gamma _{\sigma }, \Gamma _{\mathcal {D}}, \Gamma _{\boldsymbol {\theta }}, \Gamma _{Q} $, and $ [\![ S ]\!] _p(\sigma)(\mathcal {D}, \boldsymbol {\theta }, Q) \propto p(\mathcal {D}, \boldsymbol {\theta }, Q) $ and $ \widetilde{W}(S_Q)=\mathrm{dom}(\Gamma _Q) $ then:

(1)

$ [\![ S_M ]\!] _p(\sigma _D)(\mathcal {D}, \boldsymbol {\theta }, Q) \propto p(\boldsymbol {\theta }, \mathcal {D}), $

(2)

$ [\![ S_Q ]\!] _p(\sigma _M)(\mathcal {D}, \boldsymbol {\theta }, Q) = p(Q \mid \boldsymbol {\theta }, \mathcal {D}), $

where $ \sigma _D = [\![ S_D ]\!] _s(\sigma)(\mathcal {D}, \boldsymbol {\theta }, Q) $, $ \sigma _M = [\![ S_M ]\!] _s(\sigma _D)(\mathcal {D}, \boldsymbol {\theta }, Q) $, and $ p(\mathcal {D}, \boldsymbol {\theta }, Q) = p(\mathcal {D}, \boldsymbol {\theta })p(Q \mid \mathcal {D}, \boldsymbol {\theta }) $.

Proof.

This follows by proving a more general result using induction on the structure of S, Lemma 6, Lemma 2, and Lemma 4. See Appendix A for full proof.□

The given SlicStan program S defines a joint density $ p(\mathcal {D}, \boldsymbol {\theta }, Q) $. By shredding, we obtain a $ \mathbf {{\rm\small MODEL}} $ block S_M that defines $ p(\boldsymbol {\theta }, \mathcal {D}) $ and a $ \mathbf {{\rm\small GENQUANT}} $ block S_Q that defines $ p(Q \mid \boldsymbol {\theta }, \mathcal {D}) $. Hence, inference in Stan using these blocks recovers the semantics $ p(\mathcal {D}, \boldsymbol {\theta }, Q) $ of the SlicStan program.

3 Theory: Conditional Independence by Typing

This section presents the main theoretical contribution of the article: an information flow type system for conditional independence. We present a type system and show that a well-typed program in that system is guaranteed to have certain conditional independencies in its density semantics. As a reminder, determining the conditional independence relationships between variables is important, as such relationships capture a qualitative summary of the specified model and can facilitate more efficient inference. For example, in Section 4, we present an application that uses our type system: a semantic-preserving transformation that allows for discrete parameters to be introduced in SlicStan, which was previously not possible due to efficiency constraints.

Our aim is to optimise probabilistic programs by transforming abstract syntax trees or intermediate representations (as in the Stan compiler) that are close to abstract syntax. Hence, we seek a way to compute conditional dependencies by a type-based source analysis, rather than by explicitly constructing a separate graphical representation of the probabilsitic model.

Given three disjoint sets of random variables (RVs) A, B, and C, we say that A is conditionally independent of B given C, written $ A \mathrel {\text{$\perp \perp $}}B \mid C $, if and only if their densities factorise as $ p(A, B \mid C) = p(A \mid C)p(B \mid C) $. (An alternative formulation states that $ A \mathrel {\text{$\perp \perp $}}B \mid C $ if and only if $ p(A, B, C) = \phi _1(A, C) \phi _2(B, C) $ for some functions $ \phi _1 $ and $ \phi _2 $.) Deriving conditional independencies in the presence of a graphical model (such as a factor graph ⁴) is straightforward, which is why some PPLs focus on building and performing inference on graphs (for example, Infer.NET Minka et al. 2014). However, building and manipulating a factor graph in generative PPLs (e.g., Gen Cusumano-Towner et al. 2019, Pyro Uber AI Labs 2017, Edward2 Tran et al. 2018, PyMC3 Salvatier et al. 2016), or imperative density-based PPLs (SlicStan, Stan) is not straightforward. Dependencies between modelled variables might be separated by various deterministic transformations, making it harder to track the information flow and—more importantly—more difficult to isolate parts of the model needed for transformations such as variable elimination. In the case of SlicStan, each program can still be thought of as specifying a factor graph implicitly. In this article, we focus on the problem of how to work with conditional independence information implicitly encoded in a probabilistic program, without having access to an explicit factor graph. For example, consider Program A:

The factor graph above represents the factorisation of the joint density function over the parameters of the program: $ p(z_1, z_2, y_1, y_2) = b(z_1 \mid \theta _0) b(y_1 \mid \mathrm{foo}(1, z_1))b(z_2 \mid \mathrm{foo}(\theta _0, z_1)) b(y_2 \mid \mathrm{foo}(1, z_2)) $. Each of the four factors is represented by a square node in the graph, and it connects to the variables (circle nodes) that the factor depends on. This representation is useful for thinking about conditional independencies. For example, it is immediately evident from the graph that variables that connect to the same square node cannot be conditionally independent, as they share a factor. More generally, if there is an (uninterrupted by observed variables) direct path between two variables, then these two variables are not conditionally independent Frey 2002.

When looking at the factor graph, it is straightforward to see that z₁ and z₂ are not conditionally independent, and neither are z₁ and y₁ nor z₂ and y₂, as there is a direct path between each of these pairs. When looking at the program, however, we need to reason about the information flow through the deterministic variables $ \theta _1, \phi _1 $ and $ \phi _2 $ to reach the same conclusion.

Moreover, manipulation of the program based on conditional dependencies can also be more difficult without a factor graph. As an example, consider the problem of variable elimination (which we discuss in more details in Section 4.3). If we are to eliminate z₁ in the factor graph, using variable elimination, we would simply merge the factors directly connected to z₁, sum over z₁ and attach the new factors to all former neighbours of z₁ (in this case y₁ and z₂, but not y₂). However, in the case of an imperative program, we need to isolate all the statements that depend on z₁ and group them together without changing the meaning of the program beyond the elimination:

We need a way to analyse the information flow to determine conditional independencies between variables. In the example above, we can leave y₂ out of the elimination of z₁, because z₁ and y₂ are conditionally independent given z₂, written $ z_1 \mathrel {\text{$\perp \perp $}}y_2 \mid z_2 $.

To analyse the information flow, we introduce a novel type system, which we refer to via the relation $ \vdash _{2} $. It works with a lower semi-lattice $ (\lbrace \mathbf {{\rm\small L1}}, \mathbf {{\rm\small L2}}, \mathbf {{\rm\small L3}}\rbrace , \le) $ of levels, where $ \mathbf {{\rm\small L1}} \le \mathbf {{\rm\small L2}} $ and $ \mathbf {{\rm\small L1}} \le \mathbf {{\rm\small L3}} $ and $ \mathbf {{\rm\small L2}} $ and $ \mathbf {{\rm\small L3}} $ are unrelated. (Recall that a lower semi-lattice is a partial order in which any two elements $ \ell _1 $, $ \ell _2 $ have a greatest lower bound $ \ell _1 \sqcap \ell _2 $ but do not always have an upper bound.) A well-typed program induces a conditional independence relationship for the (random) variables (RVs) in the program: $ \mathbf {{\rm\small L2}}\text{-RVs} \mathrel {\text{$\perp \perp $}}\mathbf {{\rm\small L3}}\text{-RVs} \mid \mathbf {{\rm\small L1}}\text{-RVs} $.

In the example above, this result allows us to eliminate $ \mathbf {{\rm\small L2}} $-variables (z₁), while only considering $ \mathbf {{\rm\small L1}} $-variables (y₁ and z₂) and knowing $ \mathbf {{\rm\small L3}} $-variables (y₂) are unaffected by the elimination. We can use a shredding relation almost identical to that of Section 2.5 to slice the program in a semantics-preserving way and isolate the sub-statements needed for elimination. Here, $ \theta _1 $ and $ \phi _1 $ must be of level $ \mathbf {{\rm\small L2}} $ for the program to be well-typed. Thus, all statements involving $ z_1, \theta _1 $, or $ \phi _1 $ are of level $ \mathbf {{\rm\small L2}} $, and the shredding relation groups them together inside of the elimination loop for z₁.

Figure 3 shows the relationship between the levels $ \mathbf {{\rm\small L1}}, \mathbf {{\rm\small L2}}, \mathbf {{\rm\small L3}} $ and the shredding relation. Information flows from $ \mathbf {{\rm\small L1}} $ to $ \mathbf {{\rm\small L2}} $ and $ \mathbf {{\rm\small L3}} $, but there is no flow of information between $ \mathbf {{\rm\small L2}} $ and $ \mathbf {{\rm\small L3}} $ (Figure 3(b)). A $ \vdash _{2} $-well-typed program S is shredded by $ \Updownarrow _{\Gamma } $ into S₁, S₂, and S₃, where S₁ only mentions $ \mathbf {{\rm\small L1}} $ variables, S₂ only mentions $ \mathbf {{\rm\small L1}} $ and $ \mathbf {{\rm\small L2}} $ variables, and S₃ only mentions $ \mathbf {{\rm\small L1}} $ and $ \mathbf {{\rm\small L3}} $ variables. This can be understood as a new factor graph formulation of the original program S, where each of the substatements $ S_1, S_2, S_3 $ defines a factor connected to any involved variables (Figure 3(a)).

Fig. 3.

Our approach relies on determining the $ \mathbf {{\rm\small L1}}, \mathbf {{\rm\small L2}}, \mathbf {{\rm\small L3}} $ level types by type inference, as they are not intrinsic to the variables or program in any way, but are designed solely to determine conditional independence relationships. These types are not accessible by the probabilistic programming user. Our type system makes it possible to answer various questions about conditional independence in a program. Assuming a program defining a joint density $ p(\mathbf {x}) $, we can use the type system to:

(1)

Check if $ \mathbf {x}_2 \mathrel {\text{$\perp \perp $}}\mathbf {x}_3 \mid \mathbf {x}_1 $ for some partitioning $ \mathbf {x} = \mathbf {x}_1, \mathbf {x}_2, \mathbf {x}_3 $.

(2)

Find an optimal variable partitioning. Given a variable $ x \in \mathbf {x} $, find a partitioning $ \mathbf {x} = \mathbf {x}_1, \mathbf {x}_2, \mathbf {x}_3 $, such that $ x \in \mathbf {x}_2 $, $ \mathbf {x}_2 \mathrel {\text{$\perp \perp $}}\mathbf {x}_3 \mid \mathbf {x}_1 $, and $ \mathbf {x}_1 $ and $ \mathbf {x}_2 $ are as small as possible.

(3)

Ask questions about the Markov boundary of a variable. Given two variables x and $ x^{\prime } $, find the partitioning $ \mathbf {x} = x, \mathbf {x}_1, \mathbf {x}_2 $, such that $ x \mathrel {\text{$\perp \perp $}}\mathbf {x}_1 \mid \mathbf {x}_2 $ and $ \mathbf {x}_2 $ is as small as possible. Is $ x^{\prime } $ in $ \mathbf {x}_2 $? In other words, is $ x^{\prime } $ in the Markov boundary of x?

In the rest of Section 3, we give the $ \vdash _{2} $ type system (Section 3.1), state a noninterference result (Lemma 7, Lemma 8) and show that semantics is preserved when shredding $ \vdash _{2} $-well-typed programs (Lemma 10). We present the type system and transformation rules in a declarative style. The implementation relies on type inference, which we discuss in Section 4.4. We derive a result about the way shredding factorises the density defined by the program (Theorem 2). We prove a conditional independence result (Section 3.2, Theorem 3) and discuss the scope of our approach with examples (Section 3.3).

3.1 The $ \vdash _{2} $ Type System

We introduce a modified version of SlicStan’s type system. Once again, types T range over pairs $ (\tau , \ell) $ of a base type τ, and a level type ℓ, but levels ℓ are one of $ \mathbf {{\rm\small L1}}, \mathbf {{\rm\small L2}}, $ or $ \mathbf {{\rm\small L3}} $, which form a lower semi-lattice $ \left(\lbrace \mathbf {{\rm\small L1}}, \mathbf {{\rm\small L2}}, \mathbf {{\rm\small L3}}\rbrace , \le \right) $, where $ \mathbf {{\rm\small L1}} \le \mathbf {{\rm\small L2}} $ and $ \mathbf {{\rm\small L1}} \le \mathbf {{\rm\small L3}} $. This means, for example, that an $ \mathbf {{\rm\small L2}} $ variable can depend on an $ \mathbf {{\rm\small L1}} $ variable, but an $ \mathbf {{\rm\small L3}} $ variable cannot depend on an $ \mathbf {{\rm\small L2}} $ variable, as level types $ \mathbf {{\rm\small L2}} $ and $ \mathbf {{\rm\small L3}} $ are incomparable.

The type system is a standard information flow type system, very similar to the $ \vdash $ system introduced in Section 2.2. We mark the only non-standard rules, Sample2, Factor2, and Seq2, which also differ from those of $ \vdash $. Sample2 and Factor2 both have the same effect as an assignment to an implicit weight variable that can be of any of the three levels. Seq2 is a less restrictive version of Seq and exactly as in Gorinova et al. 2019, and it makes sure the program can be sliced later.

Note also the non-interference between $ \mathbf {{\rm\small L2}} $ and $ \mathbf {{\rm\small L3}} $ relies on the PrimCall2 rule not being derivable when the least upper bound $ \bigsqcup _{i=1}^n \ell _i $ does not exist.

We state and prove a noninterference result for $ \vdash _{2} $, which follows similarly to the result for $ \vdash $.

Lemma 7 (Noninterference of $ \vdash _{2} $)

Suppose $ s_1 \models \Gamma $, $ s_2 \models \Gamma $, and $ s_1 \approx _{\ell } s_2 $ for some ℓ. Then for a SlicStan statement S and expression E:

(1)

If $ ~\Gamma \vdash _{2}E: (\tau ,\ell) $ and $ (s_1, E) \Downarrow V_1 $ and $ (s_2, E) \Downarrow V_2 $, then $ V_1 = V_2 $.

(2)

If $ ~\Gamma \vdash _{2}S: \ell $ and $ (s_1, S) \Downarrow s_1^{\prime }, w_1 $ and $ (s_2, S) \Downarrow s_2^{\prime }, w_2 $, then $ s_1^{\prime } \approx _{\ell } s_2^{\prime } $.

Proof.

(1) follows by rule induction on the derivation $ \Gamma \vdash _{2}E:(\tau , \ell) $, and using that if $ \Gamma \vdash _{2}E:(\tau , \ell) $, $ x \in R(E) $ and $ \Gamma (x) = (\tau ^{\prime }, \ell ^{\prime }) $, then $ \ell ^{\prime } \le \ell $. (2) follows by rule induction on the derivation $ \Gamma \vdash _{2}S:\ell $ and using (1).□

Once again, we derive a more convenient form of the noninterference result. Because the level types $ \mathbf {{\rm\small L2}} $ and $ \mathbf {{\rm\small L3}} $ are not comparable in the order $ \le $, changes in the store at $ \mathbf {{\rm\small L2}} $ do not affect the store at $ \mathbf {{\rm\small L3}} $ and vice versa.

Lemma 8 (Noninterference of $ \vdash _{2} $-well-typed Programs)

Proof.

Follows from noninterference (Lemma 7).□

Next, we extend the shredding relation from Section 2.5, and the concept of single-level statements, to SlicStan programs that are well-typed with respect to $ \vdash _{2} $. This is done by simply treating $ \mathbf {{\rm\small L1}} $ as $ \mathbf {{\rm\small DATA}} $, $ \mathbf {{\rm\small L2}} $ as $ \mathbf {{\rm\small MODEL}} $, and $ \mathbf {{\rm\small L3}} $ as $ \mathbf {{\rm\small GENQUANT}} $ for the purpose of shredding. We include the full definition of shredding with respect to $ \vdash _{2} $ for completeness below. We use the same notation $ \Updownarrow _{\Gamma } $, and we generally treat the standard shredding relation from Section 2.5 and the conditional independence shredding relation presented here, as the same relation, as there is no difference between the two, other than the naming of levels.

As before, shredding produces single-level statements, and shredding preserves semantics with respect to $ \vdash _{2} $-well-typed programs.

Lemma 9 (Shredding Produces Single-level Statements, $ \vdash _{2} $)

b If $ ~S \Updownarrow _{\Gamma } S_1, S_2, S_3 $, then $ \Gamma \vdash \mathbf {{\rm\small L1}}(S_1) $, $ \Gamma \vdash \mathbf {{\rm\small L2}}(S_2) $, and $ \Gamma \vdash \mathbf {{\rm\small L3}}(S_3) $.

Lemma 10 (Semantic Preservation of $ \Updownarrow _{\Gamma } $, $ \vdash _{2} $)

If $ ~\Gamma \vdash _{2}S:\mathbf {{\rm\small L1}} $ and $ S \Updownarrow _{\Gamma } S_1, S_2, S_3 $, then $ [\![ S ]\!] = [\![ S_1; S_2; S_3 ]\!] $.

In addition, we derive a result about the effect of single-level statements on the store and weight of $ \vdash _{2} $-well-typed programs.

Lemma 11 (Property of $ \vdash _{2} $ Single-level Statements)

Let $ ~\Gamma _{\sigma }, \Gamma _{\mathbf {x}}, S $ be a SlicStan program, and $ \Gamma \vdash _{2}S : \mathbf {{\rm\small L1}} $, and S be single-level statement of level ℓ, $ \Gamma \vdash _{2}\ell (S) $. Then there exist unique functions f and φ, such that for any $ \sigma , \mathbf {x} \models \Gamma _{\sigma }, \Gamma _{\mathbf {x}} $:

(1)

If $ \ell = \mathbf {{\rm\small L1}} $, then $ [\![ S ]\!] (\sigma)(x) \;\;= \;\;\left(f(\sigma _{\mathbf {{\rm\small L1}}}, \mathbf {x}_{\mathbf {{\rm\small L1}}}), \sigma _{\mathbf {{\rm\small L2}}}, \sigma _{\mathbf {{\rm\small L3}}} \right)\!, \qquad \qquad \; \phi (\sigma _{\mathbf {{\rm\small L1}}})(\mathbf {x}_{\mathbf {{\rm\small L1}}}), $

(2)

If $ \ell = \mathbf {{\rm\small L2}} $, then $ [\![ S ]\!] (\sigma)(x) \;\;= \;\;\left(\sigma _{\mathbf {{\rm\small L1}}}, f(\sigma _{\mathbf {{\rm\small L1}}}, \sigma _{\mathbf {{\rm\small L2}}}, \mathbf {x}_{\mathbf {{\rm\small L1}}}, \mathbf {x}_{\mathbf {{\rm\small L2}}}), \sigma _{\mathbf {{\rm\small L3}}} \right)\!, \;\;\phi (\sigma _{\mathbf {{\rm\small L1}}}, \sigma _{\mathbf {{\rm\small L2}}})(\mathbf {x}_{\mathbf {{\rm\small L1}}}, \mathbf {x}_{\mathbf {{\rm\small L2}}}), $

(3)

If $ \ell = \mathbf {{\rm\small L3}} $, then $ [\![ S ]\!] (\sigma)(x) \;\;= \;\;\left(\sigma _{\mathbf {{\rm\small L1}}}, \sigma _{\mathbf {{\rm\small L2}}}, f(\sigma _{\mathbf {{\rm\small L1}}}, \sigma _{\mathbf {{\rm\small L3}}}, \mathbf {x}_{\mathbf {{\rm\small L1}}}, \mathbf {x}_{\mathbf {{\rm\small L3}}}) \right)\!, \;\;\phi (\sigma _{\mathbf {{\rm\small L1}}}, \sigma _{\mathbf {{\rm\small L3}}})(\mathbf {x}_{\mathbf {{\rm\small L1}}}, \mathbf {x}_{\mathbf {{\rm\small L3}}}). $

We give proofs for Lemma 9, 10, and 11 in Appendix A. These results allow us to derive the second key theorem of this article, Theorem 2, which, similarly to Theorem 1, gives us a result on the way shredding factorises the density defined by the program.

Here, and throughout the article, we use subscripts to refer to specific subsets of Γ. For example, $ \Gamma _{\mathbf {{\rm\small L1}}} $ stands for the subset of the parameters $ \Gamma _{\mathbf {x}} $, such that $ x:(\tau , \ell) \in \Gamma _{\mathbf {{\rm\small L1}}} $ if and only if $ x:(\tau , \ell) \in \Gamma _{\mathbf {x}} $ and $ \ell = \mathbf {{\rm\small L1}} $.

Theorem 2 (Shredding Induces a Factorisation of the Density (2)).

Suppose $ \Gamma \vdash _{2}S : \mathbf {{\rm\small L1}} $ with $ \Gamma = \Gamma _{\sigma }, \Gamma _{\mathbf {{\rm\small L1}}}, \Gamma _{\mathbf {{\rm\small L2}}}, \Gamma _{\mathbf {{\rm\small L3}}} $, $ ~S \Updownarrow _{\Gamma } S_1, S_2, S_3 $. Then for $ \sigma , \boldsymbol {\theta }_1, \boldsymbol {\theta }_2, \boldsymbol {\theta }_3 \models \Gamma _{\sigma }, \Gamma _{1} $, $ \Gamma _{2} $, $ \Gamma _{3} $, and $ ~\sigma ^{\prime }, \sigma ^{\prime \prime } $ such that $ [\![ S_1 ]\!] (\sigma)(\boldsymbol {\theta }_1, \boldsymbol {\theta }_2, \boldsymbol {\theta }_3) = \sigma ^{\prime } $, and $ ~[\![ S_2 ]\!] (\sigma ^{\prime })(\boldsymbol {\theta }_1, \boldsymbol {\theta }_2, \boldsymbol {\theta }_3) = \sigma ^{\prime \prime } $, we have:

(1)

$ [\![ S_1 ]\!] _p(\sigma)(\boldsymbol {\theta }_1, \boldsymbol {\theta }_2, \boldsymbol {\theta }_3) = \phi _1(\boldsymbol {\theta }_1), $

(2)

$ [\![ S_2 ]\!] _p(\sigma ^{\prime })(\boldsymbol {\theta }_1, \boldsymbol {\theta }_2, \boldsymbol {\theta }_3) = \phi _2(\boldsymbol {\theta }_1, \boldsymbol {\theta }_2), $

(3)

$ [\![ S_3 ]\!] _p(\sigma ^{\prime \prime })(\boldsymbol {\theta }_1, \boldsymbol {\theta }_2, \boldsymbol {\theta }_3) = \phi _3(\boldsymbol {\theta }_1, \boldsymbol {\theta }_3). $

Proof.

By applying Lemma 11 to each of $ S_1, S_2, S_3 $, which are single-level statements (Lemma 9).□

3.2 Conditional Independence Result for $ \vdash _{2} $-Well-typed Programs

Theorem 3 states the key theoretical result of this article: the typing in programs well-typed with respect to $ \vdash _{2} $ corresponds to a conditional independence relationship. In our proofs, we use the factorisation characterisation of conditional independence stated by Definition 6. This is a well-known result in the literature (e.g., Murphy 2012, Theorem 2.2.1).

Definition 6 (Characterisation of Conditional Independence as Factorisation).

For variables $ x, y, z $ and a density $ p(x, y, z) $, x is conditionally independent of y given z with respect to p, written $ x \mathrel {\text{$\perp \perp $}}_p y \mid z $, if and only if $ ~ \exists \phi _1, \phi _2 \text{ such that } p(x, y, z) = \phi _1(x, z) \phi _2(y, z) $.

An equivalent formulation is $ p(x, y \mid z) = p(x \mid z)p(y \mid z) $.

We extend the notion of conditional independence to apply to a general function $ \phi (x, y, z) $, using the notation $ x \perp _{\phi } y \mid z $ to mean $ \exists \phi _1, \phi _2 \text{ such that } \phi (x, y, z) = \phi _1(x, z) \phi _2(y, z) $.

Theorem 3 ($ \vdash _{2} $-Well-typed Programs Induce a Conditional Independence Relationship)

For a SlicStan program $ \Gamma , S $ such that $ \Gamma \vdash _{2}S : \mathbf {{\rm\small L1}} $, $ \Gamma = \Gamma _\sigma ,\Gamma _{\mathbf {{\rm\small L1}}},\Gamma _{\mathbf {{\rm\small L2}}},\Gamma _{\mathbf {{\rm\small L3}}} $, and for $ \sigma , \boldsymbol {\theta }_1, \boldsymbol {\theta }_2, \boldsymbol {\theta }_3 \models \Gamma _{\sigma }, \Gamma _{\mathbf {{\rm\small L1}}},\Gamma _{\mathbf {{\rm\small L2}}},\Gamma _{\mathbf {{\rm\small L3}}} $, we have $ \boldsymbol {\theta }_2 \perp _{\phi } \boldsymbol {\theta }_3 \mid \boldsymbol {\theta }_1 $.

When $ [\![ S ]\!] _p(\sigma)(\boldsymbol {\theta }_1, \boldsymbol {\theta }_2, \boldsymbol {\theta }_3) \propto p(\boldsymbol {\theta }_1, \boldsymbol {\theta }_2, \boldsymbol {\theta }_3) $, we have $ \boldsymbol {\theta }_2 \mathrel {\text{$\perp \perp $}}_p \boldsymbol {\theta }_3 \mid \boldsymbol {\theta }_1 $.

Proof.

Let $ \boldsymbol {\theta }= \boldsymbol {\theta }_1, \boldsymbol {\theta }_2, \boldsymbol {\theta }_3 $, $ S \Updownarrow _{\Gamma }S_1, S_2, S_3 $, and let $ \sigma ^{\prime } $ and $ \sigma ^{\prime \prime } $ be such that $ \sigma ^{\prime } = [\![ S_1 ]\!] _s(\sigma)(\boldsymbol {\theta }) $, and $ \sigma ^{\prime \prime } = [\![ S_2 ]\!] _s(\sigma ^{\prime })(\boldsymbol {\theta }) $. Then, by semantic preservation of shredding (Lemma 10), we have

\begin{align*} [\![ S ]\!] _p(\sigma)(\boldsymbol {\theta }) &= [\![ S_1; S_2; S_3 ]\!] _p(\sigma)(\boldsymbol {\theta }) && \text{by Lemma~10,}\\ &= [\![ S_1 ]\!] _p(\sigma)(\boldsymbol {\theta }) \times [\![ S_2 ]\!] _p(\sigma ^{\prime })(\boldsymbol {\theta }) \times [\![ S_3 ]\!] _p(\sigma ^{\prime \prime })(\boldsymbol {\theta }) && \text{by Lemma~2,} \\ &= \phi _1(\boldsymbol {\theta }_1) \times \phi _2(\boldsymbol {\theta }_1, \boldsymbol {\theta }_2) \times \phi _3(\boldsymbol {\theta }_1, \boldsymbol {\theta }_3) && \text{by Theorem~2,} \\ &= \phi ^{\prime }(\boldsymbol {\theta }_1, \boldsymbol {\theta }_2) \times \phi _3(\boldsymbol {\theta }_1, \boldsymbol {\theta }_3), \end{align*}

for some $ \phi _1, \phi _2, $ and $ \phi _3 $, $ \phi ^{\prime }(\boldsymbol {\theta }_1, \boldsymbol {\theta }_2) = \phi _1(\boldsymbol {\theta }_1) \times \phi _2(\boldsymbol {\theta }_1, \boldsymbol {\theta }_2) $. Thus, $ \boldsymbol {\theta }_2 \perp _{\phi } \boldsymbol {\theta }_3 \mid \boldsymbol {\theta }_1 $ by definition of $ \perp _{\phi } $.

Suppose $ \phi (\boldsymbol {\theta }_1, \boldsymbol {\theta }_2, \boldsymbol {\theta }_3) \propto p(\boldsymbol {\theta }_1, \boldsymbol {\theta }_2, \boldsymbol {\theta }_3) $. Then $ p(\boldsymbol {\theta }_1, \boldsymbol {\theta }_2, \boldsymbol {\theta }_3) = \phi (\boldsymbol {\theta }_1, \boldsymbol {\theta }_2, \boldsymbol {\theta }_3) \times Z = \phi ^{\prime }(\boldsymbol {\theta }_1, \boldsymbol {\theta }_2) \times \phi _3(\boldsymbol {\theta }_1, \boldsymbol {\theta }_3) \times Z = \phi ^{\prime }(\boldsymbol {\theta }_1, \boldsymbol {\theta }_2) \times \phi ^{\prime \prime }(\boldsymbol {\theta }_1, \boldsymbol {\theta }_3) $, where Z is a constant and $ \phi ^{\prime \prime }(\boldsymbol {\theta }_1, \boldsymbol {\theta }_3) = \phi _3(\boldsymbol {\theta }_1, \boldsymbol {\theta }_3) \times Z $. Therefore, $ \boldsymbol {\theta }_2 \mathrel {\text{$\perp \perp $}}_p \boldsymbol {\theta }_3 \mid \boldsymbol {\theta }_1 $.□

3.3 Scope of the Conditional Independence Result

We have shown that $ \vdash _{2} $-well-typed programs exhibit a conditional independence relationship in their density semantics. However, it is not the case that every conditional independence relationship can be derived from the type system. In particular, we can only derive results of the form $ \boldsymbol {\theta }_2 \mathrel {\text{$\perp \perp $}}\boldsymbol {\theta }_3 \mid \boldsymbol {\theta }_1 $, where $ \boldsymbol {\theta }_1, \boldsymbol {\theta }_2, \boldsymbol {\theta }_3 $ is a partitioning of $ \boldsymbol {\theta }\models \Gamma _{\mathbf {x}} $ for a SlicStan program $ \Gamma _{\sigma }, \Gamma _{\mathbf {x}}, S $. That is, the relationship includes all parameters in the program.

We discuss the scope of our approach using an example and show a situation where trying to derive a conditional independence result that does not hold results in a failure to type check.

3.3.1 Example of $ \vdash _{2} $-well-typed Program $ \rightarrow $ Conditional Independence.

Consider the Cross Model in Figure 4, its SlicStan program (a), its directed graphical model, (b) and the conditional independence (CI) relationships that hold for that model (c).

Fig. 3.

Out of the many relationships above, we can derive all relationships that involve all the variables. That is, we can use our type system to derive all conditional independence relationships that hold and are of the form $ A \mathrel {\text{$\perp \perp $}}B \mid C $, where $ A, B, C $ is some partitioning of $ \lbrace x_1, \dots , x_5\rbrace $. However, note the following properties of conditional independence:

\[ A \mathrel {\text{$\perp \perp $}}B \mid C \iff B \mathrel {\text{$\perp \perp $}}A \mid C \quad \text{ and } \quad A \mathrel {\text{$\perp \perp $}}B_1, B_2 \mid C \iff A \mathrel {\text{$\perp \perp $}}B_1 \mid C \text{ and } A \mathrel {\text{$\perp \perp $}}B_2 \mid C. \]

Some of the relationships above can be combined and written in other ways, e.g., $ x_1 \mathrel {\text{$\perp \perp $}}x_4 \mid x_2, x_3 $ and $ x_1 \mathrel {\text{$\perp \perp $}}x_5 \mid x_2, x_3 $ can be written as a single relationship $ x_1 \mathrel {\text{$\perp \perp $}}x_4, x_5 \mid x_2, x_3 $, thus expressing them as a single relationship that includes all variables in the program.

Exploring different mappings between the parameters $ x_1, \dots , x_5 $ and the type levels $ \mathbf {{\rm\small L1}}, \mathbf {{\rm\small L2}}, \mathbf {{\rm\small L3}} $, for which the above program typechecks, we can derive all CI relationships that hold for this model, except for one: $ x_1 \mathrel {\text{$\perp \perp $}}x_2 $, which we cannot derive with our approach.

3.3.2 Conditional Independence Relationship does not hold $ \rightarrow $ Type Error.

Suppose that we try to derive the result $ x_1 \mathrel {\text{$\perp \perp $}}x_2 \mid x_3, x_4, x_5 $. This does not hold for Program C. By Theorem 3, we have that a program being $ \vdash _{2} $-well-typed implies that $ \mathbf {{\rm\small L2}} \mathrel {\text{$\perp \perp $}}\mathbf {{\rm\small L3}} \mid \mathbf {{\rm\small L1}} $. So, we can derive $ x_1 \mathrel {\text{$\perp \perp $}}x_2 \mid x_3, x_4, x_5 $ using Theorem 3 if we show that $ \Gamma \vdash _{2}S: \mathbf {{\rm\small L1}} $, for $ \Gamma = \lbrace x_1 : \mathbf {{\rm\small L2}}, x_2 : \mathbf {{\rm\small L3}}, x_3 : \mathbf {{\rm\small L1}}, x_4 : \mathbf {{\rm\small L1}}, x_5 : \mathbf {{\rm\small L1}} \rbrace $ and S being Program C.

To typecheck $ \Gamma \vdash _{2}S:\mathbf {{\rm\small L1}} $, we need to typecheck $ x_3 \sim \sf normal(x_1+x_2,1) $ at some level ℓ. Thus, by Sample2 and PrimCall2, x₁, $ x_2, $ and x₃ need to typecheck at ℓ. The types of x₁, $ x_2, $ and x₃ are $ \mathbf {{\rm\small L2}}, \mathbf {{\rm\small L3,}} $ and $ \mathbf {{\rm\small L1}} $, respectively. So, using ESub2, it must be the case that $ \mathbf {{\rm\small L2}} \le \ell $, and $ \mathbf {{\rm\small L3}} \le \ell $, and $ \mathbf {{\rm\small L1}} \le \ell $. However, no such level exists in our lower semi-lattice, as $ \mathbf {{\rm\small L2}} $ and $ \mathbf {{\rm\small L3}} $ have no upper bound. Therefore, typechecking fails and we cannot derive $ x_1 \mathrel {\text{$\perp \perp $}}x_2 \mid x_3, x_4, x_5 $.

4 Application: Discrete Parameters Support Through A Semantics-preserving Transformation

This section presents the main practical contribution of our work: a semantics-preserving procedure for transforming a probabilistic program to enable combined inference of discrete and continuous model parameters, which we have implemented for SlicStan. The procedure corresponds to variable elimination (VE) for discrete parameters implemented in the probabilistic program itself, which can be combined with gradient-based methods, such as HMC, to perform inference on all parameters.

PPLs that have gradient-based methods in the core of their inference strategy do not, in general, support directly working with discrete parameters. Stan disallows discrete model parameters altogether, while Pyro Uber AI Labs 2017 and Edward2 Tran et al. 2018 throw a runtime error whenever discrete parameters are used within a gradient-based method. However, working with discrete parameters in these languages is still possible, albeit in an implicit way. In many cases, discrete parameters can be marginalised out manually, and then drawn conditionally on the continuous parameters. Stan’s user guide shows many examples of this approach Stan Development Team 2019aChapter 7. Pyro provides an on-request marginalisation functionality, which automates this implicit treatment for plated factor graphs Obermeyer et al. 2019.

The key idea of the workaround is to marginalise out the discrete parameters by hand, so the resulting program corresponds to a density function that does not depend on any discrete parameters. That is, the user writes a program that computes $ \sum _{\boldsymbol {\theta }_d}p(\boldsymbol {\theta }_d, \boldsymbol {\theta }_c) = p(\boldsymbol {\theta }_c) $, where the density semantics of the original program was $ p(\boldsymbol {\theta }_d, \boldsymbol {\theta }_c) $ for discrete parameters $ \boldsymbol {\theta }_d $ and continuous parameters $ \boldsymbol {\theta }_c $. This allows for continuous parameters of the program to be sampled with HMC, or other gradient-based inference algorithms, whereas that would have not been possible for the program with both discrete and continuous latent variables.

Because a SlicStan program computes a density directly, it is easy to modify it to marginalise a variable. For a SlicStan program $ \Gamma , S $, with parameters $ \mathbf {x} \models \Gamma _{\mathbf {x}} $, and a discrete parameter z of type $ \sf int\langle K \rangle $, the program $ \sf elim(\sf int\langle K \rangle z)~S \triangleq \sf factor(\mathrm{sum}([\sf target(S) \mid z~\sf in~ 1 : K]) $⁴ marginalises z:

In other words, we can easily marginalise out all discrete variables in a probabilistic program by encapsulating the entire program in nested loops (nested array comprehension expressions in our examples). However, this approach becomes infeasible for more than a few variables. Variable elimination Zhang and Poole 1994 Koller and Friedman 2009 exploits the structure of a model to do as little work as possible. Consider the HMM snippet (Program D) with three discrete (binary) hidden variables z₁, $ z_2, $ and z₃, and observed outcomes y₁, $ y_2, $ and y₃. Naively marginalising out the hidden variables results in nested loops around the original program (Program E). In the general case of N hidden variables, the resulting program is of complexity $ O(2^N) $.

However, this is wasteful: Expressions like $ z_3 \sim \mathrm{bernoulli}(\theta [z_2]) $ do not depend on z₁, and so do not need to be inside of the z₁-elimination loop. VE avoids this problem by pre-computing some of the work. Program F implements VE for this model: When eliminating a variable, say, z₁, we pre-compute statements that involve z₁ for each possible value of z₁ and store the resulting density contributions in a new factor, f₁. This new factor depends on the variables involved in those statements—the neighbours of z₁—in this case that is solely z₂. We then repeat the procedure for the other variables, re-using the already computed factors where possible.

In the special case of an HMM, and given a suitable elimination order, variable elimination recovers the celebrated forward algorithm Rabiner 1989, which has time complexity O(N). Our goal is to automatically translate the source code of Program D to Program F, exploiting statically detectable independence properties in the model.

4.1 Goal

Our ultimate goal is to transform a program S with continuous parameters $ \boldsymbol {\theta }_c $, discrete parameters $ \boldsymbol {\theta }_d $, data $ \mathcal {D} $, and density semantics $ [\![ S ]\!] _p(\sigma)(\boldsymbol {\theta }_d, \boldsymbol {\theta }_c, \mathcal {D}) \propto p(\boldsymbol {\theta }_d, \boldsymbol {\theta }_c \mid \mathcal {D}) $ into two subprograms: $ S_{{\rm\small HMC}} $ and $ S_{{\rm\small GEN}} $, such that:

•

The density defined by $ ~S_{{\rm\small HMC}} $ is the marginal $ p(\boldsymbol {\theta }_c \mid \mathcal {D}) $, with the discrete parameters $ \boldsymbol {\theta }_d $ marginalised out. This first statement, $ S_{{\rm\small HMC}} $, represents the marginalisation part of the program (see Section 4.3) and allows for Hamiltonian Monte Carlo (HMC) sampling of $ \boldsymbol {\theta }_c $, as it does not involve any discrete parameters.

•

The density defined by $ S_{{\rm\small GEN}} $ is the conditional $ p(\boldsymbol {\theta }_d \mid \boldsymbol {\theta }_c, \mathcal {D}) $. This second statement, $ S_{{\rm\small GEN}} $, represents the generative part of the program (Section 4.5) and it encodes a way to draw $ \boldsymbol {\theta }_d $ generatively, without using HMC or another heavy-weight inference algorithm.

Similarly to the extended SlicStan slicing based on information-flow type inference, here we also want to transform and slice into sub-programs, each focusing on a subset of the parameters and preserving the overall meaning:

\[ [\![ S ]\!] _p \propto p(\boldsymbol {\theta }_d, \boldsymbol {\theta }_c \mid \mathcal {D}) = p(\boldsymbol {\theta }_c \mid \mathcal {D}) \times p(\boldsymbol {\theta }_d \mid \boldsymbol {\theta }_c, \mathcal {D}) \propto [\![ S_{{\rm\small HMC}} ]\!] _p\times [\![ S_{{\rm\small GEN}} ]\!] _p = [\![ S_{{\rm\small HMC}}; S_{{\rm\small GEN}} ]\!] _p. \]

¹²

Our approach performs a semantics-preserving transformation, guided by information-flow and type inference, which creates an efficient program-specific inference algorithm automatically, combining HMC with variable elimination.

4.2 Key Insight

The key practical insight of this work is to use the adaptation of SlicStan’s level types of Section 3 and its information flow type system to rearrange the program in a semantics-preserving way, so discrete parameters can be forward-sampled, instead of sampled using a heavy-weight inference algorithm. We achieve this by a program transformation for each of the discrete variables. Assuming that we are applying the transformation with respect to a variable z, we use:

•

The top-level information flow type system $ \Gamma \vdash S : \mathbf {{\rm\small DATA}} $ from Section 2.2, which involves the level types $ \mathbf {{\rm\small DATA}} \le \mathbf {{\rm\small MODEL}} \le \mathbf {{\rm\small GENQUANT}} $. This partitions the modelled variables $ \mathbf {x} $ into data $ \mathcal {D} $, model parameters $ \boldsymbol {\theta } $, and generated quantities Q. When we use type inference for $ \vdash $ in conjunction with shredding $ S \Updownarrow _{\Gamma }S_D, S_M, S_Q $ (Section 2.5), we slice the statement S into a data part S_D (involving only variables in $ \mathcal {D} $), a non-generative part S_M (involving $ \mathcal {D} $ and $ \boldsymbol {\theta } $), and a generative part S_Q (involving $ \mathcal {D} $, $ \boldsymbol {\theta } $, and Q).

•

The conditional independence information flow type system, $ \Gamma \vdash _{2}S : \mathbf {{\rm\small L1}} $ from Section 3, which uses a lower semi-lattice of level types $ \mathbf {{\rm\small L1}} \le \mathbf {{\rm\small L2}} $, $ \mathbf {{\rm\small L1}} \le \mathbf {{\rm\small L3}} $. A $ \vdash _{2} $-well-typed program induces a conditional independence relationship: $ \mathbf {{\rm\small L2}} $-variables are conditionally independent of $ \mathbf {{\rm\small L3}} $-variables given $ \mathbf {{\rm\small L1}} $-variables: $ \mathbf {x}_{\mathbf {{\rm\small L2}}} \mathrel {\text{$\perp \perp $}}\mathbf {x}_{\mathbf {{\rm\small L3}}} \mid \mathbf {x}_{\mathbf {{\rm\small L1}}} $, where $ \mathbf {x} = \mathbf {x}_{\mathbf {{\rm\small L1}}}, \mathbf {x}_{\mathbf {{\rm\small L2}}}, \mathbf {x}_{\mathbf {{\rm\small L3}}} = \boldsymbol {\theta }, \mathcal {D} $. When we use type inference for $ \vdash _{2} $ in conjunction with shredding $ S \Updownarrow _{\Gamma }S_1, S_2, S_3 $ (Section 2.5), we isolate S₂: a part of the program that does not interfere with S₃. We can marginalise out $ \mathbf {{\rm\small L2}} $-variables in that sub-statement only, keeping the rest of the program unchanged.

•

The discrete variable transformation relation $ \Gamma , S \xrightarrow {z} \Gamma ^{\prime }, S^{\prime } $ (defined in Section 4.6.2), which takes a SlicStan program $ \Gamma , S $ that has discrete model parameter z, and transforms it to a SlicStan program $ \Gamma ^{\prime }, S^{\prime } $, where z is no longer a $ \mathbf {{\rm\small MODEL}} $-level parameter but instead one of level $ \mathbf {{\rm\small GENQUANT}} $. We define the relation in terms of $ \vdash $ and $ \vdash _{2} $ as per the Elim Gen rule.

4.3 Variable Elimination

VE Zhang and Poole 1994 Koller and Friedman 2009 is an exact inference algorithm often phrased in terms of factor graphs. It can be used to compute prior or posterior marginal distributions by eliminating, one-by-one, variables that are irrelevant to the distribution of interest. VE uses dynamic programming combined with a clever use of the distributive law of multiplication over addition to efficiently compute a nested sum of a product of expressions.

We already saw an example of variable elimination in Section 3 (Programs A and B). The idea is to eliminate (marginalise out) variables one-by-one. To eliminate a variable z, we multiply all of the factors connected to z to form a single expression, then sum over all possible values for z to create a new factor, remove z from the graph, and finally connect the new factor to all former neighbours⁴ of z. Recall Program D, with latent variables $ z_1, z_2, z_3 $ and observed data $ \mathbf {y} = y_1, y_2, y_3 $. Figure 5 shows the VE algorithm step-by-step applied to this program. We eliminate z₁ to get the marginal on z₂ and z₃ (Figure 5(a) and (b)), then eliminate z₂ to get the marginal on z₃ (Figure 5(c) and (d)).

Fig. 4.

4.4 Conditional Independence Relationships and Inferring the Markov Blanket

The key property we are looking for to be able to marginalise out a variable independently of another is conditional independence given neighbouring variables. If we shred a $ \vdash _{2} $-well-typed program into $ S_1, S_2 $, and S₃, and think of $ [\![ S_1 ]\!] _p, [\![ S_2 ]\!] _p $, and $ [\![ S_3 ]\!] _p $ as factors, it is easy to visualise the factor graph corresponding to the program: it is as in Figure 6(a). Eliminating all $ \mathbf {x}_{\mathbf {{\rm\small L2}}} $ variables, ends up only modifying the $ [\![ S_2 ]\!] _p $ factor (Figure 6(b)).

When using VE to marginalise out a parameter z, we want to find the smallest set of other parameters A, such that $ z \mathrel {\text{$\perp \perp $}}B \mid A $, where B is the rest of the parameters. The set A is also called z’s minimal Markov blanket or Markov boundary. Once we know this set, we can ensure that we involve the smallest possible number of variables in z’s elimination, which is important to achieve a performant algorithm.

For example, when we eliminate z₁ in Program D, both z₂ and y₁ need to be involved, as z₁ shares a factor with them. By contrast, there is no need to include $ y_2, z_3, y_3 $ and the statements associated with them, as they are unaffected by z₁, given z₂. The variables y₁ and z₂ form z₁’s Markov blanket: Given these variables, z₁ is conditionally independent of all other variables. That is, $ z_1 \mathrel {\text{$\perp \perp $}}z_3, y_2, y_3 \mid z_2, y_1 $.

The type system we present in Section 3 can tell us if the conditional independence relationship $ \mathbf {x}_{\mathbf {{\rm\small L2}}} \mathrel {\text{$\perp \perp $}}\mathbf {x}_{\mathbf {{\rm\small L3}}} \mid \mathbf {x}_{\mathbf {{\rm\small L1}}} $ holds for a concrete partitioning of the modelled variables $ \mathbf {x} = \mathbf {x}_{\mathbf {{\rm\small L1}}}, \mathbf {x}_{\mathbf {{\rm\small L2}}}, \mathbf {x}_{\mathbf {{\rm\small L3}}} $. But to find the Markov blanket of a variable z we want to eliminate, we rely on type inference. We define a performance ordering between the level types $ \mathbf {{\rm\small L3}} \prec \mathbf {{\rm\small L1}} \prec \mathbf {{\rm\small L2}} $, where our first preference is for variables to be of level $ \mathbf {{\rm\small L3}} $, level $ \mathbf {{\rm\small L1}} $ is our second preference, and $ \mathbf {{\rm\small L2}} $ is our last resort. In our implementation, we use bidirectional type-checking Pierce and Turner 2000 to synthesise hard constraints imposed by the type system, and resolve them, while optimising for the soft constraints given by the $ \prec $ ordering. This maximises the number of variables that are conditionally independent of z given its blanket ($ \mathbf {{\rm\small L3}} $) and minimises the number of variables forming the blanket ($ \mathbf {{\rm\small L1}} $). Fixing z to be of $ \mathbf {{\rm\small L2}} $ level, and $ \mathbf {{\rm\small L2}} $ being the least preferred option, ensures that only z and variables dependent on z through deterministic assignment are of that level.

Fig. 5.

4.5 Sampling the Discrete Parameters

Variable elimination gives a way to efficiently marginalise out a variable z from a model defining density $ p(\mathbf {x}) $, to obtain a new density $ p(\mathbf {x} \setminus \lbrace z\rbrace) $. In the context of SlicStan, this means we have the tools to eliminate all discrete parameters $ \boldsymbol {\theta }_d $, from a density $ p(\mathcal {D}, \boldsymbol {\theta }_c, \boldsymbol {\theta }_d) $ on data $ \mathcal {D} $, continuous parameters $ \boldsymbol {\theta }_c $ and discrete parameters $ \boldsymbol {\theta }_d $. The resulting marginal $ \sum _{\boldsymbol {\theta }_d}p(\mathcal {D}, \boldsymbol {\theta }_c, \boldsymbol {\theta }_d) = p(\mathcal {D}, \boldsymbol {\theta }_c) $ does not involve discrete parameters and, therefore, we can use gradient-based methods to infer $ \boldsymbol {\theta }_c $. However, the method so far does not give us a way to infer the discrete parameters $ \boldsymbol {\theta }_d $.

To infer these, we observe that $ p(\mathbf {x}) = p(\mathbf {x} \setminus \lbrace z\rbrace) p(z \mid \mathbf {x} \setminus \lbrace z\rbrace) $, which means that we can preserve the semantics of the original model (which defines $ p(\mathbf {x}) $), by finding an expression for the conditional $ p(z \mid \mathbf {x} \setminus \lbrace z\rbrace) $. If $ \mathbf {x}_1, \mathbf {x}_2 $ is a partitioning of $ \mathbf {x} \setminus \lbrace z\rbrace $ such that $ z \mathrel {\text{$\perp \perp $}}\mathbf {x}_2 \mid \mathbf {x}_1 $, then (from Definition 6) $ p(\mathbf {x}) = \phi _1(z, \mathbf {x}_1)\phi _2(\mathbf {x}_1, \mathbf {x}_2) $ for some functions $ \phi _1 $ and $ \phi _2 $. Thus, $ p(z \mid \mathbf {x} \setminus \lbrace z\rbrace) = \phi _1(z, \mathbf {x}_1) \cdot \left(\phi _2(\mathbf {x}_1,\mathbf {x}_2)/p(\mathbf {x} \setminus \lbrace z\rbrace)\right) \propto \phi _1(z, \mathbf {x}_1) $.

In the case when z is a discrete variable of finite support, we can calculate the conditional probability exactly: $ p(z \mid \mathbf {x} \setminus \lbrace z\rbrace) = \frac{\phi _1(z, \mathbf {x}_1)}{\sum _z \phi _1(z, \mathbf {x}_1)} $. We can apply this calculation to the factorisation of a program $ \Gamma \vdash _{2}S $ that is induced by shredding (Theorem 2). In that case, $ \mathbf {x}_{\mathbf {{\rm\small L2}}} $, $ \mathbf {x}_{\mathbf {{\rm\small L1}}} $, $ [\![ S_2 ]\!] _p $ play the roles of z, $ \mathbf {x}_1 $, and $ \phi _1 $, respectively. Consequently, we obtain a formula for drawing $ \mathbf {x}_{\mathbf {{\rm\small L2}}} $ conditional on the other parameters: $ \mathbf {x}_{\mathbf {{\rm\small L2}}} \sim \mathrm{categorical}([ \frac{[\![ S_2 ]\!] _p(\mathbf {x}_{\mathbf {{\rm\small L2}}}, \mathbf {x}_{\mathbf {{\rm\small L1}}})}{\sum _{\mathbf {x}_{\mathbf {{\rm\small L2}}}} [\![ S_2 ]\!] _p(\mathbf {x}_{\mathbf {{\rm\small L2}}}, \mathbf {x}_{\mathbf {{\rm\small L1}}})} \mid \mathbf {x}_{\mathbf {{\rm\small L2}}} \in \mathrm{supp}(\mathbf {x}_{\mathbf {{\rm\small L2}}}) ]) $.

4.6 A Semantics-preserving Transformation Rule

In this section, we define a source-to-source transformation that implements a single step of variable elimination. The transformation re-writes a SlicStan program $ \Gamma , S $ with a discrete $ \mathbf {{\rm\small MODEL}} $-level parameter z, to a SlicStan program, where z is a $ \mathbf {{\rm\small GENQUANT}} $-level parameter. Combining the rule with the shredding presented in Section 2 results in support for efficient inference (see Section 4.8 for discussion of limitations) of both discrete and continuous random variables, where continuous variables can be inferred using gradient-based methods, such as HMC or variational inference, while discrete variables are generated using ancestral sampling. The transformation allows for SlicStan programs with explicit use of discrete parameters to be translated to Stan. We show a step-by-step example of our discrete parameter transformation in Section 4.7.

4.6.1 The φ, $ \sf elim $, and $ \sf gen $ derived forms.

We introduce three derived forms that allow us to state the rule concisely.

The elimination expression $ \sf elim(\sf int\langle K\rangle z)~S $ adds a new factor that is equivalent to marginalising z in S. In other words, $ [\![ \sf elim(\sf int\langle K\rangle z)~S ]\!] _p(\sigma)(\mathbf {x}) = \sum _{z = 1}^{K}[\![ S ]\!] _p(\sigma)(\mathbf {x}) $ (see Lemma 14). A φ-expression $ \phi (\sf int\langle K_1 \rangle ~z_1, \dots , \sf int\langle K_N \rangle ~z_N)~S $ simply computes the density of the statement S in a multidimensional array for all possible values of the variables $ z_1, \dots z_N $. In other words, $ [\![ \left(f = \phi (\sf int\langle K_1 \rangle ~z_1, \dots , \sf int\langle K_N \rangle ~z_N)~S\right); \sf factor(f[z_1] \dots [z_N]) ]\!] _p(\sigma)(\mathbf {x}) = [\![ S ]\!] _p(\sigma)(\mathbf {x}) $ (Lemma 14). The φ-expression allows us to pre-compute all the work that we may need to do when marginalising other discrete variables, which results in efficient nesting. Finally, the generation expression computes the conditional of a variable z given the rest of the parameters, as in Section 4.5 (see Lemma 15).

4.6.2 Eliminating a Single Variable z.

The Elim Gen rule below specifies a semantics-preserving transformation that takes a SlicStan program with a discrete $ \mathbf {{\rm\small MODEL}} $-level parameter z, and transforms it to one where z is $ \mathbf {{\rm\small GENQUANT}} $-level parameter. In practice, we apply this rule once per discrete $ \mathbf {{\rm\small MODEL}} $-level parameter, which eliminates those parameters one-by-one, similarly to the variable elimination algorithm. And like in VE, the ordering in which we eliminate those variables can impact performance.

The Elim Gen rule makes use of two auxiliary definitions that we define next. First, the neighbours of z, $ \Gamma _{\mathrm{ne}} $, are defined by the relation $ \mathrm{ne}(\Gamma , \Gamma ^{\prime }, z) $ (Definition 7), which looks for non-data and non-continuous $ \mathbf {{\rm\small L1}} $-variables in $ \Gamma ^{\prime } $.

Definition 7 (Neighbours of z, $ \mathrm{ne}\, (\Gamma , \Gamma ^{\prime }, z) $)

Second, $ \mathrm{st}(S_2) $ (Definition 8) is a statement that has the same store semantics as S₂, but density semantics of 1: $ [\![ \mathrm{st}(S_2) ]\!] _s = [\![ S_2 ]\!] _s $, but $ [\![ \mathrm{st}(S_2) ]\!] _p = 1 $. This ensures that the transformation preserves both the density semantics and the store semantics of S and is needed because $ \sf gen(z)S_2 $ discards any store computed by S₂, thus only contributing to the weight.

Definition 8.

Given a statement S, we define the statement $ \mathrm{st}(S) $ by replacing all $ \sf factor(E) $- and $ L\sim d(E_1,\ldots ,E_n) $-substatements in S by $ \sf skip $ (see Appendix A for the precise definition).

We can use the Elim Gen rule to transform a SlicStan program, with respect to a parameter z, as described by Algorithm 1. This involves three main steps:

(1)

Separate out S_M—the $ \mathbf {{\rm\small MODEL}} $-level sub-part of S—using the top-level type system $ \vdash $ (line 1 of Algorithm 1).

(2)

Separate out S₂—the part of S_M that involves the discrete parameter z—using the conditional independence type system $ \vdash _{2} $ (lines 2–8).

(3)

Perform a single VE step by marginalising out z in S₂ and sample z from the conditional probability specified by S₂ (lines 10–11).

All other sub-statements of the program, $ S_D, S_1, S_3 $, and S_Q, stay the same during the transformation. By isolating S₂ and transforming only this part of the program, we make sure we do not introduce more work than necessary when performing variable elimination.

To efficiently marginalise out z, we want to find the Markov boundary of z given all $ \mathbf {{\rm\small DATA}} $ and continuous $ \mathbf {{\rm\small MODEL}} $ parameters: The data is given, and marginalisation happens inside the continuous parameters inference loop, so we can see continuous parameters as given for the purpose of discrete parameters marginalisation. Thus, we are looking for the relationship: $ z \mathrel {\text{$\perp \perp $}}\boldsymbol {\theta }_{d2} \mid \mathcal {D}, \boldsymbol {\theta }_c, \boldsymbol {\theta }_{d1} $, where $ \mathcal {D} $ is the data, $ \boldsymbol {\theta }_{c} $ are the continuous $ \mathbf {{\rm\small MODEL}} $-level parameters, $ \boldsymbol {\theta }_{d1} $ is a subset of the discrete $ \mathbf {{\rm\small MODEL}} $-level parameters that is as small as possible (the Markov blanket), and $ \boldsymbol {\theta }_{d2} $ is the rest of the discrete $ \mathbf {{\rm\small MODEL}} $-level parameters. We can find an optimal partitioning of the discrete parameters $ \boldsymbol {\theta }_{d1}, \boldsymbol {\theta }_{d2} $ that respects this relationship of interest using the type system from Section 3 together with type inference.

The judgement $ \Gamma _M \vdash _{2}S_M : \mathbf {{\rm\small L1}} $ induces a conditional independence relationship of the form $ \mathbf {x}_{\mathbf {{\rm\small L2}}} \mathrel {\text{$\perp \perp $}}\mathbf {x}_{\mathbf {{\rm\small L3}}} \mid \mathbf {x}_{\mathbf {{\rm\small L1}}} $, where $ \mathbf {x} \models \Gamma _{\mathbf {x}} $ (Theorem 3). The relation $ \Gamma \xrightarrow {z} \Gamma _M $ (Definition 9) constrains the form of $ \Gamma _M $ based on Γ. This is needed to make sure we are working with a relationship of the form we are interested in—$ z \mathrel {\text{$\perp \perp $}}\boldsymbol {\theta }_{d2} \mid \mathcal {D}, \boldsymbol {\theta }_c, \boldsymbol {\theta }_{d1} $—and that base types τ are the same between Γ and $ \Gamma _M $. In particular, $ \Gamma \xrightarrow {z} \Gamma _M $ constrains z to be the only $ \mathbf {{\rm\small L2}} $ parameter in $ \Gamma _M $ and all $ \mathbf {{\rm\small DATA}} $ and continuous $ \mathbf {{\rm\small MODEL}} $-level parameters of Γ are $ \mathbf {{\rm\small L1}} $ in $ \Gamma _M $. Note, $ \mathrm{dom}(\Gamma _M) \subseteq \mathrm{dom}(\Gamma) $ and $ \Gamma _M $ only contains variables that are of level $ \mathbf {{\rm\small MODEL}} $ and below in Γ. Variables that are of level $ \mathbf {{\rm\small GENQUANT}} $ in Γ are not in $ \Gamma _M $.

Definition 9 ($ \Gamma \xrightarrow {z} \Gamma ^{\prime } $)

Following convention from earlier in the article, we use level subscripts to refer to specific subsets of Γ: In the above definition, $ \Gamma ^{\prime }_{\mathbf {x}, \mathbf {{\rm\small L2}}} $ refers to the subset of parameters $ \mathbf {x} $ in $ \Gamma ^{\prime } $, which are of level $ \mathbf {{\rm\small L2}} $.

4.7 Marginalising Multiple Variables: An Example

To eliminate more than one discrete parameter, we apply the Elim Gen rule repeatedly. Here, we work through a full example, showing the different steps of this repeated Elim Gen transformation.

Consider an extended version of the HMM model from the beginning of this section (Program D), reformulated to include transformed parameters:

The variables we are interested in transforming are $ z_1, z_2 $, and z₃: These are the $ \mathbf {{\rm\small MODEL}} $-level discrete parameters of Program G. The variable $ \sf genz $ is already at $ \mathbf {{\rm\small GENQUANT}} $ level, so we can sample this with ancestral sampling (no need for automatic marginalisation).

We eliminate $ z_1, z_2 $, and z₃ one-by-one, in that order. The order of elimination generally has a significant impact on the complexity of the resulting program (see also Section 4.8), but we do not focus on how to choose an ordering here. The problem of finding an optimal ordering is well-studied Arnborg et al. 1987 Kjærulff 1990 Amir 2010 and is orthogonal to the focus of our work.

4.7.1 Eliminating z₁.

Eliminating a single variable happens in three steps, as shown in Figure 7: standard shredding into $ S_D, S_M $, and S_Q, conditional independence shredding of S_M into $ S_1, S_2 $, and S₃, and combining everything based on Elim Gen.

Fig. 6.

(1)

Standard shredding: $ S \Updownarrow _{\Gamma }S_{D}, S_{M}, S_{Q} $. First, we separate out the parts of the program that depend on discrete parameters generatively; that is, any part of the program that would be in generated quantities with respect to the original program. In our case, this includes the last two lines in S. This would also include the $ \sf gen $ parts of the transform program, which draw discrete parameters as generated quantities. Thus, $ S \Updownarrow _{\Gamma }S_{D}, S_{M}, S_{Q} $, where $ S_{D} $ is empty, $ S_{Q} = \left(\sf theta3 = \sf theta[z3]; \sf genz \sim \sf bernoulli(\sf theta3)\right) $, and S_M is the rest of the program (see Figure 7, (1)).

(2)

Conditional independence shredding: $ S_M \Updownarrow _{\Gamma _M} S_1, S_2, S_3 $. In the next step, we want to establish a conditional independence relationship $ z_1 \mid A \mathrel {\text{$\perp \perp $}}\mathbf {y}, \phi _0, \theta _0, B $, where z₁ is some discrete parameter and $ A, B $ is a partitioning of the rest of the discrete parameters in the model: $ \lbrace z_2, z_3\rbrace $. We derive a new, $ \vdash _{2} $ typing environment $ \Gamma _M $, using $ \Gamma \xrightarrow {z} \Gamma _M $:

Here, we use the notation $ {{?}} $ for a type placeholder, which will be inferred using type inference.

The optimal $ \Gamma _M $ under the type inference soft constraint $ \mathbf {{\rm\small L3}} \prec \mathbf {{\rm\small L1}} \prec \mathbf {{\rm\small L2}} $ such that $ \Gamma _M \vdash _{2}S_M : \mathbf {{\rm\small L1}} $ is such that the levels of $ \theta _{1} $ and $ \phi _{1} $ are $ \mathbf {{\rm\small L2}} $, z₂ is $ \mathbf {{\rm\small L1,}} $ and $ \theta _2, \phi _2, $ and $ \phi _3 $ are $ \mathbf {{\rm\small L3}} $. Shredding then gives us $ S_M \Updownarrow _{\Gamma _M} S_1, S_2, S_3 $, as in Figure 7, (2).

(3)

Combining based on Elim Gen . Having rearranged the program into suitable sub-statements, we use Elim Gen to get Program G-1 (Figure 7, (3)) and:

Eliminating z₂. We apply the same procedure to eliminate the next variable, z₂, from the updated Program G-1. The variable z₁ is no longer a $ \mathbf {{\rm\small MODEL}} $-level parameter, thus the only neighbouring parameter of z₂ is z₃. Note also that the computation of the factor f₁ does not include any free discrete parameters (both z₁ and z₂ are local to the computation due to $ \sf elim $ and φ). Thus, we do not need to include the computation of this factor anywhere else in the program (it does not get nested into other computations). We obtain a new program, Program G-2:

Eliminating z₃. Finally, we eliminate z₃, which is the only discrete $ \mathbf {{\rm\small MODEL}} $-level parameter left in the program. Thus, z₃ has no neighbours and f₃ is of arity 0: It is a real number instead of a vector.

The final program generated by our implementation is Program G-3:

4.8 Relating to Variable Elimination and Complexity Analysis

Assume $ \mathcal {D}, \boldsymbol {\theta }_d $, and $ \boldsymbol {\theta }_c $ are the data, discrete model-level parameters, and continuous model-level parameters, respectively. As S₂ is a single-level statement of level $ \mathbf {{\rm\small L2}} $, the density semantics of S₂ is of the form $ \psi (\mathbf {x}_{\mathbf {{\rm\small L1}}}, \mathbf {x}_{\mathbf {{\rm\small L2}}}) = \psi (\mathcal {D}, \boldsymbol {\theta }_c, \boldsymbol {\theta }_{d, \mathbf {{\rm\small L1}}}, z) $ (Lemma 11).

As $ \sf elim(\sf int\langle K \rangle z)~\_ $ binds the variable z and $ \phi (\Gamma _{\mathrm{ne}}) \lbrace \_\rbrace $ binds the variables in $ \mathrm{dom}(\Gamma _{\mathrm{ne}}) $, the expression $ \phi (\Gamma _{\mathrm{ne}}) \lbrace \sf elim(\sf int\langle K \rangle z)~S_2 $ depends only on continuous parameters and data, and it contains no free mentions of any discrete variables. This means that the expression will be of level $ \mathbf {{\rm\small L1}} $ and shredded into S₁ during the marginalisation of any subsequent discrete variable $ z^{\prime } $. The substatement S₂ will always be some sub-statement of the original program (prior to any transformations), up to potentially several constant factors of the form $ \sf factor(f[\mathrm{dom}(\Gamma _{\mathrm{ne}})]) $.

This observation makes it easy to reason about how repeated application of the Elim Gen transform changes the complexity of the program. If the complexity of a SlicStan program with N discrete parameters of support $ 1, \dots , K $, is $ \mathcal {O}(S) $, then the complexity of a program where we naively marginalised out the discrete variables (Program E) will be $ \mathcal {O}(S \times K^N) $. In contrast, transforming with Elim Gen gives us a program of complexity at most $ \mathcal {O}(N \times S \times K^{M+1}) $ where M is the largest number of direct neighbours in the factor graph induced by the program. Further, the complexity could be smaller depending on the elimination ordering of choice. This result is not surprising, as we conjecture that repeated application of Elim Gen is equivalent to variable elimination (though we do not formally prove this equivalence), which is of the same complexity.

It is clear from this complexity observation that VE is not always efficient. When the dependency graph is dense, M will be close to N, thus inference will be infeasible for large N. Fortunately, in many practical cases (such as those discussed in Section 5), this graph is sparse ($ M \ll N $) and our approach is suitable and efficient. We note that this is a general limitation of exact inference of discrete parameters, and it is not a limitation of our approach in particular.

4.9 Semantic Preservation of the Discrete Variable Transformation

The result we are interested in is the semantic preservation of the transformation rule $ \xrightarrow {z} $.

Theorem 4 (Semantic Preservation of $ \xrightarrow {z} $)

For SlicStan programs $ \Gamma , S $ and $ \Gamma ^{\prime }, S^{\prime } $, and a discrete parameter z: $ \Gamma , S \xrightarrow {z} \Gamma ^{\prime }, S^{\prime } $ implies $ [\![ S ]\!] = [\![ S^{\prime } ]\!] $.

Proof.

Note that shredding preserves semantics with respect to both $ \vdash $ and $ \vdash _{2} $ (Lemma 6 and 10), examines the meaning of derived forms (Lemma 14 and 15), notes properties of single-level statements (Lemma 11), and applies the results on factorisation of shredding (Theorem 1) and conditional independence (Theorem 3). We present the full proof in Appendix A.□

In addition, we also show that it is always possible to find a program derivable with Elim Gen, such that a $ \mathbf {{\rm\small MODEL}} $-level variable z is transformed to a $ \mathbf {{\rm\small GENQUANT}} $-level variable.

Lemma 12 (Existence of $ \mathbf {{\rm M}{\rm\small{ODEL}}} $ to $ \mathbf {{\rm G}{\rm\small{ENQUANT}}} $ Transformation)

For any SlicStan program $ \Gamma , S $ such that $ \Gamma \vdash S : \mathbf {{\rm\small L1}} $, and a variable $ z \in \mathrm{dom}(\Gamma) $ such that $ \Gamma (z) = (\sf int\langle K \rangle $, MODEL), there exists a SlicStan program $ \Gamma ^{\prime }, S^{\prime } $, such that:

\[ \Gamma , S \xrightarrow {z} \Gamma ^{\prime }, S^{\prime } \quad \text{and} \quad \Gamma ^{\prime }(z) = (\sf int\langle K \rangle , \textrm {GENQUANT}). \]

Proof.

By inspecting the level types of variables in each part of a program derivable using Elim Gen. We include the full proof in Appendix A.□

The practical usefulness of Theorem 4 stems from the fact that it allows us to separate inference for discrete and continuous parameters. After applying Elim Gen to each discrete $ \mathbf {{\rm\small MODEL}} $-level parameter, we are left with a program that only has $ \mathbf {{\rm\small GENQUANT}} $-level discrete parameters (Lemma 12). We can then slice the program into $ S_{{\rm\small HMC}} $ and $ S_{{\rm\small GEN}} $ and infer continuous parameters by using HMC (or other algorithms) on $ S_{{\rm\small HMC}} $ and, next, draw the discrete parameters using ancestral sampling by running forward $ S_{{\rm\small GEN}} $. Theorem 4 tells us that this is a correct inference strategy.

When used in the context of a model with only discrete parameters, our approach corresponds to exact inference through VE. In the presence of discrete and continuous parameters, our transformation gives an analytical sub-solution for the discrete parameters in the model.

A limitation of our method is that, due to its density-based nature, it can only be applied to models of fixed size. It cannot, in its current form, support models where the number of random variables changes during inference, such as Dirichlet processes. However, this is a typical constraint adopted in Bayesian inference for efficiency. Another limitation is that discrete variables need to have finite (and fixed) support. For example, the method cannot be applied to transform a Poisson-distributed variable. In some but not all applications, truncating unbounded discrete parameters at a realistic upper bound would suffice to make the method applicable.

An advantage of our method is that it can be combined with any inference algorithm that requires a function proportional to the joint density of variables. This includes gradient-based algorithms, such as HMC and variational inference, but it can also be used with methods that allow for (e.g., unbounded) discrete variables as an analytical sub-solution that can optimise inference. For example, consider a Poisson variable $ n \sim \mathrm{Poisson}(\lambda) $ and a Binomial variable $ k \sim \mathrm{Binomial}(n, p) $. While n is of infinite support, and we cannot directly sum over all of its possible values, analytically marginalising out n gives us $ k \sim \mathrm{Poisson}(\lambda p) $. Future work can utilise such analytical results in place of explicit summation where possible.

Fig. 7.

4.10 Scope and Limitations of Elim Gen

Previously, we discussed the scope of the conditional independence result of the article (Section 3.3). Similarly, here we demonstrate with an example, a situation where our approach of eliminating variables one-by-one using Elim Gen is not optimal.

Consider the simple control-flow Program H below. In this example, z₂ and z₃ are not conditionally independent given $ z_1 = 1 $, but they are conditionally independent given $ z_1 \gt K/2 $. This independence is also referred to as context-specific independence Minka and Winn 2009 Boutilier et al. 1996. We can use different elimination strategy depending on which if-branch of the program we find ourselves. Program H-A demonstrates this: Its complexity is $ \mathcal {O}(\frac{K}{2} \times K^2 + \frac{K}{2} \times 2 \times K) = \mathcal {O}(\frac{1}{2}K^3 + K^2) $.

The typing relation $ \vdash _{2} $ can only detect overall (in)dependencies, where sets of variables are conditionally independent given some X, regardless of what value X takes. Thus, our static analysis is not able to detect that $ z_2 \mathrel {\text{$\perp \perp $}}z_3 \mid z_1 = 0 $. This results in Program H-B, which has complexity $ \mathcal {O}(K^3 + K^2 + K) $: the same complexity as the optimal Program H-A, but with a bigger constant.

Even if we extend our approach to detect that z₂ and z₃ are independent in one branch, it is unclear how to incorporate this new information. Our strategy is based on computing intermediate factors that allow re-using already computed information: Eliminating z₁ requires computing a new factor f₁ that no longer depends on z₁. We represented f₁ with a multidimensional array indexed by z₂ and z₃, and we need to define each element of that array, thus we cannot decouple them for particular values of z₁.

Runtime systems that compute intermediate factors in a similar way, such as Pyro Uber AI Labs 2017, face that same limitation. Birch Murray and Schön 2018, however, will be able to detect the conditional independence in the case $ z_1 \gt K/2 $, but it will not marginalise z₁, as it cannot (analytically) marginalise over branches. Instead, it uses Sequential Monte Carlo (SMC) to repeatedly sample z₁ and proceed according to its value.

5 Implementation and Empirical Evaluation

The transformation we introduce can be useful for variety of models, and it can be adapted to PPLs to increase efficiency of inference and usability. Most notably, it can be used to extend Stan to allow for direct treatment of discrete variables, where previously that was not possible.

In this section, we present a brief overview of such a discrete parameter extension for SlicStan (Section 5.1). To evaluate the practicality of Elim Gen, we build a partial NumPyro Phan et al. 2019 backend for SlicStan and compare our static approach to variable elimination for discrete parameters to the dynamic approach of NumPyro (Section 5.2). We find that our static transformation strategy speeds up inference compared to the dynamic approach, but that for models with a large number of discrete parameters performance gains could be diminished by the exponentially growing compilation time (Section 5.3).

In addition to demonstrating the practicality of our contribution through empirical evaluation, we also discuss the usefulness of our contribution through examples, in Appendix B.

5.1 Implementation

We update the original SlicStan⁴ according to the modification described in Section 2 and extend it to support automatic variable elimination through the scheme outlined in Section 4. As with the first version of SlicStan, the transformation produces a new SlicStan program that is then translated to Stan.

The variable elimination transformation procedure works by applying Elim Gen iteratively, once for each discrete variable, as we show in Section 4.7. The level types $ \mathbf {{\rm\small L1}} $, $ \mathbf {{\rm\small L2,}} $ and $ \mathbf {{\rm\small L3}} $ are not exposed to the user and are inferred automatically. Using bidirectional type-checking, we are able to synthesise a set of hard constraints that the levels must satisfy. These hard constraints will typically be satisfied by more than one assignment of variables to levels. We search for the optimal types with respect to the soft constraints $ \mathbf {{\rm\small L3}} \prec \mathbf {{\rm\small L1}} \prec \mathbf {{\rm\small L2}} $ using the theorem prover Z3 De Moura and Bjørner 2008.

5.2 Empirical Evaluation

To evaluate the practicality of our approach, we compare to the prior work most closely related to ours: that of Obermeyer et al. 2019, who implement efficient variable-elimination for plated factor graphs in Pyro Uber AI Labs 2017. Their approach uses effect-handlers and dynamically marginalises discrete variables, so gradient-based inference schemes can be used for the continuous parameters. This VE strategy has also been implemented in NumPyro Phan et al. 2019.

As both ours and Pyro’s strategies correspond to VE, we do not expect to see differences in complexity of the resulting programs. However, as in our case the VE algorithm is determined and set up at compile time, while in the case of Pyro/NumPyro, this is done at runtime. The main question we aim to address is whether setting up the variable elimination logistics at compile time results in a practical runtime speed-up.

To allow for this comparison, we built a partial NumPyro backend for SlicStan. For each model we choose, we compare the runtime performance of three NumPyro programs:

(1)

The NumPyro program obtained by translating a SlicStan program with discrete parameters to NumPyro directly (labelled “NumPyro”). This is the baseline: We leave the discrete parameter elimination to NumPyro.

(2)

The NumPyro program obtained by translating a transformed SlicStan program, where all discrete parameters have been eliminated according to Elim Gen (labelled “SlicStan”). The variable elimination setup is done at compile time; NumPyro does not do any marginalisation.

(3)

A hand-optimised NumPyro program, which uses the plate and markov program constructs to specify some of the conditional independencies in the program (labelled “NumPyro-Opt”).

In each case, we measure the time (in seconds) for sampling a single chain consisting of 2,500 warm-up samples and 10,000 samples using NUTS Hoffman and Gelman 2014.

In addition, we report three compilation times:

(1)

The compilation time of the NumPyro program obtained by translating a SlicStan program with discrete parameters to NumPyro directly (labelled “NumPyro”).

(2)

The compilation time of the NumPyro program obtained by translating a transformed SlicStan program, where all discrete parameters have been eliminated (labelled “SlicStan”).

(3)

The time taken for the original SlicStan program to be transformed using Elim Gen and translated to NumPyro code (labelled “SlicStan-to-NumPyro”).

We consider different numbers of discrete parameters for each model, up to 25 discrete parameters. We do not consider more than 25 parameters due to constraints of the NumPyro baseline, which we discuss in more detail in Section 5.3. We run experiments on two classes of model often seen in practice: hidden Markov models (Section 5.2.1) and mixture models (Section 5.2.2). To ensure a fair comparison, the same elimination ordering was used across experiments. Experiments were run on a dual-core 2.30 GHz Intel Xeon CPU and a Tesla T4 GPU (when applicable). All SlicStan models used in the experiments are available at the SlicStan repo.

5.2.1 Hidden Markov Models.

We showed several examples of simple HMMs throughout the article (Program A, Program D, Program G) and worked through a complete example of VE in an HMM (4.7). We evaluate our approach on both the simple first-order HMM seen previously and on two additional ones: second-order HMM and factorial HMM.

First-order HMM. The first-order HMM is a simple chain of N discrete variables, each taking a value from 1 to K according to a categorical distribution. The event probabilities for the distribution of z_n are given by $ \boldsymbol {\theta }_{z_{n-1}} $, where $ \boldsymbol {\theta } $ is some given $ K \times K $ matrix. Each data point $ \mathbf {y} $ is modelled as coming from a Gaussian distribution with mean $ \mu _{z_n} $ and standard deviation 1, where $ \boldsymbol {\mu } $ is a $ K- $dimensional continuous parameter of the model.

\[ \begin{gather*} \mu _k \sim \mathcal {N}(0, 1) \quad \text{for } k \in 1, \dots , K \\ z_1 \sim \mathrm{categorical}(\boldsymbol {\theta }_{1}) \\ z_n \sim \mathrm{categorical}(\boldsymbol {\theta }_{z_{n-1}}) \quad \text{for } n \in 2, \dots , N \\ y_n \sim \mathcal {N}(\mu _{z_{n}}, 1) \quad \text{for } n \in 1, \dots , N \end{gather*} \]

We measure the compilation time and the time taken to sample 1 chain with each of the three NumPyro programs corresponding to this model. We use $ K = 3 $ and different values for N, ranging from $ N=3 $ to $ N=25 $. Figure 9 shows a summary of the results. We see that both on CPU and GPU, the program transformed using SlicStan outperforms the automatically generated NumPyro and also the manually optimised NumPyro-Opt. Each of the three programs has compilation time exponentially increasing with the number of variables, however SlicStan’s compilation time increases the fastest. We discuss this drawback in more detail in Section 5.3, highlighting the importance of an extended loop-level analysis being considered in future work.

Fig. 8.

Second-order HMM. The second-order HMM is very similar to the first-order HMM, but the discrete variables depend on the previous two variables, in this case taking the maximum of the two.

\[ \begin{gather*} {\mu }_k \sim \mathcal {N}(0, 1) \quad \text{for } k \in 1, \dots , K \\ z_1 \sim \mathrm{categorical}(\theta _{1}), \quad z_2 \sim \mathrm{categorical}(\theta _{z_1}) \\ z_n \sim \mathrm{categorical}(\theta _{\mathrm{max}(z_{n-2}, z_{n-1})}) \quad \text{for } n \in 3, \dots , N \\ y_n \sim \mathcal {N}(\mu _{z_{n}}, 1) \quad \text{for } n \in 1, \dots , N \end{gather*} \]

Similarly to before, we run the experiment for $ K = 3 $ and different values for N, ranging from $ N=3 $ to $ N=25 $. We show the results in Figure 10, which once again shows SlicStan outperforming NumPyro and NumPyro-Opt in terms of runtime, but having slower compilation time for a larger number of discrete parameters.

Factorial HMM. In a factorial HMM, each data point y_n is generated using two independent hidden states z_n and h_n, each depending on the previous hidden states $ z_{n-1} $ and $ h_{n-1} $.

\[ \begin{gather*} \mu _k \sim \mathcal {N}(0, 1) \quad \text{for } k \in 1, \dots , K^2 \\ z_1 \sim \mathrm{categorical}(\theta _{1}), \quad h_1 \sim \mathrm{categorical}(\theta _{1}) \\ z_n \sim \mathrm{categorical}(\theta _{z_{n-1}}), \quad h_n \sim \mathrm{categorical}(\theta _{h_{n-1}}) \quad \text{for } n \in 2, \dots , N \\ y_n \sim \mathcal {N}(\mu _{z_{n} * h_{n}}, 1) \quad \text{for } n \in 1, \dots , N \end{gather*} \]

Fig. 09.

Fig. 10.

We run the experiment for $ K = 3 $ and different length of the chain N, ranging from $ N=1 $ (two discrete parameters) to $ N=12 $ (24 discrete parameters). We show the results in Figure 11: Similarly to before, SlicStan outperforms both NumPyro and NumPyro-Opt in terms of runtime. We also observe that, in the case of SlicStan, the time taken to sample a single chain increases more slowly as we increase the number of discrete variables.

5.2.2 Mixture Models.

Another useful application of mixed discrete and continuous variable models is found in mixture models. We run experiments on two models: soft K-means clustering and linear regression with outlier detection.

Soft K-means. The Gaussian mixture model underlines the celebrated soft K-means algorithm. Here, we are interested in modelling some D-dimensional data that belongs to one of K (unknown) Gaussian clusters. Each cluster k is specified by a D-dimensional mean $ \mu _{., k} $. Each data point $ y_{.,n} $ is associated with a cluster z_n.

\[ \begin{gather*} \mu _{d, k} \sim \mathcal {N}(0, 1) \quad \text{for } d \in 1, \dots , D \text{ and } k \in 1, \dots , K\\ z_n \sim \mathrm{categorical}(\pi) \quad \text{for } n \in 1, \dots , N \\ y_{d, n} \sim \mathcal {N}(\mu _{d, z_n}, 1) \quad \text{for } d \in 1, \dots , D \text{ and } n \in 1, \dots , N \end{gather*} \]

We run the experiments for $ K = 3 $, $ D = 10 $, and $ N = 3, \dots , 25 $ and show the results in Figure 12.

We observe a clear linear trend of the runtime growing with N, with SlicStan performing better and its runtime growing more slowly than that of NumPyro. While the SlicStan-translated code runs faster than NumPyro-Opt for $ N \le 25 $, we observe that the SlicStan runtime grows faster than that of the manually optimised NumPyro-Opt code.

Outlier detection. The final model we consider is a Bayesian linear regression that allows for outlier detection. The model considers data points $ (x_n, y_n) $, where y lies on the line $ \alpha x + \beta $ with some added noise. The noise $ \sigma _{z_n} $ depends on a Bernoulli parameter z_n, which corresponds to whether or not the point $ (x_n, y_n) $ is an outlier or not. The noise for outliers ($ \sigma _1 $) and the noise for non-outliers ($ \sigma _2 $) are given as hyperparameters.

\[ \begin{gather*} \alpha \sim \mathcal {N}(0, 10), \quad \beta \sim \mathcal {N}(0, 10)\\ \pi ^{(raw)}_1 \sim \mathcal {N}(0, 1), \quad \pi ^{(raw)}_2 \sim \mathcal {N}(0, 1), \quad \pi = \frac{\exp {\pi ^{(raw)}_1}}{\exp {\pi ^{(raw)}_1} + \exp {\pi ^{(raw)}_2}} \\ z_n \sim \mathrm{bernoulli}(\pi) \quad \text{for } n \in 1, \dots , N \\ y_{n} \sim \mathcal {N}(\alpha x_n + \beta , \sigma _{z_n}) \quad \text{for } n \in 1, \dots , N \end{gather*} \]

Similarly to the earlier HMM models, SlicStan has the smallest runtime per chain, but at the expense of fast growing compile time (Figure 13).

Fig. 11.

Fig. 12.

5.3 Analysis and Discussion

Our method can be applied to general models containing a fixed and known number of finite-support discrete parameters, which significantly reduces the amount of manual effort that was previously required for such models in languages like Stan Damiano et al. 2018. In addition, as shown in Figures 9–13, SlicStan outperforms both the NumPyro baseline and the hand-optimised NumPyro-Opt in terms of runtime. This suggests that a static-time discrete variable optimisation, like the one introduced in this article, is indeed beneficial and speeds up inference.

One limitation of our experimental analysis is the relatively small number of discrete parameters we consider. Due to the array dimension limit imposed by PyTorch/NumPy, Pyro cannot have more than 25 discrete variables (64 for CPU) unless the dependence between them is specified using markov or plate (as with NumPyro-Opt). For NumPyro this hardcoded limit is 32. Thus, it would not be possible to compare to the NumPyro baseline for a larger number of variables, though comparing to the hand-optimised NumPyro-Opt would still be possible.

Perhaps the biggest limitation of the discrete parameters version of SlicStan is the exponentially growing compilation time. Using a semi-lattice instead of a lattice in the $ \vdash _{2} $ level type analysis breaks the requirement of the bidirectional type system that ensures efficiency of type inference. The constraints generated by the type system can no longer be resolved by SlicStan’s original linear-time algorithm. While polynomial-time constraint-solving strategy may still exist, we choose to employ Z3 to automatically resolve the type inference constraints and leave the consideration for more efficient type inference algorithm for future work.

This also highlights the importance of a future SlicStan version that considers arrays of discrete parameters. Our algorithm currently supports only individual discrete parameters. In the cases where the size of an array of discrete parameters is statically known, the Elim Gen procedure can be applied to a program where such arrays have been “flattened” into a collection of individual discrete variables, which is the strategy we adopt for the experiments in this section. But to be applicable more widely, the Elim Gen rule needs to be generalised based on array element level dependence analysis, for example, by incorporating ideas from the polyhedral model Feautrier 1992. As the array level dependence analysis that would be required in most practical use-cases is very straightforward, we believe this would be a useful and feasible applied extension of our work. In addition, this would significantly decrease the number of program variables for which we need to infer a level type during the Elim Gen transformation, thus making compilation practical for larger or arbitrary numbers of discrete parameters.

6 Related Work

This article provides a type system that induces conditional independence relationships, and it discusses one practical application of such type system: an automatic marginalisation procedure for discrete parameters of finite support.

Conditional independence. The theoretical aim of our article is similar to that of Barthe et al. 2019, who discuss a separation logic for reasoning about independence, and the follow-up work of Bao et al. 2021, who extend the logic to capture conditional independence. One advantage of our method is that the verification of conditional independence is automated by type inference, while it would rely on manual reasoning in the works of Barthe et al. 2019 and Bao et al. 2021. However, the logic approach can be applied to a wider variety of verification tasks. Amtoft and Banerjee 2020 show a correspondence between variable independence and slicing a discrete-variables-only probabilistic program. The biggest difference to our work is that their work considers only conditional independence of variables given the observed data: that is, CI relationships of the form $ \mathbf {x}_1 \mathrel {\text{$\perp \perp $}}\mathbf {x}_2 \mid \mathcal {D} $ for some subsets of variables $ \mathbf {x}_1 $ and $ \mathbf {x}_2 $ and data $ \mathcal {D} $. The language of Amtoft and Banerjee 2020 requires observed data to be specified syntactically using an observe statement. Conditional independencies are determined only given this observed data, and the method for determining how to slice a program is tied to the observe statements. From the Amtoft and Banerjee 2020 paper: “A basic intuition behind our approach is that an observe statement can be removed if it does not depend on something on which the returned variable x also depends.” In contrast, we are able to find CI relationships given any variables we are interested in ($ \mathbf {x_1} \mathrel {\text{$\perp \perp $}}\mathbf {x_2} \mid \mathbf {x}_3 $ for some $ \mathbf {x}_1 $, $ \mathbf {x}_2 $, and $ \mathbf {x}_3 $), and type inference constitutes of a straightforward algorithm for finding such relationships. However, Amtoft and Banerjee 2020 permit unbounded number of variables (e.g., while (y > 0) y bernoulli(0.2)), while it is not clear how to extend SlicStan/Stan to support this. While not in a probabilist programming setting, Lobo-Vesga et al. 2020 use taint analysis to find independencies between variables in a program to facilitate easy trade off between privacy and accuracy in differential privacy context.

Automatic marginalisation. The most closely related previous work, in terms of the automatic marginalisation procedure, is that of Obermeyer et al. 2019 and that of Murray et al. 2018. Obermeyer et al. 2019 implement efficient variable-elimination for plated factor graphs in Pyro Uber AI Labs 2017. Their approach uses effect-handlers and can be implemented in other effect-handling-based PPLs, such as Edward2 Tran et al. 2018. Murray et al. 2018 introduce a “delayed sampling” procedure in Birch Murray and Schön 2018, which optimises the program via partial analytical solutions to sub-programs. Their method corresponds to automatic variable elimination and, more generally, automatic Rao–Blackwellization. While we focus on discrete variable elimination only, our conditional independence type system can be directly used for more general analysis. The method from Section 4 can be extended to marginalise out and sample continuous variables whenever they are part of an analytically tractable sub-program, similarly to delayed sampling in Birch. One key difference of our approach is that the program re-writes are guided by the type system and happen at compile time, before inference is run. In contrast, both Pyro and Birch maintain a dynamic graph that guides the analysis at runtime.

Symbolic inference. Where a full analytical solution is possible, several probabilistic programming languages can derive it via symbolic manipulation, including Hakaru Narayanan et al. 2016 and PSI Gehr et al. 2016 2020, while Dice Holtzen et al. 2020 performs exact inference for models with discrete parameters only, by analysing the program structure. In contrast, we focus on re-writing the program and decomposing it into parts to be used with fast and more general asymptotically exact or approximate inference algorithms, such as HMC, variational inference, or others.

Extending HMC to support discrete parameters. The idea of modifying HMC to handle discrete variables and discontinuities has been previously explored Zhou 2020 Zhang et al. 2012 Pakman and Paninski 2013 Nishimura et al. 2017. More recently, Zhou et al. 2019 introduced the probabilistic programming language LF-PPL, which is designed specifically to be used with the Discontinuous Hamiltonian Monte Carlo (DHMC) algorithm Nishimura et al. 2017. The algorithm and their framework can also be extended to support discrete parameters. LF-PPL provides support for an HMC version that itself works with discontinuities. Our approach is to statically rewrite the program to match the constraints of Stan, vanilla HMC and its several well-optimised extensions, such as NUTS Hoffman and Gelman 2014.

Composable and programmable inference. Recent years have seen a growing number of techniques that allow for tailored-to-the-program compilation to an inference algorithm. For example, Gen Cusumano-Towner et al. 2019 can statically analyse the model structure to compile to a more efficient inference strategy. In addition, languages such as Gen and Turing Ge et al. 2018 facilitate composable and programmable inference Mansinghka et al. 2018, where the user is provided with inference building blocks to implement their own model-specific algorithm. Our method can be understood as an automatic composition between two inference algorithms: variable elimination and HMC or any other inference algorithm that can be used to sample continuous variables.

7 Conclusion

This article introduces an information flow type system that can be used to check and infer conditional independence relationships in probabilistic programs, through type checking and inference, respectively. We present a practical application of this type system: a semantics-preserving transformation that makes it possible to use, and to efficiently and automatically infer discrete parameters in SlicStan, Stan, and other density-based probabilistic programming languages. The transformed program can be seen as a hybrid inference algorithm on the original program, where continuous parameters can be drawn using efficient gradient-based inference methods, like HMC, while the discrete parameters are drawn using variable elimination.

While the variable elimination transformation uses results on conditional independence of discrete parameters, our type system is not restricted to this usage. Conditional independence relationships can be of interest in many contexts in probabilistic modelling, including more general use of variable elimination, message-passing algorithms, Rao-Blackwellization, and factorising a program for a composed-inference approach. We believe conditional independence by typing can enable interesting future work that automates the implementation of such methods.

Acknowledgments

We thank Vikash Mansinghka for suggesting the outlier detection example, which we used for evaluation, as well as Lawrence Murry for clarifying the behaviour of Birch, and anonymous reviewers whose helpful suggestions improved the article.

Footnotes

The implementation is available at https://github.com/mgorinova/SlicStan.

We use $ \ell' \gt \ell $ as a shorthand for $ \ell \leq \ell' \wedge \neg \ell' \leq \ell . $

Here, we use $ \Gamma (L) $ to look up the type of the L-value L in Γ. Sometimes we will use an overloaded meaning of this notation (Definition 14) to look up the level type of a general expression. Which $ \Gamma (.) $ we refer to will be clear from context.

⁴

$ f(V_1, \dots , V_n) $ means applying the built-in function f on the values $ V_1, \dots , V_n $.

⁵

Here, we write $ E[E^{\prime }/x] $ for the usual capture avoiding substitution of $ E^{\prime } $ for x in E.

⁶

By $ d(V|V_1,\ldots ,V_n) $, we mean the result of evaluating the intended built-in conditional distribution d on $ V,V_1,\ldots ,V_n $.

⁷

For example, in our Stan backend for SlicStan, if such a statement is of level $ \mathbf {{\rm\small MODEL}} $, it will be executed as density contribution, while if it is of level $ \mathbf {{\rm\small GENQUANT}} $, then it will be executed as a probabilistic assignment.

⁸

Ancestral (or forward) sampling refers to the method of sampling from a joint distribution by individually sampling variables from the factors constituting the joint distribution. For example, we can sample from $ p(x, y) = p(x)p(y \mid x) $ by randomly generating $ \hat{x} $ according to p(x), and then randomly generating $ \hat{y} $ according to $ p(y \mid x = \hat{x}) $.

⁹

Here, $ p(Q \mid \boldsymbol {\theta },\mathcal {D}) $ denotes the conditional probability density of Q, given the values of $ \boldsymbol {\theta } $ and $ \mathcal {D} $.

¹⁰

A factor graph is a bipartite graph that shows the factorisation of a multivariable function. Variables are circular nodes, and each factor of the function is a square node. An edge exists between a variable node x and a factor node φ if and only if φ is a function of x. See Program A and its corresponding factor graph as an example, or Koller and Friedman 2009 for details.

¹¹

Here, we assume the function sum is available in the language.

¹²

This expression is simplified for readability.

¹³

“Neighbours” refers to the variables that are connected to a factor that connects to z.

¹⁴

Available at https://github.com/mgorinova/SlicStan.

A Definitions and Proofs

A.1 Definitions

Definition 12 (Samples-to Set $ \widetilde{W}(S) $)

Definition 15.

$ \Gamma (E_1, \dots , E_n) \equiv \Gamma (E_1) \sqcup \dots \sqcup \Gamma (E_n) $.

Definition 16 ($ R_{\Gamma \vdash \ell }(S) $)

Definition 17 ($ W_{\Gamma \vdash \ell }(S) $)

$ W_{\Gamma \vdash \ell }(S) \triangleq \lbrace x \in W(S) \mid \Gamma (x) = (\tau , \ell) \text{ for some } \tau \rbrace . $

Definition 18 ($ \widetilde{W}_{\Gamma \vdash \ell }(S) $)

$ \widetilde{W}_{\Gamma \vdash \ell }(S) \triangleq \lbrace x \in \widetilde{W}(S) \mid \Gamma (x) = (\tau , \ell) \text{ for some } \tau \rbrace . $

Definition 20 (Neighbours of z, $ \mathrm{ne}(\Gamma , \Gamma ^{\prime }, z) $)

A.2 Proofs

Lemma 1.

Lemma 1 (Noninterference of $ \vdash $). Suppose $ s_1 \models \Gamma $, $ s_2 \models \Gamma $, and $ s_1 \approx _{\ell }\, s_2 $ for some ℓ. Then for SlicStan statement S and expression E:

(1)

If $ ~\Gamma \vdash E:(\tau ,\ell) $ and $ (s_1, E) \Downarrow V_1 $ and $ (s_2, E) \Downarrow V_2 $, then $ V_1 = V_2 $.

(2)

If $ ~\Gamma \vdash S:\ell $ and $ (s_1, S) \Downarrow s_1^{\prime }, w_1 $ and $ (s_2, S) \Downarrow s_2^{\prime }, w_2 $, then $ s_1^{\prime } \approx _{\ell } s_2^{\prime } $.

Proof.

(1) follows by rule induction on the derivation $ \Gamma \vdash E:(\tau , \ell) $, and using that if $ \Gamma \, \vdash E:(\tau , \ell) $, $ x \in R(E) $ and $ \Gamma (x) = (\tau ^{\prime }, \ell ^{\prime }) $, then $ \ell ^{\prime } \le \ell $. (2) follows by rule induction on the derivation $ \Gamma \vdash S:\ell $ and using (1).

Most cases follow trivially from the inductive hypothesis. An exception is the Target case, which we show below.

Target

We use the premise $ \forall \ell ^{\prime } \gt \ell . R_{\Gamma \vdash \ell ^{\prime }}(S) = \varnothing $, together with a lemma that for S, s₁ and s₂ such that $ s_1, S \Downarrow s_1^{\prime }, w_1 $, and $ s_2, S \Downarrow s_2^{\prime }, w_2 $, and $ \forall x \in R(S). s_1(x) = s_2(x) $, we have that $ w_1 = w_2 $. (This lemma follows by structural induction on S.) In the case of Target, $ s_1, \mathrm{target}(S) \Downarrow w_1 $, and $ s_2, \mathrm{target}(S) \Downarrow w_2 $ and $ R(S) = \bigcup _{\ell ^{\prime }} R_{\Gamma \vdash \ell ^{\prime }}(S) = \left(\bigcup _{\ell ^{\prime } \le \ell } R_{\Gamma \vdash \ell ^{\prime }}(S) \right) \cup \left(\bigcup _{\ell ^{\prime } \gt \ell } R_{\Gamma \vdash \ell ^{\prime }}(S) \right) = \bigcup _{\ell ^{\prime } \le \ell } R_{\Gamma \vdash \ell ^{\prime }}(S) $. Then, for any $ x \in R(S) $, $ x \in R_{\Gamma \vdash \ell ^{\prime }}(S) $ for some $ \ell ^{\prime } \le \ell $, so $ \Gamma (x) = (\tau , \ell _x) $ such that $ \ell _x \le \ell ^{\prime } \le \ell $. And thus, by definition of $ \approx _\ell $, $ s_1(x) = s_2(x) $ for any $ x \in R(S) $. By applying the lemma above, we then get $ w_1 = w_2 $, as required.

□

Proof.

By rule induction on the derivation of $ S \Updownarrow _{\Gamma }S_D, S_M, S_Q $.□

Lemma 3.

Lemma 5 (Property of Single-level Statements). Let $ ~\Gamma _{\sigma }, \Gamma _{\mathbf {x}}\vdash S $ be SlicStan program, such that S is single-level statement of level ℓ, $ \Gamma \vdash \ell (S) $. Then there exist unique functions f and φ, such that for any $ \sigma , \mathbf {x} \models \Gamma _{\sigma }, \Gamma _{\mathbf {x}} $:

\[ [\![ S ]\!] (\sigma)(x) = f(\sigma _{\le \ell }, \mathbf {x}_{\le \ell })\cup \sigma _{\gt \ell } , \;\;\phi (\sigma _{\le \ell })(\mathbf {x}_{\le \ell }), \]

where we write $ \sigma _{\le \ell }=\lbrace (x\mapsto V)\in \sigma \mid \Gamma _{\sigma }(x)=(\_,\ell)\rbrace $ and $ \sigma _{\gt \ell }=\sigma \setminus \sigma _{\le \ell } $.

Proof.

This property follows from noninterference (Lemma 1), if we understand factor and sample statements as assignments to a reserved weight variables of different levels. Let $ \Gamma , S $ be a SlicStan program and suppose we obtain $ S^{\prime } $ by:

•

Substituting every $ \sf factor(E) $ statement with $ w_{\ell } = w_{\ell } * E $, where $ \Gamma (E) = \sf real, \ell $ and $ w_{\mathbf {{\rm\small DATA}}} $, $ w_{\mathbf {{\rm\small MODEL}}} $, and $ w_{\mathbf {{\rm\small QENQUANT}}} $ are write-only, distinct, and reserved variables in the program.

•

Substituting every $ L \sim d(E_1, \dots , E_n) $ statement with $ w_{\ell } = w_{\ell } * d_{\mathrm{pdf}}(L \mid E_1, \dots , E_n) $, where $ \Gamma (d_{\mathrm{pdf}}(L \mid E_1, \dots , E_n)) = \sf real, \ell $.

Then for all $ \sigma , \mathbf {x} \models \Gamma $, we have $ [\![ S ]\!] _p(\sigma)(\mathbf {x}) = \prod _{\ell }\sigma ^{\prime }(w_{\ell }) $, where $ \sigma ^{\prime } = [\![ S^{\prime } ]\!] _s(\sigma , \forall \ell . w_{\ell } \mapsto 1)(\mathbf {x}) $. By non-interference (Lemma 1), for any level ℓ and store $ \sigma _2 \approx _{\ell } \sigma $, if $ \sigma _2^{\prime } = [\![ S^{\prime } ]\!] _s(\sigma _2, \forall \ell . w_{\ell } \mapsto 1)(\mathbf {x}) $, then $ \sigma _2^{\prime } \approx _{\ell } \sigma ^{\prime } $. Thus, $ \sigma _2^{\prime }(w_{\ell ^{\prime }}) = \sigma _2(w_{\ell ^{\prime }}) $ for $ \ell ^{\prime } \le \ell $, and therefore, when S is a single-level statement of level ℓ, $ [\![ S^{\prime } ]\!] _s(\sigma , \forall \ell . w_{\ell } \mapsto 1)(\mathbf {x}) = f(\sigma _{\le \ell }, \mathbf {x}_{\le {\ell }}), \sigma _{\gt \ell }, w_{\le \ell } \mapsto \phi (\sigma _{\le \ell }, \mathbf {x}_{\le {\ell }}), w_{\gt \ell } \mapsto 1 $, for some functions f and φ. Finally, this gives us $ [\![ S ]\!] _s(\sigma , \mathbf {x}) = (f(\sigma _{\le \ell }, \mathbf {x}_{\le {\ell }}), \sigma _{\gt \ell }) $, $ [\![ S ]\!] _p(\sigma , \mathbf {x}) = \phi (\sigma _{\le \ell }, \mathbf {x}_{\le {\ell }}) $.□

Lemma 4.

Lemma 6 (Semantic Preservation of $ \Updownarrow _{\Gamma } $). If $ ~\Gamma \vdash S:\mathbf {{\rm\small DATA}} $ and $ S \Updownarrow _{\Gamma } (S_{D_{}}, S_{M_{}}, S_{Q_{}}) $ then $ [\![ S ]\!] = [\![ S_D; S_M; S_Q ]\!] $.

Proof.

Follows by adapting proof from Gorinova et al. 2019.□

Lemma 5.

Lemma 10 (Semantic Preservation of $ \Updownarrow _{\Gamma } $ 2). If $ ~\Gamma \vdash _{2}S:\mathbf {{\rm\small L1}} $ and $ S \Updownarrow _{\Gamma } S_1, S_2, S_3, $ then $ [\![ S ]\!] = [\![ S_1; S_2; S_3 ]\!] $.

Proof.

Follows by adapting proof from Gorinova et al. 2019.□

Lemma 13.

For a SlicStan expression E and a function $ \phi (x, y) = V $, where V is a value such that $ (\sigma , x, y), E \Downarrow V $ for every x and y and some $ \sigma $, if $ x \notin R(E) $, then:

\[ \exists \phi ^{\prime } \text{ such that } \phi (x, y) = \phi ^{\prime }(y) \text{ for all } x, y. \]

Proof.

By induction on the structure of E.□

Lemma 6.

Theorem 1 (Shredding Induces a Factorisation of the Density). Suppose $ \Gamma \vdash S : \mathbf {{\rm\small DATA}} $ and $ ~S \Updownarrow _{\Gamma } S_D, S_M, S_Q $ and $ \Gamma = \Gamma _{\sigma } \cup \Gamma _{\mathcal {D}} \cup \Gamma _{\boldsymbol {\theta }} \cup \Gamma _{Q} $. For all $ \sigma $, $ \mathcal {D} $, $ \boldsymbol {\theta } $, and Q: if $ \sigma , \mathcal {D}, \boldsymbol {\theta }, Q\models \Gamma _{\sigma }, \Gamma _{\mathcal {D}}, \Gamma _{\boldsymbol {\theta }}, \Gamma _{Q} $, and $ [\![ S ]\!] _p(\sigma)(\mathcal {D}, \boldsymbol {\theta }, Q) \propto p(\mathcal {D}, \boldsymbol {\theta }, Q) $ and $ \widetilde{W}(S_Q)=\mathrm{dom}(\Gamma _Q), $ then:

(1)

$ [\![ S_M ]\!] _p(\sigma _D)(\mathcal {D}, \boldsymbol {\theta }, Q) \propto p(\boldsymbol {\theta }, \mathcal {D}), $

(2)

$ [\![ S_Q ]\!] _p(\sigma _M)(\mathcal {D}, \boldsymbol {\theta }, Q) = p(Q \mid \boldsymbol {\theta }, \mathcal {D}), $

where $ \sigma _D = [\![ S_D ]\!] _s(\sigma)(\mathcal {D}, \boldsymbol {\theta }, Q) $ and $ \sigma _M = [\![ S_M ]\!] _s(\sigma _D)(\mathcal {D}, \boldsymbol {\theta }, Q) $.

Lemma 7.

Lemma 9 (Shredding Produces Single-level Statements 2).

\[ S \Updownarrow _{\Gamma } S_1, S_2, S_3 \Rightarrow \Gamma \vdash \mathbf {{\rm\small L1}}(S_1) \wedge \Gamma \vdash \mathbf {{\rm\small L2}}(S_2) \wedge \Gamma \vdash \mathbf {{\rm\small L3}}(S_3). \]

Proof.

By rule induction on the derivation of $ S \Updownarrow _{\Gamma }S_1, S_2, S_3 $.□

Lemma 8.

Lemma 10 (Semantic Preservation of $ \Updownarrow _{\Gamma } $, $ \vdash _{2} $). If $ ~\Gamma \vdash _{2}S:\mathbf {{\rm\small L1}} $ and $ S \Updownarrow _{\Gamma } S_1, S_2, S_3 $, then $ [\![ S ]\!] = [\![ S_1; S_2; S_3 ]\!] $.

Proof.

Proof.

By understanding factor and sample statements as assignment to a reserved weight variables of different levels (similarly to Lemma 5) and noninterference (Lemma 7).□

Lemma 14.

Let $ \Gamma , S $ be a SlicStan program, such that $ \sigma , \mathbf {x} \models \Gamma $, $ [\![ S ]\!] _s(\sigma)(\mathbf {x}) = \sigma ^{\prime } $ and $ [\![ S ]\!] _p(\sigma)(\mathbf {x}) = \psi (\mathbf {x}) $ for some function $ \psi $. If $ f \notin \mathrm{dom}(\Gamma) $ is a fresh variable, then $ z, z_1, \dots z_n \in \mathrm{dom}(\Gamma _{\mathbf {x}}) $ are discrete variables of base types $ \sf int\langle K \rangle , \sf int\langle K_1 \rangle , \dots , \sf int\langle K_n \rangle $, respectively, and $ S^{\prime } $ is a statement such that

\[ S^{\prime } = \quad f = \phi (\sf int\langle K_1 \rangle z_1, \dots \sf int\langle K_n \rangle z_n) \lbrace \sf elim(\sf int\langle K \rangle z)~S \rbrace ; \quad \sf factor(f[z_1, \dots , z_n]); \]

then $ [\![ S^{\prime } ]\!] _s(\sigma)(\mathbf {x}) = \sigma ^{\prime \prime } $ with $ \sigma ^{\prime \prime }[-f] = \sigma ^{\prime } $ and $ [\![ S^{\prime } ]\!] _p(\sigma)(\mathbf {x}) = \sum _{z=1}^{K}\psi (\mathbf {x}) $.

Proof.

By examining the operational semantics of assignment, $ \sf factor $, and the derived forms $ \sf elim $ and φ.□

Lemma 15.

Let $ \Gamma , S $ be a SlicStan program, such that $ \sigma , \mathbf {x} \models \Gamma $, $ [\![ S ]\!] _s(\sigma)(\mathbf {x}) = \sigma ^{\prime } $ and $ [\![ S ]\!] _p(\sigma)(\mathbf {x}) = \psi (\mathbf {x}) $ for some function $ \psi $. If $ z \in \mathrm{dom}(\Gamma _{\mathbf {x}}) $ is a discrete variable of base type $ \sf int\langle K \rangle $, and $ S^{\prime } $ is a statement such that

\[ S^{\prime } = \quad \sf gen(z)~ S; \quad \mathrm{st}(S); \]

then $ [\![ S^{\prime } ]\!] _s(\sigma)(\mathbf {x}) = \sigma ^{\prime } $, $ \psi (\mathbf {x}) $ is normalisable with respect to z with $ \psi (\mathbf {x}) \propto p(z \mid \mathbf {x} \setminus \lbrace z\rbrace) $, and $ [\![ S^{\prime } ]\!] _p(\sigma)(\mathbf {x}) = p(z \mid \mathbf {x} \setminus \lbrace z\rbrace) $.

Proof.

By examining the operational semantics of ∼ and $ \sf target $, and by induction on the structure of S to prove $ [\![ \mathrm{st}(S) ]\!] _s = [\![ S ]\!] _s $ and $ [\![ \mathrm{st}(S) ]\!] _p = 1 $.□

Lemma 11.

Theorem 4 (Semantic Preservation of $ \xrightarrow {z} $). For SlicStan programs $ \Gamma , S $ and $ \Gamma ^{\prime }, S^{\prime } $, and a discrete parameter z: $ \Gamma , S \xrightarrow {z} \Gamma ^{\prime }, S^{\prime } \rightarrow [\![ S ]\!] = [\![ S^{\prime } ]\!] $.

B Examples

B.1 Sprinkler

Often, beginners are introduced to probabilistic modelling through simple, discrete variable examples, as they are more intuitive to reason about and often have analytical solutions. Unfortunately, one cannot express such examples directly in PPLs that do not support discrete parameters. One well-known discrete variable example, often used in tutorials on probabilistic modelling, is the “Sprinkler” example. It models the relationship between cloudy weather, whether it rains, whether the garden sprinkler is on, and the wetness of the grass. In Figure 15, we show a version of the sprinkler model written in SlicStan with discrete parameters (left) and the marginalisation part of its corresponding transformed version (right).

As $ \mathrm{cloudy} \mathrel {\text{$\perp \perp $}}\mathrm{wet} \mid \mathrm{sprinkler}, \mathrm{rain} $, we do not need to include $ \mathrm{wet} $ in the elimination of $ \mathrm{cloudy} $, and the new factor is computed for different values of only $ \mathrm{sprinkler} $ and $ \mathrm{rain} $ (lines 2–6). The rest of the variables are eliminated one-by-one, involving all remaining variables (lines 7–15).

The snippet of the SlicStan code generated by our transformation is an exact implementation of the variable elimination algorithm for this model. This not only facilitates a platform for learning probabilistic programming using standard introductory models, but it can also be a useful tool for learning concepts such as marginalisation, conditional independence, and exact inference methods.

Fig. 14.

B.2 Soft-K-means Model

In Figure 16, we present the standard soft-k-means clustering model as it is written in SlicStan with support for discrete model parameters (left). The right column shows the resulting code that our program transformation generates. This code consists of plain SlicStan code and no support for discrete model parameters is needed to perform inference on it.

The model can be used for (softly) dividing N data points $ \mathbf {y} $ in D-dimensional Euclidean space into K clusters that have means $ \boldsymbol {\mu } $ and probability $ \boldsymbol {\pi } $.

Fig. 15.

B.3 A Causal Inference Example

The question of how to adapt PPLs to causal queries has been recently gaining popularity. One way to express interventions and reason about causality, is to assume a discrete variable specifying the direction (or absence of) causal relationship and specify different behaviour for each case using if statements Winn 2012. We show a simple causal inference example (Figure 17) written in SlicStan with direct support for discrete parameters (left) and the code that our transformation generates (right) on which we can perform inference using a combination of, e.g., HMC and ancestral sampling.

This model can be read as follows: Assume that we are in a situation where we want to answer a causal question. We want to answer this question based on N paired observations of A and B, in some of which we might have intervened ($ \mathrm{doB} $). Our model proceeds by drawing a (prior) probability that A causes B from a beta distribution, and then specifying A and B for different scenarios (intervention, A causes B and no intervention, B causes A and no intervention) using conditional statements.

Fig. 16.

References

[1]

Martín Abadi, Anindya Banerjee, Nevin Heintze, and Jon G. Riecke. 1999. A core calculus of dependency. In Symposium on Principles of Programming Languages. ACM, 147–160.

Google Scholar

[2]

Eyal Amir. 2010. Approximation algorithms for treewidth. Algorithmica 56, 4 (2010), 448–479.

Crossref

Google Scholar

[3]

Torben Amtoft and Anindya Banerjee. 2020. A theory of slicing for imperative probabilistic programs. ACM Trans. Program. Lang. Syst. 42, 2 (2020), 1–71.

Digital Library

Google Scholar

[4]

Stefan Arnborg, Derek G. Corneil, and Andrzej Proskurowski. 1987. Complexity of finding embeddings in a k-tree. SIAM J. Algeb. Discr. Meth. 8, 2 (1987), 277–284.

Digital Library

Google Scholar

[5]

Jialu Bao, Simon Docherty, Justin Hsu, and Alexandra Silva. 2021. A bunched logic for conditional independence. In 36th Annual ACM/IEEE Symposium on Logic in Computer Science. IEEE, 1–14.

Google Scholar

[6]

Gilles Barthe, Justin Hsu, and Kevin Liao. 2019. A probabilistic separation logic. Proc. ACM on Program. Lang. 4, POPL (2019), 1–30.

Google Scholar

[7]

Michael Betancourt and Mark Girolami. 2015. Hamiltonian Monte Carlo for hierarchical models. Curr. Trends Bayesian Methodol. Applic. 79 (2015), 30.

Google Scholar

[8]

Craig Boutilier, Nir Friedman, Moises Goldszmidt, and Daphne Koller. 1996. Context-specific independence in Bayesian networks. In 12th International Conference on Uncertainty in Artificial Intelligence. 115–123.

Google Scholar

[9]

Bob Carpenter, Andrew Gelman, Matthew Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. Stan: A probabilistic programming language. J. Statist. Softw. 76, 1 (2017), 1–32. DOI:https://doi.org/10.18637/jss.v076.i01

Google Scholar

[10]

Marco F. Cusumano-Towner, Feras A. Saad, Alexander K. Lew, and Vikash K. Mansinghka. 2019. Gen: A general-purpose probabilistic programming system with programmable inference. In 40th ACM SIGPLAN Conference on Programming Language Design and Implementation. Association for Computing Machinery, New York, NY, 221–236. DOI:https://doi.org/10.1145/3314221.3314642

Digital Library

Google Scholar

[11]

Luis Damiano, Brian Peterson, and Michael Weylandt. 2018. A tutorial on hidden Markov models using Stan. StanCon (2018). DOI:https://doi.org/10.5281/zenodo.1284341.

Google Scholar

[12]

Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337–340.

Digital Library

Google Scholar

[13]

Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time. Int. J. Parallel Program. 21, 6 (1992), 389–420.

Crossref

Google Scholar

[14]

Brendan J. Frey. 2002. Extending factor graphs so as to unify directed and undirected graphical models. In 19th Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc., San Francisco, CA, 257–264.

Google Scholar

[15]

Hong Ge, Kai Xu, and Zoubin Ghahramani. 2018. Turing: A language for flexible probabilistic inference. In International Conference on Artificial Intelligence and Statistics. 1682–1690. DOI:http://proceedings.mlr.press/v84/ge18b.html.

Google Scholar

[16]

Timon Gehr, Sasa Misailovic, and Martin Vechev. 2016. PSI: Exact symbolic inference for probabilistic programs. In International Conference on Computer-aided Verification. Springer, 62–83.

Crossref

Google Scholar

[17]

Timon Gehr, Samuel Steffen, and Martin Vechev. 2020. λPSI: Exact inference for higher-order probabilistic programs. In 41st ACM SIGPLAN Conference on Programming Language Design and Implementation. 883–897.

Digital Library

Google Scholar

[18]

Andrew Gelman, Daniel Lee, and Jiqiang Guo. 2015. Stan: A probabilistic programming language for Bayesian inference and optimization. J. Educ. Behav. Statist. 40, 5 (2015), 530–543.

Crossref

Google Scholar

[19]

Andrew D. Gordon, Claudio V. Russo, Marcin Szymczak, Johannes Borgström, Nicolas Rolland, Thore Graepel, and Daniel Tarlow. 2015. Probabilistic programs as spreadsheet queries. In European Symposium on Programming Languages and Systems(Lecture Notes in Computer Science, Vol. 9032). Springer, 1–25.

Google Scholar

[20]

Maria I. Gorinova, Andrew D. Gordon, and Charles Sutton. 2019. Probabilistic programming with densities in SlicStan: Efficient, flexible, and deterministic. Proce. ACM Program. Lang. 3, POPL (2019), 35.

Google Scholar

[21]

Andreas Griewank and Andrea Walther. 2008. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. SIAM.

Crossref

Google Scholar

[22]

Matthew D. Hoffman and Andrew Gelman. 2014. The No-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 15, 1 (2014), 1593–1623.

Google Scholar

[23]

Steven Holtzen, Guy Van den Broeck, and Todd Millstein. 2020. Dice: Compiling discrete probabilistic programs for scalable inference. arXiv preprint arXiv:2005.09089 (2020).

Google Scholar

[24]

Uffe Kjærulff. 1990. Triangulation of graphs-Algorithms giving small total state space. Research Report R 90-09. Department of Mathematics and Computer Science, Aalborg University, Denmark.

Google Scholar

[25]

Daphne Koller and Nir Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. The MIT Press.

Digital Library

Google Scholar

[26]

Elisabet Lobo-Vesga, Alejandro Russo, and Marco Gaboardi. 2020. A programming framework for differential privacy with accuracy concentration bounds. In IEEE Symposium on Security and Privacy. IEEE, 411–428.

Google Scholar

[27]

Vikash K. Mansinghka, Ulrich Schaechtle, Shivam Handa, Alexey Radul, Yutian Chen, and Martin Rinard. 2018. Probabilistic programming with programmable inference. In 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, New York, NY, 603–616. DOI:https://doi.org/10.1145/3192366.3192409

Digital Library

Google Scholar

[28]

Tom Minka and John Winn. 2009. Gates. In Advances in Neural Information Processing Systems, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou (Eds.), Vol. 21. Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper/2008/file/4b0a59ddf11c58e7446c9df0da541a84-Paper.pdf.

Google Scholar

[29]

T. Minka, J. M. Winn, J. P. Guiver, S. Webster, Y. Zaykov, B. Yangel, A. Spengler, and J. Bronskill. 2014. Infer.NET 2.6. Microsoft Research Cambridge. Retrieved from http://research.microsoft.com/infernet.

Google Scholar

[30]

Dave Moore and Maria I. Gorinova. 2018. Effect handling for composable program transformations in Edward2. In International Conference on Probabilistic Programming. https://arxiv.org/abs/1811.06150.

Google Scholar

[31]

Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. The MIT Press.

Digital Library

Google Scholar

[32]

Lawrence M. Murray, Daniel Lundén, Jan Kudlicka, David Broman, and Thomas B. Schön. 2018. Delayed sampling and automatic Rao-Blackwellization of probabilistic programs. In 21st International Conference on Artificial Intelligence and Statistics. 1037–1046. Retrieved from http://proceedings.mlr.press/v84/murray18a.html.

Google Scholar

[33]

Lawrence M. Murray and Thomas B. Schön. 2018. Automated learning with a probabilistic programming language: Birch. Ann. Rev. Contr. 46 (2018), 29–43.

Crossref

Google Scholar

[34]

Praveen Narayanan, Jacques Carette, Wren Romano, Chung-chieh Shan, and Robert Zinkov. 2016. Probabilistic inference by program transformation in Hakaru (system description). In Functional and Logic Programming, Oleg Kiselyov and Andy King (Eds.). Springer International Publishing, Cham, 62–79.

Crossref

Google Scholar

[35]

Radford M. Neal et al. 2011. MCMC using Hamiltonian dynamics. Handb. Mark. Chain Monte Carlo 2, 11 (2011).

Crossref

Google Scholar

[36]

Akihiko Nishimura, David Dunson, and Jianfeng Lu. 2017. Discontinuous Hamiltonian Monte Carlo for sampling discrete parameters. arXiv preprint arXiv:1705.08510 (2017).

Google Scholar

[37]

Fritz Obermeyer, Eli Bingham, Martin Jankowiak, Justin Chiu, Neeraj Pradhan, Alexander Rush, and Noah Goodman. 2019. Tensor variable elimination for plated factor graphs. In International Conference on Machine Learning.

Google Scholar

[38]

A. Pakman and L. Paninski. 2013. Auxiliary-variable exact Hamiltonian Monte Carlo samplers for binary distributions. In International Conference on Advances in Neural Information Processing Systems.

Google Scholar

[39]

Du Phan, Neeraj Pradhan, and Martin Jankowiak. 2019. Composable effects for flexible and accelerated probabilistic programming in NumPyro. arXiv preprint arXiv:1912.11554 (2019).

Google Scholar

[40]

Benjamin C. Pierce and David N. Turner. 2000. Local type inference. ACM Trans. Program. Lang. Syst. 22, 1 (2000), 1–44.

Digital Library

Google Scholar

[41]

Lawrence R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2 (1989), 257–286.

Crossref

Google Scholar

[42]

John Salvatier, Thomas V. Wiecki, and Christopher Fonnesbeck. 2016. Probabilistic programming in Python using PyMC3. PeerJ Comput. Sci. 2 (2016), e55.

Crossref

Google Scholar

[43]

Stan Development Team. 2019a. Stan language reference manual. Version 2.26.0. Retrieved from http://mc-stan.org.

Google Scholar

[44]

Stan Development Team. 2019b. Stan user’s guide. Version 2.26.0. Retrieved from http://mc-stan.org.

Google Scholar

[45]

Dustin Tran, Matthew D. Hoffman, Srinivas Vasudevan, Christopher Suter, Dave Moore, Alexey Radul, Matthew Johnson, and Rif A. Saurous. 2018. Edward2: Simple, distributed, accelerated. Retrieved from https://github.com/tensorflow/probability/tree/master/tensorflow_probability/python/edward2.

Google Scholar

[46]

Uber AI Labs. 2017. Pyro: A deep probabilistic programming language. Retrieved from http://pyro.ai/.

Google Scholar

[47]

Dennis M. Volpano, Cynthia E. Irvine, and Geoffrey Smith. 1996. A sound type system for secure flow analysis. J. Comput. Secur. 4, 2/3 (1996), 167–188.

Crossref

Google Scholar

[48]

John Winn. 2012. Causality with gates. In International Conference on Artificial Intelligence and Statistics. 1314–1322. Retrieved from http://proceedings.mlr.press/v22/winn12.html.

Google Scholar

[49]

Frank Wood, Jan Willem van de Meent, and Vikash Mansinghka. 2014. A new approach to probabilistic programming inference. In 17th International Conference on Artificial Intelligence and Statistics. 1024–1032.

Google Scholar

[50]

Nevin Lianwen Zhang and David Poole. 1994. A simple approach to Bayesian network computations. In Conference of the Canadian Society for Computational Studies of Intelligence. Canadian Information Processing Society, 171–178.

Google Scholar

[51]

Yichuan Zhang, Zoubin Ghahramani, Amos J. Storkey, and Charles A. Sutton. 2012. Continuous relaxations for discrete Hamiltonian Monte Carlo. In Advances in Neural Information Processing Systems, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3194–3202. Retrieved from http://papers.nips.cc/paper/4652-continuous-relaxations-for-discrete-hamiltonian-monte-carlo.pdf.

Google Scholar

[52]

Guangyao Zhou. 2020. Mixed Hamiltonian Monte Carlo for mixed discrete and continuous variables. In International Conference on Advances in Neural Information Processing Systems.

Google Scholar

[53]

Yuan Zhou, Bradley J. Gram-Hansen, Tobias Kohn, Tom Rainforth, Hongseok Yang, and Frank Wood. 2019. LF-PPL: A low-level first order probabilistic programming language for non-differentiable models. In International Conference on Artificial Intelligence and Statistics. Retrieved from http://proceedings.mlr.press/v89/zhou19b.html.

Google Scholar

Cited By

View all

Li JWang EZhang Y(2024)Compiling Probabilistic Programs for Variable Elimination with Information FlowProceedings of the ACM on Programming Languages10.1145/36564488:PLDI(1755-1780)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656448
Garg PHoltzen SVan den Broeck GMillstein T(2024)Bit Blasting Probabilistic ProgramsProceedings of the ACM on Programming Languages10.1145/36564128:PLDI(865-888)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656412
Faggian CPautasso DVanoni G(2024)Higher Order Bayesian Networks, ExactlyProceedings of the ACM on Programming Languages10.1145/36329268:POPL(2514-2546)Online publication date: 5-Jan-2024
https://dl.acm.org/doi/10.1145/3632926
Show More Cited By

Index Terms

Conditional Independence by Typing

Recommendations

A Markov blanket based strategy to optimize the induction of Bayesian classifiers when using conditional independence learning algorithms
DaWaK'07: Proceedings of the 9th international conference on Data Warehousing and Knowledge Discovery

A Bayesian Network (BN) is a multivariate joint probability distribution graphical representation that can be induced from data. The induction of a BN is a NP problem. Two main approaches can be used for inducing a BN from data, namely, Conditional ...
Read More
A design proposal for Gen: probabilistic programming with fast custom inference via code generation
MAPL 2018: Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages

Probabilistic programming languages have the potential to make probabilistic modeling and inference easier to use in practice, but only if inference is sufficiently fast and accurate for real applications. Thus far, this has only been possible for ...
Read More
A simple graphical approach for understanding probabilistic inference in Bayesian networks

We present a simple graphical method for understanding exact probabilistic inference in discrete Bayesian networks (BNs). A conditional probability table (conditional) is depicted as a directed acyclic graph involving one or more black vertices and zero ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Programming Languages and Systems

ACM Transactions on Programming Languages and Systems Volume 44, Issue 1

March 2022

279 pages

ISSN:0164-0925

EISSN:1558-4593

DOI:10.1145/3492457

Editor:
Andrew Myers
Cornell University, USA

Issue’s Table of Contents

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 December 2021

Accepted: 01 September 2021

Revised: 01 August 2021

Received: 01 April 2021

Published in TOPLAS Volume 44, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

EPSRC Centre for Doctoral Training in Data Science
UK Engineering and Physical Sciences Research Council
University of Edinburgh
European Union’s Horizon 2020 research and innovation programme

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
1,689
Total Downloads

Downloads (Last 12 months)737
Downloads (Last 6 weeks)59

Other Metrics

View Author Metrics

Citations

Cited By

View all

Li JWang EZhang Y(2024)Compiling Probabilistic Programs for Variable Elimination with Information FlowProceedings of the ACM on Programming Languages10.1145/36564488:PLDI(1755-1780)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656448
Garg PHoltzen SVan den Broeck GMillstein T(2024)Bit Blasting Probabilistic ProgramsProceedings of the ACM on Programming Languages10.1145/36564128:PLDI(865-888)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656412
Faggian CPautasso DVanoni G(2024)Higher Order Bayesian Networks, ExactlyProceedings of the ACM on Programming Languages10.1145/36329268:POPL(2514-2546)Online publication date: 5-Jan-2024
https://dl.acm.org/doi/10.1145/3632926
Li JVen LShi PZhang Y(2023)Type-Preserving, Dependence-Aware Guide Generation for Sound, Effective Amortized Probabilistic InferenceProceedings of the ACM on Programming Languages10.1145/35712437:POPL(1454-1482)Online publication date: 11-Jan-2023
https://dl.acm.org/doi/10.1145/3571243
Lee WRival XYang H(2023)Smoothness Analysis for Probabilistic Programs with Application to Optimised Variational InferenceProceedings of the ACM on Programming Languages10.1145/35712057:POPL(335-366)Online publication date: 11-Jan-2023
https://dl.acm.org/doi/10.1145/3571205

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

1 Introduction

2 SlicStan: Extended Syntax and Semantics

2.1 Syntax

2.2 Typing

2.3 Operational Semantics of SlicStan Statements

2.4 Density Semantics

2.5 Shredding and Translation to Stan

2.6 Density Factorisation

3 Theory: Conditional Independence by Typing

3.1 The \( \vdash _{2} \) Type System

3.2 Conditional Independence Result for \( \vdash _{2} \)-Well-typed Programs

3.3 Scope of the Conditional Independence Result

3.3.1 Example of \( \vdash _{2} \)-well-typed Program \( \rightarrow \) Conditional Independence.

3.3.2 Conditional Independence Relationship does not hold \( \rightarrow \) Type Error.

4 Application: Discrete Parameters Support Through A Semantics-preserving Transformation

4.1 Goal

4.2 Key Insight

4.3 Variable Elimination

4.4 Conditional Independence Relationships and Inferring the Markov Blanket

4.5 Sampling the Discrete Parameters

4.6 A Semantics-preserving Transformation Rule

4.6.1 The φ, \( \sf elim \), and \( \sf gen \) derived forms.

4.6.2 Eliminating a Single Variable z.

4.7 Marginalising Multiple Variables: An Example

4.7.1 Eliminating z1.

4.8 Relating to Variable Elimination and Complexity Analysis

4.9 Semantic Preservation of the Discrete Variable Transformation

4.10 Scope and Limitations of Elim Gen

5 Implementation and Empirical Evaluation

5.1 Implementation

5.2 Empirical Evaluation

5.2.1 Hidden Markov Models.

5.2.2 Mixture Models.

5.3 Analysis and Discussion

6 Related Work

7 Conclusion

Acknowledgments

Footnotes

A Definitions and Proofs

A.1 Definitions

A.2 Proofs

B Examples

B.1 Sprinkler

B.2 Soft-K-means Model

B.3 A Causal Inference Example

References

Cited By

Index Terms

Recommendations

A Markov blanket based strategy to optimize the induction of Bayesian classifiers when using conditional independence learning algorithms

A design proposal for Gen: probabilistic programming with fast custom inference via code generation

A simple graphical approach for understanding probabilistic inference in Bayesian networks

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

4.7.1 Eliminating z₁.