Multivariate cluster point process to quantify and explore multi-entity configurations: Application to biofilm image data - Supplementary Materials
Suman Majumder1,∗ , Brent A. Coull2, Jessica L. Mark Welch3Patrick J. La Riviere4, Floyd E. Dewhirst3Jacqueline R. Starr5,†,Kyu Ha Lee2,†
1 University of Missouri, Columbia, MO, USA
2 Harvard T.H. Chan School of Public Health, Boston, MA, USA
3 Forsyth Institute, Cambridge, MA, USA
4 University of Chicago, Chicago, IL, USA
5 Brigham and Women’s Hospital, Boston, MA, USA
† Co-senior authors
A Neyman-Scott process
A Neyman-Scott process is a point process used for modeling parent-offspring clustering. In the simplest setting, consider the parent process to be a homogeneous Poisson point process with intensity . For each observation location , the cluster of offspring is an independent Poisson process with intensity , where is a probability distribution function parameterized by that determines the spread and distribution of the offspring locations around the parent , and is the expected number of offspring per cluster. The Neyman-Scott process is the union of all these offspring cluster processes, namely, . Further details can be found in illian2008statistical and chiu2013stochastic, for example.
B Images of the taxa from the human dental plaque biofilm data not visualized in the image included in the main text
C Computational details of the sampling algorithm
We use a Markov chain Monte Carlo (MCMC) method to draw samples from the joint posterior distribution of . In the MCMC scheme, parameters are updated either by exploiting conjugacies inherent to the proposed model or by using a Metropolis-Hastings algorithm.
C.1 Updating parameters associated with offspring densities
Let denote a set of parameters with removed. The full conditional distribution for is
where is the number of observations of taxon in the window.
C.2 Updating intensity parameters in homogeneous Poisson processes
Posterior conjugacy is also achieved in the full conditional distributions of intensity parameters, and , which are given by
where and are the numbers of observations for taxon and taxon within the window, respectively.
C.3 Updating bandwidth parameters
Since the full conditionals of the bandwidth parameters do not have standard forms, we use a random work Metropolis-Hastings step to update each of . Denote the sample for from iteration . For iteration , we propose a candidate sample as a random draw from , where is the prespecified variance of the proposal density. The corresponding acceptance ratio computes to
Then, we accept the proposed candidate as with probability or keep .
D Additional Tables from the simulation study
Here, we present additional details regarding the simulation scenarios (Table S.1) and results for the scenarios that included a taxon unrelated to the parent-offspring-type configurations of interest (Table S.2). The presence of an unrelated taxon (Table S.2) did not meaningfully affect the results (Table 1). Specifically, with or without this spatially unrelated taxon, the multivariate cluster point process (MCPP) performed better than the Neyman-Scott process (NSP) implementation in all aspects. The NSP often failed to converge, especially in scenarios where the bandwidth parameter was large.
Table S.1: A summary of twelve simulation scenarios considered in Section 4. The offspring density is controlled by setting for ‘Sparse’, for ‘Dense’ and for ‘Mixed’ densities. Bandwidth ‘Low’ sets and ‘High’ to . The setting “Unrelated taxon” refers to whether there exists a taxon in the data spatially unrelated to the multilayered arrangement.
Scenario
Unrelated taxon
Offspring density
Bandwidth
1
Absent
Sparse
Low
2
Absent
Sparse
High
3
Absent
Dense
Low
4
Absent
Dense
High
5
Absent
Mixed
Low
6
Absent
Mixed
High
7
Present
Sparse
Low
8
Present
Sparse
High
9
Present
Dense
Low
10
Present
Dense
High
11
Present
Mixed
Low
12
Present
Mixed
High
Table S.2: The true value, estimates (EST), and uncertainty measures for the offspring density (, ), bandwidth (, ), and parent process () parameters from the MCPP and NSP analyses in the last six simulated scenarios (those that included a spatially unrelated taxon). For the MCPP model, the estimates are the posterior means averaged over different datasets, the SD is computed by averaging the posterior standard deviation over different datasets, and the SDEST is computed as the standard deviation of the estimates over the datasets. For the NSP model, the estimates are the outputs of the minimum contrast method, and the SE is calculated similarly by using these estimates. The SD for the NSP model is not computed, as the method does not provide an uncertainty measure. The last column (F) refers to the percentage of datasets in which the NSP model failed to converge for a given scenario. There is no corresponding F column for the MCPP because all models converged.
True
MCPP
NSP
Scenario
Value
EST
SD
SDEST
EST
SE
F
1.50
1.53
0.10
0.10
1.46
0.33
1.00
1.02
0.08
0.09
3.31
20.25
7
0.01
0.01
0.01
2
0.02
0.02
0.04
0.09
150.00
161.06
12.91
12.20
171.35
34.72
1.50
1.48
0.11
0.09
198.46
283.69
1.00
1.02
0.08
0.09
0.98
0.28
8
0.10
0.08
0.01
0.01
10.30
28.72
36
0.01
0.01
0.01
150.00
160.25
12.86
12.57
939.70
2777.40
4.00
4.02
0.14
0.15
8.77
48.78
3.00
3.05
0.13
0.13
2.91
0.77
9
0.01
0.01
0.02
0.08
0
0.02
0.02
0.02
150.00
202.78
14.38
14.29
208.52
39.37
4.00
4.00
0.17
0.17
613.34
569.20
3.00
3.02
0.13
0.14
2.93
0.53
10
0.10
0.09
0.01
0.01
1.15
0.64
48
0.01
0.01
0.01
150.00
200.49
14.26
14.48
13.30
28.78
4.00
4.05
0.15
0.14
18.45
87.48
1.00
1.00
0.07
0.08
2.05
10.04
11
0.01
0.01
0.03
0.11
0
0.02
0.02
0.47
4.41
150.00
201.50
14.37
14.09
203.36
53.30
4.00
4.02
0.17
0.17
547.61
553.61
1.00
1.02
0.07
0.07
0.97
0.24
12
0.10
0.09
0.01
0.01
1.04
0.82
70
0.01
0.01
0.01
150.00
199.05
14.23
15.95
26.62
41.46
E Sensitivity analyses regarding choice of prior for the bandwidth parameters
As part of the simulation study described in Section 4, we also evaluated the MCPP method’s sensitivity to choice of prior distribution for the bandwidth parameters. We considered four different prior distributions; 1) half-normal, 2) uniform, 3) log-normal with a flat tail and high variance and 4) log-normal with a slim tail and higher peak. For the uniform prior, the lower and upper bounds were taken to be and , respectively. Both the log-normal priors had ; the flat-tailed prior had , and the high-peaked prior had as the hyperparameter. The hyperparameter setting for the half-normal prior was the same as in Section 4. We compared performance of the MCPP for the different prior distributions in Scenario 5 and 6 (Table 1): both scenarios considered mixed offspring density (=4 and =1), one had low bandwidth (=0.01 and =0.02), and the other had high bandwidth (=0.1 and =0.01).
We report the mean absolute percentage bias for estimating the corresponding parameters in the two scenarios for the four different prior settings: i) Half-normal, ii) Uniform, iii) a flat Log-normal, and iv) a tight Log-normal. The half-normal prior-based MCPP model performed the best, and the performance was similar to that for the original model. When the true bandwidth was low, all the models—irrespective of prior choice—generally performed well and similarly to each other, with almost all biases . Differences in performance emerged when the true bandwidth was high, where the analyses with tighter priors produced much less biased estimates ( except in one instance) than the analyses with flatter priors (4-137%; Table S.3). However, using an informative log-normal prior backfired even for the low-bandwidth scenario when the offspring density was also low, as for the second offspring process (20-25%).
Table S.3: Results of MCPP based analysis of simulated data, comparing different priors for the bandwidth parameters. The true values for the offspring densities were =4 and =1. The true values for the bandwidth parameters were =0.01 and =0.02 under low bandwidth and =0.1 and =0.01 under high bandwidth. The parent process is denoted . Results are presented as mean absolute percentage bias of the estimated parameter values based on posterior means of each of the simulated datasets. There were no other taxa unrelated to these multi-layered arrangements.
Log-normal
Log-normal
Half-normal
Uniform
(flat)
(tight)
0.03
0.03
0.03
0.03
Low
0.07
0.07
0.07
0.07
bandwidth
0.02
0.02
0.02
0.05
0.04
0.04
0.04
0.24
0.06
0.06
0.06
0.06
0.03
0.31
0.52
0.03
High
0.05
0.05
0.05
0.05
bandwidth
0.09
0.99
1.37
0.10
0.03
0.03
0.03
0.20
0.06
0.06
0.06
0.06
F Additional Figures and Tables for the Analysis of Human Microbiome Biofilm Image Data
Here, we present a visual representation of the four quadrants of the whole dental plaque data for subset analyses in Figure S.2 and the abundances of different taxa in the four quadrants of the subsetted data (Table S.4). The estimates for the intensity functions for the five taxa (Neisseriaceae, Capnocytophaga, Actinomyces, Fusobacterium, Leptotrichia) that have no visible spatial relationship with the parent-offspring-type configurations also varied across the quadrants (Table S.5). K-functions for the whole and subsetted analyses (Figures S.3 and S.4 through S.7) also varied noticeably by quadrant. The DIC estimates for the different models explored in Section 5.2 (Table S.6) indicate that the models with the Fusobacterium and Leptotrichia as an additional parent-offspring pair is a better fit to the data than the original model, while models fitting Streptococcus around Fusobacterium do not fit the data well.
Table S.4: The abundance (counts) of bacterial taxa of interest in the human dental plaque sample image data and its four subdivided quadrants.
Taxon
Quadrant
Total
I
II
III
IV
Actinomyces
119
280
154
223
776
Capnocytophaga
512
755
574
573
2414
Corynebacterium
58
219
186
245
708
Fusobacterium
92
250
141
173
656
Leptotrichia
191
411
234
339
1175
Neisseriaceae
339
479
402
491
1711
Pasteurellaceae
53
130
76
106
365
Porphyromonas
227
525
269
420
1441
Streptococcus
98
379
163
249
889
Table S.5: The posterior means of parameters associated with Neisseriaceae (), Capnocytophaga (), Actinomyces (), Fusobacterium () and Leptotrichia () obtained by applying the proposed MCPP method on the entire image and on each of the four quadrants of the dental plaque sample image. All results are rounded to two decimal places. The posterior standard deviations were all smaller than 0.01 and are not reported separately.
Full Image
0.04
0.06
0.02
0.02
0.03
Segment 1
0.04
0.06
0.01
0.01
0.02
Segment 2
0.05
0.07
0.03
0.02
0.04
Segment 3
0.04
0.05
0.01
0.01
0.02
Segment 4
0.05
0.06
0.02
0.02
0.03
Table S.6: The parent-offspring relationships explored in different models and their DIC. The abbreviations C, S, Po, Pa, F and L are used for Corynebacterium, Streptococcus, Porphyromonas, Pasteurellaceae, Fusobacterium and Leptotrichia. The implies parent-offspring relationship with the arrow directed from the parent to the offspring(s).
Identifier
Parent-Offspring Realtions Present
DIC
1
124985.4
2
125531.1
3
134007.1
4
124317.0
5
124303.5
6
133357.2
7
133345.2
We additionally present two images (Figures S.8 and S.9) to make the case for testing the different models as described in Sections 5.2.1 and 5.2.2 of the main manuscript.
G Model Formulation for Given Parent-Offspring Relationships
In this section, we illustrate the formulation of the MCPP model for a given parent-offspring configuration, as described in Section 3.1 of the main manuscript. Using the notation introduced in Section 3.5, we assume data containing four taxa and (). We detail the construction of the likelihood function according to equations (1)-(3) for the following four different configurations:
a)
b)
c)
d)
G.1 Modeling
The model represents a configuration involving two parent processes (taxa and , ) and two offspring processes (taxa and , ). That is, all taxa are interrelated within parent-offspring framework, with no taxa existing outside of these relationships (). Therefore, following equations (1)-(3), the corresponding model formulation can be written as follows:
It follows that
G.2 Modeling
In this model, we investigate the configuration where two offspring processes (taxa and , ) share the same parent process (taxon , ). Additionally, the model includes a separate process for taxon , which remains uninvolved in any parent-offspring relationship (). The model formulation can be written as
It follows that
G.3 Modeling
This scenario presents a more complex relationship where taxon behaves both as an offspring to taxon and as a parent to taxon . In this setup, there is only one process serving as a parent (taxon , ), two processes functioning as offspring (taxa and , ), and a separate process for taxon , which is uninvolved in any parent-offspring relationship (). It is worth noting that as the process for taxon is counted as an offspring process, while the locations of taxon will be used for modeling the process for taxon . Therefore, the MCPP under the configuration can be written as
It follows that
G.4 Modeling
In this model, we explore a more complex scenario where the two processes for taxa and share the same parent process (for taxon ). Furthermore, the process for taxon not only functions as an offspring process with respect to that of taxon but also acts as a parent process for taxon . This configuration involves one process serving exclusively as a parent (taxon , ) and three processes serving as offspring processes (taxa and , ). Consequently, no taxon remains uninvolved in parent-offspring relationships (). As in Section G.3, it is important to note that as the process for taxon is counted as an offspring process, despite its role as a parent process for taxon . The corresponding model formulation can be written as follows: