Model-Based Clustering and Variable Selection for Multivariate Count Data

Julien Jacques; Thomas Brendan Murphy

doi:10.57750/6v7b-8483

Abstract

Model-based clustering provides a principled way of developing clustering methods. We develop a new model-based clustering methods for count data. The method combines clustering and variable selection for improved clustering. The method is based on conditionally independent Poisson mixture models and Poisson generalized linear models. The method is demonstrated on simulated data and data from an ultra running race, where the method yields excellent clustering and variable selection performance.

1 Introduction

Multivariate count data is ubiquitous in statistical applications, as ecology (Chiquet, Mariadassou, and Robin 2021), genomics (Rau et al. 2015; Silva et al. 2019). These data arise when each observation consists of a vector of count values. Count data are often treated as continuous data and therefore modeled by a Gaussian distribution, this assumption is particularly poor when the measured counts are low. Instead, we use the reference distribution for count data which is the Poisson distribution (Agresti 2013; Inouye et al. 2017).

When a data set is heterogeneous, clustering allows to extract homogeneous subsets from the whole data set. Many clustering methods, such as k-means (Hartigan and Wong 1979), are geometric in nature, whereas many modern clustering approaches are based on probabilistic models. In this work, we use model-based clustering which has been developed for many types of data (Bouveyron et al. 2019; McLachlan and Peel 2000; Frühwirth-Schnatter, Celeux, and Robert 2018).

Modern data are often high-dimensional, that is the number of variables is often large. Among these variables, some are useful for the task of interest, some are useless for the task of interest and some others are useful but redundant. There is a need to select only the relevant variables, and that, whatever is the task. Variable selection methods are widespread for supervised learning tasks, in particular to avoid overfitting. However, variable selection methods are less well developed for unsupervised learning tasks, such as clustering. Recently, several methods have been proposed for selecting the relevant variables in model-based clustering; we refer to Fop and Murphy (2018) and McParland and Murphy (2018) for recent detailed surveys.

The goal of the present work is to provide a clustering and variable selection method for multivariate count data, which, to the best of our knowledge, has not yet been studied in depth. A methodology based on a conditionally independent Poisson mixture is developed to achieve this goal. The method yields a final clustering model which is a conditionally independent Poisson mixture model for a subset of the variables.

2 Motivating Example

The International Association of Ultrarunners (IAU) 24 hour World Championships were held in Katowice, Poland from September 8th to 9th, 2012. Two hundred and sixty athletes representing twenty four countries entered the race, which was held on a course consisting of a 1.554 km looped route. An update of the number of laps covered by each athlete was recorded approximately every hour (White and Murphy 2016). Figure 1 plots the number of loops recorded each hour for the three medalists.

Figure 1: Number of loops per hour for the three medalists.

We can see among these three runners different strategies, the second placed runner lapped at a regular rate, the first placed runner had a fast start but slowed later, and the third placed runner also started fast but slowed more than the first place runner.

Our first goal will be, to analyze the whole data set to identify the different running strategies and to evaluate which strategies are the best ones. The second goal is to identify which variables allows to distinguish between the clusters, in order to identify which hour is essential in the management of this endurance race.

3 Independent Poisson Mixture

Let X_n=(X_{n1}, X_{n2}, \ldots, X_{nM}) be a random vector of counts for n=1, 2, \ldots, N. The goal is to clusters theses N observations into G clusters. Let Z_n=(Z_{n1}, Z_{n2}, \ldots, Z_{nG}) be the latent cluster indicator vector, where Z_{ng}=1 if observation n belongs to cluster g and Z_{ng}=0 otherwise. We assume that {\mathbb P}\{Z_{ng}=1\} = \tau_g for g=1,2, \ldots, G. Let denote \tau=(\tau_1,\ldots,\tau_G). The conditionally independent Poisson mixture model (Karlis 2018, sec. 9.4.2.1) assumes that the elements of X_n are independent Poisson distributed random variables, conditional on Z_n. That is, \begin{aligned} Z_n & \sim {\rm Multinomial}(1, \tau)\\ X_{nm}|(Z_{ng}=1) & \sim {\rm Poisson}(\lambda_{gm}), {\rm ~for~} m=1, 2, \ldots, M. \end{aligned} Alternative modelling frameworks exist, either to introduce some dependence between variables or to normalize the variables. We refer the interested reader to (Karlis 2018; Bouveyron et al. 2019, chap. 6) for more details.

Denoting the model parameters by \theta=(\tau,\lambda) where \lambda=(\lambda_{gm})_{1\leq g \leq G, 1\leq m \leq M}, and where X=(x_n)_{1\leq n \leq N} denotes the observations, the observed likelihood is L(\theta)=\sum_{n=1}^{N}\sum_{g=1}^{G}\tau_g\prod_{m=1}^M\phi(x_{nm},\lambda_{gm}), where \phi(x,\lambda)=\exp(-\lambda)\lambda^{x}/x!, the Poisson probability mass function.

Due to form of the mixture distribution, there are no closed form for the maximum likelihood estimators, and an iterative EM algorithm needs to be used (Dempster, Laird, and Rubin 1977) to maximize the likelihood. The EM algorithm starts from an initial value \theta^{(0)} for the model parameter, and alternates the two following steps until convergence of the likelihood.

At the qth iteration of the EM algorithm, the E-step consists of computing for all 1\leq n\leq N and 1\leq g\leq G: t_{ng}^{(q)}=\frac{\tau_g^{(q)}\prod_{m=1}^M\phi(x_{nm},\lambda_{gm})}{\sum_{h=1}^{G}\tau_h^{(q)}\prod_{m=1}^M\phi(x_{nm},\lambda_{hm})}. In the M-step, the model parameters are updated as follows: \tau_g^{(q+1)}=\frac{\sum_{n=1}^Nt_{ng}^{(q)}}{N} \quad {\rm ~and~} \quad \lambda_{gm}^{(q+1)}=\frac{\sum_{n=1}^Nt_{ng}^{(q)}x_{nm}}{\sum_{n=1}^Nt_{ng}^{(q)}}. The EM algorithm steps are iterated until convergence, where convergence is determined when \log L(\theta^{(q+1)}) - \log L(\theta^{(q)}) < \epsilon.

The number of clusters G is selected using the Bayesian information criterion (BIC) (Schwarz 1978), BIC= 2 \log L(\hat\theta) - \{(G - 1) + GM\}\log(N), where \hat\theta is the maximum likelihood estimate of the model parameters; models with higher BIC are prefered to models with lower BIC.

4 Variable selection

We develop a model-based clustering method with variable selection for multivariate count data. The method follows the approach of Raftery and Dean (2006) and Maugis, Celeux, and Martin-Magniette (2009) for continuous data and Dean and Raftery (2010) and Fop, Smart, and Murphy (2017) for categorical data. It consists in a stepwise model comparison approach where variables are added and removed from a set of clustering variables.

4.1 Model setup

The clustering and variable selection approach is based around partitioning {X}_n=(X_n^{C}, X_n^{P}, X_n^{O}) into three parts:

X_n^{C}: The current clustering variables,
X_n^{P}: The proposed variable to add to the clustering variables,
X_n^{O}: The other variables.

For simplicity of notation, C will be used to denote the set of indices of the current clustering variables, P the indices of the proposed variable and O the indices of the other one. Then (C,P,O) is a partition of \{1,\ldots,M\}.

The decision on whether to add the proposed variable to the clustering variables is based on comparing two models:

{\cal M}_1 (Clustering Model), which assumes that the proposed variable is useful for clustering: (X_n^C, X_n^P) \sim \sum_{g=1}^{G} \tau_g \prod_{m\in\{C, P\}} {\rm Poisson}(\lambda_{gm}). The {\cal M}_1 model is fitted for different values of G between 1 and G_{max} to achieve the best clustering model.

{\cal M}_2 (Non-Clustering Model) which assumes that the proposed variable is not useful for clustering, but is potentially linked to the clustering variables through a Poisson GLM, that is, \begin{aligned} X_n^C &\sim \sum_{g=1}^{G}\tau_g \prod_{m\in C} {\rm Poisson}(\lambda_{gm})\\ X_n^P| (X_n^C=x_n^C, Z_{ng}=1) & \sim {\rm PoissonGLM}(x_n^{C}), \end{aligned} where Poisson GLM states that \log {\mathbb E}[X_n^P|X_n^C=x_n^C, Z_{ng}=1] = \alpha + \beta^\top x_n^C.

In order to avoid non significant terms in the Poisson GLM model, a standard stepwise variable selection approach (using BIC as the variable selection criterion) is considered. Thus, the proposed variable X_n^P will be dependent on only a subset X_n^R of the clustering variables X_n^C. We note that G is fixed in the non-clustering model, because an optimal value for G is previously chosen. The other variables X_n^{O} are assumed to be conditionally independent of Z_n given X_n^{C} and X_n^{P}.

The clustering and non-clustering models are represented as graphical models in Figure 2.

Figure 2: Graphical model representations of the clustering and non-clustering models.

Thus, there is two reasons for which {\cal M}_2 can be preferred to {\cal M}_1: either X_n^P does not contain information about the latent clustering variable at all (ie. X_n^R=\emptyset), or X_n^P does not add further useful information about the clustering given the information already contained in the current clustering variables. In the first situation, we say that X_n^P is an irrelevant variable, because it contains no clustering information. In the second situation, we say that X_n^P is a redundant variable because it contains no extra information about the clustering beyond the current clustering variables (X_n^C).

Additionally, both models assume the same form for the conditional distribution for X_n^{O}|(X_{n}^{C}, X_{n}^{P}) and whose form doesn’t need to be explicitly specified because it doesn’t affect the model choice.

Variable P is added to C if the clustering model ({\cal M}_1) is preferred to the non-clustering model ({\cal M}_2). In order to compare {\cal M}_1 and {\cal M}_2, following (Dean and Raftery 2010), we consider the Bayes Factor: B_{1,2}=\frac{p(X|{\cal M}_1)}{p(X|{\cal M}_2)} which is asymptotically approximated (Fop, Smart, and Murphy 2017; Kass and Raftery 1995) using the difference of the BIC criteria for both models: 2\log B_{1,2}\simeq BIC_{{\cal M}_1}-BIC_{{\cal M}_2}.

The same modelling framework can be used for removing variables from the current set of clustering variables.

4.2 Interpretation

Comparing {\cal M}_1 and {\cal M}_2 is equivalent to comparing the following X_n^P|(X_n^C=x_n^C) structures.

The {\cal M}_1 (Clustering Model) assumes that, X_n^P| (X_n^C=x_n^C) \sim \sum_{g=1}^{G} {\mathbb P}\{Z_{ng}=1|X_{n}^{C}=x_{n}^{C}\} {\rm Poisson}(\lambda_{gm}), where {\mathbb P}\{Z_{ng}=1|X_{n}^{C}=x_{n}^{C}\} = \frac{\tau_g \prod_{m=1}^{M}\phi(x_{nm}, \lambda_{gm})}{\sum_{h=1}^{G}\tau_h \prod_{m=1}^{M}\phi(x_{nm}, \lambda_{hm})}.

Whereas, the {\cal M}_2 (Non-Clustering Model) assumes that, X_n^P| (X_n^C=x_n^C) = {\rm PoissonGLM}(x_n^C).

The method contrasts which of conditional model structures is better describing the distribution of the proposed variable X^P. The clustering model ({\cal M}_1) uses a mixture model, with covariate dependent weights, for the conditional model whereas the non-clustering model ({\cal M}_2) is a Poisson generalized linear model. The model selection criterion chooses the model that best models this conditional distribution.

4.3 Stepwise selection algorithm

4.3.1 Screening variables: Initialization

We start with an initial choice of C by first screening each individual variable by fitting a mixture of univariate Poisson distributions (eg. Everitt and Hand 1981, chap. 4.3), X_{nm} \sim \sum_{g=1}^{G}\tau_g {\rm Poisson}(\lambda_{gm}), {\rm ~for~} G=1,2,\ldots, G_{max}. The initial set of variables is set to be those variables where any model with G>1 is preferred to the G=1 model.

4.3.2 Stepwise algorithm: Updating

We consider a stepwise algorithm which alternates between adding and removing steps. In the removal step, all the variables in X^C are examined in turn to be removed from the set. In the adding step, all the variables in X^O are examined in turn to be added to the clustering set.

The algorithm also performs the selection of the number G of clusters finding at each stage the optimal combination of clustering variables and number of clusters. The procedure stops when no change has been made to the set X^C after consecutive exclusion and inclusion steps.

With the present stepwise selection algorithm, it can occur that during the process, we get back on a solution (a set of clustering variable) already explored. Since our algorithm is not stochastic, we fall into an infinite cycle. In this situation the algorithm is stopped, and the best solution according to BIC among the solution of the cycle is kept.

The following pseudo-code summarizes our stepwise algorithm:

ALGORITHM Stepwise
BEGIN
initialize X^C
WHILE X^C changes:
- for all variable X_j which are not in X^C
- estimate {\cal M}_1 on X^C \cup X_j and select the best G
- estimate {\cal M}_2 with the model for X^C (with G selected at the previous step) and a Poisson regression for X_j given X^C
- add X_j in X^C if BIC_{{\cal M}_1}>BIC_{{\cal M}_2}
- for each X_j in X^C
- estimate {\cal M}_2 on X^C\setminus X_j, select the best G and use a Poisson regression for X_j given X^C\setminus X_j
- estimate {\cal M}_1 on X^C (with G selected at the previous step)
- remove X_j from X^C if BIC_{{\cal M}_2}>BIC_{{\cal M}_1}
- test for infinite loop
ENDWHILE
return X^C and {\cal M}_1 estimate
END

5 Simulation study

In this section, we evaluate the proposed variable selection method through three different simulation scenarios. We start with an illustrative example in which, using a data set simulated according to the proposed model, we show how to perform the variable selection.

Then, simulation studies are performed to evaluate the behavior of the proposed selection method, when the data are simulated according to the proposed model (Scenario1) and when the model assumptions are violated. In Scenario2, the link between X^R and X^C is no longer a Poisson GLM but a more complex model. In Scenario3, the clustering variables are no longer conditionally independent.

5.1 Illustrative example

In the first simulation setting we consider 10 Poisson random variables. Variables X_1, X_2, X_3 and X_4 are the clustering variables, distributed according to a mixture of G= 3 independent Poisson mixture distributions with mixing proportions 0.4, 0.3, 0.3. Variables X_5, X_6 and X_7 are redundant variables, each one generated dependent on the clustering variables. These three variables are linked to the four first ones through a Poisson GLM. The last three variables, X_8, X_9 and X_{10} are irrelevant variables not related to the previous ones. Table 1 shows the parameter of the Poisson distribution for each variable and each cluster.

Table 1: True values of component parameters (Scenario 1)

	\lambda_{g1}	\lambda_{g2}	\lambda_{g3}	\lambda_{g4}	\lambda_{g5}	\lambda_{g6}	\lambda_{g7}	\lambda_{g8}	\lambda_{g9}	\lambda_{g10}
g=1	1	1	1	1	\lambda_{g5}	\lambda_{g6}	\lambda_{g7}	4	2	1
g=2	2	2	1	4	\lambda_{g5}	\lambda_{g6}	\lambda_{g7}	4	2	1
g=3	4	4	4	4	\lambda_{g5}	\lambda_{g6}	\lambda_{g7}	4	2	1

with \lambda_{g5}=\exp(0.2X_2), \lambda_{g6}=\exp(0.2X_1-0.1X_2) and \lambda_{g7}=\exp(0.1 (X_1+X_3+X_4)).

Below is the result obtained for one data set of size N= 400. The evaluation criteria is the selected features (true one are X_1 to X_4) and the Adjusted Rand Index (Rand 1971; Hubert and Arable 1985) obtained with the selected variables in comparison to those obtained with the full set of variables and with the true clustering variables.

The independent Poisson mixture model was fitted to the simulated data with N= 400 rows and P= 10 columns. Models with G=1 to G= 10 were fitted using the EM algorithm.

The values of BIC for the independent Poisson mixture model are plotted in Figure 3.

Figure 3: Bayesian Information Criterion (BIC) for the independent Poisson mixture model.

The model with the highest BIC has G= 3 components and the resulting estimates of \tau and \lambda are given as:

Table 2: Estimates of the mixing proportions and component parameters.

	\tau_g	\lambda_{g1}	\lambda_{g2}	\lambda_{g3}	\lambda_{g4}	\lambda_{g5}	\lambda_{g6}	\lambda_{g7}	\lambda_{g8}	\lambda_{g9}	\lambda_{g10}
g=1	0.29	4.09	4.00	4.15	4.34	2.51	1.87	3.95	4.04	1.85	1.12
g=2	0.42	2.04	2.11	1.34	3.74	1.64	1.27	2.00	3.91	2.06	0.96
g=3	0.29	0.93	0.88	1.08	0.96	1.13	1.01	1.16	3.82	2.02	1.00

A look at Table 1 of true values allows us to say that these estimates are correct (except for label switching).

Let start by initializing the stepwise algorithm.

Hide/Show the code

fit_screen <- poissonmix_screen(x, G = 1:Gmax)
jchosen <- fit_screen$jchosen

The variables selected by the screening procedure are {1, 2, 3, 4, 6, 7}.

Now, we execute the stepwise selection algorithm:

Hide/Show the code

fit <- poissonmix_varsel(x, jchosen=jchosen, G = 1:Gmax)

[1] "Initial Selected Variables: 1,2,3,4,6,7"
[1] "Iteration: 1"
[1] "Add Variable: NONE 10 BIC Difference: -13.2"
[1] "Remove Variable: 6  BIC Difference: 83.7"
[1] "Current Selected Variables: 1,2,3,4,7"
[1] "Iteration: 2"
[1] "Add Variable: NONE 9 BIC Difference: -10.6"
[1] "Remove Variable: 7  BIC Difference: 50.1"
[1] "Current Selected Variables: 1,2,3,4"
[1] "Iteration: 3"
[1] "Add Variable: NONE 10 BIC Difference: -10.5"
[1] "Remove Variable: NONE 3 BIC Difference: -26.8"
[1] "Current Selected Variables: 1,2,3,4"

Note that the computing time is about 5 minutes on a laptop with 2.3 GHz Intel Core i7 processor and 32Go of RAM.

The final chosen variables are {1, 2, 3, 4}.

Finally, the ARI obtained with the selected variables, which turn out to be the true clustering variable, is 0.594 whereas it is 0.432 with all the variables.

5.2 Scenarios of simulation

In this section the three scenario of simulation are described. The first scenario is similar to the previous illustrative example.

The second scenario is similar to the first one, except for variables X_5, X_6 and X_7 which are still redundant but linked to the true clustering variables through linear, quadratic and exponential term in an identity link function, respectively, and not a Poisson GLM with logarithm link function. More precisely, X_5, X_6 and X_7 have Poisson distribution of respective parameter \lambda_{g5}=\exp(2X_2), \lambda_{g6}=\exp(X_1^2+X_3) and \lambda_{g7}=\exp(\exp(0.1 (X_1 + X_3+ X_4))). Thus, the data are simulated from a model which does not satisfy assumptions of model {\cal M}_2.

The third scenario is similar to the second one, but some dependence between the clustering variables X_1 and X_2 is introduced, in order to create some redundancy among the true clustering variables. For this, X_1 and X_2 are simulated as in the previous setting, and a same term is added to both of these variables (simulated according a Poisson distribution of parameter 2) .

5.3 Results

Table 3 shows the number of times, among the 100 simulated data sets, that each variable is selected. For Scenario 1, the model selection procedure perform perfectly, selecting each time only the true clustering variables. For Scenario 2, due to the fact the link between the redundant and the true clustering variables is not a standard Poisson GLM, the variable selection is perturbed and variables X_5 is sometimes selected. For Scenario 3, the results is that the dependency between X_1 and X_2 perturb the variable selection, and only one of them is selected (and even sometimes none of them). Redundant variables X_5 and X_6, which are linked to the clustering variables but with a linear link, are also sometimes selected.

Table 3: Number of selection for each variable, simulation setting number 3.

	X_{1}	X_{2}	X_{3}	X_{4}	X_{5}	X_{6}	X_{7}
Scenario 1	100	100	100	100	0	0	0
Scenario 2	97	100	90	98	44	0	0
Scenario 3	48	35	89	88	65	34	3

Figure 4 plots the distribution of the ARI differences between the model with either the selected variables or all the variables, and the one with the true clustering variables. These plots shows that for all scenarios, the ARI of the model with the selected variables (left boxplot of each plot) are always closest to the optimal ARI (obtained with the true clustering variables).

Figure 4: Distribution of the ARI differences with the model with the true clustering variables, for the model with the selected variables and the model with all variables.

Finally Figure 5 plots the histogram of the difference of ARI with the selected variables and with all the variables. This plot illustrates the interest of variable selection on the clustering results, and indeed, for all the scenarios, the ARI is better with the selected variables than when using all the variables.

Figure 5: Distribution of the ARI differences for the model with the selected variables and the model with all variables.

6 International Ultrarunning Association Data

We apply the proposed procedure to the data from the 2012 International Ultrarunning Association World 24H Championships.

We start by initializing the stepwise algorithm, and find the variables selected by the screening procedure:

Hide/Show the code

fit_screen <- poissonmix_screen(x, G = 1:Gmax)
jchosen <- fit_screen$jchosen
jchosen

 [1]  3  5  6  7  8  9 11 12 13 14 15 16 17 18 19 20 21 22 23 24

We then execute the proposed stepwise selection algorithm (the computing time is about 26 minutes on a laptop with 2.3 GHz Intel Core i7 processor and 32Go of RAM):

Hide/Show the code

fit <- poissonmix_varsel(x, jchosen = jchosen, G = 1:Gmax)

The final chosen variables found by the algorithm are:

 [1]  9 10 11 12 14 15 16 17 18 19 20 21 22 24

The optimal number of clusters 6 has been chosen inside the stepwise selection algorithm. The same choice is obtained when looking for the best G with the conditionally independent Poisson mixture on the selected variables (Figure 6).

Figure 6: Bayesian Information Criterion (BIC) for the independent Poisson mixture model with the seleceted variables.

In order to illustrate the results, we plot the cluster means according to the 24 variable mean parameters per cluster. For each variable not in the chosen variable set, a Poisson regression model is fitted with the chosen variables as predictors. Forward and backwards variable selection is conducted on this regression, if the regression model has any predictor variables, then the variable is called “redundant” and if the regression model has no predictor variables, then the variable is called “irrelevant”. Figure 7 shows the cluster mean for each variable, where the label indicates if the variable is irrelevant for clustering (“I”), redundant (“R”) or useful (then the point is unlabelled).

Figure 7: Cluster means and usefulness of the variables.

The variables discriminate the clusters pacing strategies of the runners are the number of laps covered during the last two thirds of the race (except during the 13th and 23rd hours). The number of laps covered during the first eight hours does not provide any additional clustering information, and even no information at all for the number of laps covered during the first hour.

Figure 8 plots the density map per clusters. Area of high density (red) indicates the hours and the corresponding average number of laps specific of each cluster.

Cluster 5 are clearly the most efficient runners. Looking at the running strategy in Figure 7 and Figure 8, we can see that they start as runners of Cluster 1 and Cluster 2, but they managed to keep a constant pace on the second part of the race, unlike those of the other two clusters which faltered. Runners of Cluster 3 has covered the fewest number of laps. Indeed, looking at their running strategy, we can see that most of these runners stop after the first third of the race. Cluster 6 is relatively similar to Cluster 3, but runners manage to continue running until half of the race is completed. Finally, Cluster 4 obtains slightly better results than Cluster 6, starting more carefully, and managing to run until the end of the race, even if the pace of the last hours is not very constant.

Finally, Figure 9 shows boxplots of the total number of loops covered by the runners of each of the clusters.

Figure 9: Number of loops covered by the runners of each of the clusters.

7 Discussion

A method for clustering and variable selection for multivariate count data has been proposed. The method is shown to give excellent performance on both simulated and real data examples. The method selects set of relevant variables for clustering and other variables are not selected if they are irrelevant or redundate for clustering purposes.

The proposed method is shown to give interesting insights in the application domain, where some clusters members are shown to perform better overall to others and the benefits of constant (or near constant pacing) are shown.

The level of variable selection is determined by the relative performance of the two models (as shown in Section 4.2) is compared. Alternative models to the Poisson GLM model which have greater flexibility could lead to a smaller set of selected variables than the proposed method achieves. This is a topic for future research.

The proposed method is based on a conditionally independent Poisson mixture model for the selected variables. It could be argued that the conditional independence assumption is unrealistic in the application. Hand and Yu (2001) consider the implication of incorrectly assuming conditional independence in a classification setting and show that it can make the group membership probabilities over confident. Furthermore, in the conditional independent Poisson mixture model, the number of clusters can be upwardly biased, where extra clusters are included to model dependence in the data. The approach taken in the paper could be extended to use other multivariate count distributions, including multivariate distributions without the conditional independence assumption (eg. Karlis 2018; Karlis and Meligkotsidou 2007; Inouye et al. 2017).

The code for the proposed approach is available as an R package poissonmix_0.1.tar.gz.

8 Acknowlegements

This work was supported by the Science Foundation Ireland Insight Research Centre (SFI/12/RC/2289_P2) and a visit to the Collegium – Institut d’Études Avancées de Lyon.

Bibliography

Agresti, Alan. 2013. Categorical Data Analysis. Third. Wiley Series in Probability and Statistics. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ.

Bouveyron, Charles, Gilles Celeux, T. Brendan Murphy, and Adrian E. Raftery. 2019. Model-Based Clustering and Classification for Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge. https://doi.org/10.1017/9781108644181.

Chiquet, Julien, Mahendra Mariadassou, and Stéphane Robin. 2021. “The Poisson-Lognormal Model as a Versatile Framework for the Joint Analysis of Species Abundances.” Frontiers in Ecology and Evolution 9: 188. https://doi.org/10.3389/fevo.2021.588292.

Dean, Nema, and Adrian E. Raftery. 2010. “Latent Class Analysis Variable Selection.” Annals of the Institute of Statistical Mathematics 62 (1): 11–35. https://doi.org/10.1007/s10463-009-0258-9.

Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. 1977. “Maximum Likelihood from Incomplete Data via the EM Algorithm.” Journal of the Royal Statistical Society: Series B 39: 1–38. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x.

Everitt, Brian S., and David J. Hand. 1981. Finite Mixture Distributions. Chapman & Hall.

Fop, Michael, and Thomas Brendan Murphy. 2018. “Variable Selection Methods for Model-Based Clustering.” Statistics Surveys 12: 18–65. https://doi.org/10.1214/18-SS119.

Fop, Michael, Keith Smart, and Thomas Brendan Murphy. 2017. “Variable Selection for Latent Class Analysis with Application to Low Back Pain Diagnosis.” Annals of Applied Statistics. 11: 2085–115.

Frühwirth-Schnatter, Sylvia, Gilles Celeux, and Christian P. Robert. 2018. Handbook of Mixture Analysis. Chapman; Hall/CRC. https://doi.org/10.1201/9780429055911.

Hand, David J., and Keming Yu. 2001. “Idiot’s Bayes—Not so Stupid After All?” International Statistical Review 69 (3): 385–98. https://doi.org/https://doi.org/10.1111/j.1751-5823.2001.tb00465.x.

Hartigan, John A., and M. Anthony Wong. 1979. “A k-Means Clustering Algorithm.” Applied Statistics 28 (1): 100–108.

Hubert, Lawrence, and Phipps Arable. 1985. “Comparing Partitions.” Journal of Classification 2 (1): 193–218. https://doi.org/10.1007/BF01908075.

Inouye, David I., Eunho Yang, Genevera I. Allen, and Pradeep Ravikumar. 2017. “A Review of Multivariate Distributions for Count Data Derived from the Poisson Distribution.” WIREs Computational Statistics 9 (3): e1398. https://doi.org/10.1002/wics.1398.

Karlis, Dimitris. 2018. “Mixture Modelling of Discrete Data.” In Handbook of Mixture Analysis, edited by Sylvia Frühwirth-Schnatter, Gilles Celeux, and Christian P. Robert, 193–218. CRC Press.

Karlis, Dimitris, and Loukia Meligkotsidou. 2007. “Finite Mixtures of Multivariate Poisson Distributions with Application.” Journal of Statistical Planning and Inference 137 (6): 1942–60. https://doi.org/10.1016/j.jspi.2006.07.001.

Kass, Robert E., and Adrian E. Raftery. 1995. “Bayes Factors.” Journal of the American Statististical Association 90 (430): 773–95. https://doi.org/10.1080/01621459.1995.10476572.

Maugis, Cathy, Gilles Celeux, and Marie-Laure Martin-Magniette. 2009. “Variable Selection in Model-Based Clustering: A General Variable Role Modeling.” Computational Statistics & Data Analysis 53 (11): 3872–82. https://doi.org/10.1016/j.csda.2009.04.013.

McLachlan, Geoffrey, and David Peel. 2000. Finite Mixture Models. New York: Wiley. https://doi.org/10.1002/0471721182.

McParland, Damien, and Thomas Brendan Murphy. 2018. “Mixture Modelling of High-Dimensional Data.” In Handbook of Mixture Analysis, edited by Sylvia Frühwirth-Schnatter, Gilles Celeux, and Christian P. Robert, 247–80. CRC Press.

Raftery, Adrian E., and Nema Dean. 2006. “Variable Selection for Model-Based Clustering.” Journal of the American Statistical Association 101 (473): 168–78. https://doi.org/10.1198/016214506000000113.

Rand, William M. 1971. “Objective Criteria for the Evaluation of Clustering Methods.” Journal of the American Statististical Association 66 (336): 846–50.

Rau, Andrea, Cathy Maugis-Rabusseau, Marie-Laure Martin-Magniette, and Gilles Celeux. 2015. “Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models.” Bioinformatics 31 (9): 1420–27. https://doi.org/10.1093/bioinformatics/btu845.

Schwarz, Gideon. 1978. “Estimating the Dimension of a Model.” The Annals of Statistics 6 (2): 461–64. https://doi.org/10.1214/aos/1176344136.

Silva, Anjali, Steven J. Rothstein, Paul D. McNicholas, and Sanjeena Subedi. 2019. “A Multivariate Poisson-Log Normal Mixture Model for Clustering Transcriptome Sequencing Data.” BMC Bioinformatics 20 (1): 394. https://doi.org/10.1186/s12859-019-2916-0.

White, Arthur, and Thomas Brendan Murphy. 2016. “Exponential Family Mixed Membership Models for Soft Clustering of Multivariate Data.” Advances in Data Analysis and Classification 10: 521–40.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@article{jacques2025,
  author = {Jacques, Julien and Brendan Murphy, Thomas},
  publisher = {French Statistical Society},
  title = {Model-Based {Clustering} and {Variable} {Selection} for
    {Multivariate} {Count} {Data}},
  journal = {Computo},
  date = {2025-07-01},
  doi = {10.57750/6v7b-8483},
  issn = {2824-7795},
  langid = {en},
  abstract = {Model-based clustering provides a principled way of
    developing clustering methods. We develop a new model-based
    clustering methods for count data. The method combines clustering
    and variable selection for improved clustering. The method is based
    on conditionally independent Poisson mixture models and Poisson
    generalized linear models. The method is demonstrated on simulated
    data and data from an ultra running race, where the method yields
    excellent clustering and variable selection performance.}
}

For attribution, please cite this work as:

Jacques, Julien, and Thomas Brendan Murphy. 2025. “Model-Based Clustering and Variable Selection for Multivariate Count Data.” Computo, July. https://doi.org/10.57750/6v7b-8483.

```{r setup, include=FALSE} knitr::opts_chunk$set(tidy = FALSE, fig.width = 12, fig.height = 7, echo = FALSE) ``` ```{r packages2, message = FALSE, warning=FALSE, eval=TRUE} library("mclust") library("rio") library("e1071") library("plotly") library("seriation") library("magick") library("ggplot2") library("gridExtra") library("dplyr") library("tidyr") library("remotes") library("poissonmix") theme_set(theme_minimal()) ``` # Introduction Multivariate count data is ubiquitous in statistical applications, as ecology [@Chiquet2021], genomics [@Rau2015; @Silva2019]. These data arise when each observation consists of a vector of count values. Count data are often treated as continuous data and therefore modeled by a Gaussian distribution, this assumption is particularly poor when the measured counts are low. Instead, we use the reference distribution for count data which is the Poisson distribution [@Agresti_2002; @Inouye_2017]. When a data set is heterogeneous, clustering allows to extract homogeneous subsets from the whole data set. Many clustering methods, such as $k$-means [@Hartigan_1979], are geometric in nature, whereas many modern clustering approaches are based on probabilistic models. In this work, we use model-based clustering which has been developed for many types of data [@Bouveyron_2019; @McLachlanPeel2000; @Fruhwirth_2018]. Modern data are often high-dimensional, that is the number of variables is often large. Among these variables, some are useful for the task of interest, some are useless for the task of interest and some others are useful but redundant. There is a need to select only the relevant variables, and that, whatever is the task. Variable selection methods are widespread for supervised learning tasks, in particular to avoid overfitting. However, variable selection methods are less well developed for unsupervised learning tasks, such as clustering. Recently, several methods have been proposed for selecting the relevant variables in model-based clustering; we refer to @Fop_2018 and @McParlandMurphy2018 for recent detailed surveys. The goal of the present work is to provide a clustering and variable selection method for multivariate count data, which, to the best of our knowledge, has not yet been studied in depth. A methodology based on a conditionally independent Poisson mixture is developed to achieve this goal. The method yields a final clustering model which is a conditionally independent Poisson mixture model for a subset of the variables. # Motivating Example The International Association of Ultrarunners (IAU) 24 hour World Championships were held in Katowice, Poland from September 8th to 9th, 2012. Two hundred and sixty athletes representing twenty four countries entered the race, which was held on a course consisting of a 1.554 km looped route. An update of the number of laps covered by each athlete was recorded approximately every hour [@WhiteMurphy2016]. @fig-24H plots the number of loops recorded each hour for the three medalists. ```{r medalists} #| echo: false #| message: false #| warning: false #| fig-align: "center" #| label: fig-24H #| fig-cap: "Number of loops per hour for the three medalists." load("X24H.Rdata") tmp <- sort(as.vector(rowSums(x)), decreasing = TRUE, index.return = TRUE) df <- data.frame( Hours = rep(1:24, 3), Value = as.vector(t(x[tmp$ix[1:3], 1:24])), Medal = rep(c("Gold medal", "Silver medal", "Bronze medal"), each = 24) ) ggplot(df, aes(x = Hours, y = Value, color = Medal, group = Medal)) + geom_line() + geom_point() + scale_x_continuous(breaks = seq(6, 24, 6)) + scale_y_continuous(breaks = 6:9) + labs(x = 'Hours', y = "Number of loops") + scale_color_manual(values = c("Gold medal" = "gold", "Silver medal" = "gray", "Bronze medal" = "brown")) + theme_minimal() + theme(legend.position = "topright") ``` We can see among these three runners different strategies, the second placed runner lapped at a regular rate, the first placed runner had a fast start but slowed later, and the third placed runner also started fast but slowed more than the first place runner. Our first goal will be, to analyze the whole data set to identify the different running strategies and to evaluate which strategies are the best ones. The second goal is to identify which variables allows to distinguish between the clusters, in order to identify which hour is essential in the management of this endurance race. # Independent Poisson Mixture Let $X_n=(X_{n1}, X_{n2}, \ldots, X_{nM})$ be a random vector of counts for $n=1, 2, \ldots, N$. The goal is to clusters theses $N$ observations into $G$ clusters. Let $Z_n=(Z_{n1}, Z_{n2}, \ldots, Z_{nG})$ be the latent cluster indicator vector, where $Z_{ng}=1$ if observation $n$ belongs to cluster $g$ and $Z_{ng}=0$ otherwise. We assume that ${\mathbb P}\{Z_{ng}=1\} = \tau_g$ for $g=1,2, \ldots, G$. Let denote $\tau=(\tau_1,\ldots,\tau_G)$. The conditionally independent Poisson mixture model [@Karlis2018, Section 9.4.2.1] assumes that the elements of $X_n$ are independent Poisson distributed random variables, conditional on $Z_n$. That is, $$ \begin{aligned} Z_n & \sim {\rm Multinomial}(1, \tau)\\ X_{nm}|(Z_{ng}=1) & \sim {\rm Poisson}(\lambda_{gm}), {\rm ~for~} m=1, 2, \ldots, M. \end{aligned} $$ Alternative modelling frameworks exist, either to introduce some dependence between variables or to normalize the variables. We refer the interested reader to [@Karlis2018; @Bouveyron_2019, Chapter 6] for more details. Denoting the model parameters by $\theta=(\tau,\lambda)$ where $\lambda=(\lambda_{gm})_{1\leq g \leq G, 1\leq m \leq M}$, and where $X=(x_n)_{1\leq n \leq N}$ denotes the observations, the observed likelihood is $$ L(\theta)=\sum_{n=1}^{N}\sum_{g=1}^{G}\tau_g\prod_{m=1}^M\phi(x_{nm},\lambda_{gm}), $$ where $\phi(x,\lambda)=\exp(-\lambda)\lambda^{x}/x!$, the Poisson probability mass function. Due to form of the mixture distribution, there are no closed form for the maximum likelihood estimators, and an iterative EM algorithm needs to be used [@Dempster_1977] to maximize the likelihood. The EM algorithm starts from an initial value $\theta^{(0)}$ for the model parameter, and alternates the two following steps until convergence of the likelihood. At the $q$th iteration of the EM algorithm, the E-step consists of computing for all $1\leq n\leq N$ and $1\leq g\leq G$: $$t_{ng}^{(q)}=\frac{\tau_g^{(q)}\prod_{m=1}^M\phi(x_{nm},\lambda_{gm})}{\sum_{h=1}^{G}\tau_h^{(q)}\prod_{m=1}^M\phi(x_{nm},\lambda_{hm})}.$$ In the M-step, the model parameters are updated as follows: $$ \tau_g^{(q+1)}=\frac{\sum_{n=1}^Nt_{ng}^{(q)}}{N} \quad {\rm ~and~} \quad \lambda_{gm}^{(q+1)}=\frac{\sum_{n=1}^Nt_{ng}^{(q)}x_{nm}}{\sum_{n=1}^Nt_{ng}^{(q)}}. $$ The EM algorithm steps are iterated until convergence, where convergence is determined when $\log L(\theta^{(q+1)}) - \log L(\theta^{(q)}) < \epsilon$. The number of clusters $G$ is selected using the Bayesian information criterion (BIC) [@Schwarz_1978], $$ BIC= 2 \log L(\hat\theta) - \{(G - 1) + GM\}\log(N), $$ where $\hat\theta$ is the maximum likelihood estimate of the model parameters; models with higher BIC are prefered to models with lower BIC. # Variable selection We develop a model-based clustering method with variable selection for multivariate count data. The method follows the approach of @RafteryDean2006 and @MaugisEtAl2009 for continuous data and @DeanRaftery2010 and @Fop_2017 for categorical data. It consists in a stepwise model comparison approach where variables are added and removed from a set of clustering variables. ## Model setup The clustering and variable selection approach is based around partitioning ${X}_n=(X_n^{C}, X_n^{P}, X_n^{O})$ into three parts: - $X_n^{C}$: The current clustering variables, - $X_n^{P}$: The proposed variable to add to the clustering variables, - $X_n^{O}$: The other variables. For simplicity of notation, $C$ will be used to denote the set of indices of the current clustering variables, $P$ the indices of the proposed variable and $O$ the indices of the other one. Then $(C,P,O)$ is a partition of $\{1,\ldots,M\}$. The decision on whether to add the proposed variable to the clustering variables is based on comparing two models: ${\cal M}_1$ (Clustering Model), which assumes that the proposed variable is useful for clustering: $$ (X_n^C, X_n^P) \sim \sum_{g=1}^{G} \tau_g \prod_{m\in\{C, P\}} {\rm Poisson}(\lambda_{gm}). $$ The ${\cal M}_1$ model is fitted for different values of $G$ between $1$ and $G_{max}$ to achieve the best clustering model. ${\cal M}_2$ (Non-Clustering Model) which assumes that the proposed variable is not useful for clustering, but is potentially linked to the clustering variables through a Poisson GLM, that is, $$ \begin{aligned} X_n^C &\sim \sum_{g=1}^{G}\tau_g \prod_{m\in C} {\rm Poisson}(\lambda_{gm})\\ X_n^P| (X_n^C=x_n^C, Z_{ng}=1) & \sim {\rm PoissonGLM}(x_n^{C}), \end{aligned} $$ where Poisson GLM states that $$ \log {\mathbb E}[X_n^P|X_n^C=x_n^C, Z_{ng}=1] = \alpha + \beta^\top x_n^C. $$ In order to avoid non significant terms in the Poisson GLM model, a standard stepwise variable selection approach (using BIC as the variable selection criterion) is considered. Thus, the proposed variable $X_n^P$ will be dependent on only a subset $X_n^R$ of the clustering variables $X_n^C$. We note that $G$ is fixed in the non-clustering model, because an optimal value for $G$ is previously chosen. The other variables $X_n^{O}$ are assumed to be conditionally independent of $Z_n$ given $X_n^{C}$ and $X_n^{P}$. The clustering and non-clustering models are represented as graphical models in @fig-graphical. :::{#fig-graphical layout="[[-10,80,10]]" } ```{.tikz} %%| filename: ../figures/fig-graphical \begin{tikzpicture} \node[draw, circle] at (0, 0) (z) {$Z$}; \node[draw] at (-2, -2) (xc) {$X^C$}; \node[draw] at (2, -2) (xp) {$X^P$}; \node[draw] at (0, -4) (xo) {$X^O$}; \draw[->] (z) -- (xp); \draw[->] (z) -- (xc); \draw[->] (xc) -- (xo); \draw[->] (xp) -- (xo); \node[draw, rectangle] at (0, 0.75) {Clustering}; \node[draw, circle] at (7, 0) (z) {$Z$}; \node[draw] at (5, -2) (xc) {$X^C$}; \node[draw] at (9, -2) (xp) {$X^P$}; \node[draw] at (7, -4) (xo) {$X^O$}; \draw[->] (z) -- (xc); \draw[dashed,->] (xc) -- (xp); \node[fill=white] at (7, -1.7) {$X^R\subseteq X^C$}; \draw[->] (xc) -- (xo); \draw[->] (xp) -- (xo); \node[draw, rectangle] at (7, 0.75) {Non Clustering}; \end{tikzpicture} ``` Graphical model representations of the clustering and non-clustering models. ::: Thus, there is two reasons for which ${\cal M}_2$ can be preferred to ${\cal M}_1$: either $X_n^P$ does not contain information about the latent clustering variable at all (ie. $X_n^R=\emptyset$), or $X_n^P$ does not add further useful information about the clustering given the information already contained in the current clustering variables. In the first situation, we say that $X_n^P$ is an irrelevant variable, because it contains no clustering information. In the second situation, we say that $X_n^P$ is a redundant variable because it contains no extra information about the clustering beyond the current clustering variables ($X_n^C$). Additionally, both models assume the same form for the conditional distribution for $X_n^{O}|(X_{n}^{C}, X_{n}^{P})$ and whose form doesn't need to be explicitly specified because it doesn't affect the model choice. Variable $P$ is added to $C$ if the clustering model (${\cal M}_1$) is preferred to the non-clustering model (${\cal M}_2$). In order to compare ${\cal M}_1$ and ${\cal M}_2$, following [@DeanRaftery2010], we consider the Bayes Factor: $$B_{1,2}=\frac{p(X|{\cal M}_1)}{p(X|{\cal M}_2)}$$ which is asymptotically approximated [@Fop_2017; @Kass_1995] using the difference of the BIC criteria for both models: $$2\log B_{1,2}\simeq BIC_{{\cal M}_1}-BIC_{{\cal M}_2}.$$ The same modelling framework can be used for removing variables from the current set of clustering variables. ## Interpretation {#sec-interpretation} Comparing ${\cal M}_1$ and ${\cal M}_2$ is equivalent to comparing the following $X_n^P|(X_n^C=x_n^C)$ structures. The ${\cal M}_1$ (Clustering Model) assumes that, $$ X_n^P| (X_n^C=x_n^C) \sim \sum_{g=1}^{G} {\mathbb P}\{Z_{ng}=1|X_{n}^{C}=x_{n}^{C}\} {\rm Poisson}(\lambda_{gm}), $$ where $$ {\mathbb P}\{Z_{ng}=1|X_{n}^{C}=x_{n}^{C}\} = \frac{\tau_g \prod_{m=1}^{M}\phi(x_{nm}, \lambda_{gm})}{\sum_{h=1}^{G}\tau_h \prod_{m=1}^{M}\phi(x_{nm}, \lambda_{hm})}. $$ Whereas, the ${\cal M}_2$ (Non-Clustering Model) assumes that, $$ X_n^P| (X_n^C=x_n^C) = {\rm PoissonGLM}(x_n^C). $$ The method contrasts which of conditional model structures is better describing the distribution of the proposed variable $X^P$. The clustering model (${\cal M}_1$) uses a mixture model, with covariate dependent weights, for the conditional model whereas the non-clustering model (${\cal M}_2$) is a Poisson generalized linear model. The model selection criterion chooses the model that best models this conditional distribution. ## Stepwise selection algorithm ### Screening variables: Initialization We start with an initial choice of $C$ by first screening each individual variable by fitting a mixture of univariate Poisson distributions [eg. @EverittHand1981, Chapter 4.3], $$ X_{nm} \sim \sum_{g=1}^{G}\tau_g {\rm Poisson}(\lambda_{gm}), {\rm ~for~} G=1,2,\ldots, G_{max}. $$ The initial set of variables is set to be those variables where any model with $G>1$ is preferred to the $G=1$ model. ### Stepwise algorithm: Updating We consider a stepwise algorithm which alternates between adding and removing steps. In the removal step, all the variables in $X^C$ are examined in turn to be removed from the set. In the adding step, all the variables in $X^O$ are examined in turn to be added to the clustering set. The algorithm also performs the selection of the number $G$ of clusters finding at each stage the optimal combination of clustering variables and number of clusters. The procedure stops when no change has been made to the set $X^C$ after consecutive exclusion and inclusion steps. With the present stepwise selection algorithm, it can occur that during the process, we get back on a solution (a set of clustering variable) already explored. Since our algorithm is not stochastic, we fall into an infinite cycle. In this situation the algorithm is stopped, and the best solution according to BIC among the solution of the cycle is kept. The following pseudo-code summarizes our stepwise algorithm: ALGORITHM Stepwise\ BEGIN\ initialize $X^C$\ WHILE $X^C$ changes:\ - for all variable $X_j$ which are not in $X^C$\ - estimate ${\cal M}_1$ on $X^C \cup X_j$ and select the best $G$\ - estimate ${\cal M}_2$ with the model for $X^C$ (with G selected at the previous step) and a Poisson regression for $X_j$ given $X^C$\ - add $X_j$ in $X^C$ if $BIC_{{\cal M}_1}>BIC_{{\cal M}_2}$\ - for each $X_j$ in $X^C$\ - estimate ${\cal M}_2$ on $X^C\setminus X_j$, select the best $G$ and use a Poisson regression for $X_j$ given $X^C\setminus X_j$\ - estimate ${\cal M}_1$ on $X^C$ (with G selected at the previous step)\ - remove $X_j$ from $X^C$ if $BIC_{{\cal M}_2}>BIC_{{\cal M}_1}$\ - test for infinite loop\ ENDWHILE\ return $X^C$ and ${\cal M}_1$ estimate\ END # Simulation study In this section, we evaluate the proposed variable selection method through three different simulation scenarios. We start with an illustrative example in which, using a data set simulated according to the proposed model, we show how to perform the variable selection. Then, simulation studies are performed to evaluate the behavior of the proposed selection method, when the data are simulated according to the proposed model (Scenario1) and when the model assumptions are violated. In Scenario2, the link between $X^R$ and $X^C$ is no longer a Poisson GLM but a more complex model. In Scenario3, the clustering variables are no longer conditionally independent. ## Illustrative example ```{r simulation1} # Generate data set.seed(69007) N <- 400 G <- 3 P <- 10 tau0 <- c(0.4, 0.3, 0.3) lambda0 <- matrix(c(1, 1, 1, 1, 2, 2, 1, 4, 4, 4, 4, 4), G, 4, byrow = TRUE) l0 <- sample (1:G, prob = tau0, replace = TRUE, size = N) Ng <- table(l0) Z0 <- unmap(l0) x <- matrix(NA, N, P) for (g in 1:G) { x[l0==g, 1:4] <- matrix(rpois(4 * Ng[g], lambda0[g, ]), Ng[g], 4, byrow = TRUE) } #x[, 1:4] <- x[, 1:4] + rpois(N, 1) x[ ,5] <- rpois(N, exp(0.2 * x[, 2])) x[ ,6] <- rpois(N, exp(x[, 1] * 0.2 - 0.1 * x[, 2])) x[ ,7] <- rpois(N, exp(0.1 * (x[, 1] + x[, 3] + x[, 4]))) x[ ,8] <- rpois(N, 4) x[ ,9] <- rpois(N, 2) x[ ,10] <- rpois(N, 1) ``` In the first simulation setting we consider 10 Poisson random variables. Variables $X_1$, $X_2$, $X_3$ and $X_4$ are the clustering variables, distributed according to a mixture of $G=$ `r G` independent Poisson mixture distributions with mixing proportions `r tau0`. Variables $X_5$, $X_6$ and $X_7$ are redundant variables, each one generated dependent on the clustering variables. These three variables are linked to the four first ones through a Poisson GLM. The last three variables, $X_8$, $X_9$ and $X_{10}$ are irrelevant variables not related to the previous ones. @tbl-ParamSimul shows the parameter of the Poisson distribution for each variable and each cluster. ```{r tabParamSimul} #| label: tbl-ParamSimul #| echo: false #| message: false #| fig-align: "center" #| tbl-cap: "True values of component parameters (Scenario 1)" lambda <- matrix(c(1, 1, 1, 1, 2, 2, 1, 4, 4, 4, 4, 4), G, 4, byrow = TRUE) lambda=cbind(lambda,c('$\\lambda_{g5}$','$\\lambda_{g5}$','$\\lambda_{g5}$'),c('$\\lambda_{g6}$','$\\lambda_{g6}$','$\\lambda_{g6}$'),c('$\\lambda_{g7}$','$\\lambda_{g7}$','$\\lambda_{g7}$'),c(4,4,4),c(2,2,2),c(1,1,1)) colnames(lambda) <- c(paste("$\\lambda_{g", 1:P, "}$", sep ="")) rownames(lambda) <- paste("$g=", 1:3, "$", sep="") knitr::kable(lambda) ``` with $\lambda_{g5}=\exp(0.2X_2)$, $\lambda_{g6}=\exp(0.2X_1-0.1X_2)$ and $\lambda_{g7}=\exp(0.1 (X_1+X_3+X_4))$. Below is the result obtained for one data set of size $N=$ `r N`. The evaluation criteria is the selected features (true one are $X_1$ to $X_4$) and the Adjusted Rand Index [@Rand_1971; @HubertArabie1985] obtained with the selected variables in comparison to those obtained with the full set of variables and with the true clustering variables. ```{r fullpoisson, cache = TRUE} Gmax <- 10 fit_all <- poissonmix_all(x, G = 1:Gmax) ``` The independent Poisson mixture model was fitted to the simulated data with $N=$ `r N` rows and $P=$ `r P` columns. Models with $G=1$ to $G=$ `r Gmax` were fitted using the EM algorithm. The values of BIC for the independent Poisson mixture model are plotted in @fig-BIC. ```{r plotBIC} #| label: fig-BIC #| echo: false #| message: false #| warning: false #| fig-align: "center" #| fig-cap: "Bayesian Information Criterion (BIC) for the independent Poisson mixture model." df <- data.frame( G = 1:Gmax, BIC = fit_all$BIC ) ggplot(df, aes(x = G, y = BIC)) + geom_line() + geom_point() + labs(x = "G", y = "BIC") + theme(axis.text.x = element_text(size = 12), axis.text.y = element_text(size = 12)) ``` The model with the highest BIC has $G=$ `r fit_all$bestfit$G` components and the resulting estimates of $\tau$ and $\lambda$ are given as: ```{r fullpoissonparam} #| label: tbl-params #| message: false #| out-width: "70%" #| tbl-cap: "Estimates of the mixing proportions and component parameters." tauhat <- fit_all$bestfit$tau lambdahat <- fit_all$bestfit$lambda Ghat <- fit_all$bestfit$G res <- data.frame(tauhat,lambdahat) colnames(res) <- c("$\\tau_g$", paste("$\\lambda_{g", 1:P, "}$", sep ="")) rownames(res) <- paste("$g=", 1:Ghat, "$", sep="") knitr::kable(res, digits = 2) ``` A look at @tbl-ParamSimul of true values allows us to say that these estimates are correct (except for label switching). Let start by initializing the stepwise algorithm. ```{r screening, cache = TRUE, echo = TRUE} fit_screen <- poissonmix_screen(x, G = 1:Gmax) jchosen <- fit_screen$jchosen ``` The variables selected by the screening procedure are {`r fit_screen$jchosen`}. Now, we execute the stepwise selection algorithm: ```{r varsel, cache = TRUE, echo = TRUE} fit <- poissonmix_varsel(x, jchosen=jchosen, G = 1:Gmax) ``` Note that the computing time is about 5 minutes on a laptop with 2.3 GHz Intel Core i7 processor and 32Go of RAM. The final chosen variables are {`r fit$jchosen`}. ```{r comparison, cache = TRUE, echo = FALSE} fit_sel <- poissonmix_all(x[,fit$jchosen], G = Ghat) fit_true <- poissonmix_all(x[,1:4], G = Ghat) ``` Finally, the ARI obtained with the selected variables, which turn out to be the true clustering variable, is `r round(adjustedRandIndex(l0, fit_sel$bestfit$classification), 3)` whereas it is `r round(adjustedRandIndex(l0, fit_all$bestfit$classification), 3)` with all the variables. ## Scenarios of simulation {#sec-Scenari} In this section the three scenario of simulation are described. The first scenario is similar to the previous illustrative example. ```{r simulationsize, cache = TRUE, echo = FALSE} D <- 100 ``` ```{r simulation1whole, cache = TRUE, echo = FALSE, eval=FALSE} selected_variables <- matrix(0, D, P) ARI <- matrix(NA, D, 3) colnames(ARI) <- c("Selected variables", "All variables", "True variables") # register previous results selected_variables[1, fit$jchosen] <- 1 ARI[1, 1] <- adjustedRandIndex(l0, fit_sel$bestfit$classification) ARI[1, 2] <- adjustedRandIndex(l0, fit_all$bestfit$classification) ARI[1, 3] <- adjustedRandIndex(l0, fit_true$bestfit$classification) # loop on the data sets for (d in 2:D){ # Generate data l0 <- sample (1:G, prob=tau0, replace = TRUE, size = N) Ng <- table(l0) Z0 <- unmap(l0) x <- matrix(NA, N, P) for (g in 1:G) { x[l0==g, 1:4] <- matrix(rpois(4 * Ng[g], lambda0[g, ]), Ng[g], 4, byrow = TRUE) } x[ ,5] <- rpois(N, exp(0.2 * x[, 2])) x[ ,6] <- rpois(N, exp(x[, 1] * 0.2 - 0.1 * x[, 2])) x[ ,7] <- rpois(N, exp(0.1 * (x[, 1] + x[, 3] + x[, 4]))) x[ ,8] <- rpois(N, 4) x[ ,9] <- rpois(N, 2) x[ ,10] <- rpois(N, 1) # Select G fit_all <- poissonmix_all(x, G = 1:Gmax) Ghat <- fit_all$bestfit$G # Select initial variable fit_screen <- poissonmix_screen(x, G = 1:Gmax) jchosen <- fit_screen$jchosen # Variable selection fit <- poissonmix_varsel(x, jchosen=jchosen, G = 1:Gmax) selected_variables[d,fit$jchosen] <- 1 # Computing partitions and ARI fit_sel <- poissonmix_all(x[,fit$jchosen], G = Ghat) fit_true <- poissonmix_all(x[,1:4], G = Ghat) ARI[d, 1] <- adjustedRandIndex(l0, fit_sel$bestfit$classification) ARI[d, 2] <- adjustedRandIndex(l0, fit_all$bestfit$classification) ARI[d, 3] <- adjustedRandIndex(l0, fit_true$bestfit$classification) } ``` ```{r loadsim1, echo = FALSE, message = FALSE} load('result-simulation1.Rdata') ARI1=ARI selected_variables1=selected_variables ``` The second scenario is similar to the first one, except for variables $X_5$, $X_6$ and $X_7$ which are still redundant but linked to the true clustering variables through linear, quadratic and exponential term in an identity link function, respectively, and not a Poisson GLM with logarithm link function. More precisely, $X_5$, $X_6$ and $X_7$ have Poisson distribution of respective parameter $\lambda_{g5}=\exp(2X_2)$, $\lambda_{g6}=\exp(X_1^2+X_3)$ and $\lambda_{g7}=\exp(\exp(0.1 (X_1 + X_3+ X_4)))$. Thus, the data are simulated from a model which does not satisfy assumptions of model ${\cal M}_2$. ```{r simulation2whole, cache = TRUE, echo = FALSE, eval = FALSE} selected_variables <- matrix(0, D, P) ARI <- matrix(NA, D, 3) colnames(ARI) <- c("Selected variables", "All variables", "True variables") # loop on the data sets for (d in 1:D){ # Generate data l0 <- sample (1:G, prob=tau0, replace = TRUE, size = N) Ng <- table(l0) Z0 <- unmap(l0) x <- matrix(NA, N, P) for (g in 1:G) { x[l0==g, 1:4] <- matrix(rpois(4 * Ng[g], lambda0[g, ]), Ng[g], 4, byrow = TRUE) } x[ ,5] <- rpois(N, 2 * x[, 2]) x[ ,6] <- rpois(N, x[, 1] ^ 2 + x[, 3]) x[ ,7] <- rpois(N, exp(0.1 * (x[, 1] + x[, 3] + x[, 4]))) x[ ,8] <- rpois(N, 4) x[ ,9] <- rpois(N, 2) x[ ,10] <- rpois(N, 1) # Select G fit_all <- poissonmix_all(x, G = 1:Gmax) Ghat <- fit_all$bestfit$G # Select initial variable fit_screen <- poissonmix_screen(x, G = 1:Gmax) jchosen <- fit_screen$jchosen # Variable selection fit <- poissonmix_varsel(x, jchosen=jchosen, G = 1:Gmax) selected_variables[d,fit$jchosen]=1 # Computing partitions and ARI fit_sel <- poissonmix_all(x[,fit$jchosen], G = Ghat) fit_true <- poissonmix_all(x[,1:4], G = Ghat) ARI[d, 1] <- adjustedRandIndex(l0, fit_sel$bestfit$classification) ARI[d, 2] <- adjustedRandIndex(l0, fit_all$bestfit$classification) ARI[d, 3] <- adjustedRandIndex(l0, fit_true$bestfit$classification) } ``` ```{r loadsim2, echo = FALSE, message = FALSE} load('result-simulation2.Rdata') ARI2=ARI selected_variables2=selected_variables ``` The third scenario is similar to the second one, but some dependence between the clustering variables $X_1$ and $X_2$ is introduced, in order to create some redundancy among the true clustering variables. For this, $X_1$ and $X_2$ are simulated as in the previous setting, and a same term is added to both of these variables (simulated according a Poisson distribution of parameter 2) . ```{r simulation3whole, cache = TRUE, echo = FALSE,eval = FALSE} selected_variables <- matrix(0,D,P) ARI <- matrix(NA,D,3) colnames(ARI) <- c("Selected variables", "All variables", "True variables") # loop on the data sets for (d in 1:D){ # Generate data l0 <- sample (1:G, prob=tau0, replace = TRUE, size = N) Ng <- table(l0) Z0 <- unmap(l0) x <- matrix(NA, N, P) for (g in 1:G) { x[l0==g, 1:4] <- matrix(rpois(4 * Ng[g], lambda0[g, ]), Ng[g], 4, byrow = TRUE) } x[, 1:2] <- x[, 1:2] + rpois(N, 2) x[ ,5] <- rpois(N, 2 * x[, 2]) x[ ,6] <- rpois(N, x[, 1]^2 + x[, 3]) x[ ,7] <- rpois(N, exp(0.1 * (x[, 1] + x[, 3] + x[, 4]))) x[ ,8] <- rpois(N, 4) x[ ,9] <- rpois(N, 2) x[ ,10] <- rpois(N, 1) # Select G fit_all <- poissonmix_all(x, G = 1:Gmax) Ghat <- fit_all$bestfit$G # Select initial variable fit_screen <- poissonmix_screen(x, G = 1:Gmax) # JJ: why to not use Ghat ?? jchosen <- fit_screen$jchosen # Variable selection fit <- poissonmix_varsel(x, jchosen=jchosen, G = 1:Gmax) selected_variables[d,fit$jchosen]=1 # Computing partitions and ARI fit_sel <- poissonmix_all(x[,fit$jchosen], G = Ghat) fit_true <- poissonmix_all(x[,1:4], G = Ghat) ARI[d,1] <- adjustedRandIndex(l0, fit_sel$bestfit$classification) ARI[d,2] <- adjustedRandIndex(l0, fit_all$bestfit$classification) ARI[d,3] <- adjustedRandIndex(l0, fit_true$bestfit$classification) } ``` ```{r loadsim3, echo = FALSE, message = FALSE} load('result-simulation3.Rdata') ARI3=ARI selected_variables3=selected_variables ``` ## Results {#sec-wholeresults} @tbl-HISTall shows the number of times, among the `r D` simulated data sets, that each variable is selected. For Scenario 1, the model selection procedure perform perfectly, selecting each time only the true clustering variables. For Scenario 2, due to the fact the link between the redundant and the true clustering variables is not a standard Poisson GLM, the variable selection is perturbed and variables $X_5$ is sometimes selected. For Scenario 3, the results is that the dependency between $X_1$ and $X_2$ perturb the variable selection, and only one of them is selected (and even sometimes none of them). Redundant variables $X_5$ and $X_6$, which are linked to the clustering variables but with a linear link, are also sometimes selected. ```{r tabHISTall} #| label: tbl-HISTall #| echo: false #| message: false #| fig-align: "center" #| tbl-cap: "Number of selection for each variable, simulation setting number 3." res <- matrix(c(colSums(selected_variables1),colSums(selected_variables2),colSums(selected_variables3)),3,P,byrow = TRUE) colnames(res) <- paste("$X_{", 1:P, "}$", sep ="") rownames(res) <- c("Scenario 1","Scenario 2","Scenario 3") knitr::kable(res) ``` @fig-all3 plots the distribution of the ARI differences between the model with either the selected variables or all the variables, and the one with the true clustering variables. These plots shows that for all scenarios, the ARI of the model with the selected variables (left boxplot of each plot) are always closest to the optimal ARI (obtained with the true clustering variables). ```{r plotall3} #| label: fig-all3 #| echo: false #| message: false #| fig-align: "center" #| fig-cap: "Distribution of the ARI differences with the model with the true clustering variables, for the model with the selected variables and the model with all variables." ARI11 <- matrix(c(ARI1[,1] - ARI1[,3], ARI1[,2] - ARI1[,3]), ncol = 2, byrow = FALSE) colnames(ARI11) <- c("Selected var. - True var.", "All var. - True var.") ARI1_long <- data.frame( Comparison = rep(c("Selected var. - True var.", "All var. - True var."), each = nrow(ARI11)), Value = as.vector(ARI11) ) ARI22 <- matrix(c(ARI2[,1] - ARI2[,3], ARI2[,2] - ARI2[,3]), ncol = 2, byrow = FALSE) colnames(ARI22) <- c("Selected var. - True var.", "All var. - True var.") ARI2_long <- data.frame( Comparison = rep(c("Selected var. - True var.", "All var. - True var."), each = nrow(ARI22)), Value = as.vector(ARI22) ) ARI33 <- matrix(c(ARI3[,1] - ARI3[,3], ARI3[,2] - ARI3[,3]), ncol = 2, byrow = FALSE) colnames(ARI33) <- c("Selected var. - True var.", "All var. - True var.") ARI3_long <- data.frame( Comparison = rep(c("Selected var. - True var.", "All var. - True var."), each = nrow(ARI33)), Value = as.vector(ARI33) ) plot1 <- ggplot(ARI1_long, aes(x = Comparison, y = Value, fill = Comparison)) + geom_boxplot() + geom_hline(yintercept = 0, color = "red") + labs(title = "Scenario 1", x = "", y = "") + theme(axis.text.x = element_text(angle = 45, hjust = 1,size = 12), axis.ticks.x = element_blank(), axis.text.y = element_text(size = 12), legend.position = "none") + ylim(-0.5, 0.2) plot2 <- ggplot(ARI2_long, aes(x = Comparison, y = Value, fill = Comparison)) + geom_boxplot() + geom_hline(yintercept = 0, color = "red") + labs(title = "Scenario 2", x = "", y = "") + theme(axis.text.x = element_text(angle = 45, hjust = 1,size = 12), axis.ticks.x = element_blank(), axis.text.y = element_text(size = 12), legend.position = "none") + ylim(-0.5, 0.2) plot3 <- ggplot(ARI3_long, aes(x = Comparison, y = Value, fill = Comparison)) + geom_boxplot() + geom_hline(yintercept = 0, color = "red") + labs(title = "Scenario 3", x = "", y = "") + theme(axis.text.x = element_text(angle = 45, hjust = 1,size = 12), axis.ticks.x = element_blank(), axis.text.y = element_text(size = 12), legend.position = "none") + ylim(-0.5, 0.2) grid.arrange(plot1, plot2, plot3, ncol = 3) ``` Finally @fig-all2 plots the histogram of the difference of ARI with the selected variables and with all the variables. This plot illustrates the interest of variable selection on the clustering results, and indeed, for all the scenarios, the ARI is better with the selected variables than when using all the variables. ```{r plotall2, warning=FALSE} #| label: fig-all2 #| echo: false #| message: false #| fig-align: "center" #| fig-cap: "Distribution of the ARI differences for the model with the selected variables and the model with all variables." ARI1_df=data.frame(ARI1) ARI1_df$difference <- ARI1_df[, 1] - ARI1_df[, 2] ARI2_df=data.frame(ARI2) ARI2_df$difference <- ARI2_df[, 1] - ARI2_df[, 2] ARI3_df=data.frame(ARI3) ARI3_df$difference <- ARI3_df[, 1] - ARI3_df[, 2] plot1<-ggplot(ARI1_df, aes(x = difference)) + geom_histogram(bins = 10, fill = "blue", alpha = 0.7) + geom_vline(aes(xintercept = 0), color = "red", linetype = "dashed", linewidth = 1) + labs(x = 'ARI differences', y = 'Frequency', title = 'Scenario 1') + xlim(-0.1, 0.5) + theme(axis.text.x = element_text(size = 12), axis.text.y = element_text(size = 12), ) plot2<-ggplot(ARI2_df, aes(x = difference)) + geom_histogram(bins = 10, fill = "blue", alpha = 0.7) + geom_vline(aes(xintercept = 0), color = "red", linetype = "dashed", linewidth = 1) + labs(x = 'ARI differences', y = 'Frequency', title = 'Scenario 2') + xlim(-0.1, 0.5) + theme(axis.text.x = element_text(size = 12), axis.text.y = element_text(size = 12), ) plot3<-ggplot(ARI3_df, aes(x = difference)) + geom_histogram(bins = 10, fill = "blue", alpha = 0.7) + geom_vline(aes(xintercept = 0), color = "red", linetype = "dashed", linewidth = 1) + labs(x = 'ARI differences', y = 'Frequency', title = 'Scenario 3') + xlim(-0.1, 0.5) + theme(axis.text.x = element_text(size = 12), axis.text.y = element_text(size = 12), ) grid.arrange(plot1, plot2, plot3, ncol = 3) ``` # International Ultrarunning Association Data ```{r loadrunning, echo = FALSE, message = FALSE, warning = FALSE, fig.align='center'} load("X24H.Rdata") P <- ncol(x) Gmax <- 10 ``` We apply the proposed procedure to the data from the 2012 International Ultrarunning Association World 24H Championships. We start by initializing the stepwise algorithm, and find the variables selected by the screening procedure: ```{r screenrunning, cache = TRUE, echo = TRUE} fit_screen <- poissonmix_screen(x, G = 1:Gmax) jchosen <- fit_screen$jchosen jchosen ``` We then execute the proposed stepwise selection algorithm (the computing time is about 26 minutes on a laptop with 2.3 GHz Intel Core i7 processor and 32Go of RAM): ```{r fitrunning, cache = TRUE, echo = TRUE, eval = FALSE} fit <- poissonmix_varsel(x, jchosen = jchosen, G = 1:Gmax) ``` ```{r loadfitrunning, echo = FALSE} load("RunningFit.Rdata") ``` The final chosen variables found by the algorithm are: ```{r runningchosen, echo = FALSE} fit$jchosen ``` The optimal number of clusters `r fit$bestfit$G` has been chosen inside the stepwise selection algorithm. The same choice is obtained when looking for the best $G$ with the conditionally independent Poisson mixture on the selected variables (@fig-BIC2). ```{r plotBIC2} #| label: fig-BIC2 #| echo: false #| message: false #| warning: false #| fig-align: "center" #| fig-cap: "Bayesian Information Criterion (BIC) for the independent Poisson mixture model with the seleceted variables." Gmax <- 10 fit_all <- poissonmix_all(as.matrix(x[,fit$jchosen]), G = 1:Gmax) df <- data.frame( G = 1:Gmax, BIC = fit_all$BIC ) ggplot(df, aes(x = G, y = BIC)) + geom_line() + geom_point() + labs(x = "G", y = "BIC") + theme(axis.text.x = element_text(size = 12), axis.text.y = element_text(size = 12)) ``` In order to illustrate the results, we plot the cluster means according to the `r P` variable mean parameters per cluster. For each variable not in the chosen variable set, a Poisson regression model is fitted with the chosen variables as predictors. Forward and backwards variable selection is conducted on this regression, if the regression model has any predictor variables, then the variable is called "redundant" and if the regression model has no predictor variables, then the variable is called "irrelevant". @fig-Running-result1 shows the cluster mean for each variable, where the label indicates if the variable is irrelevant for clustering (“I”), redundant (“R”) or useful (then the point is unlabelled). ```{r plotrunning, echo=FALSE, fig.align='center', cache = TRUE} #| label: fig-Running-result1 #| fig-cap: "Cluster means and usefulness of the variables." G <- fit$bestfit$G Z <- fit$bestfit$Z x <- as.matrix(x) xmeans <- t(Z) %*% x / apply(Z, 2, sum) jremove <- setdiff(1:P, fit$jchosen) vartype <- rep(NA, P) vartype[fit$jchosen] <- "Clustering" for (j in jremove) { y <- x[, j] Xregr <- x[, fit$jchosen] datregr <- data.frame(y, Xregr) fitglm <- glm(y ~ ., family = "poisson", data = datregr) fit1 <- glm(y ~ 1, family = "poisson", data = datregr) modelscope <- list(lower = formula(fit1), upper = formula(fitglm)) fitglmb <- step(fit1, scope = modelscope, k = log(N), trace = 0, direction = "forward") fitglmf <- step(fitglm, scope = modelscope, k = log(N), trace = 0, direction = "backward") if (BIC(fitglmf) < BIC(fitglmb)) {fitglms <- fitglmf} else {fitglms <- fitglmb} if (length(coef(fitglms)) > 1) { vartype[j] <- "Redundant" # text(rep(j, G), xmeans[, j], rep("R", G), col = 1:G) } else { vartype[j] <- "Irrelevant" # text(rep(j, G), xmeans[, j], rep("I", G), col = 1:G) } } # Supposons que fit$jchosen, fit$bestfit$lambda, xmeans, etc. existent déjà # On crée les données pour le premier matplot (fit$jchosen vs t(lambda)) lambda_df <- as.data.frame(t(fit$bestfit$lambda)) lambda_df$Hour <- fit$jchosen lambda_long <- pivot_longer(lambda_df, cols = -Hour, names_to = "Cluster", values_to = "Laps") # Graphique de base : courbes + points p <- ggplot(lambda_long, aes(x = Hour, y = Laps, color = Cluster, group = Cluster)) + scale_x_continuous(breaks = seq(6, 24, 6)) + scale_y_continuous(breaks = 0:9) + labs(x = "Hours", y = "Number of laps") + theme(axis.text.x = element_text(size = 12), axis.text.y = element_text(size = 12)) # Ajout des lignes verticales grises p <- p + geom_vline(xintercept = fit$jchosen, linetype = "dotted", color = "lightgray") # Préparation des xmeans pour marquage "R"/"I" xmeans_df <- as.data.frame(t(xmeans)) colnames(xmeans_df) <- as.character(1:G) xmeans_df$Variable <- 1:nrow(xmeans_df) xmeans_long <- pivot_longer(xmeans_df, cols = -Variable, names_to = "Cluster", values_to = "Mean") # Annoter avec R/I pour les variables non sélectionnées xmeans_annot <- xmeans_long %>% filter(Variable %in% jremove) %>% mutate(Label = ifelse(vartype[Variable] == "Redundant", "R", "I")) p <- p + geom_text(data = xmeans_annot, aes(x = Variable, y = Mean, label = Label, color = Cluster), size = 3, vjust = -0.2, show.legend = FALSE) # Ajouter les moyennes xmeans (comme dans le dernier matplot) p <- p + geom_line(data = xmeans_long, aes(x = Variable, y = Mean, group = Cluster, color = Cluster), linetype = "solid") + theme_minimal() # Afficher le graphique print(p) ``` The variables discriminate the clusters pacing strategies of the runners are the number of laps covered during the last two thirds of the race (except during the 13th and 23rd hours). The number of laps covered during the first eight hours does not provide any additional clustering information, and even no information at all for the number of laps covered during the first hour. @fig-Running-result3 plots the density map per clusters. Area of high density (red) indicates the hours and the corresponding average number of laps specific of each cluster. ```{r ARIrunning, echo = FALSE, message = FALSE, warning=FALSE, fig.align='center'} #| label: fig-Running-result3 #| fig-cap: "Density maps per cluster" heatmap_data <- list() Pchosen <- length(fit$jchosen) yvec <- min(x):max(x) for (g in 1:fit$bestfit$G) { for (p in 1:P) { if (sum(p==fit$jchosen)>1){ lambda <- fit$bestfit$lambda[g, p] z <- dpois(yvec, lambda) heatmap_data[[length(heatmap_data) + 1]] <- data.frame( Hour = fit$jchosen[which(p==fit$jchosen)], Laps = yvec, Density = z, Cluster = factor(g) ) } else z =0 heatmap_data[[length(heatmap_data) + 1]] <- data.frame( Hour = p, Laps = yvec, Density = z, Cluster = factor(g) ) } for (p in 1:length(fit$jchosen)) { lambda <- fit$bestfit$lambda[g, p] z <- dpois(yvec, lambda) heatmap_data[[length(heatmap_data) + 1]] <- data.frame( Hour = fit$jchosen[p], Laps = yvec, Density = z, Cluster = factor(g) ) } } heatmap_df <- do.call(rbind, heatmap_data) colnames(x)=1:P observed_data <- as.data.frame(x) observed_data$Cluster <- factor(fit$bestfit$classification) observed_data$id <- seq_len(nrow(observed_data)) observed_long <- observed_data %>% pivot_longer(cols = -c(id, Cluster), names_to = "Hour", values_to = "Laps") %>% mutate(Hour = as.numeric(Hour)) ggplot() + geom_tile(data = heatmap_df, aes(x = Hour, y = Laps, fill = Density)) + scale_fill_gradient(low = "white", high = "blue") + geom_line(data = observed_long, aes(x = Hour, y = Laps, group = id), color = "lightgray", alpha = 0.5) + facet_wrap(~ Cluster, ncol = 2) + labs(x = "Hour", y = "Laps", title = "Poisson Densities with Observed Data per Cluster") + theme_minimal() ``` Cluster 5 are clearly the most efficient runners. Looking at the running strategy in @fig-Running-result1 and @fig-Running-result3, we can see that they start as runners of Cluster 1 and Cluster 2, but they managed to keep a constant pace on the second part of the race, unlike those of the other two clusters which faltered. Runners of Cluster 3 has covered the fewest number of laps. Indeed, looking at their running strategy, we can see that most of these runners stop after the first third of the race. Cluster 6 is relatively similar to Cluster 3, but runners manage to continue running until half of the race is completed. Finally, Cluster 4 obtains slightly better results than Cluster 6, starting more carefully, and managing to run until the end of the race, even if the pace of the last hours is not very constant. Finally, @fig-Running-result2 shows boxplots of the total number of loops covered by the runners of each of the clusters. ```{r ARIrunning, echo = FALSE, message = FALSE, warning=FALSE, fig.align='center'} #| label: fig-Running-result2 #| fig-cap: "Number of loops covered by the runners of each of the clusters." df <- data.frame( total_laps = rowSums(x), cluster = as.factor(map(Z)) ) ggplot(df, aes(x = cluster, y = total_laps)) + geom_boxplot() + xlab("Cluster") + ylab("Total number of laps") + theme_minimal() ``` # Discussion A method for clustering and variable selection for multivariate count data has been proposed. The method is shown to give excellent performance on both simulated and real data examples. The method selects set of relevant variables for clustering and other variables are not selected if they are irrelevant or redundate for clustering purposes. The proposed method is shown to give interesting insights in the application domain, where some clusters members are shown to perform better overall to others and the benefits of constant (or near constant pacing) are shown. The level of variable selection is determined by the relative performance of the two models (as shown in @sec-interpretation) is compared. Alternative models to the Poisson GLM model which have greater flexibility could lead to a smaller set of selected variables than the proposed method achieves. This is a topic for future research. The proposed method is based on a conditionally independent Poisson mixture model for the selected variables. It could be argued that the conditional independence assumption is unrealistic in the application. @Hand2001 consider the implication of incorrectly assuming conditional independence in a classification setting and show that it can make the group membership probabilities over confident. Furthermore, in the conditional independent Poisson mixture model, the number of clusters can be upwardly biased, where extra clusters are included to model dependence in the data. The approach taken in the paper could be extended to use other multivariate count distributions, including multivariate distributions without the conditional independence assumption [eg. @Karlis2018; @Karlis_2007; @Inouye_2017]. The code for the proposed approach is available as an R package [poissonmix_0.1.tar.gz](https://github.com/JuJacques/MultivariateCountData/blob/main/poissonmix_0.1.tar.gz). # Acknowlegements This work was supported by the Science Foundation Ireland Insight Research Centre (SFI/12/RC/2289_P2) and a visit to the Collegium -- Institut d'Études Avancées de Lyon.