class: center, middle, inverse, title-slide # missSBM ## Inference in Stochastic Block Models from Missing Data ### P. Barbillon, J. Chiquet, T. Tabouy
Paris-Saclay, AgroParisTech, INRAE
Last update 09 December, 2021
###
https://github.com/GrossSBM/missSBM
--- # Resources ### R/C++ package Last stable release on CRAN, development version available on GitHub. ```r install.packages("missSBM") remotes::install_github("GrossSBM/missSBM@development") ``` ```r library(missSBM) packageVersion("missSBM") ``` ``` ## [1] '1.0.1' ``` ### Publications <small> The [missSBM website](https://grosssbm.github.io/missSBM/) contains the standard package documentation and a couple of vignettes for the top-level functions. - Tabouy, T., P. Barbillon, and J. Chiquet (2019). "Variational Inference for Stochastic Block Models from Sampled Data". In: _Journal of the American Statistical Association_ 0.ja, pp. 1-20. DOI: [10.1080/01621459.2018.1562934](https://doi.org/10.1080%2F01621459.2018.1562934). - Barbillon, P., J. Chiquet, and T. Tabouy (2022). "misssbm: An r package for handling missing values in the stochastic block model". In: _Journal of Statistical Software_. </small> --- class: inverse, middle # Outline 1. .large[Motivations] 2. .large[Binary SBM and variational Inference] 3. .large[SBM inference from observed data] 4. .large[Illustration] --- # Network data with missing entries ### Recommandation system: Epinion Who-trust-whom online social network of a general consumer review site Epinions.com. Members of the site can decide whether to ''trust'' each other. **Available** at http://www.trustlet.org/datasets/extended_epinions/user_rating.txt.gz ### Social networks in ethnobiology A seed exchange network in Kenya is collected on a limited space area, where all the 155 farmers are interviewed. Farmers only provide information about other farmers with whom they have interacted. ### Ecological networks: plant-pollinator nbetwork Interaction network between predefined sets of plants and pollonitor, by direct observation. How can trust the "0" in network data collected not to be a missing entry? --- # .small[Companion data set: French political Blogosphere] Single day snapshot of almost 200 political blogs automatically extracted the 14 October 2006 and manually classified by the "Observatoire Présidentielle" project. ```r data("frenchblog2007") party <- vertex.attributes(frenchblog2007)$party table(party) %>% kableExtra::kbl() %>% kableExtra::kable_classic() ``` <table class=" lightable-classic" style='font-family: "Arial Narrow", "Source Sans Pro", sans-serif; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> party </th> <th style="text-align:right;"> Freq </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> analyst </td> <td style="text-align:right;"> 11 </td> </tr> <tr> <td style="text-align:left;"> center-left </td> <td style="text-align:right;"> 11 </td> </tr> <tr> <td style="text-align:left;"> center-rigth </td> <td style="text-align:right;"> 32 </td> </tr> <tr> <td style="text-align:left;"> far-left </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> far-right </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> green </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> left </td> <td style="text-align:right;"> 57 </td> </tr> <tr> <td style="text-align:left;"> liberal </td> <td style="text-align:right;"> 25 </td> </tr> <tr> <td style="text-align:left;"> right </td> <td style="text-align:right;"> 40 </td> </tr> </tbody> </table> --- # French blog: graph view <img src="slides_files/figure-html/blog-graph-1.png" width="95%" style="display: block; margin: auto;" /> --- # French blog: matrix view <img src="slides_files/figure-html/blog-matrix-1.png" width="70%" style="display: block; margin: auto;" /> --- class: inverse, middle # SBM: background - Probabilistic model for random graph - Latent variable model - Variational Inference <!-- SBM: background --> --- # Stochastic Block Model .large[A popular probabilistic model for network data] .pull-left[ <div class="figure" style="text-align: center"> <img src="slides_files/figure-html/sbm-model-1.png" alt="The binary SBM model" /> <p class="caption">The binary SBM model</p> </div> ] .pull-right[ .content-box-purple[ Let - *Fixed* nodes `\(\{1, \dots, n \}\)` - Unknown colors in `\(\mathcal{C}=\{\color{#fab20a}{\bullet},\color{#0000ff}{\bullet},\color{#008000}{\bullet}\}\)` - `\(\alpha_\bullet = \mathbb{P}(i \in \bullet)\)`, `\(\bullet\in\mathcal{C}\)` - `\(\pi_{\color{#fab20a}{\bullet}\color{#0000ff}{\bullet}} = \mathbb{P}(i \leftrightarrow j | i\in\color{#fab20a}{\bullet},j\in\color{#0000ff}{\bullet})\)` In other words, `\begin{align*} Z_i = \mathbf{1}_{\{i \in \bullet\}} \ & \sim^{\text{iid}} \mathcal{M}(1,\alpha), \\ Y_{ij} \ | \ \{i\in\color{#fab20a}{\bullet},j\in\color{#0000ff}{\bullet}\} & \sim^{\text{ind}} \mathcal{B}(\pi_{\color{#fab20a}{\bullet}\color{#0000ff}{\bullet}})\\ \end{align*}` ] ] <small> - Frank, O. and F. Harary (1982). "Cluster inference by using transitivity indices in empirical graphs". In: _J. Am. Stat. Soc._ 77.380, pp. 835-840. - Holland, P. W., K. B. Laskey, and S. Leinhardt (1983). "Stochastic blockmodels: First steps". In: _Social networks_ 5.2, pp. 109-137. </small> --- # Examples of topology: .small[Community network] ```r pi <- matrix(c(0.3,0.02,0.02,0.02,0.3,0.02,0.02,0.02,0.3),3,3) communities <- igraph::sample_sbm(100, pi, c(25, 50, 25)) plot(communities, vertex.label=NA, vertex.color = rep(1:3,c(25, 50, 25))) ``` <img src="slides_files/figure-html/sample-sbm-community-1.png" style="display: block; margin: auto;" /> --- # Examples of topology: .small[Star network] ```r pi <- matrix(c(0.05,0.3,0.3,0),2,2) star <- igraph::sample_sbm(100, pi, c(4, 96)) plot(star, vertex.label=NA, vertex.color = rep(1:2,c(4,96))) ``` <img src="slides_files/figure-html/sample-sbm-star-1.png" style="display: block; margin: auto;" /> --- # Estimation in the SBM: .small[latent variable model] .pull-left[ .pull-left[ <img src="slides_files/figure-html/sbm-inference-1.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ <small> .content-box-purple[ - *Fixed* nodes `\(\{1, \dots, n \}\)` - latent colors `\(\mathcal{C}=\{\color{#fab20a}{\bullet},\color{#0000ff}{\bullet},\color{#008000}{\bullet}\}\)` </small> ] ] ] .pull-right[ .content-box-red[Estimate the model parameters and the clustering: - `\(\theta = \{\boldsymbol\alpha = (\alpha_\bullet), \boldsymbol\Pi = (\pi_{\color{#fab20a}{\bullet}\color{#0000ff}{\bullet}})\}\)` - Colors of `\(i\)`, i.e. the `\(\mathbf{Z}_i\)` ] ] .large[Marginal likelihood] Integration over `\(\mathcal{Z}\)` is intractable: `\(\mathrm{card}(Q)^n\)` terms! `$$p_\theta(\mathbf{Y}_i) = \int_{\mathcal{Z}} \prod_{(i,j)} p_\theta(Y_{ij} | Z_i, Z_j ) \, p_\theta(\mathbf{Z}) \mathrm{d}\mathbf{Z}$$` .large[Maximum likelihood for incomplete data model: EM] `$$\log p_\theta(\mathbf{Y}) = \mathbb{E}_{p_\theta(\mathbf{Z}\,|\,\mathbf{Y})} [\log p_\theta(\mathbf{Y}, \mathbf{Z})] + \mathcal{H}[p_\theta(\mathbf{Z}\,|\,\mathbf{Y})], \quad \text{ with } \mathcal{H}(p) = -\mathbb{E}_p(\log(p))$$` .important[EM requires to evaluate (some moments of)] `\(p_\theta(\mathbf{Z}\,|\,\mathbf{Y})\)` --- # Intractable EM: solution(s) .large[Variants of EM, MCMC/Bayesian approaches] <small> - Nowicki, K. and T. A. B. Snijders (2001). "Estimation and Prediction for Stochastic Blockstructures". In: _J. Am. Stat. Soc._ 96.455, pp. 1077-1087. - Daudin, J., F. Picard, and S. Robin (2008). "A mixture model for random graphs". In: _Stat. comp._ 18.2, pp. 173-183. - Latouche, P., É. Birmelé, and C. Ambroise (2012). "Variational Bayesian inference and complexity control for stochastic block models". In: _Stat. Modelling_ 12.1, pp. 93-115. - Peixoto, T. P. (2014). "Efficient Monte Carlo and greedy heuristic for the inference of stochastic block models". In: _Physical Review E_ 89.1, p. 012804. </small> .large[Variational approach] Find a proxy `\(q_\psi(\mathbf{Z}) \approx p_\theta(\mathbf{Z} | \mathbf{Y})\)` picked in a convenient class of distribution `\(\mathcal{Q}\)` `$$q(\mathbf{Z})^\star \arg\min_{q\in\mathcal{Q}} D\left(q(\mathbf{Z}), p(\mathbf{Z} | \mathbf{Y})\right).$$` Küllback-Leibler is a popular choice .small[(error averaged wrt the approximated distribution)] `$$KL\left(q(\mathbf{Z}), p(\mathbf{Z} | \mathbf{Y})\right) = \mathbb{E}_q\left[\log \frac{q(z)}{p(z)}\right] = \int_{\mathcal{Z}} q(z) \log \frac{q(z)}{p(z)} \mathrm{d}z.$$` --- # Variational EM for SBM ## Class of distribution: multinomial `$$\mathcal{Q} = \Big\{q_\psi: \, q_\psi(\mathbf{Z}) = \prod_i q_{\psi_i}(\mathbf{Z}_i), \, q_{\psi_i}(\mathbf{Z}_i) = \mathcal{M}\left(\mathbf{Z}_i; \boldsymbol\tau_i\right), \, \psi_i = \{\boldsymbol{\tau}_i\}, \boldsymbol{\tau}_i \in \mathbb{R}^{K} \Big\}$$` Maximize the ELBO (Evidence Lower BOund): `$$J(\theta, \psi) = \log p_\theta(\mathbf{Y}) - KL[q_\psi (\mathbf{Z}) || p_\theta(\mathbf{Z} | \mathbf{Y})] = \mathbb{E}_{q} [\log p_\theta(\mathbf{Y}, \mathbf{Z})] + \mathcal{H}[q_\psi(\mathbf{Z})]$$` ## Variational EM - Initialization: get `\(\mathbf{T}^0 = \{\tau_{ik}^0\}\)` with Absolute Spectral Clustering - M step: update `\(\theta^h = \{ \boldsymbol\alpha^h, \boldsymbol\Pi^h\}\)` `$$\theta^h = \arg\max J(\theta, \psi^h) = \arg\max_{\theta} \mathbb{E}_{q_{\psi^h}} [\log p_{\theta}(\mathbf{Y}, \mathbf{Z})]$$` - VE step: find the optimal `\(q_\psi\)`, by updating `\(\psi^h= (\psi^h_{i})_i = \mathbf{T}^{h} = \mathbb{E}_{q^{h}} (\mathbf{Z})\)`: `$$\psi^h = \arg \max J(\theta^h, \psi) = \arg\min_{\psi} KL[q_\psi(\mathbf{Z}) \,||\, p_{\theta^h}(\mathbf{Z}\,|\,\mathbf{Y})]$$` --- # Variational EM for SBM: ingredients ### Variational bound `$$J(\theta, \tau ; \mathbf{Y}) = \sum_{(i,j)} \sum_{(k,\ell)} \tau_{ik} \tau_{j\ell} \log b(Y_{ij},\pi_{k\ell }) + \sum_{i} \sum_{k} \tau _{ik} \log (\alpha_k/\tau_{ik})$$` ### M-step (Analytical) `$$\alpha_k = \frac{1}{n} \sum_{i} \tau_{i k} , \quad \pi_{k\ell } = \frac{\sum_{(i,j)} \tau_{ik}\tau_{j\ell} Y_{ij}}{\tau_{ik}\tau_{j\ell}} \qquad \left({\boldsymbol\alpha} = \mathbf{1}_n^\top\mathbf{T}, \quad {\boldsymbol\Pi} = \frac{\mathbf{T}^\top \mathbf{Y} \mathbf{T}}{\mathbf{T}^\top \mathbf{T}} \right)$$` ### Variational E-step (fixed point) `$$\tau_{ik} \varpropto \alpha_k \prod_{(i,j)} \prod_{\ell} b(Y_{ij} ; \pi_{k\ell})^{\tau_{j\ell}}$$` ### Model Selection `$$\mathrm{vICL}(K) = \mathbb{E}_{q} [\log L(\hat{\theta)}; \mathbf{Y}, \mathbf{Z}] - \frac{1}{2} \left(\frac{K(K+1)}{2} \log \frac{n(n-1)}{2} + (K-1) \log (n) \right)$$` --- # Example: French politcal blogosphere ```r blog <- as_adj(frenchblog2007, sparse = FALSE) blocks <- 1:18 sbm_full <- estimateMissSBM(blog, blocks, "node") ``` ``` ## ## ## Adjusting Variational EM for Stochastic Block Model ## ## Imputation assumes a 'node' network-sampling process ## ## Initialization of 18 model(s). ## Performing VEM inference ## Model with 8 blocks. Model with 3 blocks. Model with 1 blocks. Model with 17 blocks. Model with 14 blocks. Model with 5 blocks. Model with 2 blocks. Model with 11 blocks. Model with 18 blocks. Model with 13 blocks. Model with 16 blocks. Model with 6 blocks. Model with 10 blocks. Model with 15 blocks. Model with 7 blocks. Model with 12 blocks. Model with 9 blocks. Model with 4 blocks. ## Looking for better solutions ## Pass 1 Going forward +++++++++++++++++ Pass 1 Going backward +++++++++++++++++ ``` --- # Convergence monitoring (ELBO) ```r plot(sbm_full, "monitoring") ``` <img src="slides_files/figure-html/fblog-simpleSbm-analysis-plot1-1.png" style="display: block; margin: auto;" /> --- # Model Selection (vICL) ```r plot(sbm_full) ``` <img src="slides_files/figure-html/fblog-simpleSbm-analysis-plot2-1.png" style="display: block; margin: auto;" /> --- # Parameters ```r plot(sbm_full$bestModel, "meso") ``` <img src="slides_files/figure-html/fblog-simpleSbm-analysis-theta-1.png" style="display: block; margin: auto;" /> --- # Clustering I ```r plot(sbm_full$bestModel, dimLabels = list(row = "blogs", col = "blogs")) ``` <img src="slides_files/figure-html/fblog-simpleSbm-analysis-plot3-1.png" style="display: block; margin: auto;" /> --- # Clustering II ```r plot(sbm_full$bestModel, "expected", dimLabels = list(row = "blogs", col = "blogs")) ``` <img src="slides_files/figure-html/fblog-simpleSbm-analysis-plot4-1.png" style="display: block; margin: auto;" /> --- # Clustering III ```r aricode::ARI(sbm_full$bestModel$fittedSBM$memberships, party) ``` ``` ## [1] 0.463709 ``` ```r aricode::NID(sbm_full$bestModel$fittedSBM$memberships, party) ``` ``` ## [1] 0.3920363 ``` --- class: inverse, middle # SBM from an observed network - Missing data framework for SBM - Modeling the observation process - Inference with missing dyads <!-- MISS-SBM --> --- # .small[Inference of an observed network (missing dyads)] .pull-left[ <small> `$$\left(\begin{array}{cccccccccc} & 1 & \texttt{NA} & 1 & 0 & \texttt{NA} & 0 & 0 & 0 & 0 \\ 1 & & 0 & 0 & 1 & 0 & 0 & 1 & \texttt{NA} & 0 \\ \texttt{NA} & 0 & & \texttt{NA} & 0 & 0 & 1 & \texttt{NA} & 1 & 0 \\ 1 & 0 & \texttt{NA} & & 0 & 0 & 0 & \texttt{NA} & 1 & 0 \\ 0 & 1 & 0 & 0 & & 1 & 0 & 0 & 0 & 0 \\ \texttt{NA} & 0 & 0 & 0 & 1 & & 0 & \texttt{NA} & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & & 0 & 0 & 0 \\ 0 & 1 & \texttt{NA} & \texttt{NA} & 0 & \texttt{NA} & 0 & & \texttt{NA} & 0 \\ 0 & \texttt{NA} & 1 & 1 & 0 & 1 & 0 & \texttt{NA} & & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & \end{array}\right)$$` </small> ] .pull-right[ .content-box-red[ Dyads are observed (or not) according to a specific sampling process which must be taken into account in the inference ] .content-box-purple[ About the sampling - Completely random? - Depends on the connectivity? - Depends on hidden colors (groups)? ] ] - Kolaczyk, E. D. (2009). _Statistical analysis of network data, methods and models_. Springer. - Handcock, M. S. and K. J. Gile (2010). "Modeling Social networks From Sampled Data". In: _The Annals of Applied Statistics_ 4.1, pp. 5-25. - Frisch, G., J. Léger, and Y. Grandvalet (2020). "Learning from missing data with the Latent Block Model". In: _arXiv preprint arXiv:2010.12222_. - Gaucher, S., O. Klopp, and G. Robin (2021). "Outlier detection in networks with missing links". In: _Computational Statistics and Data Analysis_ 164, p. 107308. --- # Missing data: general framework ### Little and Rubin's framework Let - `\(R\sim p_\beta\)` be a random process defining the observation (sampling) process - `\(Y\sim p_\theta\)` be some data split into two subsets `\(Y^m, Y^o\)` ("observed" and "missing") Little and Rubin [LR14]' define - **MCAR** (Missing Completely At Random): `\(R \perp Y\)` - **MAR** (Missing At Random): : `\(R \perp Y^m | Y^o\)` - **MNAR** (Missing Not At Random): other cases Note that MCAR `\(\subset\)` MAR and that in MAR case, inference of `\(\theta\)` can be done of `\(Y^o\)` only: `$$\begin{aligned}p_{\theta, \beta}(Y^o,R) & = \int p_{\theta}(Y^o,Y^m)p_{\psi}(R|Y^o,Y^m)dY^m \\ & = p_{\theta}(Y^o)p_{\beta}(R|Y^o)\end{aligned}$$` --- # Missing data: SBM case .large[Setting] - The observation process is given by the sampling matrix `$$(R_{ij})= \mathbf{1}_{\{Y_{ij} \text{ is observed}\} }$$` - The process is **MAR** if `\(R \perp Y^m, Z | Y^o\)`, in which case `$$p_{\theta, \beta}(Y^o,R) = \int p_{\theta}(Y^o,Y^m, Z)p_{\beta}(R|Y^o,Y^m, Z)dY^m dZ^m = p_{\theta}(Y^o)p_{\beta}(R|Y^o)$$` .large[Typology of observation processes] <img src="slides_files/figure-html/dags-1.png" style="display: block; margin: auto;" /> --- # Observation process .small[(a.k.a "sampling design")] ### Some studied processes *Notation*: .mar[M(C)AR], .mnar[MNAR], `\(S_i = \mathbf{1}_{\{\text{node i is sampled}\}}\)` (i.e., `\(R_{ij} = 1\)` for all `\(j\)`) .pull-left[ ### Dyad-centered - .mar[Random dyad sampling] `$$R_{ij} \sim^{iid} \mathcal{B}(\rho)$$` - .mnar[Double standard sampling] `$$\begin{aligned}R_{ij} | Y_{ij}=1 & \sim^{ind} \mathcal{B}(\rho_1) \\ R_{ij} | Y_{ij} = 0 & \sim^{ind} \mathcal{B}(\rho_0)\end{aligned}$$` - .mnar[Block dyad sampling] `$$R_{ij}|Z_i, Z_j \sim^{ind} \mathcal{B}(\rho_{Z_i Z_j})$$` ] .pull-right[ ### Node-centered - .mar[Node sampling] $$ S_{i} \sim^{iid} \mathcal{B}(\rho)$$ - .mnar[Degree sampling], `$$\begin{aligned} S_{i}|D_i & \sim^{ind} \mathcal{B}(\mathrm{logistic}(a + b D_i)) \\ D_i & = \sum_j Y_{ij}\end{aligned}$$` - .mnar[Block node sampling] `$$S_{i}|Z_i \sim^{ind} \mathcal{B}(\rho_{Z_i})$$` ] --- # Observation proces: .small[illustration] We first generate a community-shape network <small> ```r ## SBM parameters N <- 300 # number of nodes K <- 3 # number of clusters alpha <- rep(1,K)/K # block proportion pi <- list(mean = diag(.45,K) + .05 ) # connectivity matrix ## simulate an undirected binary SBM sbm <- sbm::sampleSimpleSBM(N, alpha, pi) plot(sbm) ``` <img src="slides_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> </small> --- # Observation process: .small[sample network data] We consider some sampling designs and their associated parameters ```r sampling_parameters <- list( "dyad" = .3, "node" = .3, "double-standard" = c(0.2, 0.6), "block-node" = c(.3, .8, .5), "block-dyad" = pi$mean, "degree" = c(.1, .2) ) observed_networks <- list() for (sampling in names(sampling_parameters)) { observed_networks[[sampling]] <- missSBM::observeNetwork( adjacencyMatrix = sbm$networkData, sampling = sampling, parameters = sampling_parameters[[sampling]], cluster = sbm$memberships ) } ``` --- # Observation process: output <img src="slides_files/figure-html/plot-samplings1-1.png" style="display: block; margin: auto;" /><img src="slides_files/figure-html/plot-samplings1-2.png" style="display: block; margin: auto;" /> --- # Identifiability We build on the proof of [Cel+12] for Indentifiability of the SBM (sort marginal probabilities into a Vandermonde matrix which is invertible, so that we can express parameters `\(\pi, \alpha\)` as a function of the original probailities). ### .content-box-red[SBM observed under MAR samplings (node/dyad centered)] .content-box-yellow[ Let `\(n\geq 2K\)` and assume that for any `\(1\leq k \leq K\)`, `\(\rho>0\)`, `\(\alpha_k >0\)` and the coordinates of `\(\pi . \alpha\)` are pairwise distinct. Then, under dyad (resp. node) sampling, SBM parameters are identifiable w.r.t. the distribution of the observed part of the SBM up to label switching. ] ### .content-box-red[SBM observed under block sampling] .content-box-yellow[ Let `\(n\geq 2K\)` and assume that for any `\(1\leq k \leq K\)`, `\(\rho_k>0\)`, `\(\alpha_k >0\)` and the coordinates of `\(\pi . \alpha\)` are pairwise distinct. If the coordinates `\(( \sum_k \pi_{1k} \rho_k \alpha_k, \dots, \sum_k \pi_{Kk} \rho_k \alpha_k)\)` are pairwise distinct, under block sampling, `\(\theta\)` and `\(\beta\)` are identifiable w.r.t. the distributions of the SBM and the sampling up to label switching. ] Identifiability of SBM under double-standard and degree samplings: still open. --- # .small[Inference of SBM from an observed network: .mar[MAR]] ### Setting We now need to estimate - The SBM parameters `\(\theta = \{(\boldsymbol\alpha, \boldsymbol\Pi)\}\)` - The sampling parameters `\(\beta\)` (e.g., `\(\rho\)`, or `\(\rho_k\)`, etc. depending on the design). ### MAR case Since `$$p_{\theta, \beta}(Y^o,R) = p_{\theta}(Y^o)p_{\beta}(R|Y^o),$$` we just have to .color-box-red[perform inference on the observed part of the data] `\(\rightsquigarrow\)` "usual" V-EM (with possibility of saving memory footprint par sparsely encoding both `\(0\)` and `\(\texttt{NA}\)`). --- # .small[Inference of SBM from an observed network: .mnar[MNAR]] ### Variational approximation To evaluate `\(\mathbb{E}_{Z, Y^m | Y^o, R}\big(\cdot\big)\)`, the distribution `\(p_{\theta, \psi}(Z, Y^m | Y^o, R)\)` is approximated by `$$\begin{aligned} q_\psi(Z, Y^m) = \prod_{i=1}^{n} m(Z_i;\tau_{i}) \prod_{Y_{ij} \in Y_{ij}^m} b(Y_{ij};\nu_{ij}) = \prod_{i=1}^n\prod_{k=1}^K (\tau_{ik})^{\mathbf{1}_{\{Z_i = k\}}} \cdot \prod_{Y_{ij} \in Y_{ij}^m} \nu_{ij}^{Y_{ij}}(1-\nu_{ij})^{1-Y_{ij}} \end{aligned}$$` where `\(\psi = \{(\nu_{ij}), (\tau_{ik})\}\)` are the variational parameters to be optimized - `\(\tau_{ik}\)` the posterior probabilities, are (almost) generic to any sampling design - `\(\nu_{ij}\)`, the imputation values, are specific to the sampling design. ### M-step - `\(\beta\)`, the sampling parameters, are specific to the design - `\(\theta = (\boldsymbol\alpha, \boldsymbol\pi)\)` are generic: `$$\hat{\alpha}_k=\frac{1}{n}\sum_i \hat{\tau}_{ik}, \qquad \hat{\pi}_{k\ell}=\frac{\sum_{(i,j)\in Y^o_{ij}}\hat{\tau}_{iq}\hat{\tau}_{j\ell}Y_{ij} + \sum_{(i,j)\in Y_{ij}^m}\hat{\tau}_{iq}\hat{\tau}_{j\ell}\hat{\nu}_{ij}}{\sum_{(i,j)}\hat{\tau}_{iq}\hat{\tau}_{j\ell}}.$$` --- # .small[General Variational EM for MNAR inference] Essentially separate computations for fitting the SBM / the sampling design .content-box-yellow[ **Initialize** `\(\tau^{0}\)`, `\(\nu^{0}\)` and `\(\beta^{0}\)` **Repeat** `$$\begin{array}{lclrr} \theta^{(h+1)} & = & \arg\max_{\theta} J\left(Y^o,R ;\ \tau^{h},\nu^{h},\beta^{h},\theta \right) & \text{M-step a)} & \text{SBM} \\ \beta^{h+1} & = & \arg\max_{\beta} J\left(Y^o,R; \ \tau^{h},\nu^{h},\beta,\theta^{h+1} \right) & \text{M-step b)} & \text{Sampling} \\ \tau^{h+1} & = & \arg\max_{\tau} J\left(Y^o,R; \ \tau,\nu^{h},\beta^{h+1}, \theta^{h+1} \right) & \text{VE-step a)} & \text{SBM} \\ \nu^{h+1} & = & \arg\max_{\nu} J\left(Y^o,R ; \tau^{h+1},\nu,\beta^{h+1},\theta^{h+1} \right) & \text{VE-step b)} & \text{Sampling} \\ \end{array}$$` **Until** `\(\left\|\theta^{h+1} - \theta^{h}\right\| < \varepsilon\)` ] where we have the following decomposition: `$$\begin{aligned} J(Y^o,R) & = \mathbb{E}_{q_{\psi}} [\log p_{\theta,\beta}(Y^o, R, Y^m, Z)] + \mathcal{H}(q_{\psi}(Z, Y^m))\\ & = \mathbb{E}_{q_{\psi}} [\log p_{\beta}(R | Y^o, Y^m, Z)] + \mathbb{E}_{q_{\tau}} [\log p_{\theta}(Y^o | Z)] + \mathbb{E}_{q_{\nu,\tau}} [\log p_{\theta}(Y^m | Z)] \\ & \qquad \qquad + \mathcal{H}(q_{\tau}(Z)) + \mathcal{H}(q_{\nu}(Y^m)) \end{aligned}$$` --- # Design specific updates ### Example for Block-dyad sampling Recall that `$$R_{ij}|Z_i, Z_j \sim^{ind} \mathcal{B}(\rho_{Z_i Z_j})$$` Then, the expected log-likelihood w.r.t the variational approximation `\(q\)` is `$$\mathbb{E}_{q_{\psi}} [\log p_{\beta}(R | Y^o, Y^m, Z)] = \sum_{(i,j) \in Y^o} \sum_{k,\ell} \tau_{ik}\tau_{j\ell} \log(\rho_{k\ell}) + \sum_{(i,j)\in Y^m} \sum_{k,\ell} \tau_{ik}\tau_{j\ell} \log(\rho_{k\ell}),$$` From which we derive `$$\hat{\rho}_{k\ell}=\frac{\sum_{(i,j)\in Y^o} \tau_{ik}\tau_{j\ell} }{\sum_{(i,j)\in Y} \tau_{ik}\tau_{j\ell} }\,$$` and `$$\hat{\nu}_{ij} = \mathrm{logistic} \left( \sum_{k,\ell} \tau_{ik}\tau_{j\ell} \log\left(\frac{\pi_{k\ell}}{1-\pi_{k\ell}}\right) \right)$$` --- # Variational Estimators: .small[theoretical guarantees] ### Consistency & Asymptotic Normality Inspired by the two following papers: - [Bic+13] deal with binary SBM under "sparse" conditions - [BKM17] deal with LBM with distribution in the one-dimensional exponential family fully observed ### .content-box-red[Theorem [MT20]] .content-box-yellow[ Consider an SBM with `\(K\)` blocks and distribution in the *one-dimensional exponential family* under *random dyad sampling* and identifiability conditions (already explicited). Then, maximum likelihood and variational estimators are *consistent* and *asymptotically normal* with explicit asymptotic variance/covariance matrix. ] `\(\rightarrow\)` Only for MAR sampling ! --- # SBM with covariates and missing data Consider `\(m\)` external covariates `\(X_{ij}\in\mathbb{R}^m\)` defined at the edge level. For covariates at the node level `\(X_i\)`, we can define a similarity `\(\phi(X_i, X_j) \to X_{ij}\)`. `$$\begin{aligned} Z_i & \sim^{\text{iid}} \mathcal{M}(1,\alpha), \\ Y_{ij} \ | \ \{Z_i, Z_j, X_{ij} \} & \sim^{\text{ind}} \mathcal{B}(\text{logistic}(\pi_{Z_i Z_j} + \eta^\top X_{ij}))\\ \end{aligned}$$` ### Dyad-centered sampling Let `\(\delta \in \mathbb{R}\)`, `\(\kappa \in \mathbb{R}^m\)`. The probability to observe a dyad is `$$\mathbb{P}(R_{ij} = 1 |X_{ij}) = \text{logistic}(\delta + \kappa^T X_{ij}).$$` ### Node-centered sampling Let `\(\delta \in \mathbb{R}\)` and `\(\kappa \in \mathbb{R}^n\)`. The probability to observe all dyads corresponding to a node is `$$\mathbb{P}(S_{i} = 1 |X_{i}) = \text{logistic}(\delta + \kappa^T X_i).$$` .content-box-red[These sampling designs are NMAR, however, conditionally to `\(X\)` they are MCAR] --- class: inverse, middle # Illustrations 1. Numerical study of MNAR vs MAR 2. French blogosphere 3. PPI ER (ESR1) ego network <!-- Illustration MNAR --> --- # Block-dyad sampling Consider a community like network: .pull-left[ ```r n <- 200 alpha <- c(1/3,1/3,1/3) pi <- .15 + diag(3) * .25 theta <- list(mean = pi) pi ``` ] .pull-right[ ``` ## [,1] [,2] [,3] ## [1,] 0.40 0.15 0.15 ## [2,] 0.15 0.40 0.15 ## [3,] 0.15 0.15 0.40 ``` ] Define sampling matrices with decreasing agreement with `\(\pi\)` <img src="slides_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- # Performance of MAR vs MNAR <img src="slides_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> 100 replicates. The closer `\(\delta\)` to zero, the closer to the MAR case. <!-- Illustration blogosphere --> --- # Back to French blogosphere ### Control the network observation - We sample in the original network to get a partly observed blog network - We sampled more in the highly connected communities. ```r samplingParameters <- .2 + ifelse(sbm_full$bestModel$fittedSBM$connectParam$mean < .1, 0, .6) blog_obs <- observeNetwork( adjacencyMatrix = blog, sampling = "block-dyad", parameters = samplingParameters, clusters = sbm_full$bestModel$fittedSBM$memberships ) ``` --- # French blogosphere sampled <img src="slides_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- # .small[Compare MAR and NMAR with model selection criterion] ```r sbm_mar <- estimateMissSBM(blog_obs, blocks, "dyad") sbm_mnar <- estimateMissSBM(blog_obs, blocks, "block-dyad") ``` <img src="slides_files/figure-html/plotICL-fblog-1.png" style="display: block; margin: auto;" /> --- # Validation? Compare the clustering of the different models with the original *party* classification: <small> ```r ARI(party, sbm_full$bestModel$fittedSBM$memberships) ``` ``` ## [1] 0.463709 ``` ```r ARI(party, sbm_mar$bestModel$fittedSBM$memberships) ``` ``` ## [1] 0.4169295 ``` ```r ARI(party, sbm_mnar$bestModel$fittedSBM$memberships) ``` ``` ## [1] 0.5040523 ``` ```r ARI(sbm_mnar$bestModel$fittedSBM$memberships, sbm_full$bestModel$fittedSBM$memberships) ``` ``` ## [1] 0.7266251 ``` ```r ARI(sbm_mnar$bestModel$fittedSBM$memberships, sbm_mar$bestModel$fittedSBM$memberships) ``` ``` ## [1] 0.5840606 ``` </small> --- # French blogosphere: .small[original classification] <img src="slides_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- # French blogosphere: .small[MAR clustering] <img src="slides_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- # French blogosphere: .small[MNAR clustering] <img src="slides_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> <!-- PPI-network --> --- # Protein-Protein Network ### The data - The PPI network in the neighborhood of ER composed by 741 proteins - Valued dyads: `\(\omega_{ij} \in (0,1]\)` reflecting the level of confidence in the interaction - Binarization of the network with a threshold `\(\gamma\)`: `$$\mathbf{Y}^\gamma = \left(Y^\gamma\right)_{ij} = \left\{ \begin{array}{ll} 1 & \text{if } \omega_{ij} > 1-\gamma, \\ \texttt{NA} & \text{if } \gamma \leq \omega_{ij} \leq 1-\gamma, \\ 0 & \text{if } \omega_{ij} < \gamma. \\ \end{array} \right.$$` ### Questions - What `\(\gamma\)`? - What sampling design: MAR or NMAR? --- # Model Selection <img src="slides_files/ICL-PPI.png" width="1855" style="display: block; margin: auto;" /> - The ICL criterion selects `\(\gamma = .35\)` and MNAR sampling as the one that better fit the data - Number of selected clusters: `\(11\)` (MAR) and `\(13\)` (NMAR) - ARI between NMAR clustering and MAR clustering: `\(.39\)` - MNAR clustering somehow coherent with gene ontology --- # Imputation <img src="slides_files/net_esr1_NA_nu_both.jpg" width="2052" style="display: block; margin: auto;" /> --- # Gene Ontology (GO) Enrichment analysis *i.e.* identifying classes of genes over-represented in a large set of genes MNAR founde `\(13\)` significant biological process founded (MAR: only `\(1\)`) <img src="slides_files/ontology.png" width="1248" style="display: block; margin: auto;" /> --- # Conclusion ## Perspectives/ongoing - Sampling - study robustness (block-sampling "includes" double-standard?) - Other models - degree-corrected SBM - (ZI)-Poisson emission law - Simple SBM `\(\to\)` Bipartite SBM (aka Latent block models) - Other algorithms - SGD algorithms + Pytorch framework - 'Exact' ICL maximization (with É. Côme) ## Advertisement [https://computo.sfds.asso.fr](https://computo.sfds.asso.fr), a new journal promoting reproducible research ## THANK YOU --- # References Barbillon, P., J. Chiquet, and T. Tabouy (2022). "misssbm: An r package for handling missing values in the stochastic block model". In: _Journal of Statistical Software_. Bickel, P., D. Choi, X. Chang, et al. (2013). "Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels". In: _Ann. Stat._ 41.4, pp. 1922-1943. Brault, V., C. Keribin, and M. Mariadassou (2017). "Consistency and Asymptotic Normality of Latent Blocks Model Estimators". working paper or preprint. Celisse, A., J. Daudin, L. Pierre, et al. (2012). "Consistency of maximum-likelihood and variational estimators in the stochastic block model". In: _Electron. J. Stat._ 6, pp. 1847-1899. Little, R. J. and D. B. Rubin (2014). _Statistical analysis with missing data_. John Wiley & Sons. Mariadassou, M. and T. Tabouy (2020). "Consistency and asymptotic normality of stochastic block models estimators from sampled data". In: _Electronic Journal of Statistics_ 14.2, pp. 3672-3704. Tabouy, T., P. Barbillon, and J. Chiquet (2019). "Variational Inference for Stochastic Block Models from Sampled Data". In: _Journal of the American Statistical Association_ 0.ja, pp. 1-20. DOI: [10.1080/01621459.2018.1562934](https://doi.org/10.1080%2F01621459.2018.1562934).