missSBM

# missSBM
## Inference in Stochastic Block Models from Missing Data
### P. Barbillon, J. Chiquet, T. Tabouy<br /><br /> <small>Paris-Saclay, AgroParisTech, INRAE</small> <br /> <small>Last update 09 December, 2021</small>
### <br/><a href="https://github.com/GrossSBM/missSBM" class="uri">https://github.com/GrossSBM/missSBM</a>

---

# Resources

### R/C++ package

Last stable release on CRAN, development version available on GitHub.

```r
install.packages("missSBM")
remotes::install_github("GrossSBM/missSBM@development")
```

```r
library(missSBM)
packageVersion("missSBM")
```

```
## [1] '1.0.1'
```

### Publications

<small>

The [missSBM website](https://grosssbm.github.io/missSBM/) contains the standard package documentation and a couple of vignettes for the top-level functions.

- Tabouy, T., P. Barbillon, and J. Chiquet (2019). "Variational Inference
for Stochastic Block Models from Sampled Data". In: _Journal of the
American Statistical Association_ 0.ja, pp. 1-20. DOI:
[10.1080/01621459.2018.1562934](https://doi.org/10.1080%2F01621459.2018.1562934).
- Barbillon, P., J. Chiquet, and T. Tabouy (2022). "misssbm: An r package
for handling missing values in the stochastic block model". In:
_Journal of Statistical Software_.

</small>

---

# Outline

1. .large[Motivations]
  2. .large[Binary SBM and variational Inference]
  3. .large[SBM inference from observed data]
  4. .large[Illustration]

---

# Network data with missing entries

### Recommandation system: Epinion

Who-trust-whom online social network of a general consumer review site Epinions.com. Members of the site can decide whether to ''trust'' each other.

**Available** at http://www.trustlet.org/datasets/extended_epinions/user_rating.txt.gz

### Social networks in ethnobiology

A seed exchange network in Kenya is collected on a limited space area, where all the 155 farmers are interviewed. Farmers only provide information about other farmers with whom they have interacted.

### Ecological networks: plant-pollinator nbetwork

Interaction network between predefined sets of plants and pollonitor, by direct observation. How can trust the "0" in network data collected not to be a missing entry?

---

# .small[Companion data set: French political Blogosphere]

Single day snapshot of almost 200 political blogs automatically extracted the 14 October 2006 and manually classified by the "Observatoire Présidentielle" project.

```r
data("frenchblog2007")
party <- vertex.attributes(frenchblog2007)$party
table(party) %>% kableExtra::kbl() %>% kableExtra::kable_classic()
```

<table class=" lightable-classic" style='font-family: "Arial Narrow", "Source Sans Pro", sans-serif; margin-left: auto; margin-right: auto;'>
 <thead>
  <tr>
   <th style="text-align:left;"> party </th>
   <th style="text-align:right;"> Freq </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> analyst </td>
   <td style="text-align:right;"> 11 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> center-left </td>
   <td style="text-align:right;"> 11 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> center-rigth </td>
   <td style="text-align:right;"> 32 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> far-left </td>
   <td style="text-align:right;"> 7 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> far-right </td>
   <td style="text-align:right;"> 2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> green </td>
   <td style="text-align:right;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> left </td>
   <td style="text-align:right;"> 57 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> liberal </td>
   <td style="text-align:right;"> 25 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> right </td>
   <td style="text-align:right;"> 40 </td>
  </tr>
</tbody>
</table>

---

# French blog: graph view

---

# French blog: matrix view

---

# SBM: background

- Probabilistic model for random graph
  - Latent variable model
  - Variational Inference

---

# Stochastic Block Model

.pull-left[
<div class="figure" style="text-align: center">
<img src="slides_files/figure-html/sbm-model-1.png" alt="The binary SBM model"  />
<p class="caption">The binary SBM model</p>
</div>
]

- *Fixed* nodes `$\{1, \dots, n \}$`
- Unknown colors in `$\mathcal{C}=\{\color{#fab20a}{\bullet},\color{#0000ff}{\bullet},\color{#008000}{\bullet}\}$`
- `$\alpha_\bullet  =  \mathbb{P}(i  \in  \bullet)$`, `$\bullet\in\mathcal{C}$`
- `$\pi_{\color{#fab20a}{\bullet}\color{#0000ff}{\bullet}} = \mathbb{P}(i \leftrightarrow j | i\in\color{#fab20a}{\bullet},j\in\color{#0000ff}{\bullet})$`

In other words,

`\begin{align*}
Z_i = \mathbf{1}_{\{i \in \bullet\}}  \ & \sim^{\text{iid}} \mathcal{M}(1,\alpha), \\ 
Y_{ij} \ | \ \{i\in\color{#fab20a}{\bullet},j\in\color{#0000ff}{\bullet}\}
& \sim^{\text{ind}} \mathcal{B}(\pi_{\color{#fab20a}{\bullet}\color{#0000ff}{\bullet}})\\
\end{align*}`
]
]

<small>

- Frank, O. and F. Harary (1982). "Cluster inference by using
transitivity indices in empirical graphs". In: _J. Am. Stat. Soc._
77.380, pp. 835-840.
- Holland, P. W., K. B. Laskey, and S. Leinhardt (1983). "Stochastic
blockmodels: First steps". In: _Social networks_ 5.2, pp. 109-137.

</small>
---

# Examples of topology: .small[Community network]

```r
pi <- matrix(c(0.3,0.02,0.02,0.02,0.3,0.02,0.02,0.02,0.3),3,3)
communities <- igraph::sample_sbm(100, pi, c(25, 50, 25))
plot(communities, vertex.label=NA, vertex.color = rep(1:3,c(25, 50, 25)))
```

---

# Examples of topology: .small[Star network]

```r
pi <- matrix(c(0.05,0.3,0.3,0),2,2)
star <- igraph::sample_sbm(100, pi, c(4, 96))
plot(star, vertex.label=NA, vertex.color = rep(1:2,c(4,96)))
```

---
# Estimation in the SBM: .small[latent variable model]

.pull-left[
.pull-left[
<img src="slides_files/figure-html/sbm-inference-1.png" width="80%" style="display: block; margin: auto;" />
]
.pull-right[
<small>
.content-box-purple[
- *Fixed* nodes `$\{1, \dots, n \}$`
- latent colors `$\mathcal{C}=\{\color{#fab20a}{\bullet},\color{#0000ff}{\bullet},\color{#008000}{\bullet}\}$`
</small>
]
]
]

.pull-right[
.content-box-red[Estimate the model parameters and the clustering:
- `$\theta = \{\boldsymbol\alpha = (\alpha_\bullet), \boldsymbol\Pi = (\pi_{\color{#fab20a}{\bullet}\color{#0000ff}{\bullet}})\}$`
- Colors of `$i$`, i.e. the `$\mathbf{Z}_i$`
]
]
  
.large[Marginal likelihood]

Integration over `$\mathcal{Z}$` is intractable: `$\mathrm{card}(Q)^n$` terms!

`$$p_\theta(\mathbf{Y}_i) = \int_{\mathcal{Z}} \prod_{(i,j)} p_\theta(Y_{ij} | Z_i, Z_j ) \, p_\theta(\mathbf{Z}) \mathrm{d}\mathbf{Z}$$`

`$$\log p_\theta(\mathbf{Y}) = \mathbb{E}_{p_\theta(\mathbf{Z}\,|\,\mathbf{Y})} [\log p_\theta(\mathbf{Y}, \mathbf{Z})] + \mathcal{H}[p_\theta(\mathbf{Z}\,|\,\mathbf{Y})], \quad \text{ with } \mathcal{H}(p) = -\mathbb{E}_p(\log(p))$$`

---

# Intractable EM: solution(s)

<small>

- Nowicki, K. and T. A. B. Snijders (2001). "Estimation and Prediction
for Stochastic Blockstructures". In: _J. Am. Stat. Soc._ 96.455, pp.
1077-1087.
- Daudin, J., F. Picard, and S. Robin (2008). "A mixture model for random
graphs". In: _Stat. comp._ 18.2, pp. 173-183.
- Latouche, P., É. Birmelé, and C. Ambroise (2012). "Variational Bayesian
inference and complexity control for stochastic block models". In:
_Stat. Modelling_ 12.1, pp. 93-115.
- Peixoto, T. P. (2014). "Efficient Monte Carlo and greedy heuristic for
the inference of stochastic block models". In: _Physical Review E_
89.1, p. 012804.

</small>

Find a proxy `$q_\psi(\mathbf{Z}) \approx p_\theta(\mathbf{Z} | \mathbf{Y})$` picked in a convenient class of distribution `$\mathcal{Q}$`

`$$q(\mathbf{Z})^\star  \arg\min_{q\in\mathcal{Q}} D\left(q(\mathbf{Z}), p(\mathbf{Z} | \mathbf{Y})\right).$$`
Küllback-Leibler  is a popular choice .small[(error averaged wrt the approximated distribution)]

`$$KL\left(q(\mathbf{Z}), p(\mathbf{Z} | \mathbf{Y})\right) = \mathbb{E}_q\left[\log \frac{q(z)}{p(z)}\right] = \int_{\mathcal{Z}} q(z) \log \frac{q(z)}{p(z)} \mathrm{d}z.$$`

---
# Variational EM for SBM

## Class of distribution: multinomial

`$$\mathcal{Q} = \Big\{q_\psi: \, q_\psi(\mathbf{Z}) = \prod_i q_{\psi_i}(\mathbf{Z}_i), \, q_{\psi_i}(\mathbf{Z}_i) = \mathcal{M}\left(\mathbf{Z}_i; \boldsymbol\tau_i\right), \, \psi_i = \{\boldsymbol{\tau}_i\}, \boldsymbol{\tau}_i \in  \mathbb{R}^{K} \Big\}$$`

Maximize the ELBO (Evidence Lower BOund):

`$$J(\theta, \psi) = \log p_\theta(\mathbf{Y}) - KL[q_\psi (\mathbf{Z}) ||  p_\theta(\mathbf{Z} | \mathbf{Y})]  = \mathbb{E}_{q} [\log p_\theta(\mathbf{Y}, \mathbf{Z})] + \mathcal{H}[q_\psi(\mathbf{Z})]$$`

## Variational EM

- Initialization: get `$\mathbf{T}^0 = \{\tau_{ik}^0\}$` with Absolute Spectral Clustering

- M step: update `$\theta^h = \{ \boldsymbol\alpha^h, \boldsymbol\Pi^h\}$`

`$$\theta^h = \arg\max J(\theta, \psi^h) = \arg\max_{\theta} \mathbb{E}_{q_{\psi^h}} [\log p_{\theta}(\mathbf{Y}, \mathbf{Z})]$$`

- VE step: find the optimal `$q_\psi$`, by updating `$\psi^h= (\psi^h_{i})_i = \mathbf{T}^{h} = \mathbb{E}_{q^{h}} (\mathbf{Z})$`:

`$$\psi^h = \arg \max J(\theta^h, \psi) = \arg\min_{\psi} KL[q_\psi(\mathbf{Z}) \,||\, p_{\theta^h}(\mathbf{Z}\,|\,\mathbf{Y})]$$`

---

# Variational EM for SBM: ingredients

### Variational bound

`$$J(\theta, \tau ; \mathbf{Y}) = \sum_{(i,j)} \sum_{(k,\ell)} \tau_{ik} \tau_{j\ell} \log b(Y_{ij},\pi_{k\ell }) + \sum_{i} \sum_{k} \tau _{ik} \log (\alpha_k/\tau_{ik})$$`

### M-step (Analytical)

`$$\alpha_k = \frac{1}{n} \sum_{i} \tau_{i k} , \quad  \pi_{k\ell } = \frac{\sum_{(i,j)} \tau_{ik}\tau_{j\ell} Y_{ij}}{\tau_{ik}\tau_{j\ell}} \qquad \left({\boldsymbol\alpha} = \mathbf{1}_n^\top\mathbf{T}, \quad {\boldsymbol\Pi} =  \frac{\mathbf{T}^\top \mathbf{Y} \mathbf{T}}{\mathbf{T}^\top  \mathbf{T}} \right)$$`

### Variational E-step (fixed point)

`$$\tau_{ik} \varpropto \alpha_k \prod_{(i,j)} \prod_{\ell} b(Y_{ij} ; \pi_{k\ell})^{\tau_{j\ell}}$$`

### Model Selection

`$$\mathrm{vICL}(K) = \mathbb{E}_{q} [\log L(\hat{\theta)}; \mathbf{Y}, \mathbf{Z}] - \frac{1}{2} \left(\frac{K(K+1)}{2} \log \frac{n(n-1)}{2} + (K-1) \log (n) \right)$$`

---

# Example: French politcal blogosphere

```r
blog  <- as_adj(frenchblog2007, sparse = FALSE)
blocks <- 1:18
sbm_full <- estimateMissSBM(blog, blocks, "node")
```

```
## 
## 
##  Adjusting Variational EM for Stochastic Block Model
## 
## 	Imputation assumes a 'node' network-sampling process
## 
##  Initialization of 18 model(s). 
##  Performing VEM inference
##  	Model with 8 blocks.
 	Model with 3 blocks.
 	Model with 1 blocks.
 	Model with 17 blocks.
 	Model with 14 blocks.
 	Model with 5 blocks.
 	Model with 2 blocks.
 	Model with 11 blocks.
 	Model with 18 blocks.
 	Model with 13 blocks.
 	Model with 16 blocks.
 	Model with 6 blocks.
 	Model with 10 blocks.
 	Model with 15 blocks.
 	Model with 7 blocks.
 	Model with 12 blocks.
 	Model with 9 blocks.
 	Model with 4 blocks.
##  Looking for better solutions
##  Pass 1   Going forward +++++++++++++++++
                                                                                                    
 Pass 1   Going backward +++++++++++++++++
                                                                                                    
```

---

# Convergence monitoring (ELBO)

```r
plot(sbm_full, "monitoring")
```

---

# Model Selection (vICL)

```r
plot(sbm_full)
```

---

# Parameters

```r
plot(sbm_full$bestModel, "meso")
```

---

# Clustering I

```r
plot(sbm_full$bestModel, dimLabels = list(row = "blogs", col = "blogs"))
```

---

# Clustering II

```r
plot(sbm_full$bestModel, "expected", dimLabels = list(row = "blogs", col = "blogs"))
```

---

# Clustering III

```r
aricode::ARI(sbm_full$bestModel$fittedSBM$memberships, party)
```

```
## [1] 0.463709
```

```r
aricode::NID(sbm_full$bestModel$fittedSBM$memberships, party)
```

```
## [1] 0.3920363
```

---

# SBM from an observed network

- Missing data framework for SBM
  - Modeling the observation process
  - Inference with missing dyads

---

# .small[Inference of an observed network (missing dyads)]

.pull-left[
<small>
`$$\left(\begin{array}{cccccccccc}
 & 1 & \texttt{NA} & 1 & 0 & \texttt{NA} & 0 & 0 & 0 & 0 \\
 1 &  & 0 & 0 & 1 & 0 & 0 & 1 & \texttt{NA} & 0 \\
  \texttt{NA} & 0 &  & \texttt{NA} & 0 & 0 & 1 & \texttt{NA} & 1 & 0 \\
 1 & 0 &  \texttt{NA} &  & 0 & 0 & 0 & \texttt{NA} & 1 & 0 \\
 0 & 1 & 0 & 0 &  & 1 & 0 & 0 & 0 & 0 \\
  \texttt{NA} & 0 & 0 & 0 & 1 &  & 0 & \texttt{NA} & 1 & 0 \\
 0 & 0 & 1 & 0 & 0 & 0 &  & 0 & 0 & 0 \\
 0 & 1 &  \texttt{NA} &  \texttt{NA} & 0 &  \texttt{NA} & 0 &  & \texttt{NA} & 0 \\
 0 &  \texttt{NA} & 1 & 1 & 0 & 1 & 0 &  \texttt{NA} &  & 0 \\
 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 
\end{array}\right)$$`
</small>
]

Dyads are observed (or not)  according to a specific sampling process which must be taken into account in the inference

]

- Completely random?
  - Depends on the connectivity?
  - Depends on hidden colors (groups)?

]
]

- Kolaczyk, E. D. (2009). _Statistical analysis of network data, methods
and models_. Springer.
- Handcock, M. S. and K. J. Gile (2010). "Modeling Social networks From
Sampled Data". In: _The Annals of Applied Statistics_ 4.1, pp. 5-25.
- Frisch, G., J. Léger, and Y. Grandvalet (2020). "Learning from missing
data with the Latent Block Model". In: _arXiv preprint
arXiv:2010.12222_.
- Gaucher, S., O. Klopp, and G. Robin (2021). "Outlier detection in
networks with missing links". In: _Computational Statistics and Data
Analysis_ 164, p. 107308.

---

# Missing data: general framework

### Little and Rubin's framework

Let 
- `$R\sim p_\beta$` be a random process defining the observation (sampling) process 
- `$Y\sim p_\theta$` be some data split into two subsets `$Y^m, Y^o$` ("observed" and "missing")
 
Little and Rubin [LR14]' define

-  **MCAR** (Missing Completely At Random): `$R \perp Y$`

-  **MAR** (Missing At Random): : `$R \perp Y^m | Y^o$`

-  **MNAR** (Missing Not At Random): other cases

Note that MCAR `$\subset$` MAR and that in MAR case, inference of `$\theta$` can be done of `$Y^o$` only:

`$$\begin{aligned}p_{\theta, \beta}(Y^o,R) & = \int  p_{\theta}(Y^o,Y^m)p_{\psi}(R|Y^o,Y^m)dY^m \\ & = p_{\theta}(Y^o)p_{\beta}(R|Y^o)\end{aligned}$$`

---

# Missing data: SBM case

- The observation process is given by the sampling matrix `$$(R_{ij})= \mathbf{1}_{\{Y_{ij} \text{ is observed}\} }$$`

- The process is **MAR**  if `$R \perp Y^m, Z | Y^o$`, in which case

`$$p_{\theta, \beta}(Y^o,R) = \int  p_{\theta}(Y^o,Y^m, Z)p_{\beta}(R|Y^o,Y^m, Z)dY^m dZ^m  = p_{\theta}(Y^o)p_{\beta}(R|Y^o)$$`

---

# Observation process .small[(a.k.a "sampling design")]

### Some studied processes

*Notation*: .mar[M(C)AR], .mnar[MNAR], `$S_i = \mathbf{1}_{\{\text{node i is sampled}\}}$` (i.e., `$R_{ij} = 1$` for all `$j$`)

### Dyad-centered

- .mar[Random dyad sampling]

`$$R_{ij} \sim^{iid} \mathcal{B}(\rho)$$`
  - .mnar[Double standard sampling]
  
`$$\begin{aligned}R_{ij} | Y_{ij}=1 & \sim^{ind} \mathcal{B}(\rho_1) \\
R_{ij} | Y_{ij} = 0 & \sim^{ind} \mathcal{B}(\rho_0)\end{aligned}$$`

- .mnar[Block dyad sampling]
  
`$$R_{ij}|Z_i, Z_j \sim^{ind} \mathcal{B}(\rho_{Z_i Z_j})$$`
]

### Node-centered

- .mar[Node sampling]

$$ S_{i} \sim^{iid} \mathcal{B}(\rho)$$
  - .mnar[Degree sampling],

`$$\begin{aligned} S_{i}|D_i & \sim^{ind} \mathcal{B}(\mathrm{logistic}(a + b D_i)) \\
D_i & = \sum_j Y_{ij}\end{aligned}$$`

- .mnar[Block node sampling]

`$$S_{i}|Z_i \sim^{ind} \mathcal{B}(\rho_{Z_i})$$`
]

---

# Observation proces: .small[illustration]

We first generate a community-shape network

<small>

```r
## SBM parameters
N <- 300 # number of nodes
K <- 3   # number of clusters
alpha <- rep(1,K)/K     # block proportion
pi    <- list(mean = diag(.45,K) + .05 ) # connectivity matrix
## simulate an undirected binary SBM
sbm <- sbm::sampleSimpleSBM(N, alpha, pi)
plot(sbm)
```

---

# Observation process: .small[sample network data]

We consider some sampling designs and their associated parameters

```r
sampling_parameters <- list(
   "dyad" = .3,
   "node" = .3,
   "double-standard" = c(0.2, 0.6),
   "block-node" = c(.3, .8, .5),
   "block-dyad" = pi$mean,
   "degree" = c(.1, .2)
 )

observed_networks <- list()

for (sampling in names(sampling_parameters)) {
  observed_networks[[sampling]] <-
     missSBM::observeNetwork(
       adjacencyMatrix = sbm$networkData,
       sampling        = sampling,
       parameters      = sampling_parameters[[sampling]],
       cluster         = sbm$memberships
     )
}
```

---

# Observation process: output

---

# Identifiability

We build on the proof of [Cel+12]  for Indentifiability of the SBM (sort marginal probabilities into a Vandermonde matrix which is invertible, so that we can express parameters `$\pi, \alpha$` as a function of the original probailities).

### .content-box-red[SBM observed under MAR samplings (node/dyad centered)]

Let `$n\geq 2K$` and assume that for any `$1\leq k \leq K$`, `$\rho>0$`, `$\alpha_k >0$` and the coordinates of `$\pi . \alpha$` are pairwise distinct.  Then, under dyad (resp. node) sampling, SBM parameters are identifiable w.r.t. the distribution of the observed part of the SBM up to label switching.

]

### .content-box-red[SBM observed under block sampling]

Let `$n\geq 2K$` and assume that for any `$1\leq k \leq K$`, `$\rho_k>0$`, `$\alpha_k >0$` and the coordinates of `$\pi . \alpha$` are pairwise distinct. If the coordinates `$( \sum_k \pi_{1k} \rho_k \alpha_k, \dots, \sum_k \pi_{Kk} \rho_k \alpha_k)$` are pairwise distinct, under block sampling, `$\theta$` and `$\beta$` are identifiable w.r.t. the distributions of the SBM and the sampling up to label switching.
]

Identifiability of SBM under double-standard and degree samplings: still open.

---

# .small[Inference of SBM from an observed network: .mar[MAR]]

### Setting

We now need to estimate

- The SBM parameters `$\theta = \{(\boldsymbol\alpha, \boldsymbol\Pi)\}$`
- The sampling parameters `$\beta$` (e.g., `$\rho$`, or `$\rho_k$`, etc. depending on the design).

### MAR case

Since `$$p_{\theta, \beta}(Y^o,R) = p_{\theta}(Y^o)p_{\beta}(R|Y^o),$$`

we just have to .color-box-red[perform inference on the observed part of the data]

`$\rightsquigarrow$` "usual" V-EM (with possibility of saving memory footprint par sparsely encoding both `$0$` and `$\texttt{NA}$`).

---

# .small[Inference of SBM from an observed network: .mnar[MNAR]]

### Variational approximation

To evaluate `$\mathbb{E}_{Z, Y^m | Y^o, R}\big(\cdot\big)$`,  the distribution `$p_{\theta, \psi}(Z, Y^m | Y^o, R)$` is approximated by

`$$\begin{aligned}
q_\psi(Z, Y^m) = \prod_{i=1}^{n} m(Z_i;\tau_{i}) \prod_{Y_{ij} \in Y_{ij}^m} b(Y_{ij};\nu_{ij}) =  \prod_{i=1}^n\prod_{k=1}^K (\tau_{ik})^{\mathbf{1}_{\{Z_i = k\}}} \cdot \prod_{Y_{ij} \in Y_{ij}^m} \nu_{ij}^{Y_{ij}}(1-\nu_{ij})^{1-Y_{ij}}  \end{aligned}$$`

where  `$\psi = \{(\nu_{ij}), (\tau_{ik})\}$` are the variational parameters to be optimized
- `$\tau_{ik}$` the posterior probabilities, are (almost) generic to any sampling design
- `$\nu_{ij}$`, the imputation values, are specific to the sampling design.

### M-step

- `$\beta$`, the sampling parameters, are specific to the design
- `$\theta = (\boldsymbol\alpha, \boldsymbol\pi)$` are generic:

`$$\hat{\alpha}_k=\frac{1}{n}\sum_i \hat{\tau}_{ik}, \qquad \hat{\pi}_{k\ell}=\frac{\sum_{(i,j)\in Y^o_{ij}}\hat{\tau}_{iq}\hat{\tau}_{j\ell}Y_{ij} +
\sum_{(i,j)\in Y_{ij}^m}\hat{\tau}_{iq}\hat{\tau}_{j\ell}\hat{\nu}_{ij}}{\sum_{(i,j)}\hat{\tau}_{iq}\hat{\tau}_{j\ell}}.$$`

---

# .small[General Variational EM for MNAR inference]

Essentially separate computations for fitting the SBM / the sampling design

**Repeat**

`$$\begin{array}{lclrr}
\theta^{(h+1)} & = & \arg\max_{\theta} J\left(Y^o,R ;\ \tau^{h},\nu^{h},\beta^{h},\theta \right) & \text{M-step a)} & \text{SBM} \\
\beta^{h+1} & = & \arg\max_{\beta} J\left(Y^o,R; \ \tau^{h},\nu^{h},\beta,\theta^{h+1} \right) & \text{M-step b)} & \text{Sampling} \\
\tau^{h+1} & = & \arg\max_{\tau} J\left(Y^o,R; \ \tau,\nu^{h},\beta^{h+1}, \theta^{h+1} \right) & \text{VE-step  a)} & \text{SBM} \\
\nu^{h+1} & = & \arg\max_{\nu} J\left(Y^o,R ; \tau^{h+1},\nu,\beta^{h+1},\theta^{h+1} \right) & \text{VE-step b)} & \text{Sampling} \\
\end{array}$$`

**Until** `$\left\|\theta^{h+1} - \theta^{h}\right\| < \varepsilon$`
]

where we have the following decomposition:

`$$\begin{aligned}
J(Y^o,R) & = \mathbb{E}_{q_{\psi}} [\log p_{\theta,\beta}(Y^o, R, Y^m, Z)] + \mathcal{H}(q_{\psi}(Z, Y^m))\\
& = \mathbb{E}_{q_{\psi}} [\log p_{\beta}(R | Y^o, Y^m, Z)] + \mathbb{E}_{q_{\tau}} [\log p_{\theta}(Y^o | Z)] + \mathbb{E}_{q_{\nu,\tau}} [\log p_{\theta}(Y^m | Z)] \\
& \qquad \qquad + \mathcal{H}(q_{\tau}(Z)) + \mathcal{H}(q_{\nu}(Y^m)) \end{aligned}$$`

---

# Design specific updates

### Example for Block-dyad sampling

Recall that

`$$R_{ij}|Z_i, Z_j \sim^{ind} \mathcal{B}(\rho_{Z_i Z_j})$$`

Then, the expected log-likelihood w.r.t the variational approximation `$q$` is

`$$\mathbb{E}_{q_{\psi}} [\log p_{\beta}(R | Y^o, Y^m, Z)] = \sum_{(i,j) \in Y^o}
  \sum_{k,\ell} \tau_{ik}\tau_{j\ell} \log(\rho_{k\ell}) + \sum_{(i,j)\in Y^m}
  \sum_{k,\ell} \tau_{ik}\tau_{j\ell} \log(\rho_{k\ell}),$$`

From which we derive

`$$\hat{\rho}_{k\ell}=\frac{\sum_{(i,j)\in Y^o} \tau_{ik}\tau_{j\ell} }{\sum_{(i,j)\in Y} \tau_{ik}\tau_{j\ell} }\,$$`  
  
and

`$$\hat{\nu}_{ij} = \mathrm{logistic} \left( \sum_{k,\ell} \tau_{ik}\tau_{j\ell} \log\left(\frac{\pi_{k\ell}}{1-\pi_{k\ell}}\right) \right)$$`

---

# Variational Estimators: .small[theoretical guarantees]

### Consistency & Asymptotic Normality

Inspired by the two following papers:

- [Bic+13] deal with binary SBM under "sparse" conditions
- [BKM17] deal with LBM with distribution in the one-dimensional exponential family fully observed

### .content-box-red[Theorem [MT20]]

.content-box-yellow[
Consider an SBM with `$K$` blocks and distribution in the *one-dimensional exponential family* under *random dyad sampling* and identifiability conditions (already explicited).

Then, maximum likelihood and variational estimators are *consistent* and *asymptotically normal* with explicit asymptotic variance/covariance matrix. 
]

`$\rightarrow$` Only for MAR sampling !

---

# SBM with covariates and missing data

Consider `$m$` external covariates `$X_{ij}\in\mathbb{R}^m$` defined at the edge level. For covariates at the node level `$X_i$`, we can define a similarity `$\phi(X_i, X_j) \to X_{ij}$`.

`$$\begin{aligned}
Z_i & \sim^{\text{iid}} \mathcal{M}(1,\alpha), \\ 
Y_{ij} \ | \ \{Z_i, Z_j, X_{ij} \} & \sim^{\text{ind}} \mathcal{B}(\text{logistic}(\pi_{Z_i Z_j} + \eta^\top X_{ij}))\\
\end{aligned}$$`

### Dyad-centered sampling

Let `$\delta \in \mathbb{R}$`, `$\kappa \in \mathbb{R}^m$`. The probability to observe a dyad is
`$$\mathbb{P}(R_{ij} = 1 |X_{ij}) = \text{logistic}(\delta + \kappa^T X_{ij}).$$`

### Node-centered sampling

Let `$\delta \in \mathbb{R}$` and `$\kappa \in \mathbb{R}^n$`. The probability to observe all dyads corresponding to a node is
`$$\mathbb{P}(S_{i} = 1 |X_{i}) = \text{logistic}(\delta + \kappa^T X_i).$$`

---

# Illustrations

1. Numerical study of MNAR vs MAR
  2. French blogosphere
  3. PPI ER (ESR1) ego network

---

# Block-dyad sampling

Consider a community like network:

```r
n <- 200
alpha <-  c(1/3,1/3,1/3)
pi    <- .15 + diag(3) * .25
theta <- list(mean = pi)
pi
```
]

```
##      [,1] [,2] [,3]
## [1,] 0.40 0.15 0.15
## [2,] 0.15 0.40 0.15
## [3,] 0.15 0.15 0.40
```
]

Define sampling matrices with decreasing agreement with `$\pi$`

---

# Performance of MAR vs MNAR

100 replicates. The closer `$\delta$` to zero, the closer to the MAR case.

---

# Back to French blogosphere

### Control the network observation

- We sample in the original network to get a partly observed blog network
  - We sampled more in the highly connected communities.

```r
samplingParameters <- 
  .2 + ifelse(sbm_full$bestModel$fittedSBM$connectParam$mean < .1, 0, .6)
blog_obs <-
  observeNetwork(
    adjacencyMatrix = blog,
    sampling        = "block-dyad",
    parameters      = samplingParameters,
    clusters        = sbm_full$bestModel$fittedSBM$memberships
  )
```

---

# French blogosphere sampled

---

# .small[Compare MAR and NMAR with model selection criterion]

```r
sbm_mar  <- estimateMissSBM(blog_obs, blocks, "dyad")
sbm_mnar <- estimateMissSBM(blog_obs, blocks, "block-dyad")
```

---

# Validation?

Compare the clustering of the different models with the original *party* classification:

<small>

```r
ARI(party, sbm_full$bestModel$fittedSBM$memberships)
```

```
## [1] 0.463709
```

```r
ARI(party, sbm_mar$bestModel$fittedSBM$memberships)
```

```
## [1] 0.4169295
```

```r
ARI(party, sbm_mnar$bestModel$fittedSBM$memberships)
```

```
## [1] 0.5040523
```

```r
ARI(sbm_mnar$bestModel$fittedSBM$memberships, sbm_full$bestModel$fittedSBM$memberships)
```

```
## [1] 0.7266251
```

```r
ARI(sbm_mnar$bestModel$fittedSBM$memberships, sbm_mar$bestModel$fittedSBM$memberships)
```

```
## [1] 0.5840606
```
</small>

---

# French blogosphere: .small[original classification]

---

# French blogosphere: .small[MAR clustering]

---

# French blogosphere: .small[MNAR clustering]

---

# Protein-Protein Network

### The data

- The PPI network in the neighborhood of ER composed by 741 proteins
- Valued dyads: `$\omega_{ij} \in (0,1]$` reflecting the level of confidence in the interaction
-  Binarization of the network with a threshold `$\gamma$`:
`$$\mathbf{Y}^\gamma = \left(Y^\gamma\right)_{ij} = \left\{
    \begin{array}{ll}
      1 & \text{if } \omega_{ij} > 1-\gamma, \\
      \texttt{NA} & \text{if } \gamma \leq \omega_{ij} \leq 1-\gamma, \\
      0 & \text{if } \omega_{ij} < \gamma. \\
    \end{array}
  \right.$$`

### Questions

- What `$\gamma$`? 
- What sampling design: MAR or NMAR?

---

# Model Selection

- The ICL criterion selects `$\gamma = .35$` and MNAR sampling as the one that better fit the data
- Number of selected clusters: `$11$` (MAR) and `$13$` (NMAR)
- ARI between NMAR clustering and MAR clustering: `$.39$`
- MNAR clustering somehow coherent with gene ontology

---

# Imputation

<img src="slides_files/net_esr1_NA_nu_both.jpg" width="2052" style="display: block; margin: auto;" />
---

# Gene Ontology (GO)

Enrichment analysis *i.e.* identifying classes of genes over-represented in a large set of genes
MNAR founde `$13$` significant biological process founded (MAR: only `$1$`)

---
# Conclusion

## Perspectives/ongoing

- Sampling
  - study robustness (block-sampling "includes" double-standard?)
- Other models
  - degree-corrected SBM
  - (ZI)-Poisson emission law
  - Simple SBM `$\to$` Bipartite SBM (aka Latent block models)
- Other algorithms
  - SGD algorithms + Pytorch framework
  - 'Exact' ICL maximization (with É. Côme)

## Advertisement

[https://computo.sfds.asso.fr](https://computo.sfds.asso.fr), a new journal promoting reproducible research

## THANK YOU

---
# References

Barbillon, P., J. Chiquet, and T. Tabouy (2022). "misssbm: An r package
for handling missing values in the stochastic block model". In:
_Journal of Statistical Software_.

Bickel, P., D. Choi, X. Chang, et al. (2013). "Asymptotic normality of
maximum likelihood and its variational approximation for stochastic
blockmodels". In: _Ann. Stat._ 41.4, pp. 1922-1943.

Brault, V., C. Keribin, and M. Mariadassou (2017). "Consistency and
Asymptotic Normality of Latent Blocks Model Estimators". working paper
or preprint.

Celisse, A., J. Daudin, L. Pierre, et al. (2012). "Consistency of
maximum-likelihood and variational estimators in the stochastic block
model". In: _Electron. J. Stat._ 6, pp. 1847-1899.

Little, R. J. and D. B. Rubin (2014). _Statistical analysis with
missing data_. John Wiley & Sons.

Mariadassou, M. and T. Tabouy (2020). "Consistency and asymptotic
normality of stochastic block models estimators from sampled data". In:
_Electronic Journal of Statistics_ 14.2, pp. 3672-3704.

Tabouy, T., P. Barbillon, and J. Chiquet (2019). "Variational Inference
for Stochastic Block Models from Sampled Data". In: _Journal of the
American Statistical Association_ 0.ja, pp. 1-20. DOI:
[10.1080/01621459.2018.1562934](https://doi.org/10.1080%2F01621459.2018.1562934).