In what follows I'll try to explain my basic understanding and interepretation of the semi-supervised framework based on Variational Autoencoders, as described in [1]. I shall assume a vector notation where bold symbols $\mathbf{a}$ represent vectors, whose $j$-th component can be represented as $a_j$.
The starting point of the framework is to consider a dataset \(D = S \cup U\), where:
\(S = \{(\mathbf{x}_1, \mathbf{y}_1), \ldots, (\mathbf{x}_n, \mathbf{y}_n)\}\),
\(U = \{\mathbf{x}_{n+1}, \ldots, \mathbf{x}_{n+m}\}\), with \(\mathbf{x}_i \in \mathbb{R}^N\) and \(\mathbf{y}_i \in \{0,1\}^C\) represents a one-hot encoding of a class in \(\{1, \ldots, C\}\).