As visual computing advances across domains such as image editing, autonomous driving, and digital twins, the need for high-fidelity yet computationally efficient representations has become increasingly critical. Traditional 2D models are constrained by fixed grids, limiting their adaptability and compactness, while emerging 3D techniques often deliver realism at the cost of excessive training time, memory usage, and energy consumption. This thesis tackles a central challenge across both 2D and 3D domains: how to construct scalable, high-quality visual representations without succumbing to inefficiency.
We examine Steered Mixture-of-Experts (SMoE)—a modular, kernel-based architecture that promises localized modeling and interpretability. Yet despite its expressive power, SMoE has historically suffered from impractical training regimes, bloated parameter counts, and poor support for high-dimensional data. This work pursues a cohesive answer to three research questions, aimed at making SMoE fast, compact, and capable of handling 3D visual content.
First, we confront the long-ignored problem of initialization. Through a segmentation-based method that aligns expert kernels with semantic image regions, we drastically reduce redundancy and training duration, producing models that are both compact and structurally aligned with the data. Second, we tackle the inefficiency of gradient-based optimization by introducing a rasterized training scheme, adapted from Gaussian splatting techniques in 3D rendering. By partitioning images into blocks and activating only relevant kernels during each optimization step, we reduce the computational footprint by an order of magnitude without sacrificing accuracy. Third, we generalize SMoE to 3D by reparameterizing its spatial kernels and integrating splatting-based differentiable rendering. This extension maintains the compactness of SMoE while supporting high-quality scene reconstruction, even under sparse supervision.
Experimental results confirm that our methods outperform baseline SMoE implementations in both speed and reconstruction quality across 2D and 3D tasks, and they further surpass existing 3DGS and related Gaussian-based approaches. Moreover, our approach enables previously infeasible applications—real-time training, compact deployment, and scalable modeling of complex scenes.
This thesis transforms SMoE from a theoretically elegant yet impractical construct into a viable backbone for efficient, high-fidelity visual data representation. By grounding mixture models in perceptual structure and exploiting block-level sparsity, we chart a broader design principle for structure-aware, rasterization-friendly learning systems.
Berlin: Technische Universität Berlin , 2026. , p. 121
The thesis is part of a double PhD degree with Technische Universität Berlin and Mid Sweden University, published at TU Berlin.
At the time of the doctoral defence the following paper was unpublished: paper 3 and 5 (manuscript).