VAE-Based Compression of Light Field Images Using Disentangled Latent Modeling and Perceptual Quality Assessment
2026 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]
The demand for immersive visual experiences in applications like virtual reality and telepresence has highlighted the limitations of traditional 2D imaging. Light field (LF) imaging addresses this by capturing a 4D representation of a scene, encoding both spatial (texture) and angular (viewpoint) information. This richness enables true parallax and depth perception but creates a significant data bottleneck, as the massive data volumes are a major obstacle to efficient storage, transmission, and real-time processing. Conventional compression methods often treat LF data as a simple sequence of images, failing to effectively exploit the underlying spatial-angular structure, which leads to sub-optimal performance.
This thesis addresses the challenge of efficient LF compression by developing a principled, learning-based framework centered on spatial-angular disentanglement. The core of the work is a series of Variational Autoencoder (VAE)-based architectures that explicitly separate spatial and angular features into distinct latent representations. This approach provides greater flexibility and efficiency by allowing each domain to be modeled according to its unique statistical properties. The foundational VAE model is progressively advanced through two key contributions: first, the integration of dual-hyperprior entropy models to learn tailored probability distributions for each latent stream, improving rate-distortion performance; and second, the introduction of an information-theoretic regularizer to ensure robust feature separation. Finally, a lightweight, modular compression pipeline is proposed to further compress these latent representations without requiring network retraining.
The proposed methods were rigorously evaluated on standard public LF datasets as well as a novel spherical LF dataset created as part of this research to support immersive telepresence scenarios. Objective evaluations demonstrate that the disentangled frameworks achieve a superior rate-distortion performance, with significant Bjontegaard Delta-Peak Signal-to-Noise Ratio (BD-PSNR) gains over state-of-the-art learning-based and traditional codecs. Crucially, the methods also offer substantially faster encoding and decoding times, a critical requirement for real-time applications. To assess perceptual performance, a formal subjective quality study was conducted, which confirmed that the proposed methods deliver improved visual quality, particularly in preserving angular consistency and reducing artifacts that impair the immersive experience.
In conclusion, this thesis demonstrates that explicitly disentangling, modeling, and compressing the spatial and angular components of light fields is a highly effective strategy. The developed frameworks and tools advance the state-of-the-art by providing practical and scalable solutions that balance compression efficiency, computational speed, and perceptual quality. This work makes a significant contribution toward the feasibility of using high-quality LF imaging in bandwidth-constrained immersive applications. This compilation thesis is based on the contributions of six peer-reviewed scientific publications.
Ort, förlag, år, upplaga, sidor
Sundsvall: Mid Sweden University , 2026. , s. 70
Serie
Mid Sweden University doctoral thesis, ISSN 1652-893X ; 440
Serie
École Doctoral: Centre Inria de l’Université de Rennes ; 601
Nationell ämneskategori
Datorseende och lärande system
Identifikatorer
URN: urn:nbn:se:miun:diva-56370ISBN: 978-91-90017-39-5 (tryckt)OAI: oai:DiVA.org:miun-56370DiVA, id: diva2:2025552
Disputation
2026-01-22, L111, Holmgatan 10, Sundsvall, 09:15 (Engelska)
Opponent
Handledare
Anmärkning
As part of a double degree with Université de Rennes.
2026-01-092026-01-072026-01-19Bibliografiskt granskad
Delarbeten