Efficient Inference of parallel partitioned hybrid-Vision TransformersShow others and affiliations
2025 (English)In: 2025 CYBER-PHYSICAL SYSTEMS AND INTERNET-OF-THINGS WEEK, CPS-IOT WEEK WORKSHOPS, Association for Computing Machinery (ACM) , 2025, article id 5Conference paper, Published paper (Refereed)
Abstract [en]
Recent advancements have explored parallel partitioning of Transformers and Convolutional Neural Network (CNN) based models across networks of edge devices to accelerate deep neural network (DNN) inference. However, partitioning strategies for hybrid Vision Transformers-models integrating convolutional and attention layers-remain underdeveloped, particularly in scenarios with low communication data rates. This work introduces a novel partitioning scheme tailored for hybrid Vision Transformers, addressing communication latency through efficient compressed communication and model size reduction. The proposed approach incorporates a trainable quantization and JPEG compression pipeline to minimize overhead. We evaluate our scheme on two state-of-the-art architectures, edgeViT and CoatNet. For a communication data rate of 10 MB/s and partitioning across 12 devices, we achieve up to a 1.74x speed-up and a 5.34x model size reduction for edgeViT-XXS. Similarly, on a customized CoatNet-0, our method achieves a 1.40x speed-up and a 2.66x reduction in model size, demonstrating the efficacy of the approach in real-world scenarios.
Place, publisher, year, edition, pages
Association for Computing Machinery (ACM) , 2025. article id 5
Keywords [en]
Partitioning, Parallelization, IoT, ViT, hybrid Transformers
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:miun:diva-54746DOI: 10.1145/3722567.3727849ISI: 001498351300005ISBN: 9798400716119 (print)OAI: oai:DiVA.org:miun-54746DiVA, id: diva2:1975888
Conference
4th International Workshop on Real-time and Intelligent Edge Computing-RAGE, MAY 06-09, 2025, Irvine, CA
2025-06-242025-06-242025-09-25Bibliographically approved