Advertisement

Efficacy of lightweight Vision Transformers in diagnosis of pneumonia

Research Article | DOI: https://doi.org/10.37579/2834-5142/020

Efficacy of lightweight Vision Transformers in diagnosis of pneumonia

  • Muhammad Tayyeb Bukhari

Beaconhouse Main Campus Abbottabad, Khyber Pakhtunkhwa, Pakistan.

*Corresponding Author: Muhammad Tayyeb Bukhari. Beaconhouse Main Campus Abbottabad, Khyber Pakhtunkhwa, Pakistan.

Citation: Muhammad T. Bukhari, (2024), Efficacy of lightweight Vision Transformers in diagnosis of pneumonia, International Journal of Clinical Nephrology. 3(1); DOI:10.37579/2834-5142/020

Copyright: © 2024, Muhammad Tayyeb Bukhari. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Received: 01 February 2024 | Accepted: 15 February 2024 | Published: 26 February 2024

Keywords: rheumatic autoimmune disease; autoimmune rheumatic diseases

Abstract

Pneumonia is one of the leading causes of death in children under five, particularly in resource-limited settings. The timely and accurate detection of pneumonia, often conducted through chest X-rays, remains a challenge due to the scarcity of trained professionals and the limitations of traditional diagnostic methods. In recent years, Artificial Intelligence (AI) models, especially Convolutional Neural Networks (CNNs), have been increasingly applied to automate pneumonia detection. However, CNN models are often computationally expensive and lack the ability to capture long-range dependencies in images, limiting their efficacy in certain medical applications. To address these limitations, lightweight hybrid models such as Vision Transformers (ViTs), which combine the strengths of CNNs and transformers, offer a promising solution. This study compares the efficacy of two lightweight CNNs (EfficientNet Lite0 and MobileNetV3 Large) with two hybrid ViTs (MobileViT Small and EfficientFormerV2 S0) for pneumonia detection. The models were evaluated on a publicly available chest X-ray dataset using metrics such as accuracy, F1 score, precision, and recall. Results show that the hybrid models, particularly MobileViT Small, outperformed their CNN counterparts in both accuracy (97.50%) and F1 score (0.9664), demonstrating the potential of ViT-based models for medical imaging tasks. The findings suggest that hybrid models provide superior recall, reducing false negatives, which is crucial for medical diagnostics. Further research should focus on optimizing these hybrid models to improve computational efficiency while maintaining high diagnostic performance.

2.Introduction

In 2019, pneumonia killed 740000+ children under the age of five respectively making it one of the leading causes of death in vulnerable populations. Pneumonia is an acute respiratory infection due to a viral or bacterial pathogen which targets the lungs, the alveoli in particular filling them with pus and fluid. Typical symptoms include cough, shortness of breath, chest pain, fatigue, and fever (World Health Organization [WHO], 2022). In regions with limited healthcare resources, the timely and accurate diagnosis of pneumonia poses significant challenges due to a shortage of trained professionals and inadequate facilities (Simkovich et al., 2021). Pneumonia detection is commonly done using chest x-rays however traditional methods are time-consuming and prone to human error (Alapat et al., 2022). This is where automated image analysis comes into play, Artificial Intelligence (AI) tools, such as Convolutional Neural Networks (CNNs), are increasingly being applied to improve accuracy in detecting pneumonia, particularly in resource-limited settings (An et al., 2024). Many powerful AI models exist but since they are computationally expensive, so they are not suited for resource constrained environments (Jia et al., 2023). In cases like this, lightweight models like MobileNet are employed which balance efficiency and accuracy (Trivedi & Gupta, 2021). The purpose of this study is to compare the efficacy of lightweight hybrid Vision Transformers (ViTs) and CNNs for pneumonia detection, focusing on both performance and efficiency, with the hypothesis that hybrid ViTs consisting of elements of CNNs and ViTs will outperform traditional CNNs.

3. Materials and Methods

3.1 Dataset

The dataset used was the Mendeley chest X-ray pneumonia dataset (Kermany et al., 2018) and can be accessed here. The train images, which totalled to 5216 images (of which 3875 were labelled as pneumonia and 1341 were labelled as normal), were used for training, validating, and testing. The images were split into training, validation, and test datasets in the ratio 7:2:1.

3.2 Data Augmentation

The seed was set to 0 to ensure reproducibility using the following code:

3.2.1 Training dataset augmentation

Various augmentations from the albumentations (Buslaev et al., 2020) library were implemented to ensure model generalisation by introducing variability in the images. The augmentations applied are as follows:

• Rotation: The images were rotated to up to ±180° with a probability of 0.5.

• Affine: Scaling between 0.9 to 1.1, translation up to 10%, and shear between -2 and 2 were applied with a probability of 0.5.

• Flipping: Images were randomly flipped horizontally and vertically with a 50% probability for both.

• Resizing: All images were resized to 224x244 pixels

• Normalization: The RGB values of pixels were mapped to be in the range 0 to 1.

 3.2.2 Validation and Test dataset augmentation

The transformation pipeline for the validation dataset and test dataset was simpler and was as follows:

• Resizing: Images were resized to 224x224 pixels.

• Normalization: RGB values were normalized to be in the range 0 to 1.

3.3 Model Architectures

Four lightweight image classification models were employed in this study:

• EfficientNet Lite0

• MobileNetV3 Large

• MobileViT Small

• EfficientFormerV2 S0

3.3.1 Efficient Net Lite0

A scaled-down version of the original EfficientNet model, EfficientNet Lite0 was created with mobile and edge devices in mind. It makes use of MBConv layers, which combine squeeze- and-excitation blocks with depthwise separable convolutions. By keeping performance high while lowering the number of parameters, these layers strike a balance between accuracy and efficiency. The model employs a compound scaling method that uniformly scales depth, width, and resolution of the network based on a predefined factor. This enables the model to perform well across various tasks with fewer computations (Tan, & Le, 2019).

3.3.2 MobileNetV3 Large

MobileNetV3 Large is intended for use by mobile devices with constrained computational resources. It uses squeeze-and-excitation layers, inverted residuals, and depthwise separable convolutions. For increased accuracy at the lowest possible computational cost, it also makes use of hard-swish and swish activation functions. MobileNetV3 further optimizes performance by using Neural Architecture Search (NAS) to identify the best configurations of layers. Higher accuracy is given priority in the large version, but the lightweight design is kept (Howard et al., 2019).

3.3.3 Mobile ViT Small

The goal of MobileViT, a ViT-CNN hybrid model, is to integrate the local feature extraction powers of CNNs with the globally receptive field of Vision Transformers (ViT). MobileViT combines CNNs that extract local features with transformer encoders to capture long-range dependencies in the image. Because of this combination, MobileViT offers a balance between accuracy and efficiency that is optimal for mobile applications (Mehta & Rastegari, 2021).

3.3.4 EfficientFormerV2 S0

An effective vision transformer for quick inference on edge devices is EfficientFormerV2. To achieve high performance on image classification tasks, it combines the simplicity of depthwise separable convolutions with the efficiency of transformers. Memory and computational expenses are decreased by EfficientFormerV2 models, such as the S0 variant, by using an optimized transformer block and a hierarchical design. MLP (Multi-Layer Perceptron) blocks and multi-head self-attention mechanisms have been integrated into the architecture of EfficientFormerV2. One of the smallest models, the S0 version was created especially for mobile and edge applications (Li et al., 2022).

 3.4 Training and Optimization

All models were trained for 30 epochs using the following setup:

• Optimizer: The Adam optimizer with a learning rate of 1e-4 and a weight decay of 1e- 5 was used to minimize the loss.

• Loss function: The loss function used as the criterion was Binary Cross-Entropy with Logits Loss as it was suited for this binary classification problem.

• Scheduler: ReduceLROnPlateau was employed which would reduce the learning rate by a factor of 0.1 when the validation loss would plateau.

The following code was used to configure the optimizer, the loss function, and the scheduler:

In each epoch, the model was trained on the training dataset and then evaluated on the validation dataset. The validation loss was used to update the scheduler to ensure that the learning rate was reduced when validation loss would stop decreasing.

3.5 Evaluation Metrics

After training each model was evaluated on the test set using the following metrics:

• Accuracy score: the proportion of correctly predicted labels out of the total.

• F1 score: the harmonic means of precision and recall.

• Precision score: the ratio of true positives to the sum of true positives and false positives.

• Recall score: the ratio of true positives to the sum of true positives and false positives.The metrics were calculated using the methods provided in the scikit-learn library which required a list of the predicted labels and a list of the true labels (Pedregosa et al., 2011).

4. Results

The following results were obtained from the evaluation of the models on the test dataset and computation of the above-mentioned metrics for each of the four models.

Model

Accuracy score

F1 score

Precisio score

Recall score

EfficientNet Lite0

MobileNetV3 Large

0.9539

0.9385

0.9241

0.9558

0.9578

0.9433

0.9306

0.9583

MobileViT Small

0.9750

0.9664

0.9539

     0.9808

EfficientFormerV2S0

0.9558

0.9418

0.9234

0.9655

Table (a): Evaluation metrics on the test dataset of each model

Model

Accuracy score

F1 score

Parameters  (M)

Giga Multiply Add

Operations per second

EfficientNet Lite0

0.9539

0.9385

4.7

0.4

MobileNetV3 Large

0.9578

0.9433

5.5

0.2

MobileViT Small

0.9750

0.9664

5.6

2.0

EfficientFormerV2 S0

0.9558

0.9418

3.6

0.4

Table (b): Comparison of models in context of parameter count and computational cost

5. Discussion

5.1 Comparison of Cnn and Hybrid Vit Models

The results reveal that while both the CNN models and the hybrid ViT models performed well on the test dataset, both the hybrid ViTs performed better than their CNN counterparts in terms of accuracy score and F1 score. The best performing models was MobileViT Small with an accuracy score of 0.9750 and an F1 score of 0.9664, performing better than the other models in all metrics used in the study. The best performing CNN model was MobileNetV3 Large, achieving an accuracy score of 0.9578 and an F1 score of 0.9433 but fell short in recall score in comparison with the hybrid ViT models.

The CNN models rely heavily on local feature extraction through convolutional layers, which makes them efficient in identifying localized patterns within an image, such as the texture of lung tissues (Tan, & Le, 2019) (Howard et al., 2019). However, their ability to capture long- range dependencies across the image is limited, which could explain why they underperformed compared to hybrid models like MobileViT Small, which integrate both local and global feature extraction. This blend enables hybrid models to capture more complex relationships across the entire X-ray image, which is crucial in medical diagnostics where subtle, distributed patterns may indicate disease.

EfficientFormerV2 S0 was outperformed by both MobileNetV3 Large and MobileViT Small in terms of accuracy score and F1 score, achieving an accuracy score and F1 score of 0.9558 and 0.9418, respectively. It did however have a low GMACs value of 0.4 (Li et al., 2022) while attaining a high recall score of 0.9655, outperforming both CNN models in this regard. This suggests that transformer components in hybrid architectures significantly enhance performance by allowing the model to consider the entire image context and produce fewer false negatives.

5.2 Practical Implications and Further Research

The results highlight that lightweight hybrid ViT models, such as MobileViT Small and EfficientFormerV2 S0, exhibit superior recall compared to lightweight CNN models, which is critical in medical applications like pneumonia detection, where false negatives can lead to missed diagnoses and delayed treatment. This makes hybrid models particularly valuable for preventing underdiagnosis, ensuring that more cases of pneumonia are correctly identified, even in resource-constrained settings. However, CNN models like MobileNetV3 Large still performed strongly, particularly in terms of computational efficiency, indicating that CNNs are not far behind and continue to be viable options for real-time applications where model complexity needs to be minimized. Further research should focus on optimizing hybrid models to reduce their computational demands while maintaining high recall. Additionally, exploring hybrid architectures in diverse medical imaging tasks, along with investigating how CNNs can be enhanced to capture long- range dependencies, would provide valuable insights into creating more efficient and accurate diagnostic tools.

References

Clinical Trials and Clinical Research: I am delighted to provide a testimonial for the peer review process, support from the editorial office, and the exceptional quality of the journal for my article entitled “Effect of Traditional Moxibustion in Assisting the Rehabilitation of Stroke Patients.” The peer review process for my article was rigorous and thorough, ensuring that only high-quality research is published in the journal. The reviewers provided valuable feedback and constructive criticism that greatly improved the clarity and scientific rigor of my study. Their expertise and attention to detail helped me refine my research methodology and strengthen the overall impact of my findings. I would also like to express my gratitude for the exceptional support I received from the editorial office throughout the publication process. The editorial team was prompt, professional, and highly responsive to all my queries and concerns. Their guidance and assistance were instrumental in navigating the submission and revision process, making it a seamless and efficient experience. Furthermore, I am impressed by the outstanding quality of the journal itself. The journal’s commitment to publishing cutting-edge research in the field of stroke rehabilitation is evident in the diverse range of articles it features. The journal consistently upholds rigorous scientific standards, ensuring that only the most impactful and innovative studies are published. This commitment to excellence has undoubtedly contributed to the journal’s reputation as a leading platform for stroke rehabilitation research. In conclusion, I am extremely satisfied with the peer review process, the support from the editorial office, and the overall quality of the journal for my article. I wholeheartedly recommend this journal to researchers and clinicians interested in stroke rehabilitation and related fields. The journal’s dedication to scientific rigor, coupled with the exceptional support provided by the editorial office, makes it an invaluable platform for disseminating research and advancing the field.

img

Dr Shiming Tang

Clinical Reviews and Case Reports, The comment form the peer-review were satisfactory. I will cements on the quality of the journal when I receive my hardback copy

img

Hameed khan