Air quality refers to the condition or cleanliness of the air in our environment, typically measured by the presence and concentration of pollutants such as PM, NO₂, SO₂, O₃, and CO. It is a crucial indicator of environmental and public health, as poor air quality can lead to severe health issues, including respiratory problems, cardiovascular diseases, and premature death (Chen and Kan, 2008).
In recent years, growing concern over air pollution has made air quality monitoring a global priority (Kalajdjieski et al., 2020). Accurate classification of air quality levels enables timely warnings, informed decision-making, and the development of mitigation strategies. Traditional air quality monitoring systems rely on specialized sensor networks, which are often expensive and geographically limited. Consequently, there is a rising interest in alternative approaches that are scalable, cost-effective, and accessible to a broader population (Jin et al., 2022).
Image-based air quality assessment is promising due to the ubiquity of smartphone cameras, enabling data collection across large populations and diverse environments without the need for specialized hardware. This democratizes monitoring and allows for rapid expansion of air quality datasets. Deep learning models can play a crucial role in this process. Deep learning (DL) methods have shown strong performance in image classification tasks, such as cyberbullying image classification (Tabassum and Nunavath, 2024), spam detection (Altunay and Albayrak, 2024), and healthcare (Esteva et al., 2019). Deep learning models are well-suited for multi-class air quality classification, as they can automatically learn complex visual features and subtle differences, such as haze, color tones, and visibility that traditional methods may struggle to capture.
The motivation for this study arises from the limited work conducted on multi-class classification of air quality data (Pavikars and Jansi, 2025; Hardini et al., 2023). Moreover, no prior studies have explored the application of transformer-based models for this task. To address this gap, we employed both traditional deep learning models (CNN and ResNet-50) and transformer architectures (ViT and Swin transformer) for multi-class classification of air quality images, aiming to enhance performance and broaden the methodological landscape for environmental monitoring. Additionally, we included traditional machine learning classifiers, SVM and Random Forest, to provide a comprehensive performance comparison.
This study contributes to the field of environmental monitoring by focusing on image-based multi-class air quality classification and highlighting the advantages of transformer-based models. The primary objective of this work is to develop a deep learning-based framework for multi-class classification of air quality from image data. We explore and compare the performance of various state-of-the-art models, including SVM, Random Forest, Convolutional Neural Networks (CNN), ResNet-50, Vision Transformer (ViT), and Swin Transformer. However, the main contribution of this paper is a comprehensive comparison of traditional machine learning models with both conventional and transformer-based deep learning approaches, demonstrating the superior effectiveness of deep learning, especially transformer models, in multi-class air quality image classification.
The rest of the paper is organized as follows: Section 2 discusses related work in this area, Section 3 presents the methodology, including dataset preparation and model details, Section 4 shows the experimental results, and we discuss our work in section 5. Finally Section 6 concludes the paper and offers suggestions for future work.
2. Literature review
Israt and Nunavath (2024) demonstrated that images can be effectively classified using a ViT, CNN, and ResNet-50 model, achieving 99.5% accuracy on a multi-class classification task with the ViT model. This result indicates that ViT is highly capable of capturing complex visual features and performing accurate image classification. Similarly, Liu et al. (2021) showed that the Swin Transformer, with its hierarchical and shifted window design, achieves state-of-the-art performance on various image classification benchmarks, highlighting its strength in balancing computational efficiency and accuracy.
Pavikars and Jansi (2025) proposed a deep learning-based method for pollution classification and severity assessment using air quality image datasets from India and Nepal, selected due to similar climatic conditions. They utilized transfer learning with the InceptionV3 model pre-trained on ImageNet, applying fine-tuning along with dropout and L2 regularization to improve generalization. Their model classified images into three categories, including “good,” “moderate,” and “severe,” achieving an accuracy of 92.65%, which outperformed VGG and non-transfer learning approaches.
Hardini et al. (2023) proposed a convolutional neural network (CNN) model optimized using genetic algorithms for air quality estimation based on images. They utilized a combined dataset from three open-access sources, covering a wide range of AQI categories, including “good,” “moderate,” “unhealthy,” “very unhealthy,” and “hazardous.” To enhance model generalization and mitigate overfitting, they incorporated techniques such as dropout layers, batch normalization, and early stopping. The optimized CNN model demonstrated strong performance, achieving an overall accuracy of 95.44% and outperforming traditional baseline methods across all evaluation metrics.
Kow et al. (2022) and Jin et al. (2022) developed an image-based deep learning framework (CNN-RC) that combines a convolutional neural network with a regression classifier to estimate air quality levels from captured images. Their model was trained and tested on 3,549 hourly datasets collected from the Linyuan air quality monitoring station in Kaohsiung City, Taiwan, incorporating images, PM2.5, PM10, and AQI values. To improve reliability and accuracy, different combinations of current images, baseline images, and HSV color statistics were used as inputs. The model achieved strong performance, with R² scores for PM2.5, PM10, and AQI reaching up to 83%, 84%, and 76%, respectively, depending on the time of day. Their approach enables rapid, reliable, and scalable multi-pollutant estimation using only image data, offering a cost-effective alternative to traditional air quality monitoring systems.
Table 1. The related work of image based air quality data.
2.1 Research gap
While numerous studies have focused on air quality prediction, most have emphasized forecasting. However, no research has implemented multi-class air quality classification using image data with transformer-based deep learning models. To address this gap, this paper explores the application of these models for the multi-class classification of air quality using image data.
3. Materials and Methods
This section outlines a comprehensive methodology for multi-class classification of air quality using image data. The complete workflow for this study is illustrated in Figure 1. It begins with the SAPID dataset and progresses through several preprocessing stages, including data cleaning, labeling, image resizing, sampling, augmentation, and feature extraction. The processed data is then divided into training and validation sets. These sets are used to train various models, followed by an evaluation to assess their performance.

3.1 Data collection
The dataset used in this work was provided by Zhu (2024) and is publicly available at https://data.mendeley.com/datasets/j654cspb6r/2. It consists of 456 images categorized into five Air Quality Index (AQI) levels as defined by the United States Environmental Protection Agency (EPA). Specifically, the dataset includes 135 images labeled as “good”, 188 as moderate, 29 as unhealthy for sensitive groups, 78 as unhealthy, and 26 as very unhealthy. The data was collected using smartphone cameras and is referred to as the smartphone-based air pollution image dataset (SAPID).
3.2 Data pre-processing
To ensure robust and reliable performance of the classification models, a comprehensive data pre-processing pipeline was implemented. This pipeline aimed to standardize the input data, balance class distributions, and enhance model learning through data augmentation and feature engineering.
3.2.1 Label encoding
The classification task involved categorical air quality labels expressed in textual form. To convert these labels into a machine-readable format, label encoding was applied. Each class label was assigned a unique numerical identifier using a predefined mapping: "very unhealthy" was encoded as class 0, "unhealthy" as class 1, "unhealthy for sensitive groups" as class 2, "moderate" as class 3, and "good" as class 4. This encoding facilitated efficient computation and model training in supervised learning settings.
3.2.2 Image resizing and normalization:
All images in the dataset were resized to a fixed dimension of 224×224 pixels to ensure uniformity and compatibility with widely used pre-trained deep learning architectures. Additionally, pixel values were normalized to a range between 0 and 1. This standard normalization step improves numerical stability during model training, accelerates convergence, and helps avoid issues related to varying image scales and lighting conditions.
3.2.3 Handling class imbalance
The dataset revealed significant class imbalance, with certain air quality categories underrepresented. To mitigate this, the RandomOverSampler technique from the imblearn library was employed. This method duplicates instances of the minority classes until all classes have equal representation. Unlike under-sampling, which could lead to valuable data loss from the majority classes, or synthetic oversampling methods like SMOTE—which might introduce unrealistic artifacts in image data—random oversampling was chosen to preserve data integrity and avoid introducing potential noise.
3.2.4 Image augmentation
To further address class imbalance and enhance dataset diversity, image augmentation techniques were applied. These included rotations at varying angles, horizontal and vertical flipping, and cropping. By artificially expanding the dataset in this way, the model was exposed to a broader variety of patterns and visual features, which improved generalization and reduced the risk of overfitting, especially in classes with initially fewer instances.
3.2.5 Feature extraction
A variety of advanced feature extraction techniques were utilized to maximize the representational power of the input data. These included the use of deep learning-based pre-trained models through transfer learning, conventional Convolutional Neural Networks (CNNs), and the patch-embedding technique. The patch-embedding method involved splitting images into smaller patches for sequential processing, enhancing spatial understanding of visual elements. Transfer learning allowed for the leveraging of previously learned weights from large datasets, which significantly improved classification accuracy and reduced training time, particularly beneficial in scenarios with limited training data.
After applying the data preprocessing method, each class consisted of 188 images, resulting in a balanced dataset of 940 images (188 images per class across five classes). This preprocessing step was essential for mitigating class bias and ensuring a fair and robust learning process for the classification model. While oversampling was implemented to address class imbalance, providing quantitative details on the number of samples per class before and after augmentation clarifies how the imbalance was handled. Initially, the dataset exhibited a disparity, with class sizes ranging from as few as 26 images (very unhealthy) to 188 images (moderate). After applying RandomOverSampler, all classes were equalized to 188 samples each, effectively balancing the dataset and preventing bias toward majority classes during model training.
3.3 Splitting data into train, validation, and test sets
The balanced dataset was divided into three subsets: 80% for training, 10% for validation, and 10% for testing. To ensure that each subset preserved the original distribution of class labels, stratified sampling was used during the splitting process. This approach maintains a consistent representation of each class across the training, validation, and testing sets, supporting fair model evaluation and preventing bias from imbalanced splits.
3.4 Model architecture
3.4.1 Convolutional neural network (CNN) model architecture
We classified air quality image data using the same CNN model architecture proposed by Tabassum and Nunavath (2024). The CNN architecture implemented in our work, which is designed for five-class image classification (Figure 2).
The model takes 224×224 RGB images as input and processes them through three Conv2d layers, each with a 3×3 kernel, a stride of 1, and ReLU activation, followed by 2×2 max-pooling. These layers contain 32, 64, and 128 filters, respectively, making the CNN lightweight and suitable for the small dataset. The output is flattened and passed into a fully connected layer with ReLU activation, followed by a Dropout layer with a rate of 0.5 to prevent overfitting. Another fully connected layer precedes the final SoftMax layer, which outputs probabilities for the five classes. Early stopping based on validation loss was employed to further regularize training and avoid overfitting. The resulting features are then processed by an additional linear layer, followed by a SoftMax output layer that produces the classification results across the five classes (class 0 to class 4).

3.4.2 ResNet-50 model architecture
We employed the ResNet-50 model architecture, consistent with that used by Tabassum and Nunavath (2024), to classify air quality image data. Figure 3 illustrates the architecture used in our study for multi-class air quality classification. The architecture begins with an initial convolution and max pooling, followed by four stages of residual blocks: 3 blocks in Conv1_x, 4 in Conv2_x, 6 in Conv3_x, and 3 in Conv4_x. These shortcut connections help prevent vanishing gradients, enabling effective training of the deep network. After feature extraction, average pooling reduces the feature map size before the fully connected layer and SoftMax output classify the images into five classes. Fine-tuning only the deeper layers improved accuracy while avoiding overfitting, given the limited data.

3.4.3 Swin transformer model architecture
We employed the Swin transformer architecture as outlined by Song et al. (2023) for image classification tasks. Images were resized to 224×224 pixels and normalized before being input into the model. The Swin transformer begins by dividing the input image into non-overlapping patches, which are linearly embedded into feature vectors. These embeddings are then processed through a sequence of hierarchical stages composed of Swin Transformer blocks. Each block applies local self-attention within shifted windows, allowing the model to efficiently capture fine-grained visual patterns while maintaining computational efficiency.
Between stages, patch merging operations reduce spatial resolution and increase feature dimensionality, similar to pooling layers in CNNs. This hierarchical process builds progressively richer feature representations. The network consists of four stages, Layer 1 (two Swin Transformer blocks), Layer 2 (two blocks), Layer 3 (six blocks), and Layer 4 (two blocks), with patch merging occurring between stages. The model was initialized with ImageNet-pretrained weights and fine-tuned on our dataset. The final output features are then utilized for multi-class image classification related to air quality data. Figure 4 illustrates the Swin Transformer model used in this study for multi-class air quality classification.

3.4.4 Vision transformer (ViT) model architecture
For classifying air quality image data, we employed the ViT architecture as outlined by Tabassum and Nunavath (2024). Figure 5 illustrates the ViT model implemented in our study. The model processes an input image by first dividing it into fixed-size patches. Images are resized to 224×224 pixels and segmented into 16×16 pixel patches, resulting in 196 patches per image. These patches are then flattened into one-dimensional vectors. Each flattened patch is embedded with positional encoding to preserve the spatial relationships among the patches.
We implemented the ViT-Base/16 variant, which passes these embedded patches through multiple layers of multi-head self-attention and feed-forward networks in the Transformer encoder. This architecture effectively captures long-range dependencies and complex patterns across the entire image. Following the encoding process, a classification head aggregates the extracted features, and a final SoftMax output layer produces the predicted probabilities across the five output classes (class 0 to class 4), completing the multi-class classification task.

3.4.5 Support Vector Machine (SVM) model architecture
In the SVM architecture, input feature vectors were mapped into a high-dimensional space using a non-linear RBF kernel. The model then constructed an optimal decision boundary that maximized the margin between different class clusters. This approach is well-suited for problems with complex, non-linear class separations. We followed the same architecture as Álvarez-Alvarado et al. (2021) in our model.
3.4.6 Random Forest classifier model architecture
The Random Forest model utilized a different paradigm, consisting of an ensemble of decision trees. Each tree was trained on a random subset of the training data and a random subset of features, promoting diversity among the trees. During inference, each tree generated a class prediction, with the final output determined by majority voting across all trees. This ensemble strategy enhances robustness and reduces overfitting. We followed the same architecture as Shafi et al. (2020) in our model.
3.5 Experimental setup and hyperparameter optimization
For the deep learning models, we implemented a standard hyper-parameter tuning strategy based on validation set performance to enhance our model's performance on image data. The AdamW optimizer was used to fine-tune the model parameters, following the methodology suggested by Reyad et al. (2023). The training process was conducted for a maximum of 20 epochs, with a batch size of 20 and a learning rate set to 0.00002. For the multi-class classification task, we employed the "Sparse-Categorical Crossentropy" loss function, which is well-suited for this purpose. To further improve generalization and reduce the risk of overfitting, we applied an early stopping mechanism that terminated training if no improvement in validation loss was observed for three consecutive epochs. Additionally, a learning rate scheduler (StepLR) with a decay factor of 0.9 per epoch was used to gradually reduce the learning rate during training. The model with the best validation performance was retained for final evaluation on the test dataset. We used the same hyper-parameter optimization process for all deep learning models.
For the machine learning models, hyper-parameter tuning was performed using GridSearchCV with 5-fold cross-validation on the training set. For the SVM, we tuned C and gamma with an RBF kernel. For Random Forest, we optimized n_estimators, max_depth, and min_samples_split. The best model, based on validation accuracy, was evaluated on the test set.
All experiments were conducted using TensorFlow 2.x on a system equipped with an NVIDIA RTX 3080 GPU and 32 GB of RAM. This setup enabled efficient training and evaluation of the deep learning models on the air quality image dataset.
3.6 Model performance evaluation
After training, the reserved test set was used to measure final performance. To evaluate the trained models, 10% of the dataset was allocated as a test set and held out during training. Evaluation was conducted using multiple metrics: accuracy, precision, recall, F1-score, and a confusion matrix, providing a comprehensive overview of model effectiveness. We followed the same approach for our five-class multi-class image classification as described by (Tabassum and Nunavath, 2024).
4. Result
In this study, we evaluate the performance of four deep learning models—CNN, ResNet-50, Swin Transformer, and ViT—and two machine learning models, SVM and Random Forest classifier, for multi-class classification of air quality using image data.
Table 2 summarizes the overall performance of each model in this classification task. The traditional CNN model achieves moderate results, with an accuracy of 78% and nearly balanced precision, recall, and F1 score ranging from 0.77 to 0.78. ResNet-50 shows the weakest performance among the models, with an accuracy of only 64% and matching recall and F1-score of 0.64, although it achieves a higher precision of 0.80. This suggests that while ResNet-50 is effective at making correct positive predictions when it does classify correctly (high precision), it struggles to identify all instances of the classes properly (lower recall), leading to a poorer overall balance.
Table 2. Model performance of air quality-based image data for deep learning models.
In contrast, the ViT achieves the best overall performance, with the highest accuracy of 97%. It also boasts precision, recall, and F1-score values of 0.97, demonstrating its strong and consistent ability across all evaluation metrics. The Swin Transformer performs exceptionally well as well, with an accuracy of 96% and similarly high precision, recall, and F1-score values of 0.96 each, closely following ViT. These results clearly highlight the superior capability of transformer-based models (Swin and ViT) compared to traditional CNN architectures for the complex task of air quality classification from image data.
Table 3 summarizes the overall performance of each model for multi-class classification of air quality using image data. The SVM model shows solid performance with an accuracy of 88%, along with precision, recall, and F1-score values ranging from 0.88 to 0.90, reflecting reliable classification ability. Similarly, the Random Forest classifier outperforms the SVM, achieving an accuracy of 91%, with precision, recall, and F1-score all close to 0.91–0.92, indicating strong and consistent performance across metrics.
Table 3. Model performance of air quality-based image data for machine learning models.
Overall, both traditional machine learning classifiers prove effective as baseline models for air quality image classification; however, the ViT model outperformed the traditional machine learning models.
4.1 Result evaluation with confusion-matrix
The confusion matrices compare the performance of various models in multi-class classification of air quality image data. The CNN model demonstrated moderate accuracy but showed significant inter-class confusion, particularly between class 2 and class 4, which were misclassified in 18% and 15% of cases, respectively (Figure 6). This suggests that these categories share overlapping visual features. While class 3 achieved the highest recall at 82%, class 1 fell behind at 73%, indicating uneven discriminative capability.

The ResNet-50 model demonstrated incremental improvements; however, persistent challenges remained. Classes 2 and 4 continued to be confused, with a misclassification rate of 12–15%, while class 1 remained the weakest performer, achieving only 78% recall. This underscores its inherent complexity (Figure 7).

In contrast, the Swin Transformer achieved significantly better results, with nearly all classes surpassing 95% in both precision and recall. The only notable exception was class 3, which was misclassified as class 2 in 3% of cases, likely due to subtle feature similarities (Figure 8).

The ViT model demonstrated exceptional performance, with all classes achieving over 99% accuracy and no misclassifications in classes 1, 4, and 5 (Figure 9).

The SVM model demonstrates strong performance in Figure 10 for air quality classes (class 0-2), achieving an accuracy of 85-95%. However, it struggles significantly under more polluted conditions (class 3-5). Class 3 is often misclassified as class 5, with a misclassification rate of 31.6%, while class 5 itself has an accuracy of only 40%.

The confusion matrix, as shown in Figure 11, confirms the strong performance of the Random Forest model, with most predictions falling along the diagonal (indicating correct classifications). Misclassifications are minimal, primarily occurring in class 3 (“Moderate”), which was occasionally predicted as class 4 (“Good”).

4.2 Result evaluation with ROC-AUC curves
The ROC curves help us understand how effectively each model distinguishes between the air quality classes.

The CNN model demonstrated a moderate ability to distinguish between classes, with ROC-AUC scores ranging from 0.781 for class 1 to 0.891 for class 5 (Figure 12). These scores indicate that the model faced challenges in clearly separating certain classes. The ResNet-50 model (Figure 13) performed slightly better, with ROC-AUC scores between 0.841 and 0.911; however, it still exhibited some overlap, particularly among the middle classes.

This ROC curve illustrates the performance of an SVM classifier across five air quality classes, as shown in Figure 16. The model demonstrates excellent results, with AUC values ranging from 0.96 to 1.00. Class 0 ("Very Unhealthy") achieves a perfect AUC of 1.00, while classes 1 and 2 score 0.99, indicating that the SVM effectively distinguishes between categories.
The Swin Transformer performed exceptionally well, achieving scores close to perfect, ranging from 0.991 to 1.000. This indicates its ability to clearly distinguish between most classes, although class 3 had a slightly lower score of 0.991 (Figure 14).

Finally, the ViT model achieved perfect results, with a ROC-AUC of 1.000 for all classes, demonstrating its exceptional ability to accurately identify all air quality levels (Figure 15).

This ROC curve illustrates the performance of an SVM classifier across five air quality classes, as shown in Figure 16. The model demonstrates excellent results, with AUC values ranging from 0.96 to 1.00. Class 0 ("Very Unhealthy") achieves a perfect AUC of 1.00, while classes 1 and 2 score 0.99, indicating that the SVM effectively distinguishes between categories.

Figure 17 presents the ROC curve, illustrating the performance of the Random Forest classifier, which slightly outperforms the SVM. Four out of five classes have AUC values of 0.99 or 1.00. Classes 0 and 1 achieve perfect AUCs, demonstrating that the model is highly effective in class separation.

5. Discussion
The findings of this study reveal key insights into the comparative effectiveness of traditional machine learning models and advanced deep learning architectures for air quality classification based on image data. The traditional CNN model achieved a moderate performance with 78% accuracy, and its precision, recall, and F1-score remained balanced around 0.77–0.78. This suggests that while CNNs can capture spatial features effectively, their representational power may not be sufficient for more complex multi-class tasks without additional architectural depth or contextual modeling (Zhao et al., 2024; Wu et al., 2022). Notably, ResNet-50, despite being a deeper convolutional neural network, underperformed with only 64% accuracy. Although its precision was relatively high (0.80), the recall and F1-score both dropped to 0.64. This performance gap indicates that ResNet-50 may be overfitting to certain dominant patterns or failing to generalize across minority classes in the imbalanced dataset. The low recall suggests it struggles to correctly identify all relevant cases, undermining its overall classification balance (Sambasivam et al., 2025; Wen et al., 2025).
In contrast, transformer-based models delivered significantly superior results. The Swin Transformer achieved 96% accuracy with uniformly high precision, recall, and F1-score (0.96). Even more impressively, the ViT reached a peak accuracy of 97%, with precision, recall, and F1-score values of 0.97. These results underscore the transformers' capacity to model complex spatial relationships and extract deep contextual information from image data (Anjali et al., 2025; Nafi et al., 2022). Unlike CNNs, which focus on local patterns via convolutions, ViT and Swin utilize self-attention mechanisms, allowing them to capture both global and local dependencies across the entire image. This likely contributed to their enhanced performance, especially in distinguishing between visually similar air quality categories (Qezelbash-Chamak and Hicklin, 2025).
Traditional machine learning models also exhibited robust performance. The SVM achieved an accuracy of 88%, with balanced precision, recall, and F1-score values around 0.88–0.90. The Random Forest classifier slightly outperformed the SVM, achieving a 91% accuracy and consistent metrics around 0.91–0.92. These results affirm that classical models, when paired with effective feature extraction, can perform reliably as baseline classifiers (Gyasi-Agyei, 2025; Islam et al., 2024). However, their performance remains slightly below the transformer-based models, likely due to their limited ability to capture deep spatial and semantic patterns from raw image data.
The confusion matrix analysis provides deeper insight into how well each model distinguishes between different air quality categories using image data. The CNN model, while achieving moderate overall accuracy, revealed notable weaknesses in class separation, especially between class 2 ("unhealthy for sensitive groups") and class 4 ("good"), which were misclassified in 18% and 15% of instances, respectively. This suggests overlapping visual characteristics between these classes, making them harder to distinguish. Additionally, the uneven recall values—highest for class 3 (82%) and lowest for Class 1 (73%)—indicate that the CNN struggles with consistent classification across all categories (Kalantari et al., 2024). ResNet-50 showed slight improvement over CNN, with marginally reduced confusion between Classes 2 and 4. However, the recall for Class 1 remained low at 78%, reflecting ongoing difficulty in accurately identifying this class. This persistent misclassification underscores the model's limited capacity to differentiate between visually similar air quality levels, particularly in middle categories where the boundaries are less distinct (Nawaz et al., 2025).
On the other hand, the Swin Transformer significantly enhanced classification accuracy, achieving over 95% precision and recall for nearly all classes. Its only noticeable error was a 3% misclassification of class 3 as class 2, likely due to subtle visual similarities. This highlights the model’s superior ability to extract both global and local image features, making it more effective for fine-grained classification tasks (Zhou et al., 2025). The ViT delivered the highest performance among all models, with perfect or near-perfect classification. It achieved over 99% accuracy across all classes and showed no misclassification in classes 1, 4, and 5. This result reinforces the strength of transformer-based architectures in handling complex, high-dimensional image data with minimal confusion across categories (Ortiz et al., 2025).
Among the machine learning models, SVM performed well for classes 0 to 2, achieving 85–95% accuracy. However, it struggled significantly with classes 3 to 5—most notably misclassifying class 3 as class 5 in over 31% of cases and identifying class 5 correctly only 40% of the time. These errors suggest that SVM has difficulty modeling the complexity and subtle visual gradients in more polluted conditions. In contrast, the Random Forest classifier showed stronger and more balanced performance. Most of its predictions aligned with the correct labels, as seen by the concentration of values along the diagonal of its confusion matrix. Its only minor confusion was between class 3 and class 4, where "Moderate" was sometimes predicted as "Good" (Imani et al., 2025)
The ROC curve analysis provides a comprehensive view of each model’s capability to distinguish between air quality categories. The CNN model showed only moderate class discrimination, with ROC-AUC scores ranging from 0.781 to 0.891. Particularly low performance for class 1 suggests difficulties in identifying that category accurately, likely due to overlapping visual features with neighboring classes (Nakagawa et al., 2022).ResNet-50 improved slightly on CNN, achieving ROC-AUC scores between 0.841 and 0.911. While this reflects better separation overall, it still struggled with the middle-range air quality classes, indicating that deeper CNNs alone may not fully resolve inter-class ambiguity in complex image datasets (Wen et al., 2025). The Swin Transformer exhibited excellent discriminative power, with ROC-AUC scores close to 1.0 across all classes. Even its lowest-performing class, class 3, scored 0.991—indicating a highly reliable classification performance with minimal overlap between categories (Tanwar et al., 2025). The ViT achieved perfect separation across all five classes, with ROC-AUC scores of 1.000. This result confirms ViT’s exceptional ability to distinguish between even the most visually similar air quality levels, highlighting the strength of self-attention mechanisms in extracting fine-grained, global image features.
Among the traditional classifiers, SVM performed impressively, achieving ROC-AUC values between 0.96 and 1.00. It particularly excelled in identifying Classes 0, 1, and 2, with near-perfect scores. However, its performance still fell short of the transformer models, especially in handling more nuanced class boundaries. The Random Forest classifier slightly outperformed SVM, with four out of five classes achieving AUC scores of 0.99 or higher, and perfect scores for Classes 0 and 1. These results suggest that ensemble-based machine learning models can offer strong class separability, though still not at the level of transformer-based approaches (Hosain et al., 2025; Ravin et al., 2025).
6. Conclusions
This study presents a deep learning-based approach for multi-class classification of air quality using image data. By leveraging the Smartphone-Based Air Pollution Image Dataset (SAPID), we evaluated the performance of several state-of-the-art models, including deep learning models such as CNN, ResNet-50, ViT, and Swin Transformer, as well as machine learning models like SVM and Random Forest. Our results demonstrate that transformer-based architectures, particularly ViT and Swin Transformer, significantly outperform traditional convolutional models in terms of accuracy, precision, recall, and F1-score. The ViT model achieved the highest classification accuracy of 0.97, followed closely by the Swin Transformer at 0.96, highlighting the potential of vision transformers in capturing complex features in environmental image data. These findings underscore the effectiveness of transformer-based models for environmental monitoring tasks and suggest a promising direction for scalable, image-based air quality assessment systems. Future work should focus on evaluating the proposed models on larger, more diverse datasets that encompass various geographic regions, lighting conditions, and pollution sources. Expanding the dataset will not only improve model robustness but also provide deeper insights into the adaptability of transformer-based approaches in real-world environmental monitoring scenarios.
Acknowledgements
Not applicable.
Ethical approval statement
No ethical approval was required to conduct the study.
Data availability
The presented data are deposited in the Mendeley Data (https://data.mendeley.com/datasets/j654cspb6r/1).
Informed consent statement
No informed consent was required to conduct the study.
Conflict of interest
The authors declare no conflict of interest.
Authors’ contribution
Conceptualization: Israt Tabassum and G. M. Ansarul Kabir, Iffat Tasnuva, and Shaheena Sultana; Data collection: Israt Tabassum, G. M. Ansarul Kabir, and Iffat Tasnuva; Data analysis: Israt Tabassum, G. M. Ansarul Kabir and Iffat Tasnuva; Figure preparation: Israt Tabassum, G.M. Ansarul Kabir, Iffat Tasnuva, and Shaheena Sultana. All authors critically reviewed the manuscript and agreed to submit final version of the manuscript.