1 Introduction

The UAV platform needs visual object classification as a core enabling technology for the deployment of diverse applications in the computing paradigm. The low-altitude aerial images are obtained from drones flying within a certain height from the ground. We have considered aerial images that are captured by drones flying approximately 100 m or less above the land. The applications of unmanned aerial vehicles (UAVs) include autonomous driving cars [1], object detection and classification [2], spotting violent crowd behaviors [3], traffic monitoring [4], and aerial terrain analysis [5]. Low-altitude aerial images retrieved from UAVs incorporate public safety in vehicle accidents [6], ship collisions [7], border-power lines [8], crowd surveillance [9], and energy inspection from solar farms [10]. Low-altitude aerial images in urban settings have different features than remote sensing or standard datasets. These present significant challenges for object classification for low-altitude UAV images, such as payload weight constraints and multiple overlapped or scale-oriented images [11]. In this paper, we perform object classification on multiple low-altitude aerial objects.

1.1 Motivation

Research on low-altitude datasets is relatively new, and this paper strives to experimentally compare research in low-altitude aerial datasets by evaluating the performances of leading deep learning methods for object classification. The advent of artificial intelligence technologies has led to a boost in drone-based technologies to perform a wide range of applications. In this paper, we compare machine- and deep learning-based approaches for five different classes of low-altitude aerial objects. The inherent characteristics of low-altitude aerial images are different from standard images, so the challenges encountered in this case are more complicated to solve. The classification algorithms show different behavior when applied to low-altitude aerial images. The versatile applications of UAVs including crowd surveillance [9], traffic monitoring [4], and autonomous navigation [1] are more feasible due to recently formed drone policies. It is worth studying multiobject classification models along with diverse applications in the case of low-altitude aerial images. We aim to provide a suitable model to perform classification in this unexplored domain. This study targeted young audiences working in low-altitude UAV images to compare machine and deep network choices for object classification. The recent technological advancement of the machine and deep learning fields employs visual tasks in which human experts are relatively less efficient in evaluating recognition outcomes with correct visualization. This paper is an attempt in this direction, and the significant offerings of this paper include the following:

  • Comparison between machine learning-based classifiers and a deep handcrafted CNN for object classification in low-altitude aerial images.

  • Comparison between a deep handcrafted CNN and pretrained deep models for object classification in low-altitude aerial images.

  • Performance evaluation of machine learning-based classifiers and pretrained deep models for low-altitude UAV object classification.

  • Provide a suitable choice from a machine learning-based classifier and pretrained deep model for recognizing objects in low-altitude aerial images.

The organization of this research paper is as follows: Sect. 2 highlights the challenges of low-altitude UAV objects, machine learning studies, and deep learning-based object classification techniques. Section 3 describes an experimental setup in which the methodology of classification algorithms in low-altitude UAV datasets, the training process, the evaluation parameters, and the description of the low-altitude UAV dataset are discussed. Section 4 analyzes the results obtained from machine learning-based classifiers and pretrained deep models with different parameters. The last section concludes the results achieved and predicts the feasible choice of model for multi-object classification in low-altitude UAV datasets. The future scope of the proposed work is discussed in this section.

2 Related work

Over the last decade, convolutional neural networks (CNNs) have emerged as an optimal choice for a range of image manipulation tasks such as object detection, recognition [2], semantic segmentation, and pose estimation [12]. The real-time applications deployed in low-altitude UAV datasets do CNNs work in civilian airspace in a robust manner. The development of complex applications in low-altitude aerial images includes crowd surveillance by estimating violent human poses [12], recycling of plastic waste in wilds [13], monitoring power infrastructures [14], identifying mosquito breeding areas [15], and landslide accidents [16]. In this section, we discuss the challenges of low-altitude UAV-based object classification, machine learning-based classifiers, and deep models.

2.1 Challenges of UAV based object classification

Multiple object-based classification in low-altitude aerial images is a crucial problem due to overlapping image resolutions, limited contextual information, scale differences in objects, etc. There are significant challenges in low-altitude UAV-based object detection when related to standard images, such as:

  1. 1.

    Immense variations in the scale of aerial objects.

  2. 2.

    Dense distribution of small objects.

  3. 3.

    Arbitrary orientations of objects in low-altitude aerial images.

  4. 4.

    High illumination underexposes the dark regions of high-resolution images.

  5. 5.

    Occlusion in the form of proximity with other present objects.

All the above-discussed challenges have led to object detection and recognition techniques in low-altitude aerial images that used deep features for processing. We first describe machine learning-based classifiers, then a handcrafted CNN model, and finally pretrained deep learning-based models. Object classification-based experiments were performed on models on a low-altitude aerial dataset.

2.2 Machine learning-based classifiers

The machine classifiers that have been implemented are K nearest neighbor (KNN), decision trees [17], random forests (RF) [18], and naïve Bayes [19]. These classifiers have become high-performance baseline models in object recognition systems in recent times [20]. K nearest neighbor, the classifier, is the oldest nonparametric algorithm with k neighbors, determined using a cross-validation vector on an input class. The decision tree classifier attempts to divide the features to yield a suitable generalization. Decision trees are widely used models for classification and numerical data, whereas nonlinear parameters do not affect their performance. In this case, the decision tree classifier is imported with a random state = 0 and then fit on training data into the classifier. The design of decision trees includes attribute selection and pruning method choices. Furthermore, the object is classified by considering the voted class from existing predictors [21]. The most frequently used attribute-related measures are the information gain ratio and Gini index. In a provided training set T, choosing one pixel at random belonging to some class Ci, the Gini index is depicted in Eq. (1), where f (Ci, T)/|T| belongs to the probability of the chosen scenario that belongs to class Ci.

$$\sum \mathop \sum \limits_{j \ne i} \left( {f\left( {C_{i} ,T} \right)/\left| T \right|} \right)\left( {f\left( {C_{j} ,T} \right)/\left| T \right|} \right).$$
(1)

The machine learning-based RF classifier consists of a random combination of features at every node of a tree. RF is an ensemble of unpruned decision trees that are built on a bootstrap input using a variable subset. We utilize a random forest without hyperparameter tuning and clustering. The naïve Bayes classifier is based upon the maximum a posteriori principle that calculates probability using the Bayes theorem in Eq. (2):

$$P(C = c||x_{1} , \ldots ,x_{n} ) = P\left( {C = c} \right)P(x_{1} , \ldots ,x_{n} ||C = c).$$
(2)

This approach is extendable to multiple classes and assumes conditional independence. Naïve Bayes classifiers assign the most expected class described by its feature vector and learning through feature independence. We compared machine classifiers with a customized approach, i.e., a deep handcrafted learning-based CNN network in our working methodology. This is intended to design an efficient and lightweight network from the beginning rather than adapt an existing system for low-altitude aerial images. The breakthrough of machine learning-based classifiers is observed in image processing in providing optimized object recognition results. [22] described a hybrid approach of detecting an object from UAV imagery using the Viola-Jones detection method and a histogram of oriented gradients (HOG) [23]-based support vector machine (SVM) classifier [24] used jointly. The proposed scheme adopted an orientation adjustment method that rotated the UAV image to align in the horizontal direction. The strategy further developed an integrated hybrid approach based on their detection speed to improve efficiency. [25] implemented a cascading classifier that concatenated online learning-based classifiers by exploiting multiscale HOG features. The dimensions of input features were drawn out in multiscale HOG to supply better and richer information for aerial images. Reference [26] made use of the AdaBoost classifier through a sliding window method of region proposals with integrated channel descriptions to detect independent moving features from aerial views. Different segmentation techniques, such as contour extraction and blob extraction, were evaluated to reduce the merging similarity of motion clusters. References [27, 28] made use of scale-invariant feature transform (SIFT) descriptors [29] for keypoint extraction of vehicle objects in UAV imagery. The number of objects was given by the number of final vital points extracted by the SVM classifier for classification and merging processes. Different combinations of SIFT features with color and morphology were used to calculate detection and false alarms.

Inspired by the above works, we found it interesting to compare machine and deep approaches to classify low-altitude aerial images. A comprehensive explanation of CNN-based deep models for multiple aerial object classification is discussed in the next sections.

2.3 Deep learning-based classification models

In the recent era, artificial intelligence has proven to be a revolution in machine learning in computer vision [30]. Later, an advancement of deep learning-based models evolved in image processing, which achieved tremendous object recognition results over traditional approaches in an effective manner [31]. CNNs have been the most successful object classification architectures in deep learning and work analogously to the human brain and embrace neurons that respond to the real-time environment [32]. Deep learning-based well-known CNN architectures have been deployed for object classification-based feature extractors for tuning the classifiers. The training is processed in which filters and parameters have random seeds by performing forward propagation. In low-altitude aerial studies, 2D-based CNNs have been commonly used to extract spatial features from the dimensions for object detection, recognition, and semantic segmentation of high-resolution aerial images [33], medical image-based disease diagnosis [34], and COVID-related measures [35]. Reference [34] proposed a VGG-inspired classification network to study the attention mechanisms for Alzheimer's disease. Eighteen-way data augmentation is proposed to avoid overfitting. The precision and accuracy were 97.87 ± 1.53 and 97.76 ± 1.13, respectively. Reference [35] identified COVID-19 patients through a novel artificial intelligence model on a chest CT dataset. A novel VGG-style base network was proposed as a backbone network, and a convolutional block attention module was introduced as an attention module. Furthermore, an improved multiple-way data augmentation method was used to resist overfitting. The proposed model achieved a precision per class above 95% and yielded a micro averaged F1 score of 96.87%, which is higher than 11 state-of-the-art approaches. Reference [36] improved building extraction accuracy in multifaceted building areas through a framework that applies deep learning-based semantic segmentation to UAV images with a digital surface model. The combination identified small buildings that were usually not high and covered partly by tree branches. The proposed method is applied to an open standard dataset to evaluate its strengths, and the results indicate an overall 4% accuracy increase from RGB to RGBD. Reference [37] compared the classification results of three deep models, AlexNet, VGG16, and VGG19, for ten classes of UAV landing sites with respect to different performance parameters. The results offered an understanding of typical false objects among classes of landing sites. Reference [38] proposed a dual inspection mechanism that identified missed targets in suspicious areas to assist single-stage detection branches in producing reliable results. The proposed method improved 2.7% mAP on the VisDrone2020 dataset, 1.0% mAP on the UAVDT dataset, and 1.8% mAP on the MS COCO dataset. Reference [39] provided a review on vehicle detection from UAV imagery using deep learning techniques such as convolutional neural networks, recurrent neural networks, autoencoders, generative adversarial networks, and their impact on improving the vehicle detection task. Reference [40] introduced a novel deep learning CNN architecture to identify anthracnose disease in mangos. A real-time dataset captured in farms of Karnataka, Maharashtra, and New Delhi was used for validation. In comparison with other state-of-the-art approaches, the proposed algorithm gives a higher classification accuracy of approximately 96.16%. Reference [41] evaluated the usage of transfer learning and fine-tuning on several CNN architectures, and the highest accuracy score was obtained by fine-tuning the ResNet50 model, which was 88%. The testing results show that transfer learning helps in generalization and demonstrates strong potential for the real-time application of forest fire detection.

CNNs were explicitly designed for object classification tasks, i.e., assigning single- or multiple-class labels to an entire scene. A breakthrough development in object classification was the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012, where multiple CNNs outperformed the state-of-the-art models based on handcrafted appearance descriptors [42]. Unusual extensions, such as trainable layers, increasing the capacity of the models [43], the introduction of drop-out [44], batch normalization [45], and other strategies allowing better propagation of gradients, such as rectified linear unit (ReLU)-based nonlinearities [46], allow efficient training of deeper CNNs. The correctly annotated datasets for training and inference with powerful GPUs made CNNs the right standard for solving object classification problems. The classification of low-altitude UAV images having multiple categories of objects is the primary offering of this study. Pretrained deep models such as VGG16 [43], InceptionV3 [47], ResNet50 [48] and DenseNet121 [49] trained on ImageNet have been implemented for diverse object classification. Furthermore, detailed information about the number of parameters, accuracy rates, and required image size for pretrained deep models is presented in Table 1. The VGG-D models consisted of VGG16 and VGG19 with 13 and 16 convolutional layers, respectively. Their training was regularized by several regularization mechanisms, especially for fully connected layers. InceptionV3 [47] eliminates several connections between convolutional layers that are unsuccessful and have redundant information due to the correlation between them. Inception-ResNetV2 [48] takes advantage of both Inception and ResNet networks and outperforms leading deep models. The Xception [54] architecture is built on a linear stack of a depth-wise separable convolution layer with linear residual connections. There are two important layers in architecture: a depth-wise convolutional layer in which a spatial convolution is carried out independently in each channel of input data. A pointwise convolutional layer has a 1 × 1 convolutional layer, which maps the output channels to a new channel space using a depth-wise convolution. The DenseNet [49] network was designed to address the vanishing gradient problem arising from the network depth. The problem of training exists with every deep network due to the large flow of information and gradients. These models were initially trained on ImageNet, and then feature extraction was performed on customized low-altitude UAV datasets by transferring weights only to initial layers.

Table 1 Parameters of pre-trained deep models

3 Experimental setup

We have considered multiple classes of objects in low-altitude aerial images in which object classification-based experiments have been performed. The methodology of applying machine learning-based classifiers on a low-altitude aerial dataset includes importing the necessary Python libraries, loading image files with their classes, scaling and transforming training and test data, instantiating the classification model, fitting the visualizer, and the model, and evaluating the model on the test data. The discussed machine learning-based classifiers and pretrained deep networks are trained on a customized low-altitude UAV dataset for multiple-object classification. The description of the dataset, training strategies, and performance evaluation methods are presented in subsequent sections.

3.1 Deep network-based handcrafted model

An end-to-end deep object classification model has been trained on multiple objects presented in low-altitude aerial datasets known as a deep handcrafted model. The architectural details of the proposed handcrafted-based model are described in Fig. 1 and Table 2. The network contained six convolutional and pool layers with a size of 150 × 150 as input images. The low-altitude aerial images with different dimensions were resized before feeding into the proposed algorithm. The filters were used to learn different feature types, and each filter slid over the input images. The layers after a convolution layer in the proposed architecture are global average pooling, dropout, and fully connected layers. The flattened layer converts 3D feature maps to 1D feature vectors. The activation function is ReLU, which accomplishes the threshold operation on the input to purge the effect of dark and noisy regions. Max and GlobalAveragePooling applied a maximum and average operation to each filter by restoring the spatial information of the images. The class values are calculated through a softmax classifier, and activation values correspond to diverse abstraction layers. The top layers of the model consisted of a softmax function class layer that resulted in function output, and hence, the class layer selects the label with the determined probability.

Fig. 1
figure 1

Used handcrafted deep network

Table 2 Architectural details of handcrafted deep network for object recognition

3.2 Training process

The machine learning classifiers are implemented through Python's Scikit-learn library to use the customized low-altitude aerial dataset, which consists of images and corresponding labels. The task is to forecast the low-altitude aerial class to which the related images belong. During the training process, the loading of the dataset takes place, after which the splitting of the dataset into its attributes and labels is performed. The standard scaler function is employed before splitting the data into training and testing as it transforms the data. The final step is to calculate inferences on testing data. The classification report method is utilized to calculate precision, recall, and F-1 score metrics over the employed models. Deep learning-based architectures have been implemented in Keras with a TensorFlow1.10 version backend. We utilized uniform standard data shuffling techniques in all our experiments, including random horizontal, vertical flipping, random scaling, and rotations of the input data images. The input data are shuffled randomly and further split into training and validation (3:1 ratio) for passing into deep learning-based classification models. The same process is repeated multiple times so that a fair evaluation of data can be inferred. Root mean square propagation (RMSProp) was employed to optimize the network loss function, starting with a learning rate of 0.001. The training of each employed network is performed for 1000 epochs. In our case of multiclass classification of low-altitude UAV images, the categorical cross-entropy loss function provides a stable network and significant results. The dropout rate is 0.2 as a regularization technique for deep neural networks, and a batch size of 32 is kept due to the size of the input data. The final trained model was saved to disk for further visualization of the results. Computing on a cluster of 2 NVIDIA Titan XP GPUs was performed for training and validation inputs. Throughout the experiments, platforms of an Ubuntu 16.04 LTS-based Intel Core i7-6850 K CPU @ 3.60 GHz × 12 and 64 GB RAM are used. The main components of the proposed analysis are implemented using the Python language, supported with Sklearn [50], OpenCV libraries [51], Keras [52], and the TensorFlow backend [53]. The deep models utilized the various pretrained CNNs [43, 47, 48], partially fine-tuned with a widely deployed dataset, and implemented with NVIDIA-CUDA toolkits [55] to run on desktop graphical processing units (GPUs).

3.3 Evaluation parameters

To evaluate the accuracy of each deep model, popular classification-related evaluation metrics have been employed to visualize the results precisely. The classification report was generated from the predicted data to measure recall, precision, and F-1 score. The metric precision means the fraction of the true positives from the total sum of true positives and false positives. Recall means the fraction of true positives from the total number of true positives and false negatives. The F1 score describes the harmonic mean of precision and recall.

$${\text{Precision}} = \frac{{\text{True positives}}}{{{\text{True positives}} + {\text{False positives}}}},$$
(3)
$${\text{Recall}} = \frac{{\text{True positives}}}{{{\text{True positives}} + {\text{False negatives}}}},$$
(4)
$$F_{1} {\text{measure}} = 2 \times \frac{{{\text{Precision}} \times {\text{Recall}}}}{{{\text{Precision}} + {\text{Recall}}}},$$
(5)
$${\text{Accuracy}} = \frac{{\text{Number of correct predictions}}}{{\text{Total number of predictions}}}.$$
(6)

The accuracy score is the true predictions from the class one having the maximum probability and metrics have been represented in Eqs. 3–6.

3.4 Description of low-altitude UAV dataset

We have considered annotated low-altitude UAV datasets such as CARPK [56], Okutama [57], VEDAI [58], and UAVBD [59] and combined them to form five different categories of multiple objects in a single image. A wide variety of low-altitude UAV datasets have been merged to produce multiple classes of objects, such as vehicles, persons, cars, plastic bottles, etc. The description, annotation support, and dataset size- related information are presented in Table 3. The CARPK dataset [56] provides localization and counting of car objects in the parking lot to gather free space information for new entrants. The UAVBD dataset [13] is dedicated to procuring waste plastic bottles from mountains and wild grasses for recycling from a drone’s view. The Okutama dataset [57] is specifically dedicated to human action detection between different humans as objects. The Birds dataset [59] captured at a low resolution of 25 pixels from cameras and telephoto lenses detects birds in wind farms for ecological conservation. This combined dataset has five different classes and sizes named birds, cars, persons, bottles, and vehicles, as depicted in Fig. 2. All the above classes make a total of 5000 low-altitude UAV images for implementing machine and deep learning-based classification models. The resizing of the original image was performed according to a pretrained network size, such as 224 × 224 for VGG and 299 × 299 for the Xception model. The low-altitude image data have been shuffled to maximize accuracy, and the performance comparisons of the various machines, as well as deep network-based methods, are made from UAV datasets. The next section describes a training process of multiple object classification in low-altitude UAV datasets.

Table 3 Low-altitude UAV dataset for object recognition
Fig. 2
figure 2

Multiple classes of low-altitude UAV dataset

4 Results and discussion

In this section, a comprehensive quantitative analysis is proposed concerning various machine classifiers and deep learning architectures to predict urban objects in low-altitude aerial images. The experiments suggest that the handcrafted CNN achieved a maximum accuracy score of 92.48 compared with machine classifiers. Out of the KNN, naïve Bayes, decision trees, and random forest classifiers, random forests obtained the highest value of 90% on low-altitude aerial data. Our experimental results helped to conclude that deep networks provide the right choice for achieving significant improvements in low-altitude aerial image-based classification. The overall accuracy score of the handcrafted CNN (92.48), as shown in Table 4, is higher than machine-based classifiers in Table 6. The performance of the handcrafted CNN degraded when compared with pretrained networks. The deep network models are trained for various input sizes of multiple low-altitude aerial datasets. Deep network architectures such as VGG16, VGG19, InceptionV3, Xception, DenseNet121, and InceptionResNetV2 were utilized to perform the experiments. The acquired dataset of low-altitude aerial images was resized to 224 × 224 for the VGG16 & 19 and DenseNet121 networks and to 299 × 299 for the InceptionV3, Xception and InceptionResNetV2 networks.

Table 4 Performance results for handcrafted CNN

4.1 Analysis of performance metrics

In this section, an analysis of performance metrics such as precision, recall, and F-1 score evaluation has been discussed. The confusion matrices for each machine learning-based classifier are utilized to better understand true positives and false positives for multiobject classification in low-altitude aerial images. Table 4 represents the confusion matrices for the KNN, naïve Bayes, decision tree, and random forest classifiers. The diagonal values in the matrix represent the true predictions out of the total samples. The evaluation of performance metrics in the case of machine classifiers and deep learning-based networks has been done for low-altitude aerial images. Parameters such as precision, recall, F-1 score, and accuracy score were calculated from the classification report (Table 5). Detailed visualization of the classification report of machine learning classifiers and handcrafted CNN-based object classification models with individual classes of low-altitude UAV datasets are displayed in Tables 4 and 6, respectively. Furthermore, classification performance metrics with respect to low-altitude UAV objects are presented in these tables. The precision, recall, and F-1 score of each machine classifier and deep learning-based handcrafted CNN model were combined to depict that the deep handcrafted model performed better than machine classifiers. The detailed visualization of the classification report of deep learning-based object classification models with individual classes of low-altitude UAV datasets is displayed in Fig. 3 and Table 7. The experiments suggest that the Xception model needed maximum time when trained for the required number of epochs, as depicted in Fig. 4. The VGG16 and VGG19 models converged quickly; after that, stagnant performance was seen. The analyzed deep networks depicted different behavior when trained on low-altitude aerial datasets compared to standard images. Xception, DenseNet121 and InceptionResNetV2 performed better than InceptionV3 in terms of evaluation parameters. Our experimental results helped gather deep network choices for multiple class-based object classification problems in low-altitude aerial images. The value of the handcrafted-based CNN (92.48) is found to be higher than machine learning-based classifiers such as KNN (82.26), naïve Bayes (83.26), decision trees (79), and random forests (90). Our findings concluded that training a handcrafted deep neural network is feasible compared with machine classifiers, as the accuracy obtained by the CNN (92.48) is higher than each employed machine classifier. High performance has not been achieved, as the discussed machine learning-based classifiers face problems in the case of low-altitude UAV images, such as [26]:

  • The features obtained from manual work relying on aerial domain knowledge may not be adequate for object recognition tasks.

  • Handcrafted feature engineering is a time-consuming process and quite tedious.

  • Machines with related mathematical models and assumptions restrict the flexibility to handle aerial image shapes.

Table 5 Confusion matrix of machine learning-based classifiers
Table 6 Comparison of classification accuracy in machine learning-based classifiers
Fig. 3
figure 3

Performance of various deep learning models

Table 7 Comparison of classification accuracy between pretrained deep CNNs
Fig. 4
figure 4

Training duration of various deep learning models

The pretrained networks perform even better than deep handcrafted networks because the handcrafted CNN started with randomly initialized dynamic weights. In contrast, pretrained networks trained on a large ImageNet dataset provide better end-to-end learning. In addition, pretrained deep model even performed better than the handcrafted CNN and machine models on the same dataset. This is because of the training of the model’s weights. The six kinds of pretrained transfer learning-based deep networks show different multiple object recognition results when compared with previous findings. Inception-ResNet-v2 achieved an accuracy of 98.64 and a loss of 0.2041, the same as that of the Xception network. Accurate models such as InceptionV3 obtained 96.00 accuracy and 0.5740 loss, which states that InceptionResNetV2 has an improved network over InceptionV3 in our settings. Xception also performed better than InceptionV3, with an accuracy score of 98.64%. The recently developed DenseNet121 also showed significant performance due to concatenation of input layers to produce an output layer with an accuracy of 99.68 and loss value of 0.0414.

4.2 Comparisons of accuracy and loss graphs

The training process of deep networks for multiple object recognition was executed for 500 epochs. For each epoch, a summary of accuracy and loss is generated, and thus graphs obtained from TensorBoard related to deep networks are presented in Figs. 5 and 6. The plots depicted in Fig. 6 show that the validation accuracy models seem to have converged. The line plots for both accuracy and loss show good convergence behavior, although they are somewhat bumpy. All described models are well configured and show no signs of over- or underfitting. The loss and accuracy values depicted almost no convergence after 400 epochs, from which we can assume that the model is trained. DenseNets performed fairly well in multiple object-based UAV datasets and achieved 99.68% accuracy. High convergence can be seen in the accuracy plot of Inception-ResNetV2 due to the learning capacity of the network. InceptionV3 did not perform well in our settings and obtained a loss value of 0.5714, which is higher than other pretrained deep networks. Xception performed better than the InceptionV3 network but relatively poorly when compared with other deep networks trained on low-altitude UAV datasets. The value of loss and accuracy depicted no convergence after 200 epochs, and both VGG16 and VGG19 models performed best on the low-altitude UAV dataset.

Fig. 5
figure 5

Validation accuracy graphs of deep learning models

Fig. 6
figure 6

Validation loss graphs of deep learning models

The comparison with the state-of-the art studies mentioned in Table 8, [60, 63, 64] made use of descriptor-based classification methods. These methods require hand engineering and complex methodology. [66] employed hyperspectral images by developing a hail vegetation index to identify agriculture-based patterns. Our dataset contains multiple size objects, and the impressive results of the VGG networks revealed that the network depth is an important factor in obtaining high classification accuracy. The evaluation presented in Fig. 7 indicates that deep networks trained on standard images have a different scope than those trained on low-altitude aerial views. Due to the inherent characteristics of low-altitude aerial images, such as the small size of objects, captured angle, resolution, orientation, and scale, they differ from natural images.

Table 8 Comparison with existing classification methods
Fig. 7
figure 7

Performance comparison of all algorithms

5 Conclusion

This paper has analyzed various machine learning- and deep learning-based classification networks to recognize multiple objects from low-altitude UAVs. The proposed evaluation compares machine classifiers KNN, naïve Bayes, random forest, decision trees and deep models such as handcrafted-based CNN, VGG16, VGG19, InceptionV3, Xception, DenseNets, etc. Machine- and deep model-based classification was performed to conduct experiments on low-altitude UAV images. Among the employed machine classifiers for classification, random forests achieved better results among KNN, decision trees, and naïve Bayes classifiers. However, when compared with a handcrafted CNN, the performance of leading machine classifier random forests degraded on low-altitude aerial images. In the case of pretrained deep models for object recognition, VGGD, InceptionV3, DenseNet121, Inception-ResNetV2, and Xception depicted different behaviors when trained on low-altitude aerial datasets. DenseNet121 and Inception-ResNetV2 performed better than InceptionV3 and Xception. However, VGG16 and VGG19 performed better than Xception,

DenseNet121, and Inception-ResNetV2 due to the inherent characteristics of low- altitude data. Our experimental results provide academia and the research community with a medium for dealing with multiple object classification in low-altitude aerial images. The classification reports concerning individual class in terms of precision, recall, and F-1 score are represented to analyze models better.

The progressive approaches of deep learning-based object classification in low-altitude aerial data seem to have a bright future. The vast deployment of applications influenced the aerial imaging market, which is expected to grow at a rate of 14.2% in the coming years. One of the major factors creating advanced prospects in the aerial imaging classification solutions market is the recently published drone policies by the Government of India and the availability of artificial intelligence-based technologies. Furthermore, as a part of our future work, we intend to explore human activity recognition and detect abnormal behaviors in surveillance-based UAV applications.