Plant identification with convolutional neural networks and transfer learning

Abstract


Introduction
Classification of plant and flower species has been subject to various studies.Various methods and models have been developed in order to classify plant or flower species.In recent years, Convolutional Neural Networks have become prominent among other approaches and they have some important advantages over traditional approaches.They have much less parameters than traditional artificial neural networks, so they are more efficient in terms of computation.They are less prone to overfitting problems because of the parameter sharing between different regions of the images [1].Also, they have the ability to learn more abstract features.For these reasons state of the art image classification models are based on Convolutional Neural Networks.
CNNs have three main types of layers [1].These layers are convolutional layers, pooling layers, and fully-connected layers.Convolutional layers apply convolution operation to their inputs via kernels and values of these kernels are learned during training process.Pooling layers apply downsampling to their inputs in order to decrease spatial size and fully connected layers use features learned via previous convolutional layers and make classification.
Developments in hardware and using GPU's for parallel implementations make it possible to train deep networks on large scale databases.ImageNet is such a large database and it is commonly used for evaluating the performance of visual recognition algorithms [2].Deep Convolutional Neural Network models have demonstrated impressive performances on the ImageNet benchmark [3].As a result they are used in a wide variety of Computer Vision tasks.
Using Convolutional Neural Networks with a technique called Transfer Learning significantly increases classification results.Therefore, the Transfer Learning technique has been used also in this study.Transfer learning can be accomplished by using learned parameters from a different, but related domain.In this way we can adapt learned parameters to our task and make our model to achive better generalization capacity [4].In this regard, transfer learning can be utilized by adapting large scale networks trained on ImageNet to the task at hand.Another useful technique to improve generalization capability of Computer Vision models is data augmentation.In data augmentation, samples can be increased by generating synthetic data.New samples can be generated by rotating, flipping, cropping, or distorting the original samples.It can improve model's generalization capability, because patterns are introduced to model in different situations [5].Thus a more robust model is achieved.Data augmentation techniques such as random rotation at four angles, random brightness change in the range of [-0.2, 0.2] and horizontal flip have been used in this study.
In order to develop a plant and flower identification system a dataset has been created.The image database has been split to training and test sets by stratified sampling [6].Final training set consists of 29932 images and the final test set consists of 7483 images.Thanks to the stratified sampling, each class has been introduced into both training and test sets.
Center cropping and normalization were applied to images before they are provided as input to the model.After 15 epochs, the proposed model has achieved 0.9971 accuracy on the training set and 0.9897 accuracy on the test set.
Also trained model has been used for developing a mobile plant/flower identification application for Android platforms.By using developed mobile application, proposed model has been tested with wild plants.They have described problems that can be overcome with transfer learning and the advantages it provides.They categorized approaches for transfer learning and presented this category [9].L.Perez and J.Wang have compared various data augmentation techniques and their effects on image classification.They applied traditional and recent data augmentation methods on a subset of ImageNet dataset and presented their results.They also proposed their own method called neural augmentation.In this method a neural network has been used to learn augmentations for best improvements on classifiers [10].

Related work
Dyrmann, et al. developed a CNN model for classification of 22 different weed and crop species at early grow stages in order to help site specific weed management.Their image dataset consists of 10,413 images compiled from six different datasets.They also used preprocessing and data augmentation techniques in order to improve classification capacity of the network.They removed all non-green pixels from images via segmentation and by mirroring and rotating the original bineimages they applied data augmentation.They achieved 86.20% accuracy on test set [11].Guillermo L. Grinblat, et al. replaced a previous machine learning pipeline's feature extraction and classification stages' with CNN model in order to classify white bean, red bean and soybean.They have used vein morphological patterns.In their proposed pipeline, vein morphological patterns extracted via UHMT and central patch extraction applied on binary image outputs of UHMT method.These stages ensured that color and shape information is deleted and only vein morphological patterns are used for classification.Outputs of these stages used as inputs to CNN models and they have two setups named as S1 and S2.They trained several CNN models from models with 2 layers to 6 layers on these images of the vein morphological patterns.They reached the best results at a depth of 5 layers for both setups.For S1 setup, they reached 92.6% mean accuracy and for S2 setup, they reached 96.9% mean accuracy [12].Kaya, A., et al. analyzed transfer learning methods and different transfer learning scenarios on automatic flower identification.They have designed and implemented five classification models.They have used four different datasets for the evaluation of models and have showed positive impacts of using transfer learning on deep learning models to improve performance [13].
Sue Han Lee, et al used Convolutional Neural Networks to find features that best represent leaf samples.They used a CNN model to learn representations for images of leaves and then they analyzed the features to find most important features by feature visualization techniques.They also quantified the features and find necessary features that represent the leaf data by training a CNN based model on raw leaf data and then applying a Deconvolutional Network to outputs of the CNN based model.They analyzed how CNN characterizes the leaf data and found out that shape of leaves is not a dominant feature, but orders of venation are.They also proposed new features that describe whole leaf structure and local features that focus on leaf venation.They demonstrated that this hybrid global-local feature extraction improves the classification power of plant classification systems [14].
Mario Lasseck presented deep learning techniques for LifeClef 2017.In this work, Lasseck used Deep Convolutional Networks and transfer learning via fine tuning in order to classify 10,000 species and trained several models on different datasets.He also utilized bagging of results from models and data augmentation techniques such as cropping square patches from each image at random positions, horizontal flipping, rotation, random variation of saturation and random variation of lightness applied.His system achieved a mean reciprocal rank (MRR) of 92% and a top-5 accuracy of 96% on the official PlantCLEF test set [15].
Ghazi MM et al. used transfer learning by fine-tuning GoogleLeNet, AlexNet, and VGGNet pre-trained models on LifeCLEF2015 dataset.To decrease chance of overfitting, they also applied data augmentation techniques such as rotation, translation, reflection, and scaling.They fused these different classifiers to improve overall performance.Their combined system achieved 80% accuracy on validation set [16].
Milan Sulc, et al. proposed an automatic recognition system of 10,000 plant species for ExpertLifeCLEF 2018.Their proposed system is based on the Inception-ResNet-v2 and Inception-v4 Convolutional Neural Network architectures and their ensembles.They used preprocessing techniques such as random crop, random left-right flip, brightness and saturation distortion.Instead of using values after the last training epoch they used running averages of the trained network parameters and noticed that it increases the accuracy of their models.Because the class frequencies in the training data follow a longtailed distribution and hence they have different prior probabilities than the test set they used EM algorithm for estimation of test set priors by maximization of the likelihood of the test set observations.As a result, they achieved 88.4% accuracy on the full test set [17].

Data sources
The images in the database set are compiled from web searches and the Oxford Flowers Dataset [18].In total, it consists of 5345 flowers and plants images.Also data augmentation has been applied to the created dataset and thanks to data augmentation its size has been increased to 37415 images [19].
There are 76 different plant species in the image database.64 of them flower species and 12 of them other plant species.Species belong to Turkey's flora or had brought from abroad to Turkey were chosen as target species.
A second data source that stores information about the species in the image database has been created [19].Each of the species has corresponding records in this data source.Information regarding species found on online sources and stored as a JSON file [20], [21].Information about species consists of their name, family name, genus and their origin.

Data augmentation
After creating the image database, data augmentation techniques have been applied to images.Data augmentation techniques are useful for increasing the generalization ability of the model.For data augmentation four times random rotation, a random brightness change and a horizontal flip have been applied.
For random rotation, four angles have been uniformly sampled in the range of -90 and 90 degrees without replacement.Then each image has been rotated four times according to each of these angles.Reflection has been used as an interpolation method of pixel values after the rotation.
After random rotation, random brightness change has been applied in the range of -0.2 and 0.2.Output image's pixel values randomly changed in the range of 0.2 times below and above the original pixel values.
Finally, a horizontal flip has been applied to the images.After applying these transformations, 6-fold increase has been achieved on the dataset.Transformations are given in Figure 1.

Preprocessing
Preprocessing operations have been applied to images before providing them to the CNN model as inputs.
In order to extract the most useful features images cropped around the center.After center cropping images have been rescaled to 224 x 224 dimensions.
Finally, normalization has been applied to images in the dataset and pixel values scaled in the range of -1 and 1.In practice, normalization can increase the speed of learning.

Proposed model
Proposed approach is based on Transfer Learning.Transfer Learning can be utilized for neural networks in two ways: by using a pre-trained network as a feature extractor or by removing the final layer of a pre-trained network and adding classifier for the specified task.The second way can be thought of as two networks are stacked.All or some of the parameters of the pre-trained network can be used or can be trained with appended classifier's parameters.This approach called finetuning [22].
In this work fine-tuning approach has been used to transfer learning from ImageNet domain to our task.As a pre-trained network MobileNetV2 has been used, which had been trained on ImageNet database.
MobileNet is an efficient, lightweight CNN architecture that targeted mobile and embedded platforms which have limited computation resources [23].In this architecture convolutional layers replaced by so-called depthwise separable convolutions in order to perform this operation in a more efficient way.
Convolutional layers are backbone of CNN models.First convolutional layer takes an image as an input and make convolution operation on the image by applying a convolution filter.Numbers of convolutional filters are a hyperparameter and they also determine the depth of the output.Convolutional layers don't break spatial structure of input image and each applied filter produces a two-dimensional output.By stacking together these two-dimensional outputs, feature maps are obtained.In next convolutional layers, inputs and outputs are only feature maps and depth of these feature maps depend on number of filters in each convolutional layer.
Figure 2 shows operations of a Convolutional Neural Network [24].Depthwise separable convolution is consisted of two operations: depthwise convolution and pointwise convolution.Depthwise convolution applies a single filter to each input channel in contrary to standard convolutional layers in which a filter is applied to all channels at once.Then pointwise convolution applies 1x1 convolution to outputs of depthwise convolution operation and combines them [23].Operation of a depthwise separable convolution block is given in Figure 3 [25].
Output of standard convolutional layer assuming stride one and padding is computed as below [23]: Standard convolution operation requires  ×  ×   ×   parameters where K is spatial width and height of the squared kernel, Ci is number of input channels and Co is number of output channels.The total computation cost is  ×  ×   ×   ×  ×  , where F is the spatial width and height of a square feature map.In depthwise separable convolution block, depthwise convolution is used to apply a single filter per input channel and a 1x1 convolution called pointwise convolution is then used to create a linear combination of the output of the depthwise convolution layer.Also batch normalization and ReLU nonlinearities are used for both layers.Depthwise convolution is computed as below [23]: Depthwise convolution has  ×  ×   ×  ×  computational cost.Pointwise convolution also has   ×   ×  × , so total cost of the depthwise separable block is  ×  ×   ×  ×  +   ×   ×  × .Comparing with standard operation of convolutional layer, it leads to reduction in computation of: MobileNetV2 has some architectural changes over MobileNetV1.Besides it still uses depthwise separable and pointwise convolutions, its main building block has been changed.In addition to two layers in the previous dephtwise separable block, a new 1x1 convolution layer has been introduced before the dephtwise convolution layer [23].Thus, the first 1x1 convolution layer increases the number of channels of the feature map before inputed to the depthwise convolution layer and second 1x1 convolution layer decreases the number of channels of the output feature map of dephtwise convolution layer.Also a residual connection that behaves like a short-cut has been introduced to the network.
Short-cut connections ease the vanishing gradient problem of deep networks and improve the ability of gradients to propagate across network [26].Main building block of the MobileNetV2, called bottleneck residual block is given in Figure 4 [25].Compared with MobileNetV1's dephtwise separable blocks, bottleneck residual blocks has one additional 1x1 convolution layer; however the new architecture provides the ability to use smaller input and output dimensions and a more efficient Each layer is repeated n times in the architecture.e denotes expansion factor, c denotes number of output channels and s denotes used stride.
For fine-tuning, final layer of the MobileNetV2 architecture has been removed and instead of it, a one layer neural network classifier that comprised of 76 neurons has been used.
In the training phase parameters of the first layers of MobileNet were kept constant and only its last 20 layers and added classifier's parameters were trained.
Images have been provided as input to the network in batches.Each batch has been comprised of 32 images.
Categorical cross-entropy function has been used as a loss function for evaluating the training process and Adamax [27] optimization method has been used as a gradient descent optimizer.Initial learning rate has been selected as 0.02.

Experimental results
The model has been developed with Python programming language and trained on Google Colab for utilizing GPU's.Keras, OpenCV and various other libraries used.We also developed a mobile plant identification application based on the proposed model for Android platform [19].Additionally, precision, recall and f1-score metrics have been evaluated for each class.Terms positive and negative refer to predictions of the classifier and terms true and false refer to whether predicted class is correct or not [28].Also a mobile plant identification application that based on the proposed model has been developed for Android platform.The application has been developed with Java programming language and the trained model has been integrated to it.It allows users using images from their gallery or taking images by their camera and the application make prediction about the specie in the image.Some information such as name, family, genus and origin is showed to user based on prediction.The application has been tested on wildflowers and plants.A sample is given in Figure 10.
Figure 10.A sample from our mobile application that tested on the wild.
O'Shea et al. have described Convolutional Neural Networks and their advantages over traditional neural networks.Authors have made explanations about building blocks of CNN's, their overall architecture and operations of each layer [1].Alex Krizhevsky, et al. proposed their CNN model for ImageNet classification.They have presented the success of CNN models and utilizing GPU's to accelerate Deep Learning [7].Asifullah Khan et al. have presented a survey on recent CNN architectures.They have described the history of CNN's, their components and the architectural evolution of CNN models.They also mentioned applications of CNN's and recent techniques used with them [8].C Tan et al. have presented a survey on Deep Transfer Learning.

Figure 1 .
Figure 1.Original image and transformed images.(Random rotation, random brightness change and horizontal flip have been applied to the original image respectively).

Figure 2 .
Figure 2. Operation of a standard convolutional layer.
Before the training and test phases, the samples in the database have been shuffled and then split into training and test sets by stratified sampling.The training set has consisted of 29932 images and the test set has consisted of 7483 images.The ratio of images in the test set to all images in the database is 1/5.In the training process, 0.9971 accuracy rate achieved on the training set after 15 epochs.In each epoch test set also evaluated.Plots of accuracy and loss for the training set are given in Figure5and Figure6respectively.

Figure 5 .
Figure 5. Accuracy respect to each epoch on the training set.

Figure 6 .
Figure 6.Loss respect to each epoch on the traning set.The model has been tested on test set after each epoch and as a result 0.9897 accuracy rate achieved on the test set.Plots of accuracy and loss for the test set are given in Figure7and Figure8.

Figure 7 .
Figure 7. Accuracy respect to each epoch on test set.

Figure 8 .
Figure 8. Loss respect to each epoch on test set.After training, some examples have been drawn from the test set randomly and used for predictions to assess the model visually.Some samples from the test set are given in Figure9.They have been predicted correctly.

Figure 9 .
Figure 9.Some additional, visual tests on test set.

Table 1 .
Comparison with other studies.
Comparison with other studies is given in Table1.Intraclass precision, recall and f1 score metrics are provided in appendix A.