Classification between Apple and Oranges— Machine Learning example with Images for beginners

Python Script
7 min readOct 3, 2021

In this article we will learn the following things —

  • Learn about texture and color features in an image
  • Learn about Decision Trees and train a decision tree on texture and color features.
  • Train a decision tree to differentiate between apples and oranges

Requirements

Thnings which are required for smooth processes in this walkthrough —

  1. Data set containing apple and orange image
  2. Anaconda Python
  3. OpenCV library
  4. Sklearn library (it is built-in if you are using Anaconda)

Introduction

A decision tree is a supervised machine learning algorithm which learns to classify data into two or more classes. Let’s take some images of apples and oranges and use a decision tree to classify them. First, we will have to extract a few features from the images using image processing. Then, these features will be used to train a decision tree that we build. Finally, we can test the working of the decision tree on images it has never seen before.

About Database

A supervised learning algorithm always needs labeled data. This means that we need to tell the decision tree which images are apples and which images are oranges. In order to do this, we separate our data into folders, and the folder name is taken as our label.

In Machine Learning, it is common practice to split the dataset into two (or sometimes more) parts. This is to combat a problem called ‘overfitting’. Overfitting occurs when an ML algorithm is trained extensively on a dataset without care for data which may have been ignored in the dataset.

For example, the images we are using aren’t the only possible images of oranges and apples. There are several species of these fruits, and each fruit is different. Overfitting would mean that the machine would claim that only the images it has learnt are correct and nothing else can be correct.

To combat this problem, the data is split into two parts: training data and testing data. The ML algorithm is trained on the training data. The testing data is then used to test the real-world accuracy of the ML algorithm, since it hasn’t been trained for the testing data. So the results obtained on the testing data can, ideally, be used as a measure of the real-world accuracy of the ML algorithm.

You can navigate to the “Baby Machine Data\ap-or-database\train” folder. Here you will find two folders — “apple” and “orange”. Each of these folders has been loaded with images downloaded from Google Images. You can explore these images and see the variety that they have.

Apple and Orange database for training

About Decision Tree Algorithm

Decision tree algorithm basically consists of a tree of nodes, where each node asks a question and sees if a feature fits a condition. Based on the answer to that question, it will decide what question to ask next. This process continues until it is able to separate the data into classes.

In our case, the questions that it will be asking at every node are “what is the dominant color in this image?” and “what is the texture of this image”. When we train, every node will learn what values determine if a fruit is an orange or an apple.

Working of a Decision Tree

Step 1 : Feature Extraction

In order to train our decision tree, we have to extract the features from our training images. We have observation that color and texture are the most prominent features discriminating apples and oranges. So, let’s extract these features.

In first step, we’ll extract the texture feature. There is an algorithm known as the Haralick algorithm which is meant to extract numbers from an image which give us an idea about the texture of the image. It takes each pixel of the image and seeks out pixels in its neighborhood. It then calculates a number based on how different the surrounding pixels are from the pixel it is considering. It does this for all pixels and gives a set of numbers which summarise the texture features of the image as a whole. We will use these numbers as they are for our purpose.

In order to extract the Haralick features, we will be using a command from the mahotas library. We will need to first install this library:

conda install -c conda-forge mahotas

Note : Make sure all instances of Anaconda are closed (including Anaconda Navigator, Spyder, and Jupyter) while installing the mahotas library.

Once it finishes installing, you can test if it has installed properly. type python3 in your command line. You will be able to type python code now. This will be indicated by the “>>>” symbol on your command line.And then type import mahotas as mt. If you do not get any error, then mahotas has been installed correctly.

We will also be extracting the most dominant color from the image. This will give us an idea if the image is largely red, or largely orange, which our decision tree can use to make its decisions. To do this, we will have to use OpenCV to analyze the pixels of the image. First, we will install OpenCV.

conda install -c conda-forge opencv

Note : Make sure all instances of Anaconda are closed (including Anaconda Navigator, Spyder, and Jupyter) while installing the OpenCV library.

Once it finishes installing, you can test if it has installed properly. type python3 in your command line. You will be able to type python code now. This will be indicated by the “>>>” symbol on your command line.And then type import cv2. If you do not get any error, then OpenCV has been installed correctly.

  • First, download this data set onto your PC. Unzip it to a folder.
  • Open Spyder. On the top right corner, there will be an icon of a folder. Click on it and navigate to the extracted database folder. Make sure you are inside the Data set folder. If you select any other folder, including the “ap-or-database” folder, the code will not work, since it is looking for a particular folder.
  • Copy the code below to your editor window and save the code. You will also find this code as “fruit-feature.py” in your downloaded folder.
import numpy as np
import cv2
from scipy.stats import itemfreq
import mahotas as mt
def extract_features(image):
# calculate haralick texture features for 4 types of adjacency
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
textures = mt.features.haralick(gray)
# take the mean of it and return it
ht_mean = textures.mean(axis=0)
arr = np.float32(image)
pixels = arr.reshape((-1, 3))
n_colors = 3
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 200, .1)
flags = cv2.KMEANS_RANDOM_CENTERS
_, labels, centroids = cv2.kmeans(pixels, n_colors, None, criteria, 10, flags)
palette = np.uint8(centroids)
quantized = palette[labels.flatten()]
quantized = quantized.reshape(image.shape)
dominant_color = palette[np.argmax(itemfreq(labels)[:, -1])]
feat = np.concatenate((ht_mean,dominant_color))
return feat

Step 2 : Training the Decision Tree

You can now start training the decision tree. You can do this by using built-in functions in sklearn which comes pre-installed in Anaconda.

  • Copy the code below to your editor window and run the code. You will also find this code as “baby-machine.py” in your downloaded folder.
import cv2
import numpy as np
import os
import glob
from sklearn import tree
from fruit_feature import extract_features
# load the training dataset
train_path = "ap-or-database/train"
train_names = os.listdir(train_path)
# empty list to hold feature vectors and train labels
train_features = []
train_labels = []
# loop over the training dataset
for train_name in train_names:
cur_path = train_path + "/" + train_name
cur_label = train_name
i = 1
for file in glob.glob(cur_path + "/*.jpg"):
print ("Processing Image - {} in {}".format(i, cur_label))
# read the training image
image = cv2.imread(file)
# extract texture and color from the image
features = extract_features(image)
# append the feature vector and label
train_features.append(features)
train_labels.append(cur_label)
# show loop update
i += 1
# create the classifier
clf=tree.DecisionTreeClassifier()
# train the classifier
print ("Training model..")
clf.fit(train_features, train_labels)
# loop over the test images
test_path = ("ap-or-database/test")
for file in glob.glob(test_path + "/*.jpg"):
# read the input image
image = cv2.imread(file)
# extract texture and color from the image
features = extract_features(image)
# evaluate the model and predict label
prediction = clf.predict(features.reshape(1, -1))[0]
# show the label
cv2.putText(image, prediction, (20,30), cv2.FONT_HERSHEY_SIMPLEX, 1.0, (235,188,0), 3)
print ("Prediction - {}".format(prediction))
# display the output image
cv2.imshow("Test_Image", image)
cv2.waitKey(0)
cv2.destroyAllWindows()

You will now train your decision tree. The code will then take images from the “test” folder and show you what the decision tree is predicting for the image. You can add more images to this folder to see what output you get. You can also add more images to the training data to improve results.

Output

Predicted Image
Predicted Image

--

--

Python Script

Data Science enthusiast | Kaggler | Machine Hack Concept-A-Thon Winner | Technical blogger based in New Delhi, India