A Primer on Machine Learning with Python
In the past decade, machine learning has moved from scientific research labs into everyday web and mobile apps. Machine learning enables your applications to perform tasks that were previously very difficult to program, such as detecting objects and faces in images, detecting spam and hate speech, and generating smart replies for emails and messaging apps.
But performing machine learning is fundamentally different from classic programming. In this article, you’ll learn the basics of machine learning and will create a basic model that can predict the species of flowers based on their measurements.
How Does Machine Learning Work?
Classic programming relies on well-defined problems that can be broken down into distinct classes, functions, and if–else commands. Machine learning, on the other hand, relies on developing its behavior based on experience. Instead of providing machine learning models with rules, you train them through examples.
There are different categories of machine learning algorithms, each of which can solve specific problems.
Supervised learning
Supervised learning is suitable for problems where you want to go from input data to outcomes. The common trait of all supervised learning problems is that there’s a ground truth against which you can test your model, such as labeled images or historical sales data.
Supervised learning models can solve regression or classification problems. Regression models predict quantities (such as the number of items sold or the price of stock) while classification problems try to determine the category of input data (such as cat/dog/fish/bird, fraud/not fraud).
Image classification, face detection, stock price prediction, and sales forecasting are examples of problems supervised learning can solve.
Some popular supervised learning algorithms include linear and logistic regression, support vector machines, decision trees, and artificial neural networks.
Unsupervised learning
Unsupervised learning is suitable for problems where you have data but instead of outcomes, you’re looking for patterns. For instance, you might want to group your customers into segments based on their similarities. This is called clustering in unsupervised learning. Or you might want to detect malicious network traffic that deviates from the normal activity in your enterprise. This is called anomaly detection, another unsupervised learning task. Unsupervised learning is also useful for dimensionality reduction, a trick that simplifies machine learning tasks by removing irrelevant features.
Some popular unsupervised learning algorithms include K-means clustering and principle component analysis (PCA).
Reinforcement learning
Reinforcement learning is a branch of machine learning in which an intelligent agent tries to achieve a goal by interacting with its environment. Reinforcement learning involves actions, states, and rewards. An untrained RL agent starts by randomly taking actions. Each action changes the state of the environment. If the agent finds itself in the desired state, it receives a reward. The agent tries to find sequences of actions and states that produce the most rewards.
Reinforcement learning is used in recommendation systems, robotics, and game-playing bots such as Google’s AlphaGo and AlphaStar.
Setting Up the Python Environment
In this post, we’ll focus on supervised learning, because it’s the most popular branch of machine learning and its results are easier to evaluate. We will be using Python, because it has many features and libraries that support machine learning applications. But the general concepts can be applied to any programming language that has similar libraries.
(In case you’re new to Python, freeCodeCamp has a great crash course that will get you started with the basics.)
One of the Python libraries often used for data science and machine learning is Scikit-learn, which provides implementations of popular machine learning algorithms. Scikit-learn is not part of the base Python installation and you must install it manually.
macOS and Linux come with Python preinstalled. To install the Scikit-learn library, type the following command in a terminal window:
pip install scikit-learn
Or for Python 3:
python3 -m pip install scikit-learn
On Microsoft Windows, you must install Python first. You can get the installer of the latest version of Python 3 for Windows from the official website. After installing Python, type the following command in a command-line window:
python -m pip install scikit-learn
Alternatively, you can install the Anaconda framework, which includes an independent installation of Python 3 along with Scikit-learn and many other libraries used for data science and machine learning, such as Numpy, Scipy, and Matplotlib. You can find the installation instruction of the free Individual Edition of Anaconda on its official website.
Step 1: Define the Problem
The first step to every machine learning project is knowing what problem you want to solve. Defining the problem will help you determine the kind of data you need to gather and give you an idea of the kind of machine learning algorithm you’ll need to use.
In our case, we want to create a model that predicts the species of a flower based on the measurements of the petal and sepal length and width.
This is a supervised classification problem. We’ll need to gather a list of measurements of different specimens of flowers and their corresponding species. Then we’ll use this data to train and test a machine learning model that can map measurements to species.
Step 2: Gather the Data
One of the trickiest parts of machine learning is gathering data to train your models. You’ll have to find a source where you can gather data in the quantity needed to train your model. You’ll also need to verify the quality of your data, make sure it’s representative of the different cases your model will handle, and avoid collecting data that contains hidden biases.
Luckily for us, Scikit-learn contains several toy datasets to try out different machine learning algorithms. One of them is the “Iris flower dataset”, which happens to contain the exact data that we need for our problem. All we need to do is to load it from the library.
The following code loads the housing dataset:
from sklearn.datasets import load_iris
iris = load_iris()
The Iris dataset contains 150 observations, each containing four measurements (iris.data
) and the target flower species (iris.target
). The names of data columns can be seen in iris.feature_names
:
print(iris.feature_names)
'''
['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']
'''
iris.target
contains the numerical index (0–2) of one of three flower species registered in the dataset. The names of the flower species are available in iris.target_names
:
print(iris.target_names)
'''['setosa' 'versicolor' 'virginica']'''
Step 3: Split the Dataset
Before beginning the training, you must split your data into a train and test set. You’ll use the train set to train your machine learning model and the test set to verify its accuracy.
This is to make sure your model has not overfit on the training data. Overfitting happens when your machine learning model performs well on the training examples but poorly on unseen data. Overfitting can happen as a result of choosing the wrong machine learning algorithm, making the wrong configuration on the model, having poor training data, or having too few training examples.
Depending on the kind of problem you’re solving and the amount of data you have, you must determine how much of your data you’ll allocate to the test set. Usually, when you have a lot of data (in the order of tens of thousands of examples), even a small sample of about one percent will be adequate to test your model. In the case of the Iris dataset, which contains a total of 150 records, we’ll choose a 75–25 split.
Scikit-learn has a train_test_split
function that splits the dataset into train and test datasets:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, stratify=iris.target, random_state=42)
train_test_split
takes the data and target datasets and returns two pairs of datasets for training (X_train
and y_train
) and testing (X_test
and y_test
). The test_size
parameter determines the percent (between 0 and 1) of data that will be allocated to testing. The stratify
parameter makes sure that the train and the test arrays contain a balanced number of samples from each class. The random_state
variable, which is present in many functions of Scikit-learn, is to control the random number generators and for reproducibility.
Step 4: Build the Model
Now that our data is ready, we can create a machine learning model and train it on the train set. There are many different machine learning algorithms that can solve classification problems like the one we’re dealing with. In our case, we’ll use the “logistic regression” algorithm, which is very fast and suitable for classification problems that are simple and don’t contain too many dimensions.
Scikit-learn’s LogisticRegression
class implements this algorithm. After instantiating it, we train it on our train set (X_train
and y_train
) by calling the fit
function. This will tune the model’s parameters to find a mapping between the measurements and the flower species.
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
Step 5: Evaluate the Model
Now that we’ve trained the model, we want to measure its accuracy. The LogisticRegression
class has a score
method that returns the accuracy of the model. First, we’ll measure the accuracy of the model on the training data:
print(lr.score(X_train, y_train))
This will return approximately 0.97, which means the model predicts the class of 97 percent of the training examples accurately, which is pretty good given that we only had around 37 training examples per species.
Next, we’ll check the accuracy of the model on the test set:
print(lr.score(X_test, y_test))
This will give us around 95 percent, a bit lower than the training accuracy, which is natural because these are examples that the model has never seen before. By creating a larger dataset or trying another machine learning algorithm (such as support vector machines), we might be able to further improve the model’s accuracy and bridge the gap between training and test performance.
Finally, we want to see how we can use our trained model on new examples. The LogisticRegression
class has a predict
function that takes an array of observations as input and returns the predicted class. In the case of our flower classifier model, we need to provide it with an array of four measurements (sepal length, sepal width, petal length, petal width) and it will return an integer that represents the class of the flower:
output = lr.predict([[4.4, 3.2, 1.3, 0.2]])
print(iris.target_names[output[0]])
'''setosa
Congratulations! You’ve created your first machine learning model. We can now put it together into an application that takes measurements from users and returns the flower species:
sepal_l = float(input("Sepal length (cm):"))
sepal_w = float(input("Sepal width (cm):"))
petal_l = float(input("Petal length (cm):"))
petal_w = float(input("Petal width (cm):"))
measurements = [[sepal_l, sepal_w, petal_l, petal_w]]
output = lr.predict(measurements)
print(f"Your flower is {iris.target_names[output[0]]}")
Hopefully, this will be your first step toward becoming a machine learning guru. From here, you can continue to learn other machine learning algorithms, learn more about the fundamental concepts of machine learning, and move on to more advanced topics such as neural networks and deep learning. With a bit of study and practice, you’ll be able to create remarkable applications that can detect objects in images, process voice commands, and engage in conversations with users.