Machine Learning and Regression Models with PyTorch: Part I

Introduction

PyTorch is a machine learning framework for Python. Along with TensorFlow, it is probably one of the most popular machine learning frameworks. I recently got interested in machine learning when I discovered that there was more to it than just image classification, people are using it to do physics. So, I started learning about machine learning in general, and then tried to understand how PyTorch worked by following along with some tutorials.

One of the difficulties I ran into was vocabulary. For example, there are two types of machine learning models, classification and regression. Most examples I found online were classification models, usually designed to classify images. I was interested in regression models, models that take a set of inputs and compute a set of outputs. Machine learning models can be used to perform linear regression for example, given a set of x-y data points, the machine can "learn" the slope and y-intercept of the data. There is a lot of overlap between machine learning and function fitting, function minimization, statistical modeling, etc. To me, It seems that the vocabulary used in most machine learning resources is similar to that of function optimization and statistical modeling, but there are differences. For example, in machine learning, we define a "loss" function. In other fields, this may be referred to as the "cost" function. So there are slight differences here and there. In addition, I'm a physicist, and we often use a different vocabulary.

Fundamentally, machine learning is function optimization. The idea is that there is some function that exists to describe a complex process, it takes some inputs (maybe hundreds) and it produces some outputs. The goal of machine learning is to find an approximation for this function based on observed data. The different types of models, i.e. linear models, non-linear models, neural networks, etc., are just different ways that the function can be approximated. At the end of the day, a machine learning model contains a function with parameters that may be adjusted, and these parameters are adjusted by training the model on a data set that contains examples of outputs known to correspond to a given inputs.

This post walks through some PyTorch examples and explains them from the perspective of a physicist.

Linear Regression

Note: the full source code for this example is given at the end of the section.

In this example, we'll set up a simple model to do linear regression. First, we need some data. Let's just create a very simple dataset:

data = [ [0,1], [1,3], [2,5], [3,7] ]
print(data)
# outputs:
# [[0, 1], [1, 3], [2, 5], [3, 7]]

Here we create a data set of four points that lie on the line $y = 2x + 1$.

Creating a Model

To use PyTorch, we need to import the torch module.

import torch

Now we can build a model. In PyTorch, we do this by creating a class that inherits from the torch.nn.Module class. This class is provided by PyTorch, and provides all of the machinery necessary for learning. All we have to do is:

  1. initialize the model
  2. implement the forward method, which will calculate the model outputs.
class LinearFit(torch.nn.Module):
  def __init__(self):
    super().__init__()
    self.f = torch.nn.Linear(in_features=1,out_features=1)

  def forward(self,x):
    return self.f(x)

Our "model" here represents the equation for a line, $y = mx + b$. The parameters $m$ and $b$ will initially be set to random numbers, and then we will "train" the model to "learn" the correct values for $m$ and $b$, given our data set. But first, let's look at the model definition. In Python, the __init__ method is called when we create an instance of a class, so this is where the model is setup. We need to call super().__init__() so that PyTorch can do the setup it needs to for us. After that, we just have one line

    self.f = torch.nn.Linear(in_features=1,out_features=1)

Here is where we create the linear function that we will fit to the data. PyTorch provides a model for linear functions, and we have to us it so that PyTorch can optimize it. We can create linear functions with any number of inputs and outputs. For for $y = mx + b$, we have one input, $x$, and one output $y$ (if we where fitting $z = mx + ny + b$, we would have two inputs, $x$ and $y$, and one output, $z$). It is common in machine learning to refer to inputs and outputs of a model as "features", which is why PyTorch uses in_features and out_features to specify the number of inputs and outputs.

After model setup, we define the forward method. This function will be called by PyTorch internally, so it has to be called forward, but all this function needs to do is calculate the model output for the inputs passed in as an argument. For this simple case, all we have to do is pass the inputs to the torch.nn.Linear instance and return its results.

We are now ready to fit the model to our data. We create an instance of the model

fit = LinearFit()

First, let's take a look at the current model parameters

print("initial parameters")
for p in fit.parameters():
  print(p)
# outputs:
# initial parameters
# Parameter containing:
# tensor([[0.8126]], requires_grad=True)
# Parameter containing:
# tensor([-0.2360], requires_grad=True)

This model contains two parameters, the slope and y-intercept (the y-intercept is often called the "bias"), which have been randomly initialized. The first parameter is the slope, the second is the y-intercept. We give the model an input, and it will calculate an output. This is usually called a "prediction", the model takes an input and predicts and output, but really the model just takes an input and computes the outputs with its current parameter values. For $x = 1$ then, this model should return $y = 0.8126 + (-0.2360) = 0.5766$. Let's check.

To call, or evaluate, the model, we must pass the inputs as a tensor, which is a data type provided by PyTorch. The PyTorch tensor is very similar to NumPy's array. In fact, PyTorch claims that tensors can often be used in place of arrays. So, to pass $x = 1$ to the model, we have to wrap it in a tensor

x = torch.FloatTensor([1])
y = fit(x)
print(x,y)
# outputs:
# tensor([1.]) tensor([0.5766], grad_fn=<AddBackward0>)

Here we are creating a tensor with a single element. We use torch.FloatTensor to insure that the tensor elements are stored as floats, rather than integers. Note the square brackets around 1, these are required, even for a single element tensor. To evaluate the model, we just call it like a function and pass it the input. We can see that the output is returned as a tensor as well, and the value is what we expect. Note that each time we run this code, the initial model parameters will be different, so we will get different values.

Training the Model

We are now ready to "train" our model. Training just means "determine the best value for each model parameter". We do this by giving the model a set of outputs with known inputs, meaning we need to have a set of model input values with the output value that should be produced for each. This is the dataset we created on the first line. This dataset is called the "training data" in machine learning vocabulary. Here we have just generated the training data, but in most applications, this would be data that has been observed.

As I said above, fundamentally, machine learning is function optimization. We have a function (our model), and we want to optimize its parameters (the slope and y-intercept) to fit our data. We do this by defining a loss function (sometimes called a "cost" function in other fields) and minimizing this function using a minimization algorithm. For function fitting, the loss we usually want to minimize is the sum of the deviations squared. Given a set of x-y pairs, ${(x_i,y_i)}$, the loss for any given values of $m$ and $b$ is $$ l(m,b) = \sum_i (y_i - (mx_i +b))^2 $$ In the case of simple linear regression, we could minimize this function analytically, but in general we will have to use a numerical method. Several numerical methods exist, and the best method depends on the problem. One common method is called stochastic gradient decent. This method essentially works by repeatedly calculating the gradient of the loss function with respect to the model parameters and walking downhill (its more complicated than that).

So, next we create the loss function and optimizer

calc_loss = torch.nn.MSELoss()
optimizer = torch.optim.SGD(fit.parameters(),lr=0.1)

Here we are using the mean square error (i.e. which is equivalent to the sum of deviation squared) loss function provided by PyTorch and the SGD minimization algorithm.

Before we start training, we have to get the training dataset into the correct format. PyTorch expects the training data inputs to be in a 2D tensor. A 2D tensor is similar to a matrix, it has rows and columns. PyTorch expect each row of the input tensor to correspond to one set of inputs. In our case, we only have one input, so the input tensor will only have one column. The training data outputs must also be in a tensor with the same format. Each row of the output tensor should contain the outputs for the corresponding row of the input tensor. Again, here we only have one output, so our tensor will have one column. The code below creates input and output tensors named xand y.

x = []
y = []
for i in range(len(data)):
  x.append([data[i][0]])
  y.append([data[i][1]])
x = torch.FloatTensor(x)
y = torch.FloatTensor(y)
print(x)
print(y)
# outputs:
# tensor([[0.],
#         [1.],
#         [2.],
#         [3.]])
# tensor([[1.],
#         [3.],
#         [5.],
#         [7.]])

Note the square brackets in the .append call. These are requires. We basically create x and y as a python list-of-lists first, and then create tensors from them.

The training process consists of 5 steps. Step 1, we compute outputs for each of the training data inputs using the model's current parameters.

pred = fit(x)
print("initial loss",loss)
# outputs
# initial prediction: tensor([[-0.2360],
#         [ 0.5766],
#         [ 1.3893],
#         [ 2.2019]], grad_fn=<AddmmBackward>)

Note that our model returns four values here. When we pass a tessor to the model, each row is treated as a separate set of inputs and PyTorch automatically evaluates the model on each row for us. Notice that the second output is 0.5766, which corresponds to the input $x = 1$ case we computed earlier. The model parameters have not been changed yet, so it will always return the same output for a given input.

Step 2, we compute the loss function for the outputs computed by the model

loss = calc_loss(pred,y)
print("initial loss",loss)
# ouptuts:
# initial loss tensor(10.8649, grad_fn=<MseLossBackward>)

The loss function takes two tensors. One containing the outputs calculated by the model, and one containing the outputs from our training data, i.e. the "correct" outputs.

Step 3, we compute the gradient of the loss with respect to the model parameters. This is where the PyTorch magic happens. The computed loss is a PyTorch tensor, and tensors keep track of how they were computed. The loss was computed from the model outputs. The model outputs where computed using the model parameters. So PyTorch can compute the derivative of the loss function with respect to each of the model parameters automatically. This is called auto-differentiation, and it is one of the things that has made machine learning practical. To compute the derivatives of a tensor, we call .backward()

loss.backward()

The way that the derivatives are stored is a little strange. The loss tensor does not directly store the derivatives, rather the derivatives are stored in the tensors used to calculate loss. For example, our model contains a tensor that stores the slope $m$. The derivative of the loss with respect to $m$ is stored in this tensor. Likewise, the derivative of the loss with respect to $b$ is stored in its tensor. So, the point is that the derivatives of the loss with respect to our model parameters will be stored in the parameters.

Step 4, we update the model parameters based on the gradient.

optimizer.step()

Recall that when we created the optimizer, we passed in fit.parameters(). So the optimizer knows what our model parameters are, and our model parameters know what the derivative of the loss function with respect to themselves is. So the optimizer can compute the gradient of the loss function, and update our model parameters. Let's look at the new parameters:

print("new parameters")
for p in fit.parameters():
  print(p)
# new parameters
# Parameter containing:
# tensor([[2.0146]], requires_grad=True)
# Parameter containing:
# tensor([0.3674], requires_grad=True)

So our model slope and y-intercept have been updated. Now we just need to repeat the process. However, before we do, we need to clear the derivative data in our parameters. When PyTorch computes the derivatives of a tensor, it stores them in the tensors used to compute it. If the derivatives are computed again (by calling .backward(), they are added to the values already stored. So, we need to zero out these values before repeating the steps.

Step 5, clear the derivative data.

optimizer.zero_grad()

We are now ready to repeat steps 1 - 5. Let's do it 1000 times.

for i in range(1000):
  pred = fit(x)
  loss = calc_loss(pred,y)
  loss.backward()
  optimizer.step()
  optimizer.zero_grad()

Now we have trained our model. Let's see what it has learned.

print("final parameters")
for p in fit.parameters():
  print(p)
# outputs:
# final parameters
# Parameter containing:
# tensor([[2.0000]], requires_grad=True)
# Parameter containing:
# tensor([1.0000], requires_grad=True)

Our model has correctly learned the slope and y-intercept of our data.

Here is the entire example:

data = [ [0,1], [1,3], [2,5], [3,7] ]
print(data)

import torch

class LinearFit(torch.nn.Module):
  def __init__(self):
    super().__init__()
    self.f = torch.nn.Linear(in_features=1,out_features=1)

  def forward(self,x):
    return self.f(x)


fit = LinearFit()

print("initial parameters")
for p in fit.parameters():
  print(p)

x = torch.FloatTensor([1])
y = fit(x)
print(x,y)



calc_loss = torch.nn.MSELoss()
optimizer = torch.optim.SGD(fit.parameters(),lr=0.1)


x = []
y = []
for i in range(len(data)):
  x.append([data[i][0]])
  y.append([data[i][1]])
x = torch.FloatTensor(x)
y = torch.FloatTensor(y)
print(x)
print(y)


pred = fit(x)
print("initial prediction:",pred)

loss = calc_loss(pred,y)
print("initial loss",loss)

loss.backward()
optimizer.step()
optimizer.zero_grad()

print("new parameters")
for p in fit.parameters():
  print(p)

for i in range(1000):
  pred = fit(x)
  loss = calc_loss(pred,y)
  loss.backward()
  optimizer.step()
  optimizer.zero_grad()

print("final parameters")
for p in fit.parameters():
  print(p)

Polynomial Regression

Let's say we wanted to fit a quadratic to our data instead of a line. $$ y = ax^2 + bx + c $$ This would then include a straight line as a special case ($a = 0$). How would we modify our model?

It turns out, that we can use a linear model for this since $y$ will be linearly proportional to $x$ and $x^2$. We just need to do a little preprocessing on our data. First, we create a linear model that takes two input features instead of one. Then, we process our training input dataset to pass $x$ and $x^2$ as inputs. The rest is the same as before, but you will probably find that the learning rate (the lr parameter to torch-optim.SGD that we used for linear the linear regression case) is too large. When I ran the example, the model parameters ended up as nan. We can reduce the learning rate, in this example we'll change it from 0.1 to 0.01, but this means that our model will learn more slowly, it will likely require more iterations to learn the correct coefficients. So we'll increase the number of training iterations too. The complete example is

# quadratic regression
data = [ [0,1], [1,3], [2,5], [3,7] ]
print(data)

import torch

class QuadraticFit(torch.nn.Module):
  def __init__(self):
    super().__init__()
    self.f = torch.nn.Linear(in_features=2,out_features=1)

  def forward(self,inputs):
    return self.f(inputs)


fit = QuadraticFit()

print("initial parameters")
for p in fit.parameters():
  print(p)

x = torch.FloatTensor([1,1])
y = fit(x)
print(x,y)



calc_loss = torch.nn.MSELoss()
optimizer = torch.optim.SGD(fit.parameters(),lr=0.01)


x = []
y = []
for i in range(len(data)):
  x.append([data[i][0]**2,data[i][0]])
  y.append([data[i][1]])
x = torch.FloatTensor(x)
y = torch.FloatTensor(y)
print(x)
print(y)


pred = fit(x)
print("initial prediction:",pred)

loss = calc_loss(pred,y)
print("initial loss",loss)

loss.backward()
optimizer.step()
optimizer.zero_grad()

print("new parameters")
for p in fit.parameters():
  print(p)

for i in range(10000):
  pred = fit(x)
  loss = calc_loss(pred,y)
  loss.backward()
  optimizer.step()
  optimizer.zero_grad()

print("final parameters")
for p in fit.parameters():
  print(p)

When we run this example, we will notice two things. The last print statement will produce something similar to

final parameters
Parameter containing:
tensor([[1.1671e-05, 2.0000e+00]], requires_grad=True)
Parameter containing:
tensor([1.0000], requires_grad=True)final parameters

First, we see that $a$ and $b$ coefficients in our quadratic are wrapped into one input parameter, a two element tensor. Second, the value for $a$ ends up being very small, but not exactly zero. This is common when we work with floating point numbers on a computer. In this case, $a$ is so much smaller than $b$ (5 orders of magnitude), that it does will not matter, and is essentially zero.

Let's check that it works for an actual quadratic. Commend out the first line and points on the curve $y = 3x^2 + 2x + 1$ for training data instead.

# data = [ [0,1], [1,3], [2,5], [3,7] ]
data = [ [0,1], [1,6], [2,17], [3,34] ]

Running this example will print

final parameters
Parameter containing:
tensor([[3.0000, 1.9999]], requires_grad=True)
Parameter containing:
tensor([1.0000], requires_grad=True)

at the bottom. Again, the $b$ coefficient is not exactly 2, but it is very close.

This method could also be used to fit polynomials of arbitrary order. We just need to create a linear model that accepts enough inputs (for an $n$-degree polynomial, we need $n$ inputs) and pre-processes the data before it passing it to the model.

links

social