Selecione uma das palavras-chave à esquerda…

Machine LearningLogistic Regression

Tempo de leitura: ~20 min

In this section we discuss logistic regression, which is a discriminative model for binary classification.

Consider a binary classification problem where the two classes are equally probable, the class-0 conditional density is a standard multivariable normal distribution in two dimensions, and the class-1 conditional density is a multivariate normal distribution with mean [1,1] and covariance I. Find the class boundary for the Bayes classifier.

Solution. The Bayes classifier is (x,y) \mapsto \operatorname{argmax}_i p_if_i(x,y), where

\begin{align*}f_0(x,y) &= \frac{1}{2\pi} \operatorname{e}^{-\frac{1}{2}(x^2+y^2)}\text{, and}\\ f_1(x,y) &= \frac{1}{2\pi} \operatorname{e}^{-\frac{1}{2}((x-1)^2+(y-1)^2)}\text{.}\end{align*}

By symmetry, the classifier will predict class 1 for every point above the line x + y = 1 and class 0 for every point below the line. We can obtain the same result by solving the equation f_0(x,y) = f_1(x,y). We get

\begin{align*}-\frac{1}{2}(x^2+y^2) = -\frac{1}{2}((x-1)^2+(y-1)^2), \end{align*}

which simplifies to x + y = 1, as desired.

Find the regression function r(\mathbf{x}) = \mathbb{E}[Y | \mathbf{X} = \mathbf{x}] = \mathbb{P}(Y = 1 | \mathbf{X} = \mathbf{x}) for the example above. Plot a heatmap of this function.

Solution. Let's use the multivariate normal type MvNormal from the Distributions package.

using Plots, Distributions, Optim
mycgrad = cgrad([:MidnightBlue,:SeaGreen,:Gold,:Tomato])
gr(aspect_ratio=1,fillcolor=mycgrad) # Plots.jl defaults
A = MvNormal([0,0],[1.0 0; 0 1])
B = MvNormal([1,1],[1.0 0; 0 1])
xgrid = -5:1/2^5:5
ygrid = -5:1/2^5:5
r(x,y) = 0.5pdf(B,[x,y])/(0.5pdf(A,[x,y])+0.5pdf(B,[x,y]))

We can see from the heatmap that restricting r(\mathbf{x}) to any line of slope 1 yields a function which asymptotes to 0 in the southwest direction and to 1 in the northeast direction, increasing smoothly in between. Such a function is called a sigmoid function.

Given the regression function r(\mathbf{x}) = \mathbb{P}(Y = 1 | \mathbf{X} = \mathbf{x}), we can recover the Bayes classifier by predicting class 1 whenever r(\mathbf{x}) > \frac{1}{2} and class 0 whenever r(\mathbf{x}) < \frac{1}{2}. However, the value of the regression function also conveys the degree of confidence associated with the prediction. If r(\mathbf{x}_1) = 0.65 and r(\mathbf{x}_2) = 0.95, then observations at \mathbf{x}_1 and \mathbf{x}_2 are both predicted as class 1, but the latter with much more confidence.

The graph in the example above suggests modeling r parametrically as a composition of a linear map and a sigmoid function. Specifically, we posit the model r(\mathbf{x}) = \sigma(\alpha + \boldsymbol{\beta} \cdot \mathbf{x}), where \boldsymbol{\beta} \in \mathbb{R}^p, \alpha \in \mathbb{R}, and \sigma(x) = 1 / (1+\operatorname{e}^{-x}).

To select the parameters \boldsymbol{\beta} and \alpha, we penalize lack of confident correctness for each training sample. We give a sample of class 1 the penalty \log\left(\frac{1}{r_i(x)}\right) (which is if r_i(x) is close to zero and if r_i(x) is close to 1). Likewise, we penalize a sample of class 0 by \log\left(\frac{1}{1-r_i(x)}\right).

Experiment with the sliders below and get the loss value below 2.45.



loss = ${loss}

using Optim

Z = [-1.2, -0.8, -0.7, 0.4, -2.4, 1.13]
O = [2.2, 1.3, 0.8, 2.5, 2.62]

f(α, β, x) = 1/(1+exp(-α-β*x))
function loss(Z, O, θ)
    α, β = θ
    sum(log(1/(1-f(α, β, x))) for x in Z) +
        sum(log(1/f(α, β, x)) for x in O)

optimize(θ->loss(Z,O,θ), [0.0, 1.0])

Sample 1000 points by choosing one of the two multivariate Gaussian distributions uniformly at random and then sampling from the selected distribution. Find the function of the form \sigma(\boldsymbol{\beta} \cdot \mathbf{x} + \alpha) which minimizes

\begin{align*}L(r) = \sum_{i=1}^{n} \left[y_i \log \frac{1}{r(x_i)} + (1-y_i)\log\frac{1}{1-r(x_i)}\right].\end{align*}

The decision boundary between two Gaussian class conditional densities with equal covariance is a straight line.

Solution. We begin by sampling the points as suggested.

observations = [rand(Bool) ? (rand(A),0) : (rand(B),1) for i in 1:1000]
cs =  [c for ((x,y),c) in observations]
scatter([(x,y) for ((x,y),c) in observations], group=cs, markersize=2)

Next, we define the loss function and minimize it:

σ(u) = 1/(1 + exp(-u))
r(β,x) = σ(β'*[1;x])
C(β,xᵢ,yᵢ) = yᵢ*log(1/r(β,xᵢ))+(1-yᵢ)*log(1/(1-r(β,xᵢ)))
L(β) = sum(C(β,xᵢ,yᵢ) for (xᵢ,yᵢ) in observations)
β̂ = optimize(L,ones(3),BFGS()).minimizer

We can see that the resulting heatmap looks quite similar to the actual regression function.

In the example above, is it true that r(\mathbf{x}) = \sigma(\boldsymbol{\beta} \cdot \mathbf{x} + \alpha) for some \boldsymbol{\beta} and \alpha?

Solution. We calculate

\begin{align*} \frac{\frac{1}{2}f_0(x,y)}{\frac{1}{2}f_0(x,y) + \frac{1}{2}f_1(x,y)} = \frac{\operatorname{e}^{- \frac{x^{2}}{2} - \frac{y^{2}}{2}}}{\operatorname{e}^{- \frac{x^{2}}{2} - \frac{y^{2}}{2}} + \operatorname{e}^{- \frac{\left(x - 1\right)^{2}}{2} - \frac{\left(y - 1\right)^{2}}{2}}} = \frac{1}{1+\operatorname{e}^{x + y - 1}}, \end{align*}

which is equal to \sigma(\boldsymbol{\beta} \cdot \mathbf{x} + \alpha) if \alpha = 1 and \boldsymbol{\beta} = [-1,-1]. So the assumption was correct in this example.

Consider a binary classification problem for which the regression function r satisfies r(\mathbf{x}) = \sigma(\boldsymbol{\beta}\cdot\mathbf{x} + \alpha) for some \boldsymbol{\beta} \in \mathbb{R}^p and \alpha \in \mathbb{R}. Show that the decision boundary is linear.

Solution. We solve r(\mathbf{x}) = \frac{1}{2} to find the decision boundary. This equation is equivalent to \boldsymbol{\beta} \cdot \mathbf{x} + \alpha = 0, the solution set of which is linear (by definition, since the equation is linear).

This exercise shows that directly applying logistic regression always yields linear decision boundaries. However, we can use logistic regression to find nonlinear decision boundaries by appending components to the feature vectors which are derived from the original features. For example, if we apply the map [x_1, x_2] \mapsto [x_1,x_2,x_1^2,x_2^2,x_1x_2] to each feature vector, then the linear boundary we discover in \mathbb{R}^5 will correspond to a quadric curve in the original feature space \mathbb{R}^2.

Bruno Bruno