跳至主要內容

Statistics 3 Computer Practical 3

AI悦创原创2023年11月30日英国-布里斯托尔University of Bristol大约 30 分钟...约 9067 字

set.seed(12345)

1. Instructions

  1. Reports should be handed in as a single PDF file using Blackboard, by noon on the due date. RMarkdown, image, word, zip files, for example, will not be marked.
  2. You can work alone or in a group with up to two other people.
  3. One person per group should hand in a report on blackboard. The names and student numbers of all group members must be on the first page.
  4. Your answers should combine text and code snippets in R. It is recommended that you use RMarkdown to prepare your reports, since this is typically easier for students, but this is not mandatory.
  5. You must explain what you are doing clearly to obtain full marks on each question. You can use comments (which start with #) to annotate your code. Mathematical derivations can be written using LaTeX commands in RMarkdown or on paper, with a photo then appended to the end of the PDF being submitted.
  6. This practical counts for 10% of your assessment for Statistics 2.

2. Data description

This coursework focuses on analysing data about passengers on theTitanic. The Titanic was a boat that sank in the Atlanic Ocean duringher maiden voyage from Southampton in England to New York in the UnitedStates. This data was extracted from the Github account of the book"Efficient Amazon Machine Learning", published by Packt: https://github.com/alexisperrier/packt-aml/blob/master/ch4/original_titanic.csv.

These data can be loaded in using the following code:

titanic_original = read.csv("original_titanic.csv")

This dataset includes information about 1309 passengers and includesinformation about:

Upon observing the data, it becomes clear that it is incomplete. Weassume the ages, fares and embarkation point of the passengers that weremissing are at random so can remove the lines where these are NA orempty. We also remove the other columns with missing data since they arenot important for our analysis (ticket, cabin, boat, body andhome.dest).

titanic = titanic_original[which(!is.na(titanic_original$age) & 
                                   titanic_original$embarked != "" &
                                   !is.na(titanic_original$fare)), c(1:7,9,11)]

This should leave you with 1043 observations.

3. Comparing exponential populations

We are first interested in if there is a difference in the ticket faresfor men and women. We assume that X=(X1,X2..,Xn)\mathbf{X}=(X_1,X_2..,X_n) andY=(Y1,Y2..,Ym)\mathbf{Y}=(Y_1,Y_2..,Y_m) are independent and XiiidX_i\stackrel{iid}{\sim} Exp(λ1)(\lambda_1) and YiiidY_i\stackrel{iid}{\sim}Exp(λ2)(\lambda_2). Let the ticket fares for men x=(x1,..xn)\mathbf{x}=(x_1,..x_n) and the ticket fares for women y=(y1,...,ym)\mathbf{y}=(y_1,...,y_m) be realizations of X\mathbf{X} andY\mathbf{Y} respectively. We want to perform the following test for θ=(λ1,λ0)\boldsymbol{\theta}= (\lambda_1, \lambda_0):

H0:λ1=λ2vs.H1:λ1λ2.(1) H_{0}:\lambda_{1}=\lambda_{2}\qquad\text{vs.}\qquad H_{1}:\lambda_{1}\neq \lambda_{2}. \qquad (1)

The generalized likelihood ratio (GLR) in this case is

Λ(x,y)=supθΘ0L(λ1,λ2;x,y)supθΘL(λ1,λ2;x,y), \Lambda(\mathbf{x},\mathbf{y})=\frac{\sup\limits_{\boldsymbol{\theta} \in \Theta_{0}}L(\lambda_{1},\lambda_{2};\mathbf{x},\mathbf{y})}{\sup\limits _{\boldsymbol{\theta} \in\Theta}L(\lambda_{1},\lambda_{2};\mathbf{x},\mathbf{y})},

where Θ={(λ1,λ2):λ1>0,λ2>0}\Theta=\{(\lambda_{1},\lambda_{2}):\lambda_{1}>0,\lambda_{2}>0\} and Θ0={(λ1,λ2):λ1=λ2>0}\Theta_{0}=\{(\lambda_{1},\lambda_{2}):\lambda_{1}=\lambda_{2}>0\}.

Question 1 [2 marks] Show that

Λ(x,y)=(λ^0λ1^)n(λ^0λ^2)m. \Lambda(\mathbf{x},\mathbf{y})=\left(\frac{\hat{\lambda}_0}{\hat{\lambda_1}}\right)^{n}\left(\frac{\hat{\lambda}_0}{\hat{\lambda}_2}\right)^m.

where λ^1=ni=1nxi\hat{\lambda}_{1}=\frac{n}{\sum_{i=1}^{n}x_{i}},λ^2=mi=1myi\hat{\lambda}_{2}=\frac{m}{\sum_{i=1}^{m}y_{i}} and λ^0=n+mi=1nxi+i=1myi\hat{\lambda}_0=\frac{n+m}{\sum_{i=1}^{n}x_{i}+\sum_{i=1}^{m}y_{i}} and state the approximate distribution of T=2logΛ(x,y)T=-2\log \Lambda(\mathbf{x},\mathbf{y}).

You may assume that the Hessian of the log-likelihood under Θ\Theta isnegative definite.


Solution

Your solution here.

  1. 定义似然函数
    • 对于男性票价的指数分布,似然函数是 L(λ1;x)=λ1neλ1i=1nxiL(\lambda_1; \mathbf{x}) = \lambda_1^n e^{-\lambda_1 \sum_{i=1}^{n} x_i}
    • 对于女性票价的指数分布,似然函数是 L(λ2;y)=λ2meλ2j=1myjL(\lambda_2; \mathbf{y}) = \lambda_2^m e^{-\lambda_2 \sum_{j=1}^{m} y_j}
    • λ1=λ2=λ0\lambda_1 = \lambda_2 = \lambda_0,似然函数变为 L(λ0;x,y)=λ0n+meλ0(i=1nxi+j=1myj)L(\lambda_0; \mathbf{x}, \mathbf{y}) = \lambda_0^{n+m} e^{-\lambda_0 \left(\sum_{i=1}^{n} x_i + \sum_{j=1}^{m} y_j\right)}
步骤

λ1=λ2=λ0\lambda_1 = \lambda_2 = \lambda_0 来自于零假设 H0H_0 的设定。这是假设检验的一部分,在上面的题目中,我们想要测试票价指数分布率(λ\lambda)对于男性和女性是否相同。

在统计假设检验中,零假设 (H0H_0) 通常表示没有效应或没有差异的情况。

在上面的案例中,零假设是指两个群体(男性和女性)的票价指数分布率是相等的,即 λ1=λ2\lambda_1 = \lambda_2。我用 λ0\lambda_0 来表示这个共同的分布率。

具体地,这个假设用于似然比 (Λ\Lambda) 的计算。似然比是在零假设下的似然函数值与在备择假设(λ1λ2\lambda_1 \neq \lambda_2)下的似然函数值的比值。如果零假设是真的,我们期望这个比值接近1,而如果零假设不成立,我们期望这个比值远离1。

所以,当 λ1=λ2=λ0\lambda_1 = \lambda_2 = \lambda_0,在说在零假设 H0H_0 下,我们用单一的参数 λ0\lambda_0 来表示男性和女性的票价分布率,这是我们要测试的条件。

“共同的分布率”:是指在零假设 H0:λ1=λ2H_0: \lambda_1 = \lambda_2 条件下,认为男性和女性票价的指数分布率相同,因此可以用一个共同的分布率 λ0\lambda_0 来描述这两个群体的票价分布。

在假设检验中,零假设通常设定为两个被比较的群体之间没有差异的情况。对于指数分布来说,这个“没有差异”的情况就是说两个群体的分布率参数相同。因此,将这个共同的参数记为 λ0\lambda_0,它是在假定男性和女性的票价遵循同一指数分布率时的最大似然估计。这个参数是在零假设下,用所有数据计算得到的,不区分男性和女性。

  1. 对数似然函数
  1. 最大似然估计
步骤

对于男性乘客的票价,我们有 XiExp(λ1)X_i \sim \text{Exp}(\lambda_1)。似然函数为 L(λ1;x)=λ1neλ1i=1nxiL(\lambda_1; \mathbf{x}) = \lambda_1^n e^{-\lambda_1 \sum_{i=1}^{n} x_i}。取对数得到对数似然函数:

logL(λ1;x)=nlog(λ1)λ1i=1nxi \log L(\lambda_1; \mathbf{x}) = n \log(\lambda_1) - \lambda_1 \sum_{i=1}^{n} x_i

我们对 λ1\lambda_1 求导,得到:

λ1logL(λ1;x)=nλ1i=1nxi \frac{\partial}{\partial \lambda_1} \log L(\lambda_1; \mathbf{x}) = \frac{n}{\lambda_1} - \sum_{i=1}^{n} x_i

为了找到最大值,我们设此导数等于零:

nλ1i=1nxi=0 \frac{n}{\lambda_1} - \sum_{i=1}^{n} x_i = 0

解这个方程得到 λ1\lambda_1 的最大似然估计:

λ^1=ni=1nxi \hat{\lambda}_1 = \frac{n}{\sum_{i=1}^{n} x_i}

同样地,对于女性乘客的票价 YjExp(λ2)Y_j \sim \text{Exp}(\lambda_2),我们有似然函数 L(λ2;y)=λ2meλ2j=1myjL(\lambda_2; \mathbf{y}) = \lambda_2^m e^{-\lambda_2 \sum_{j=1}^{m} y_j}。对数似然函数和其导数为:

logL(λ2;y)=mlog(λ2)λ2j=1myj \log L(\lambda_2; \mathbf{y}) = m \log(\lambda_2) - \lambda_2 \sum_{j=1}^{m} y_j

λ2logL(λ2;y)=mλ2j=1myj \frac{\partial}{\partial \lambda_2} \log L(\lambda_2; \mathbf{y}) = \frac{m}{\lambda_2} - \sum_{j=1}^{m} y_j

同样,令导数等于零,得到 λ2\lambda_2 的最大似然估计:

λ^2=mj=1myj \hat{\lambda}_2 = \frac{m}{\sum_{j=1}^{m} y_j}

最后,对于似然比中使用的共同参数 λ0\lambda_0,我们将男性和女性的票价合并,得到:

L(λ0;x,y)=λ0n+meλ0(i=1nxi+j=1myj) L(\lambda_0; \mathbf{x}, \mathbf{y}) = \lambda_0^{n+m} e^{-\lambda_0 \left(\sum_{i=1}^{n} x_i + \sum_{j=1}^{m} y_j\right)}

logL(λ0;x,y)=(n+m)log(λ0)λ0(i=1nxi+j=1myj) \log L(\lambda_0; \mathbf{x}, \mathbf{y}) = (n+m) \log(\lambda_0) - \lambda_0 \left(\sum_{i=1}^{n} x_i + \sum_{j=1}^{m} y_j\right)

λ0logL(λ0;x,y)=n+mλ0(i=1nxi+j=1myj) \frac{\partial}{\partial \lambda_0} \log L(\lambda_0; \mathbf{x}, \mathbf{y}) = \frac{n+m}{\lambda_0} - \left(\sum_{i=1}^{n} x_i + \sum_{j=1}^{m} y_j\right)

将此导数设为零,得到 λ0\lambda_0 的最大似然估计:

λ^0=n+mi=1nxi+j=1myj \hat{\lambda}_0 = \frac{n+m}{\sum_{i=1}^{n} x_i + \sum_{j=1}^{m} y_j}

  1. 似然比
步骤

我们开始时有两个独立的最大似然估计(MLEs):

  • λ^1=ni=1nxi\hat{\lambda}_1 = \frac{n}{\sum_{i=1}^{n}x_{i}},对应于男性票价的 MLE,
  • λ^2=mj=1myj\hat{\lambda}_2 = \frac{m}{\sum_{j=1}^{m}y_{j}},对应于女性票价的 MLE,
  • λ^0=n+mi=1nxi+j=1myj\hat{\lambda}_0 = \frac{n+m}{\sum_{i=1}^{n}x_{i} + \sum_{j=1}^{m}y_{j}},是在零假设下(λ1=λ2\lambda_1 = \lambda_2)的 MLE。

似然比是这样定义的:

Λ(x,y)=L(λ^0;x,y)L(λ^1,λ^2;x,y) \Lambda(\mathbf{x}, \mathbf{y}) = \frac{L(\hat{\lambda}_0; \mathbf{x}, \mathbf{y})}{L(\hat{\lambda}_1, \hat{\lambda}_2; \mathbf{x}, \mathbf{y})}

我们需要计算两个分布的似然函数 LL,一个是在最优参数 λ^0\hat{\lambda}_0 下,另一个是在独立的最优参数 λ^1\hat{\lambda}_1λ^2\hat{\lambda}_2 下。

Θ\Theta 下的似然函数为:

L(λ^1,λ^2;x,y)=i=1nλ^1eλ^1xij=1mλ^2eλ^2yj L(\hat{\lambda}_1, \hat{\lambda}_2; \mathbf{x}, \mathbf{y}) = \prod_{i=1}^{n} \hat{\lambda}_1 e^{-\hat{\lambda}_1 x_i} \prod_{j=1}^{m} \hat{\lambda}_2 e^{-\hat{\lambda}_2 y_j}

将 MLEs 代入上述表达式得到:

L(λ^1,λ^2;x,y)=(ni=1nxi)nen(mj=1myj)mem L(\hat{\lambda}_1, \hat{\lambda}_2; \mathbf{x}, \mathbf{y}) = \left(\frac{n}{\sum_{i=1}^{n}x_{i}}\right)^n e^{-n} \left(\frac{m}{\sum_{j=1}^{m}y_{j}}\right)^m e^{-m}

Θ0\Theta_0 下的似然函数为:

L(λ^0;x,y)=(n+mi=1nxi+j=1myj)n+me(n+m) L(\hat{\lambda}_0; \mathbf{x}, \mathbf{y}) = \left(\frac{n+m}{\sum_{i=1}^{n}x_{i} + \sum_{j=1}^{m}y_{j}}\right)^{n+m} e^{-(n+m)}

现在我们可以计算似然比 Λ\Lambda

Λ(x,y)=(n+mi=1nxi+j=1myj)n+me(n+m)(ni=1nxi)nen(mj=1myj)mem \Lambda(\mathbf{x}, \mathbf{y}) = \frac{\left(\frac{n+m}{\sum_{i=1}^{n}x_{i} + \sum_{j=1}^{m}y_{j}}\right)^{n+m} e^{-(n+m)}}{\left(\frac{n}{\sum_{i=1}^{n}x_{i}}\right)^n e^{-n} \left(\frac{m}{\sum_{j=1}^{m}y_{j}}\right)^m e^{-m}}

化简上式中的 ene^{-n}eme^{-m} 项:

Λ(x,y)=(i=1nxin)n(j=1myjm)m(n+mi=1nxi+j=1myj)(n+m) \Lambda(\mathbf{x}, \mathbf{y}) = \left(\frac{\sum_{i=1}^{n}x_{i}}{n}\right)^n \left(\frac{\sum_{j=1}^{m}y_{j}}{m}\right)^m \left(\frac{n+m}{\sum_{i=1}^{n}x_{i} + \sum_{j=1}^{m}y_{j}}\right)^{-(n+m)}

因为 λ^1\hat{\lambda}_1λ^2\hat{\lambda}_2 的倒数分别是男性和女性票价的样本平均值,而 λ^0\hat{\lambda}_0 的倒数是所有票价的样本平均值。用这些 MLEs 替换掉相应的样本平均值,我们得到:

Λ(x,y)=(1/λ^01/λ^1)n(1/λ^01/λ^2)m \Lambda(\mathbf{x}, \mathbf{y}) = \left(\frac{1/\hat{\lambda}_0}{1/\hat{\lambda}_1}\right)^n \left(\frac{1/\hat{\lambda}_0}{1/\hat{\lambda}_2}\right)^m

简化得到最终的似然比表达式:

Λ(x,y)=(λ^0λ^1)n(λ^0λ^2)m \Lambda(\mathbf{x}, \mathbf{y}) = \left(\frac{\hat{\lambda}_0}{\hat{\lambda}_1}\right)^n \left(\frac{\hat{\lambda}_0}{\hat{\lambda}_2}\right)^m

  1. 测试统计量
    • 测试统计量 TT2-2 乘以似然比的对数,即 T=2logΛ(x,y)T = -2 \log \Lambda(\mathbf{x}, \mathbf{y})
    • 在大样本理论下,如果零假设成立,TT 近似服从自由度为 1 的卡方分布。
Code1

We can plot our data to gain intuition about the differences in ourdistributions. First we split our data into males and females.

fare_male <- titanic$fare[titanic$sex=='male']
fare_female <- titanic$fare[titanic$sex=='female']

Then we can plot histograms of the fares.

# Defining colours so can plot histograms over each other
c1 <- rgb(173, 216, 230, max = 255, alpha = 80, names = "lt.blue")
c2 <- rgb(255, 192, 203, max = 255, alpha = 80, names = "lt.pink")

# Plots histograms
male_hist = hist(fare_male, plot = FALSE)
female_hist = hist(fare_female, plot = FALSE)
plot(male_hist, col = c1, xlab = "Ticket fare", main = (""))
plot(female_hist, col = c2, add = TRUE)
legend("topright", legend = c("Males", "Females"), col = c(c1, c2), lty = 1, lwd = 10)

Question 2 [0 mark] From these figures, what result do you think thehypothesis test will return? (Note this question is worth no marks sinceyou do the test in the next question, but to get you to think about whatthe data shows as well as complete the hypothesis test.)


Solution

Your solution here.

我们注意到男性和女性票价的频率分布有明显差异。直方图显示,在较低票价区间,男性的频率超过女性,这意味着男性乘客在较便宜的票价上的比例更高。然而,在较高票价区间,女性的频率似乎超过男性,表明女性乘客在更高票价上的比例更大。由于这些观察到的差异,我们可以预期假设检验将表明男性和女性的票价分布有统计学上的显著差异。更具体地说,这可能意味着我们将拒绝票价分布无差别的零假设(H0:λ1=λ2H_0: \lambda_{1} = \lambda_{2}),支持票价分布存在差异的备择假设(H1:λ1λ2H_1: \lambda_{1} \neq \lambda_{2})。这些观察结果与我们在问题 1 中进行的似然比检验的结果一致,该检验发现男性和女性的票价分布的指数率 (λ\lambda) 存在显著差异。


Question 3 [1 mark] Perform test (1) to compare the ticket fares formales and females (fare_male and fare_male) and describe the outcomefor a significance level of 0.05.


Solution

Your solution here.

Code1

4. Testing multinomial distributions

We assume that the number of passengers with each ticket class comefrom YY{\sim}Multinomial(p)(\textbf{p}) where p=(p1,p2,p3)\textbf{p}=(p_1, p_2, p_3). Here p\textbf{p} is a vector of nonnegative probabilities that sums to 1. Alice suggests that theprobability of a passenger being in each class is uniformly distributed.

Formally, we would like to test

H0:p=p0=(1/3,1/3,1/3)vs.H1:pp0.(2) H_0:\textbf{p}=\textbf{p}_0=(1/3, 1/3, 1/3)\qquad \text{vs.}\qquad H_{1}:\textbf{p} \neq \textbf{p}_0. \qquad (2)

We can tabulate the number of passengers in each category using thefollowing code:

tab = table(titanic$pclass)
tab

##
##   1   2   3
## 282 261 500

Question 4. [2 marks] Use the Pearson's chi-squared test statisticto perform hypothesis test (2) for a significance level of 0.05.


Solution

Your solution here.

Code1

卡方检验的统计量 X2X^2 为 100.75,自由度(df)为 2,p 值小于 2.2×10162.2 \times 10^{-16}。因为 p 值远小于常用的显著性水平(如 0.05 或 0.01),我们可以非常有信心地拒绝零假设 H0:p=p0=(1/3,1/3,1/3)H_0: \textbf{p} = \textbf{p}_0 = (1/3, 1/3, 1/3)

这意味着有强烈的统计证据表明,乘客在不同票类之间的分布并不是均匀的。具体来说,这个结果表明票类的分布与每一类都有相同比例的假设相矛盾。

问题 4 解答:

执行皮尔逊卡方检验后,我们得到了如下结果:

由于 p 值远小于 0.05 的显著性水平,我们拒绝零假设 H0:p=p0=(1/3,1/3,1/3)H_0: \textbf{p} = \textbf{p}_0 = (1/3, 1/3, 1/3)。这表明在乘客票类的分布上存在显著差异,因此,我们有充足的证据表明票类的分布并不是均匀的。


Question 5 [1 mark] Bob suggests that a passenger is twice as likelyto be in third class than in first class, and also twice as likely to bein third class than in second class. Evaluate this statement formallyusing hypothesis testing. You should define your hypotheses, calculateyour Pearson test statistic and evaluate your p value.


Solution

Your solution here.

Code1

进行Pearson卡方检验后,我们得到的卡方统计量为2.6184,自由度为2,p值为0.27。在0.05的显著性水平下,由于p值大于0.05,因此我们没有足够的证据拒绝零假设。这意味着,根据我们的样本数据,乘客在各个舱位的分布与Bob提出的比例(即三等舱的乘客概率是一等舱和二等舱的两倍)没有显著性差异。简而言之,我们的统计检验结果支持Bob的说法。


5. Logistic regression

Next we are interested in working out which of our outcome variables aresignificant in determining if a person survived the sinking of theTitanic. Since survived is a binary outcome (0 or 1) we consider alogistic model for this. Here Y1,,YnY_1,\ldots, Y_n are independant randomvariables from

YiindBernoulli(σ(θTxi)),i{1,,n}, Y_i \overset{\text{ind}}{\sim} {\rm Bernoulli}(\sigma(\theta^T x_i)), \qquad i \in \{1,\ldots,n\},

where x1,,xnx_1,\ldots,x_n are dd-dimensional real (non random) vectors ofexplanatory variables such as the age, sex and fare of thepassengers and σ\sigma is the standard logistic function

σ(z)=11+exp(z). \sigma(z) = \frac{1}{1+\exp(-z)}.

We define the n×dn \times d matrix X=(xij)X = (x_{ij}).

Question 6. [1 mark] Show that the log-likelihood function is

(θ;y)=i=1nyilog(σ(θTxi))+(1yi)log(1σ(θTxi)). \ell(\theta ; {\bf y}) = \sum_{i=1}^n y_i \log (\sigma(\theta^Tx_i)) + (1-y_i) \log (1 - \sigma(\theta^Tx_i)).


Solution

Your solution here.

在逻辑回归中,响应变量 YiY_i 对于每个观察 ii 是二项分布(Bernoulli分布),其概率由逻辑函数 σ\sigma 映射的线性预测变量 θTxi\theta^T x_i 来决定。其中,θ\theta 是模型参数的向量,xix_i 是第 ii 个观察的预测变量向量。

1. 逻辑函数(Logistic Function)和对数似然(Log-likelihood)

逻辑函数定义为:

σ(z)=11+ez. \sigma(z) = \frac{1}{1 + e^{-z}}.

因此,对于给定的观察 $$x_i$$,有:

pi=P(Yi=1)=σ(θTxi)=11+eθTxi. p_i = P(Y_i = 1) = \sigma(\theta^T x_i) = \frac{1}{1 + e^{-\theta^T x_i}}.

对于二项分布,概率质量函数(PMF)是:

P(Yi=yi)=piyi(1pi)1yi. P(Y_i = y_i) = p_i^{y_i} (1 - p_i)^{1 - y_i}.

取对数以得到对数似然函数:

logP(Yi=yi)=yilogpi+(1yi)log(1pi). \log P(Y_i = y_i) = y_i \log p_i + (1 - y_i) \log(1 - p_i).

2. 对数似然函数的总和

对所有观察 ii 进行总和,得到总的对数似然函数:

(θ)=i=1nlogP(Yi=yi)=i=1n(yilogpi+(1yi)log(1pi)). \ell(\theta) = \sum_{i=1}^n \log P(Y_i = y_i) = \sum_{i=1}^n \left( y_i \log p_i + (1 - y_i) \log(1 - p_i) \right).

pip_i 替换为 σ(θTxi)\sigma(\theta^T x_i),有:

(θ)=i=1n(yilogσ(θTxi)+(1yi)log(1σ(θTxi))). \ell(\theta) = \sum_{i=1}^n \left( y_i \log \sigma(\theta^T x_i) + (1 - y_i) \log(1 - \sigma(\theta^T x_i)) \right).

3. 对数似然函数的具体形式

因为 logσ(θTxi)=log(11+eθTxi)\log \sigma(\theta^T x_i) = \log \left( \frac{1}{1 + e^{-\theta^T x_i}} \right)log(1σ(θTxi))=log(eθTxi1+eθTxi)\log(1 - \sigma(\theta^T x_i)) = \log \left( \frac{e^{-\theta^T x_i}}{1 + e^{-\theta^T x_i}} \right),带入上面得到的:

(θ)=i=1n(yilogσ(θTxi)+(1yi)log(1σ(θTxi))) \ell(\theta) = \sum_{i=1}^n \left( y_i \log \sigma(\theta^T x_i) + (1 - y_i) \log(1 - \sigma(\theta^T x_i)) \right)

把其中的 logσ(θTxi)\log \sigma(\theta^T x_i)log(1σ(θTxi))\log(1 - \sigma(\theta^T x_i)) 分别替换为:log(11+eθTxi)\log \left( \frac{1}{1 + e^{-\theta^T x_i}} \right)log(eθTxi1+eθTxi)\log \left( \frac{e^{-\theta^T x_i}}{1 + e^{-\theta^T x_i}} \right) 得到如下:

(θ)=i=1n(yilog(11+eθTxi)+(1yi)log(eθTxi1+eθTxi)). \ell(\theta) = \sum_{i=1}^n \left( y_i \log \left( \frac{1}{1 + e^{-\theta^T x_i}} \right) + (1 - y_i) \log \left( \frac{e^{-\theta^T x_i}}{1 + e^{-\theta^T x_i}} \right) \right).

4. 最终形式

最终,可以将对数似然函数写为:

(θ;y)=i=1nyilog(σ(θTxi))+(1yi)log(1σ(θTxi)), \ell(\theta ; \mathbf{y}) = \sum_{i=1}^n y_i \log (\sigma(\theta^T x_i)) + (1-y_i) \log (1 - \sigma(\theta^T x_i)),


Question 7. [1 mark] Show that each component of the score is

(θ;y)θj=i=1n[yiσ(θTxi)]xij,j{1,,d}.(A) \frac{\partial \ell (\theta ; {\bf y})}{\partial \theta_j} = \sum_{i=1}^n [y_i - \sigma(\theta^T x_i)] x_{ij},\qquad j \in \{1,\ldots,d \}. \quad (A)

I suggest that you use the fact that dσ(z)dz=σ(z)[1σ(z)]\frac{d\sigma(z)}{dz} = \sigma(z)[1-\sigma(z)].


Solution

Your solution here.

链式法则是微积分中的一个基本定理,它指出如果一个变量 zz 依赖于另一个变量 yy,而 yy 又依赖于第三个变量 xx,那么 zz 相对于 xx 的导数可以通过乘以 zz 相对于 yy 的导数和 yy 相对于 xx 的导数来计算:

dzdx=dzdydydx \frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}

YiindBernoulli(σ(θTxi)), Y_i \overset{\text{ind}}{\sim} \text{Bernoulli}(\sigma(\theta^T x_i)),

其中 σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}} 是标准逻辑函数,接下来对它进行求导:

σ(z)=11+ez可以将逻辑函数写成σ(z)=(1+ez)1应用链式法则,先对内部函数求导u=1+ez得到dudz=ez接着对外部函数v=u1求导,得到dvdu=u2=(1+ez)2由链式法则,可知dvdz=dvdududz结合以上步骤,可以得到:dσ(z)dz=(1+ez)2(ez)=ez(1+ez)2=11+ez(111+ez)=σ(z)(1σ(z)). \sigma(z) = \frac{1}{1 + e^{-z}} \\ \text{可以将逻辑函数写成} \Rightarrow \sigma(z) = (1 + e^{-z})^{-1} \\ \text{应用链式法则,先对内部函数求导}\Rightarrow u = 1 + e^{-z} \text{得到} \frac{du}{dz} = -e^{-z} \\ \text{接着对外部函数} v = u^{-1} \text{求导,得到} \frac{dv}{du} = -u^{-2} = -(1 + e^{-z})^{-2} \\ \text{由链式法则,可知} \frac{dv}{dz} = \frac{dv}{du} \cdot \frac{du}{dz} \\ \text{结合以上步骤,可以得到:} \frac{d\sigma(z)}{dz} = -(1 + e^{-z})^{-2} \cdot (-e^{-z}) = \frac{e^{-z}}{(1 + e^{-z})^{2}}= \frac{1}{1 + e^{-z}} \left(1 - \frac{1}{1 + e^{-z}}\right) = \sigma(z)(1 - \sigma(z)).

所以,它的导数是 σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1 - \sigma(z))

逻辑回归模型的似然函数对数形式为:

(θ;y)=i=1nyilog(σ(θTxi))+(1yi)log(1σ(θTxi)) \ell(\theta ; \mathbf{y}) = \sum_{i=1}^n y_i \log (\sigma(\theta^T x_i)) + (1-y_i) \log (1 - \sigma(\theta^T x_i))

似然函数中每个样本 ii 对应的项可以表示为:

i(θ)=yilog(σ(θTxi))+(1yi)log(1σ(θTxi)) \ell_i(\theta) = y_i \log (\sigma(\theta^T x_i)) + (1-y_i) \log (1 - \sigma(\theta^T x_i))

目标是计算似然函数对参数 θj\theta_j 的偏导数。首先,需要知道逻辑函数的导数。逻辑函数的导数是上面求到的:

dσ(z)dz=σ(z)(1σ(z)) \frac{d\sigma(z)}{dz} = \sigma(z)(1-\sigma(z))

现在,对 i(θ)\ell_i(\theta) 求关于 θj\theta_j 的偏导数:

i(θ)θj=θj(yilog(σ(θTxi))+(1yi)log(1σ(θTxi))) \frac{\partial \ell_i(\theta)}{\partial \theta_j} = \frac{\partial}{\partial \theta_j} \left( y_i \log (\sigma(\theta^T x_i)) + (1-y_i) \log (1 - \sigma(\theta^T x_i)) \right)

在逻辑回归的情况中有:

现在,需要计算 i(θ)\ell_i(\theta)θj\theta_j 的偏导数。我们有两项需要应用链式法则:

  1. 对于 yilog(σ(θTxi))y_i \log (\sigma(\theta^T x_i)) 这一项,首先对 log(σ(θTxi))\log (\sigma(\theta^T x_i)) 关于 σ(θTxi)\sigma(\theta^T x_i) 求导,然后再对 σ(θTxi)\sigma(\theta^T x_i) 关于 θTxi\theta^T x_i 求导,最后对 θTxi\theta^T x_i 关于 θj\theta_j 求导。
  2. 对于 (1yi)log(1σ(θTxi))(1-y_i) \log (1 - \sigma(\theta^T x_i)) 这一项,步骤类似,只是这次是在 1σ(θTxi)1 - \sigma(\theta^T x_i) 上应用链式法则。

对于第一项 yilog(σ(θTxi))y_i \log (\sigma(\theta^T x_i))

将上面的三个导数相乘,得到:

ddθjyilog(σ(θTxi))=yi1σ(θTxi)σ(θTxi)(1σ(θTxi))xij \frac{d}{d\theta_j} y_i \log (\sigma(\theta^T x_i)) = y_i \cdot \frac{1}{\sigma(\theta^T x_i)} \cdot \sigma(\theta^T x_i)(1-\sigma(\theta^T x_i)) \cdot x_{ij}

对于第二项 (1yi)log(1σ(θTxi))(1-y_i) \log (1 - \sigma(\theta^T x_i))

将上面的三个导数相乘,得到:

ddθj(1yi)log(1σ(θTxi))=(1yi)11σ(θTxi)σ(θTxi)(1σ(θTxi))xij \frac{d}{d\theta_j} (1-y_i) \log (1 - \sigma(\theta^T x_i)) = (1-y_i) \cdot \frac{-1}{1-\sigma(\theta^T x_i)} \cdot \sigma(\theta^T x_i)(1-\sigma(\theta^T x_i)) \cdot x_{ij}

将两项合并,最终得到:

i(θ)θj=yi1σ(θTxi)σ(θTxi)(1σ(θTxi))xij(1yi)11σ(θTxi)σ(θTxi)(1σ(θTxi))xij \frac{\partial \ell_i(\theta)}{\partial \theta_j} = y_i \cdot \frac{1}{\sigma(\theta^T x_i)} \cdot \sigma(\theta^T x_i)(1-\sigma(\theta^T x_i)) \cdot x_{ij} - (1-y_i) \cdot \frac{-1}{1-\sigma(\theta^T x_i)} \cdot \sigma(\theta^T x_i)(1-\sigma(\theta^T x_i)) \cdot x_{ij}

简化后,我们得到:

i(θ)θj=(yiσ(θTxi))xij \frac{\partial \ell_i(\theta)}{\partial \theta_j} = (y_i - \sigma(\theta^T x_i)) x_{ij}

最后,得到整个数据集的得分向量,它是每个样本得分的和:

(θ;y)θj=i=1ni(θ)θj=i=1n[yiσ(θTxi)]xij \frac{\partial \ell (\theta ; \mathbf{y})}{\partial \theta_j} = \sum_{i=1}^n \frac{\partial \ell_i(\theta)}{\partial \theta_j} = \sum_{i=1}^n [y_i - \sigma(\theta^T x_i)] x_{ij}

$$\frac{\partial}{\partial \theta_j} \sigma(\theta^T x_i)$$


From Equation A, the score can be written as

(θ;y)=XT[yp(θ)], \nabla \ell(\theta; {\bf y}) = X^T [{\bf y} - {\bf p}(\theta)],

where p(θ){\bf p}(\theta) is the vector (p1(θ),,pn(θ)(p_1(\theta),\ldots,p_n(\theta) and pi(θ)=σ(θTxi)p_i(\theta) = \sigma(\theta^T x_i).

6. Hypothesis testing in logistic regression

If an explanatory variable has no effect on the probability of theresponse variable then we expect the corresponding coefficient to beequal to 00. This can be examined more formally using hypothesistesting.

Assume that we consider the logistic model described in Section 5 and wewant to test if age is a significant variable. It is useful to add acolumn of 1s to X, so that there is an "intercept" term in the model.Mathematically, the value of θ0\theta_0 determines the probability whenthe explanatory variables are all 0. Therefore, our vector of parametersbecomes θ=(θ0,θsex,θage,θfare)\boldsymbol{\theta}=(\theta_0,\theta_{\textrm{sex}},\theta_{\textrm{age}},\theta_{\textrm{fare}}) and our hypothesis test is

H0:θage=0vs.H1:θage0.(3) H_0: \theta_{\textrm{age}} = 0 \qquad \text{vs.} \qquad H_{1}:\theta_{\textrm{age}} \neq 0. \qquad (3)

We can consider the generalised likelihood ratio statistic for thistest,

Λn=supθΘ0L(θ;y)supθΘL(θ;y)=L(θ^0;y)L(θ^MLE;y) \Lambda_n=\frac{\sup_{\boldsymbol{\theta}\in\Theta_0} L(\boldsymbol{\theta;y})}{\sup_{\boldsymbol{\theta}\in\Theta} L(\boldsymbol{\theta;y})}=\frac{L(\hat{\boldsymbol{\theta}}_0;\boldsymbol{y})}{L(\hat{\boldsymbol{\theta}}_{MLE};\boldsymbol{y})}

where θ^0\hat{\boldsymbol{\theta}}_0 is the maximum likelihood estimatorunder the null hypothesis, θ^MLE\hat{\boldsymbol{\theta}}_{MLE} is themaximum likelihood estimator for the full model (which we derived inSection 5) and Θ={(θ0,θsex,θage,θfare):θ0R,θsexR,θageR,θfareR}\Theta=\{(\theta_0,\theta_{\textrm{sex}},\theta_{\textrm{age}},\theta_{\textrm{fare}}):\theta_0 \in \mathbb{R}, \theta_{\textrm{sex}} \in \mathbb{R}, \theta_{\textrm{age}} \in \mathbb{R}, \theta_{\textrm{fare}} \in \mathbb{R}\} and Θ0={(θ0,θsex,θage,θfare):θ0R,θsexR,θage{0},θfareR}\Theta_{0}=\{(\theta_0,\theta_{\textrm{sex}},\theta_{\textrm{age}},\theta_{\textrm{fare}}):\theta_0 \in \mathbb{R}, \theta_{\textrm{sex}} \in \mathbb{R}, \theta_{\textrm{age}} \in \{0\}, \theta_{\textrm{fare}} \in \mathbb{R}\}.

Then,

2logΛn=2{l(θ^0;y)l(θ^MLE;y)} -2\log \Lambda_n=-2\{l(\hat{\boldsymbol{\theta}}_0;\boldsymbol{y})-l(\hat{\boldsymbol{\theta}}_{MLE};\boldsymbol{y})\}

has a Xr2\mathcal{X}^2_r distribution under the null hypothesis (notice that rr is the number of restrictions under the nullhypothesis).

Here we extract the variables we are interested in for the analysis,encode sex as a numeric value (0 for men and 1 for women) and add ourintercept.

survived = titanic$survived
sex = ifelse(titanic$sex == "male", 0, 1)
age = titanic$age
fare = titanic$fare
intercept = rep(1, length(survived))
data = data.frame(intercept, sex, age, fare, survived)

Question 8 [2 marks] Use the generalised likelihood ratio test todecide whether the ticket age is statistically significant forsurviving the titanic when sex, age and fare are considered for asignificance level of 0.05.

To do this start by forming the matrices that correspond to therestricted model under the null hypothesis (X_rest) and to the fullmodel (X_full). In this case X_rest is formed by removing thevariable age because this is the one we want to test.

X_full <- as.matrix(data[c('intercept', 'sex', 'age', 'fare')])
X_rest <- as.matrix(data[c('intercept', 'sex', 'fare')])
Y <- data[, 'survived']

You may wish to use the following functions for your calculation.


Solution

Your solution here.

Code1

7. Epilogue

There are some functions in R that can be used to calculate the p-valuesfor questions 4, 5 and 8. These can only be used to check your answers.This section is worth no marks and does not need to be completed.

7.1 Pearson's Chi-squared Test for Count Data

We can obtain the p-values for a Pearson's Chi-squared text using thefunction chisq.test where counts is a vector of counts for each categoryand p is a vector of the proposed probabilities. The length of thevectors counts and probabilities must be the same. Use ?chisq.test formore information.

p = chisq.test(counts, p)

7.2 Likelihood Ratio Test of Nested Models

We can obtain the estimated values for the parameters in logisticregression using the glm function. One can then use the lrtest (need toload lmtest package) to perform hypothesis testing for nested models.For example, if we have two explanatory variables x1, x2 and we want totest if x1 is significant for the response y, we can test this in R asfollows:

library(lmtest)
model_full = glm(y ~ x1 + x2, data = data, family=binomial(link='logit'))
model_restricted = glm(y ~ x1, data = data, family=binomial(link='logit'))

lrtest(model_full, model_restricted)
公众号:AI悦创【二维码】

AI悦创·编程一对一

AI悦创·推出辅导班啦,包括「Python 语言辅导班、C++ 辅导班、java 辅导班、算法/数据结构辅导班、少儿编程、pygame 游戏开发、Web、Linux」,全部都是一对一教学:一对一辅导 + 一对一答疑 + 布置作业 + 项目实践等。当然,还有线下线上摄影课程、Photoshop、Premiere 一对一教学、QQ、微信在线,随时响应!微信:Jiabcdefh

C++ 信息奥赛题解,长期更新!长期招收一对一中小学信息奥赛集训,莆田、厦门地区有机会线下上门,其他地区线上。微信:Jiabcdefh

方法一:QQ

方法二:微信:Jiabcdefh

你认为这篇文章怎么样?
  • 0
  • 0
  • 0
  • 0
  • 0
  • 0
评论
  • 按正序
  • 按倒序
  • 按热度
通知
关于编程私教&加密文章