The purpose of this project is to determine whether there is a linear relationship between life expectancy, fertility rate and access to education across the countries from different regions of the world. The research is based on the multiple linear regression model that is constructed as an explanatory model that allows forecasting the value of the dependent variable given the value of the independent variable (Coad, 2010). This research is valuable since it touches upon the problem of the reduction of the birth rate in the developed countries, and thus allows to determine whether such an issue really exists. In addition, it determines whether life expectancy and school attendance are statistically significant factors. The constructed hypothesis is tested for heteroscedasticity and multicollinearity to determine whether it can be used to forecast fertility rates based on the available information about the independent variables.
For the purpose of this research, a data sample was formed. The data was retrieved from the World Bank official website and corresponds to the statistical information for 2015, which is the latest data available (The World Bank, 2017). For the variable corresponding to the access to education among children, the latest data points were collected since the information on the topic is rather limited. The data collected mostly dates back to 2014 and 2015.
In this project, the independent variables are the life expectancy and the percentage of children out of school. The dependent variable is the fertility rate, which corresponds to the number of births per woman in the corresponding country. The sample is composed of 139 observations, each corresponding to the country or region. The countries that were missing either the dependent or independent values were excluded from the sample and thus are not included in the linear regression model. It should also be noted that some of the extreme values that could have a potential influence on the outcome of the country were not included in the sample. In the process of collecting data, it was observed that surveys conducted in the least developed countries dated back to early 2010s and therefore could not be used since they do not reflect up-to-date information. However, the fertility rates, life expectancy, and the share of children out of school in those countries were rather extreme, comparing to the mean values across the sample. Had the relevant information in the least developed countries been available, the results of the model might have been slightly different. However, the basic relationships and tendencies are supposed to hold.
Prior to conducting the research, a hypothesis was made about the relationship between the two variables. In general, higher life expectancy is related to the longer life expectancy. Indeed, the highest life expectancy was observed in European and North American regions, and in the developed Asian countries where the healthcare standards and ecological conditions are more favorable for the citizens. At the same time, the developed countries are experiencing the ageing of the population. This phenomenon is related to the increase in the average age of the population, and decrease in the birth rate. Meanwhile, in the poorest countries where the life expectancy is not very high, a higher fertility rate is observed. This applies mostly to the African region where the number of births per woman is the highest. In addition, it is expected that the broader access to education will result in lower extreme fertility rates. Therefore, it is expected an increase in the life expectancy and wider access to education among children on average will correspond to lower fertility rates. This assumption will be tested in the course of the research using the linear model regression. In addition, it would be checked whether heteroscedasticity is present in the model, that is, if the data can be divided in subgroups with different variability. Also, multicollinearity will be checked to determine whether there is a correlation between the independent variables in the model. All calculations and results of tests are presented in the Excel spreadsheet.
The calculations for this research were conducted using Excel. All results and calculations will be submitted along with the report. In order to simplify the notations, the variable Life Expectancy corresponds to x1, variable Children out of School corresponds to x2, variable Fertility Rate corresponds to y.
Using Excel software, the following linear regression model corresponding to the data set was obtained:
From the equation, the intercept point of the trend line with the y axis is 10.98987924. This is the value of the fertility rate if the life expectancy were zero. However, this figure is not meaningful since such a situation is not feasible. Therefore, it serves a mathematical purpose and is used in the econometrical analysis (Columbia University). It can also be observed from the least square regression equation that an increase in the life expectancy by 1 year corresponds to a decrease in the fertility rate by -0.117554198 points. Meanwhile, an increase in the percentage of children who do not attend school by 1% corresponds to an increase in the fertility rate by 0.019989 units (Hastie, 2009). The results obtained from linear regression model support the assumptions stated at the beginning of the research. The value of r-squared in this model is 0.6514, which means that 65.14% of changes in the fertility rates are explained by the changes in the independent variables, that is the life expectancy and percentage of children out of school (Columbia University).
In order to test the reliability of the model, a few calculations were performed given the real observations of the life expectancy. For example, in the US the recorded average life expectancy in 2015 was 78.74146341 years, and the percentage of children out of school was 5.54395. According to the obtained linear equation, the expected fertility rate would be 1.844305843. The true value of fertility rate recorded in the US in the respective year was 1.843, which is very close to the value obtained from the linear regression model.
In Norway, the average life expectancy is 82.1 years, and the percentage of children who do not attend school is 0.21542. The expected fertility rate from the linear model is 1.342986, while the true value is 1.75
However, the linear regression model is not very accurate for shorter life expectancies where the distribution of fertility rates is more disperse. For instance, the life expectancy in Ethiopia is 64.57804878, and the rate of children who do not attend school is 13.51149. The expected fertility rate is 3.668535255. The true value from the sample is 4.275, which does not match with the predictions.
In Uganda, where the average life expectancy is 59.17907317, the expected fertility rate is 4.343185. Meanwhile, the true value of fertility rate is 5.682, which is significantly higher than the expected value.
Therefore, it can be concluded that the model is more suitable for predicting the fertility rate for the countries with a higher life expectancy, whereas for the lower life expectancy groups, the predicting power of the model is not that strong. Such a difference can be attributed to a particular degree of heteroscedasticity of the model, which means that the data set can be divided into the subgroups with different variabilities. The respective test will be conducted to check the model for heteroscedasticity.
In order to determine whether the model is statistically significant, the F-test is conducted, which determines the significance of all the independent variables in the sample. The null hypothesis for this test would be that all the parameters of the regression are zero. The alternative hypothesis would be not H0.
H0: β1 = β2 = 0
H1: β ≠ β2 ≠ 0
The obtained F-statistic from Excel is 127.0720021, which is higher than the critical F-value of 1.32 (Critical F-Value Calculator). In addition, the p-value is a small, which means that the probability of both beta coefficients being 0 is lower than 0.05. From the conducted test, the independent variables are statistically significant at 5%. Therefore, the null hypothesis is rejected.
In order to test for heteroscedasticity, the following graph of the squared residuals was constructed.
It can be observed that as the predicted value increases, the squared residual values go up, which suggests that heteroscedasticity might be present in the model. In order to determine whether heteroscedasticity truly takes place, the formal tests have to be conducted.
Firstly, Breusch-Pagan test is applied, in which the residuals squared are regressed against the independent variables in the model (Williams, 2015). For the purpose of this test, the null hypothesis is that the data set is homoscedastic. The alternative hypothesis is that the data set is heteroscedastic (Williams, 2015). The results of the test reveal the p-value that is lower than 0.05, therefore the null hypothesis is rejected, and heteroscedasticity is assumed.
In order to confirm heteroscedasticity, White’s general test is used. In this test, the residuals squared are regressed against the predicted value and the predicted value squared (Williams, 2015). The null hypothesis under this test assumes homoscedasticity, while the alternative hypothesis implies heteroscedasticity. The obtained p-value is smaller than 0.05, therefore the null hypothesis is rejected, and heteroscedasticity is assumed. Since both tests confirm heteroscedasticity, the data set has to be transformed to correct it. One of the ways of correcting heteroscedasticity in the model is transformation of the dependent variable. The possible ways to transform the dependent variable are the log transformation or the exponential transformation. Another way to fix the heteroscedasticity is the use of robust standard errors.
In order to test for multicollinearity, the matrix of correlations was created to determine whether the independent variables are correlated or not (Williams, 2015). The correlation between the life expectancy and percentage of children out of school is positive, but not strong. Therefore, the one variable cannot be explained by the other. Thus, multicollinearity is not assumed. The correlation matrix also indicates that the relationship between the fertility rate and life expectancy is quite strong and negative whereas the relationship between the dependent variable and children not attending school is positive but weak. However, this does not imply that X2 is not statistically significant since its p-value is smaller than 0.05 (Williams, 2015).
In summary, the conducted research was aimed at determining the relationship between the life expectancy, children out of school, and fertility rate. The findings of the research confirmed the initial assumptions, namely that an increase in the life expectancy corresponds to the lower fertility rate, while a higher percentage of children who do not attend school corresponds to an increase in the fertility rate. In particular, an increase in the life expectancy by 1 year corresponds to a decrease in fertility rate by 0.11755 units, while an increase in the share of children who do not attend school by 1% corresponds to an increase in fertility rate by 0.019959 units. However, the heteroscedasticity in the data set was detected, which means that the variability is different across the data set. Therefore, the data should be transformed, or the robust standard error should be applied to improve the results of the research. In addition, he model could be improved if additional observations were added to the sample. However, due to the lack of data on the least developed countries, the sample size had to be reduced. In addition, the value of the r-squared could be increased if additional independent variables were added to the sample. The variables of interest in this case might be the share of GDP spent on healthcare, literacy rate among the adults, GDP per capita, unemployment rate, among other. However, if the new variables are added to the model, it is necessary to conduct additional multicollinearity tests since some variables (such as GDP per capita) may have an effect both on the dependent and independent variables. Therefore, if any modifications or expansions are conducted, a new set of tests is required to make sure that the new model provides a better measure of relationship than the one constructed in this research.
- Coad, M. (2010.) Mathematics for the International Student: Mathematical Studies. S.l.: Haese And Harris Pub.
- Columbia University. “Statistical Sampling and Regression: Convariance and Correlation.” http://ci.columbia.edu/ci/premba_test/c0331/s7/s7_5.html[Accessed 7 May 2017].
- Critical F-Value Calculator. Available from: http://www.danielsoper.com/statcalc/calculator.aspx?id=10 [Accessed 7 May 2017].
- Hastie, T., Tibshirani, R., Friwdman, R. (2009.) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.
- Kuningas, M et.al. (2011). “The relationship between fertility and lifespan in humans.” AGE (2011) 33:615–622. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3220400/pdf/11357_2010_Article_9202.pdf. [Accessed 7 May 2017].