Regression Analysis

 Regression Analysis is a statistical technique or a tool that has following objectives/benefits.

        1.)  Regression analysis indicates if the relationship between the variables is statistically significant.

       2.)  It indicates the relative strength of the independent variables(x) on the dependent variables. In simple terms it helps us to determine which variable is the more important to predict the dependent variable (y)

        3.)  Make predictions.

Suppose we have got a dataset where we have to predict the house price based on few variables such as a size of the house, location of the house, regression analysis will helps us if these variables (location, size) are actually important/significant for predicting the price of the house. It will also tell us which factor/variable is more important in helping us predicting the price or as earlier said the strength of each variable; also regression analysis will also help us to predict the future price of the houses on the basis of these two variables.

Researchers select regression models that they want to utilize for the estimation. Regression models contain following components

       1.)  Y=  Dependent variable (Outcome)

       2.)  X= independent variable (Predictors)

       3.)  β =  unknown parameter

       4.)  € =  error terms/residuals

Mathematically, linear regression is given as follows




Regression analysis is about finding the best fit line which infers the relationship of how the y changes if x changes. We use this line to predict the value of y for a given value of x. The best fit line can only be drawn if there is a strong positive or negative correlation between the variables. It is not necessary for the best fit line to pass through origin. The best fit line shows the trend and is drawn with the minimum error. This is also called least squares regression line, as it reduces the distance between the data points to the regression line and minimizes the variance.

SST, SSR and SSE

SST stands for the sum of squared total; actually it is the squared difference between observed dependent variable and its mean y-bar, SSR stands for sum of squared due to regression, it is the sum of the difference between predicted value and the mean of the dependent variable also known as explained sum of squares as it measures explained variability. SSE stands as sum of square of errors. It is the difference between the observed and predicted value, it measures the unexplained variability.  These are the determinants of the best fit model. Lower SSE will results in a better regression and higher error will results in a less powerful regression. SSR is the Total variation and the sum of explained variation by regression SSR and unexplained variation by regression as SSE. The below diagram clearly depicts the above explanation.




 

R–squared is the tool that measures the strength of the correlation of the variables. If the value of R- squared is higher then the data have a stronger correlation, and if your R square is low then the data is scattered and there is no pattern. R squared tells the percentage of the explained variation.  This R squared is calculated by Dividing SSR by SST in percentage. Value of R squared ranges between 0 and 1. Higher R squared value will have higher SSR and lower SSE. When residuals are lower the data points are more close to the regression line. If the value of SSE increases, it will decrease the value of SSR thus reducing the value of R squared and which will means that data is scattered everywhere.

 

We will discuss the assumptions in linear regression in the next blog! Stay tuned !



Comments

Popular posts from this blog

Quick Mind Map for Statistics - Part 1

Quick Mind Map for Statistics - Part 2

Statistics - Measure Of Central Tendency