Data is often used to make predictions. Actually, that is a primary reason for data analysis. The information is often used to predict what will happen in the future. Sales data, say for a specific product, is used to predict how much of that product will be purchased in the coming months or years. This helps a company that produces a product determine how much of it to make and at what rate it needs to be produced. This can be very useful information.

## Linear Regression

One of the common (because it is easy to use) prediction methods is called linear regression. Basically, linear regression takes a large portion of data, such as how much of a product was sold over a period of time, and uses it to find the equation of a line that “best fits” the data. By using the equation, which would relate time (say, in months) to sales of the product, a company can put a certain number of months into the equation and predict how much of the product will sell.

Linear regression is not appropriate for all situations, and it may only be useful for short-term predictions. The linear regression equation relates the dependent variable (which is usually called the “y” variable; e.g., the amount of a product sold) to the independent variable (which is the usually called the “x” variable; e.g., the number of months the product has been on the market). The equation contains just two numbers, which the computer generates based on the statistical formulas: the y-intercept and the slope.

## Y-intercept

The y-intercept can be thought of as the starting point – what the sales of the product would have theoretically been at the beginning, when the product first started to sell, if the regression equation perfectly describes the data back to the start of its sales. This number is really never accurate though, because when the product first started to be sold (at the beginning of month 1), there were no sales. But that does not matter for the regression equation.

## Slope

The other number is the slope of the line; it tells how much the dependent variable changes for each unit of change in the independent variable. As an example, let’s say that regression analysis of sales data from the first 36 months of a manufacturer’s release of a product gives a slope of 4,000. This means that each month (the independent variable is time, measured in months), on average, 4,000 units of the product were sold. The importance of the slope to the manufacturer is that it tells them how much of the product is expected to sell in each of the coming months, so they can plan to produce that amount. But the caution is that sales may change (hopefully for the better for the manufacturer), so continuous analysis is needed.

## Prediction Model

Putting the y-intercept and slope together in an equation gives the prediction model, a basic linear equation in two variables. Building on the example started above, let’s say the y-intercept of the data is 300. This means that theoretically 300 units of the product sold immediately when the product was put on the market. With these values, we get the linear equation y = 4000x + 300. (Note that putting in 0 – when sales first started – for x, we get y = 300). This model can then be used to predict what total sales would be if sales continue at the same rate. For example, if the manufacturer wanted to know what total sales would be after 48 months of selling this product, they can simply put in 48 for x:

y = 4000(48) + 300

y = 192,300

This means that they will have sold a total of 192,300 units of the product by the end of the fourth year of the product being on the market based on past sales data.

If you are inquisitive about linear regression, you may be interested in learning more about degrees in data science.