or Connect
Styleforum › Forums › General › General Chat › Anyone a 'stats geek'?
New Posts  All Forums:Forum Nav:

# Anyone a 'stats geek'?

Want to help me with a project?

I wanted to test a series of variables against a constant to see what correlation may exist. I think I can set this up pretty easily in excel, but I am not exactly sure what I am looking at for the results. I know the closer r2 is to 1, the better, but I dont know how to interpret the rest of the results.

I will pay you back with infinite real estate knowledge.
Will be no correlation if you regress variables against a CONSTANT.
see. I am already off on the wrong foot.

What should I be doing?
Quote:
Originally Posted by jgold47

see. I am already off on the wrong foot.
What should I be doing?

Use the almighty power of Google!

For example, this calculator - plug in the two series you want to compare and voila....

http://www.easycalculation.com/statistics/correlation.php
So what two series are you trying to analyze? As someone said here, I am not sure there is much sense in trying to correlate one variable with a constant. Also, r-squared is one of the statistics you should be looking at, but certainly not the only one. Be more specific about what you are trying to do and I am sure someone here can help you.
Are you trying to see if there is a significant difference between the variables and the constant? I've never really come across this. You can use a t-test to see if there are significant differences between two variables, but I don't think that really flies with a constant. You can't really look at correlations because there is no covariance. R^2 doesn't make any sense because it is the percent of variance in the dependent variable explained by the independent variable. If your DV is the constant.....there's no variation to explain. If your DV is the variable.....how can a constant value explain any variance in the a variable? You're basically saying that a value of 10 would explain a value of 10, 15, 20, 35, 45, etc. on the DV. Even if you had two variables, don't go betting too much on the R^2 value, since there is some debate about the utility of R^2, but that's beyond the scope of this.

So yea, as gnatty said explain the problem a bit more and I can see if I can think of something.

Edit: The one thing I can think of is to see the mean and standard deviation of the variable to get an idea of the shape of the distribution and then compare how that relates to the constant. For instance, let's say your variable has a mean of 10 with a std. dev. of 2. This means that a little over 95% of the values will lie between 8 and 12. You could then see what relationship this has to the constant you're examining.
ok - thanks everyone for the responses.

Let me clarify.

i have a table of stores and their respective sales for a measuring period. I have 6-8 variables (demographics, store size, etc...) that I want to 'test' and see what the effect each of the variables has on the sales of the stores.

From that relationship, I want to take the top two or three most significant 'drivers' of sales and build a forecasting model (using averages at this point, but I would experiment with a per unit factor against the r2 value)

i also understand that these 'variables' may not affect anything independantly and may co-affect sales together, but thats a bit much for this.

the original premise of the project was to see if we could see a relationship between a sales forecasting model for one store concept and translate it to another store concept without having to have someone rebuild us a real model (they are very expensive). so, if model A says 1.3m for a store, we could expect to do 500K in concept B. Getting from A to B isnt hard and we dont need a high level of accuracy (hence the averages), but do I split the group by store size, by number of competitors, etc... , I thought testing the variables would suss that out for me in a more accurate way.

Make sense? It doesn't to me!
Okay, first how much stats knowledge do you have? Just curious because that might affect how I explain things.

Let me get it straight first. Your data looks something like this

store....sales....var1....var2.....var3....var4
1.........number...#......#..............#.......#
etc...

correct?

It's still a little difficult to figure out what exactly you're looking for, especially in light of that last paragraph. I think, however, that a simple linear regression should be fine. You might want to consider something like fixed effects to account for the possibility that "store" is having some type of impact on the DV, but that may be beyond what you're looking at doing. So basically, again if I'm parsing this out correctly, just do a linear regression with all the variables in the model. You'll then get some output that includes a beta coefficient, some other things and a p-value. There will also be an r^2 and an adjusted r^2. If you do a multivariate model you'll want to focus on adjusted r^2 since it takes into account any added explanatory power that might come about as a result of just adding in more variables.

You'll also want to look at the p-value. The p-value is the probability the effect being observed is representative of the actual process at work. A p of .05 indicates that there is a 95% probability that your observed effect is a true effect. You can adjust what values you consider significant based on your requirements such as whether you're more worried about type I or type II error (it seems like you'd be more concerned with type I). Then finally, and maybe most importantly, you have the beta coefficient. In a multivariate model the beta coefficient represents the impact that a one-unit change in the IV has on the DV, while controlling for the effects of the other IVs. So, if you have a IV like "store size" measured in square feet and an DV of "sales" measured in dollars, with a beta coefficient of 35 this indicates that adding one square foot to the store causes a \$35 increase in sales.

So, that's a basic crash course in regression analysis. A very, very basic crash course. I glossed over a LOT of the nuance and some pretty important points. There's also debate about how to interpret these statistics (not even accounting for the Bayesians!). I also don't know that I'd call a simple OLS regression a forecasting model.
Want a high r^2? Just keep adding variables and you'll reach one
Quote:
Originally Posted by stevent

Want a high r^2? Just keep adding variables and you'll reach one

Adjusted r^2 takes into account the adding in of more and more variables.

And while I'm here, why do businesses use Excel for stats work? I've never used it for statistical analysis, but it seems like it would be a major pain in the ass. Why not just learn Stata, R, or (god forbid) SPSS?
Quote:
Originally Posted by John152

Adjusted r^2 takes into account the adding in of more and more variables.
And while I'm here, why do businesses use Excel for stats work? I've never used it for statistical analysis, but it seems like it would be a major pain in the ass. Why not just learn Stata, R, or (god forbid) SPSS?

Because people are cheap and don't want to learn how to use new/better/more powerful tools.

OP, I'm not sure how academically rigorous you want your model to be, because doing what you're doing is known in my circles as 'data mining', an offense subject to intellectual ridicule for the foreseeable future.

For every single new variable you add in, you need some sort of justification from prior evidence to back you up. Moreover, you may encounter collinearity seeing as your data does not seem to be continuous. In that case, you will need to use tetra/polychoric correlation scores if your data is integer based if you do want to test first for correlation before addition into your model.

Moreover, and this is the final nail in the coffin if you want to be 'serious business' about this, any violation of the linear model assumptions in your misspecification tests will result in throwing out your entire model. If your X matrix turns out to be correlated to your error term, tough luck broski. You're going to have to argue that your massive (>1000) sample size results in your estimators approaching consistency in that case, despite being biased.
Quote:
Originally Posted by Nereis

Because people are cheap and don't want to learn how to use new/better/more powerful tools.
OP, I'm not sure how academically rigorous you want your model to be, because doing what you're doing is known in my circles as 'data mining', an offense subject to intellectual ridicule for the foreseeable future.
For every single new variable you add in, you need some sort of justification from prior evidence to back you up. Moreover, you may encounter collinearity seeing as your data does not seem to be continuous. In that case, you will need to use tetra/polychoric correlation scores if your data is integer based if you do want to test first for correlation before addition into your model.
Moreover, and this is the final nail in the coffin if you want to be 'serious business' about this, any violation of the linear model assumptions in your misspecification tests will result in throwing out your entire model. If your X matrix turns out to be correlated to your error term, tough luck broski. You're going to have to argue that your massive (>1000) sample size results in your estimators approaching consistency in that case, despite being biased.

Data mining isn't necessarily a bad thing depending on the context, and what you really mean by "data mining." There is machine-learning type data mining, which attempts to pick up imperceptible patterns in the data, and then there is atheoretical trolling of the data to try to find a result (gotta find those stars in the output). The biggest problem I see with what he's doing is, again depending on what kind of data he is looking at, it probably needs a time-series model and probably even a time-series cross-sectional model if the observations are clustered by store.
Quote:
Originally Posted by John152

Data mining isn't necessarily a bad thing depending on the context, and what you really mean by "data mining." There is machine-learning type data mining, which attempts to pick up imperceptible patterns in the data, and then there is atheoretical trolling of the data to try to find a result (gotta find those stars in the output). The biggest problem I see with what he's doing is, again depending on what kind of data he is looking at, it probably needs a time-series model and probably even a time-series cross-sectional model if the observations are clustered by store.

The data mining I'm talking about is simply adding things into the model to try and make it fit the sample better, with no real regard as to the causal effect of the additional explanatory variable.

In this manner you can 'demonstrate' that the growing size of chickens is linked to global warming.

OP, don't even focus on R squared. It doesn't tell you anything of real importance unless you're choosing between two equally well specified models. If you can rule out misspecification and can provide sensitivity analysis, then you're already ahead of the game so far as corporate use of statistics is concerned.
Quote:
Originally Posted by jgold47

ok - thanks everyone for the responses.
Let me clarify.
i have a table of stores and their respective sales for a measuring period. I have 6-8 variables (demographics, store size, etc...) that I want to 'test' and see what the effect each of the variables has on the sales of the stores.
From that relationship, I want to take the top two or three most significant 'drivers' of sales and build a forecasting model (using averages at this point, but I would experiment with a per unit factor against the r2 value)
i also understand that these 'variables' may not affect anything independantly and may co-affect sales together, but thats a bit much for this.
the original premise of the project was to see if we could see a relationship between a sales forecasting model for one store concept and translate it to another store concept without having to have someone rebuild us a real model (they are very expensive). so, if model A says 1.3m for a store, we could expect to do 500K in concept B. Getting from A to B isnt hard and we dont need a high level of accuracy (hence the averages), but do I split the group by store size, by number of competitors, etc... , I thought testing the variables would suss that out for me in a more accurate way.
Make sense? It doesn't to me!

You want regression. The response variable data points should be the sales for each quarter. Q1 Q2 Q3, etc. The X variables would be the demographics, store size, etc You are comparing the sales averaged over several quarters Y = Ax1 + Bx2 +Cx3.. Y is the sales -- x1 is store size.

For the categorical variables such as demographics, you need categorical regression
Quote:
Originally Posted by John152

Okay, first how much stats knowledge do you have? Just curious because that might affect how I explain things.
Let me get it straight first. Your data looks something like this
store....sales....var1....var2.....var3....var4
1.........number...#......#..............#.......#
etc...
correct?
It's still a little difficult to figure out what exactly you're looking for, especially in light of that last paragraph. I think, however, that a simple linear regression should be fine. You might want to consider something like fixed effects to account for the possibility that "store" is having some type of impact on the DV, but that may be beyond what you're looking at doing. So basically, again if I'm parsing this out correctly, just do a linear regression with all the variables in the model. You'll then get some output that includes a beta coefficient, some other things and a p-value. There will also be an r^2 and an adjusted r^2. If you do a multivariate model you'll want to focus on adjusted r^2 since it takes into account any added explanatory power that might come about as a result of just adding in more variables.
You'll also want to look at the p-value. The p-value is the probability the effect being observed is representative of the actual process at work. A p of .05 indicates that there is a 95% probability that your observed effect is a true effect. You can adjust what values you consider significant based on your requirements such as whether you're more worried about type I or type II error (it seems like you'd be more concerned with type I). Then finally, and maybe most importantly, you have the beta coefficient. In a multivariate model the beta coefficient represents the impact that a one-unit change in the IV has on the DV, while controlling for the effects of the other IVs. So, if you have a IV like "store size" measured in square feet and an DV of "sales" measured in dollars, with a beta coefficient of 35 this indicates that adding one square foot to the store causes a \$35 increase in sales.
So, that's a basic crash course in regression analysis. A very, very basic crash course. I glossed over a LOT of the nuance and some pretty important points. There's also debate about how to interpret these statistics (not even accounting for the Bayesians!). I also don't know that I'd call a simple OLS regression a forecasting model.

This is pretty on point, but there are still many more things to consider such as getting the data specification, for example a lot of these kind of variables have the common issue of being tied to time. Over time these types of variables just go up so if you regress all of them together you are going to have insane multicolliniarity and inflated R^2 regardless of using adjusted or not. You have to break the trends by differencing, % change, log and so on. Even with that you have to test your VAR's and so on. It is a much more intense process than I feel you can get through an online forum if you want to do it right.

Also this last paragraph is important a model is not a forecast. A model is an explanation of what happened at a point in time, this may or may not be applicable to the future.
New Posts  All Forums:Forum Nav:
Return Home
Back to Forum: General Chat
Styleforum › Forums › General › General Chat › Anyone a 'stats geek'?