Anyone a 'stats geek'?

Discussion in 'General Chat' started by jgold47, Apr 20, 2012.

1. jgold47Well-Known Member

Messages:
1,629
Joined:
Mar 23, 2008
Location:
The Mitten
Want to help me with a project?

I wanted to test a series of variables against a constant to see what correlation may exist. I think I can set this up pretty easily in excel, but I am not exactly sure what I am looking at for the results. I know the closer r2 is to 1, the better, but I dont know how to interpret the rest of the results.

I will pay you back with infinite real estate knowledge.

2. 89826Well-Known Member

Messages:
576
Joined:
Oct 31, 2006
Location:
Santa Monica
Will be no correlation if you regress variables against a CONSTANT.

3. jgold47Well-Known Member

Messages:
1,629
Joined:
Mar 23, 2008
Location:
The Mitten
see. I am already off on the wrong foot.

What should I be doing?

4. EbichumanWell-Known Member

Messages:
522
Joined:
Jun 22, 2011
Location:
North and East from the centre
5. gnatty8Well-Known Member

Messages:
9,455
Joined:
Nov 12, 2006
Location:
Not in Atlanta, GA
So what two series are you trying to analyze? As someone said here, I am not sure there is much sense in trying to correlate one variable with a constant. Also, r-squared is one of the statistics you should be looking at, but certainly not the only one. Be more specific about what you are trying to do and I am sure someone here can help you.

6. John152Well-Known Member

Messages:
72
Joined:
Aug 11, 2007
Are you trying to see if there is a significant difference between the variables and the constant? I've never really come across this. You can use a t-test to see if there are significant differences between two variables, but I don't think that really flies with a constant. You can't really look at correlations because there is no covariance. R^2 doesn't make any sense because it is the percent of variance in the dependent variable explained by the independent variable. If your DV is the constant.....there's no variation to explain. If your DV is the variable.....how can a constant value explain any variance in the a variable? You're basically saying that a value of 10 would explain a value of 10, 15, 20, 35, 45, etc. on the DV. Even if you had two variables, don't go betting too much on the R^2 value, since there is some debate about the utility of R^2, but that's beyond the scope of this.

So yea, as gnatty said explain the problem a bit more and I can see if I can think of something.

Edit: The one thing I can think of is to see the mean and standard deviation of the variable to get an idea of the shape of the distribution and then compare how that relates to the constant. For instance, let's say your variable has a mean of 10 with a std. dev. of 2. This means that a little over 95% of the values will lie between 8 and 12. You could then see what relationship this has to the constant you're examining.

Last edited: Apr 21, 2012
7. jgold47Well-Known Member

Messages:
1,629
Joined:
Mar 23, 2008
Location:
The Mitten
ok - thanks everyone for the responses.

Let me clarify.

i have a table of stores and their respective sales for a measuring period. I have 6-8 variables (demographics, store size, etc...) that I want to 'test' and see what the effect each of the variables has on the sales of the stores.

From that relationship, I want to take the top two or three most significant 'drivers' of sales and build a forecasting model (using averages at this point, but I would experiment with a per unit factor against the r2 value)

i also understand that these 'variables' may not affect anything independantly and may co-affect sales together, but thats a bit much for this.

the original premise of the project was to see if we could see a relationship between a sales forecasting model for one store concept and translate it to another store concept without having to have someone rebuild us a real model (they are very expensive). so, if model A says 1.3m for a store, we could expect to do 500K in concept B. Getting from A to B isnt hard and we dont need a high level of accuracy (hence the averages), but do I split the group by store size, by number of competitors, etc... , I thought testing the variables would suss that out for me in a more accurate way.

Make sense? It doesn't to me!

8. John152Well-Known Member

Messages:
72
Joined:
Aug 11, 2007
Okay, first how much stats knowledge do you have? Just curious because that might affect how I explain things.

Let me get it straight first. Your data looks something like this

store....sales....var1....var2.....var3....var4
1.........number...#......#..............#.......#
etc...

correct?

It's still a little difficult to figure out what exactly you're looking for, especially in light of that last paragraph. I think, however, that a simple linear regression should be fine. You might want to consider something like fixed effects to account for the possibility that "store" is having some type of impact on the DV, but that may be beyond what you're looking at doing. So basically, again if I'm parsing this out correctly, just do a linear regression with all the variables in the model. You'll then get some output that includes a beta coefficient, some other things and a p-value. There will also be an r^2 and an adjusted r^2. If you do a multivariate model you'll want to focus on adjusted r^2 since it takes into account any added explanatory power that might come about as a result of just adding in more variables.

You'll also want to look at the p-value. The p-value is the probability the effect being observed is representative of the actual process at work. A p of .05 indicates that there is a 95% probability that your observed effect is a true effect. You can adjust what values you consider significant based on your requirements such as whether you're more worried about type I or type II error (it seems like you'd be more concerned with type I). Then finally, and maybe most importantly, you have the beta coefficient. In a multivariate model the beta coefficient represents the impact that a one-unit change in the IV has on the DV, while controlling for the effects of the other IVs. So, if you have a IV like "store size" measured in square feet and an DV of "sales" measured in dollars, with a beta coefficient of 35 this indicates that adding one square foot to the store causes a \$35 increase in sales.

So, that's a basic crash course in regression analysis. A very, very basic crash course. I glossed over a LOT of the nuance and some pretty important points. There's also debate about how to interpret these statistics (not even accounting for the Bayesians!). I also don't know that I'd call a simple OLS regression a forecasting model.

Last edited: Apr 24, 2012
9. steventWell-Known Member

Messages:
9,554
Joined:
Feb 16, 2010
Want a high r^2? Just keep adding variables and you'll reach one

1 person likes this.
10. John152Well-Known Member

Messages:
72
Joined:
Aug 11, 2007


Adjusted r^2 takes into account the adding in of more and more variables.

And while I'm here, why do businesses use Excel for stats work? I've never used it for statistical analysis, but it seems like it would be a major pain in the ass. Why not just learn Stata, R, or (god forbid) SPSS?

Last edited: Apr 24, 2012
11. NereisWell-Known Member

Messages:
1,374
Joined:
Feb 12, 2009


Because people are cheap and don't want to learn how to use new/better/more powerful tools.

OP, I'm not sure how academically rigorous you want your model to be, because doing what you're doing is known in my circles as 'data mining', an offense subject to intellectual ridicule for the foreseeable future.

For every single new variable you add in, you need some sort of justification from prior evidence to back you up. Moreover, you may encounter collinearity seeing as your data does not seem to be continuous. In that case, you will need to use tetra/polychoric correlation scores if your data is integer based if you do want to test first for correlation before addition into your model.

Moreover, and this is the final nail in the coffin if you want to be 'serious business' about this, any violation of the linear model assumptions in your misspecification tests will result in throwing out your entire model. If your X matrix turns out to be correlated to your error term, tough luck broski. You're going to have to argue that your massive (>1000) sample size results in your estimators approaching consistency in that case, despite being biased.

12. John152Well-Known Member

Messages:
72
Joined:
Aug 11, 2007


Data mining isn't necessarily a bad thing depending on the context, and what you really mean by "data mining." There is machine-learning type data mining, which attempts to pick up imperceptible patterns in the data, and then there is atheoretical trolling of the data to try to find a result (gotta find those stars in the output). The biggest problem I see with what he's doing is, again depending on what kind of data he is looking at, it probably needs a time-series model and probably even a time-series cross-sectional model if the observations are clustered by store.

13. NereisWell-Known Member

Messages:
1,374
Joined:
Feb 12, 2009


The data mining I'm talking about is simply adding things into the model to try and make it fit the sample better, with no real regard as to the causal effect of the additional explanatory variable.

In this manner you can 'demonstrate' that the growing size of chickens is linked to global warming.

OP, don't even focus on R squared. It doesn't tell you anything of real importance unless you're choosing between two equally well specified models. If you can rule out misspecification and can provide sensitivity analysis, then you're already ahead of the game so far as corporate use of statistics is concerned.

14. CoburnWell-Known Member

Messages:
624
Joined:
Apr 29, 2009
Location:
Seattle


You want regression. The response variable data points should be the sales for each quarter. Q1 Q2 Q3, etc. The X variables would be the demographics, store size, etc You are comparing the sales averaged over several quarters Y = Ax1 + Bx2 +Cx3.. Y is the sales -- x1 is store size.

For the categorical variables such as demographics, you need categorical regression

15. patrickBOOTHWell-Known Member

Messages:
33,325
Joined:
Oct 16, 2006
Location:
New York City


This is pretty on point, but there are still many more things to consider such as getting the data specification, for example a lot of these kind of variables have the common issue of being tied to time. Over time these types of variables just go up so if you regress all of them together you are going to have insane multicolliniarity and inflated R^2 regardless of using adjusted or not. You have to break the trends by differencing, % change, log and so on. Even with that you have to test your VAR's and so on. It is a much more intense process than I feel you can get through an online forum if you want to do it right.

Also this last paragraph is important a model is not a forecast. A model is an explanation of what happened at a point in time, this may or may not be applicable to the future.

Last edited: May 8, 2012
16. John152Well-Known Member

Messages:
72
Joined:
Aug 11, 2007


I don't think he has time series data, but if he does you are correct. There should, as always, be tests of multicollinearity before running the model. Also, depending on what the variables look like, he probably needs to break things like "demographics" into several dichotomous variables since "demographics" is probably a nominal non-ordered variable. Finally, like I said above, fixed effects would probably be a serious consideration.

In short, I think this thread shows that there is a lot more to statistics than just throwing variables into a model. Good statistics can be very helpful and tell us some interesting things about the world, but I really believe that I'd rather have no statistics than bad statistics.

17. patrickBOOTHWell-Known Member

Messages:
33,325
Joined:
Oct 16, 2006
Location:
New York City


This is all true, a log-log transformation would be good for cross sectional data, however you are correct sound theory always makes for a better model.

18. david3558Well-Known Member

Messages:
885
Joined:
Sep 26, 2007
Location:
San Francisco
Holy cow, hate to interrupt but what did everyone major in/current profession? I'm a marketing student and I got pretty interested in stats myself

19. patrickBOOTHWell-Known Member

Messages:
33,325
Joined:
Oct 16, 2006
Location:
New York City
Finance, economics, psychology.

I still don't know what is wrong with me .

20. david3558Well-Known Member

Messages:
885
Joined:
Sep 26, 2007
Location:
San Francisco
Haha, I'm marketing, entr., and economics - though entr as a major is total BS in my opinion.