• Hi, I am the owner and main administrator of Styleforum. If you find the forum useful and fun, please help support it by buying through the posted links on the forum. Our main, very popular sales thread, where the latest and best sales are listed, are posted HERE

    Purchases made through some of our links earns a commission for the forum and allows us to do the work of maintaining and improving it. Finally, thanks for being a part of this community. We realize that there are many choices today on the internet, and we have all of you to thank for making Styleforum the foremost destination for discussions of menswear.
  • This site contains affiliate links for which Styleforum may be compensated.
  • STYLE. COMMUNITY. GREAT CLOTHING.

    Bored of counting likes on social networks? At Styleforum, you’ll find rousing discussions that go beyond strings of emojis.

    Click Here to join Styleforum's thousands of style enthusiasts today!

    Styleforum is supported in part by commission earning affiliate links sitewide. Please support us by using them. You may learn more here.

Anyone a 'stats geek'?

jgold47

Distinguished Member
Joined
Mar 23, 2008
Messages
1,617
Reaction score
13
Want to help me with a project?

I wanted to test a series of variables against a constant to see what correlation may exist. I think I can set this up pretty easily in excel, but I am not exactly sure what I am looking at for the results. I know the closer r2 is to 1, the better, but I dont know how to interpret the rest of the results.

I will pay you back with infinite real estate knowledge.
 

89826

Senior Member
Joined
Oct 31, 2006
Messages
708
Reaction score
154
Will be no correlation if you regress variables against a CONSTANT.
 

jgold47

Distinguished Member
Joined
Mar 23, 2008
Messages
1,617
Reaction score
13
see. I am already off on the wrong foot.

What should I be doing?
 

gnatty8

Stylish Dinosaur
Joined
Nov 12, 2006
Messages
12,663
Reaction score
6,204
So what two series are you trying to analyze? As someone said here, I am not sure there is much sense in trying to correlate one variable with a constant. Also, r-squared is one of the statistics you should be looking at, but certainly not the only one. Be more specific about what you are trying to do and I am sure someone here can help you.
 

John152

Well-Known Member
Joined
Aug 11, 2007
Messages
71
Reaction score
0
Are you trying to see if there is a significant difference between the variables and the constant? I've never really come across this. You can use a t-test to see if there are significant differences between two variables, but I don't think that really flies with a constant. You can't really look at correlations because there is no covariance. R^2 doesn't make any sense because it is the percent of variance in the dependent variable explained by the independent variable. If your DV is the constant.....there's no variation to explain. If your DV is the variable.....how can a constant value explain any variance in the a variable? You're basically saying that a value of 10 would explain a value of 10, 15, 20, 35, 45, etc. on the DV. Even if you had two variables, don't go betting too much on the R^2 value, since there is some debate about the utility of R^2, but that's beyond the scope of this.

So yea, as gnatty said explain the problem a bit more and I can see if I can think of something.

Edit: The one thing I can think of is to see the mean and standard deviation of the variable to get an idea of the shape of the distribution and then compare how that relates to the constant. For instance, let's say your variable has a mean of 10 with a std. dev. of 2. This means that a little over 95% of the values will lie between 8 and 12. You could then see what relationship this has to the constant you're examining.
 
Last edited:

jgold47

Distinguished Member
Joined
Mar 23, 2008
Messages
1,617
Reaction score
13
ok - thanks everyone for the responses.

Let me clarify.


i have a table of stores and their respective sales for a measuring period. I have 6-8 variables (demographics, store size, etc...) that I want to 'test' and see what the effect each of the variables has on the sales of the stores.

From that relationship, I want to take the top two or three most significant 'drivers' of sales and build a forecasting model (using averages at this point, but I would experiment with a per unit factor against the r2 value)

i also understand that these 'variables' may not affect anything independantly and may co-affect sales together, but thats a bit much for this.

the original premise of the project was to see if we could see a relationship between a sales forecasting model for one store concept and translate it to another store concept without having to have someone rebuild us a real model (they are very expensive). so, if model A says 1.3m for a store, we could expect to do 500K in concept B. Getting from A to B isnt hard and we dont need a high level of accuracy (hence the averages), but do I split the group by store size, by number of competitors, etc... , I thought testing the variables would suss that out for me in a more accurate way.



Make sense? It doesn't to me!
 

John152

Well-Known Member
Joined
Aug 11, 2007
Messages
71
Reaction score
0
Okay, first how much stats knowledge do you have? Just curious because that might affect how I explain things.

Let me get it straight first. Your data looks something like this

store....sales....var1....var2.....var3....var4
1.........number...#......#..............#.......#
etc...

correct?

It's still a little difficult to figure out what exactly you're looking for, especially in light of that last paragraph. I think, however, that a simple linear regression should be fine. You might want to consider something like fixed effects to account for the possibility that "store" is having some type of impact on the DV, but that may be beyond what you're looking at doing. So basically, again if I'm parsing this out correctly, just do a linear regression with all the variables in the model. You'll then get some output that includes a beta coefficient, some other things and a p-value. There will also be an r^2 and an adjusted r^2. If you do a multivariate model you'll want to focus on adjusted r^2 since it takes into account any added explanatory power that might come about as a result of just adding in more variables.

You'll also want to look at the p-value. The p-value is the probability the effect being observed is representative of the actual process at work. A p of .05 indicates that there is a 95% probability that your observed effect is a true effect. You can adjust what values you consider significant based on your requirements such as whether you're more worried about type I or type II error (it seems like you'd be more concerned with type I). Then finally, and maybe most importantly, you have the beta coefficient. In a multivariate model the beta coefficient represents the impact that a one-unit change in the IV has on the DV, while controlling for the effects of the other IVs. So, if you have a IV like "store size" measured in square feet and an DV of "sales" measured in dollars, with a beta coefficient of 35 this indicates that adding one square foot to the store causes a $35 increase in sales.

So, that's a basic crash course in regression analysis. A very, very basic crash course. I glossed over a LOT of the nuance and some pretty important points. There's also debate about how to interpret these statistics (not even accounting for the Bayesians!). I also don't know that I'd call a simple OLS regression a forecasting model.
 
Last edited:

stevent

Distinguished Member
Joined
Feb 16, 2010
Messages
9,564
Reaction score
1,483
Want a high r^2? Just keep adding variables and you'll reach one
 

John152

Well-Known Member
Joined
Aug 11, 2007
Messages
71
Reaction score
0

Want a high r^2? Just keep adding variables and you'll reach one


Adjusted r^2 takes into account the adding in of more and more variables.

And while I'm here, why do businesses use Excel for stats work? I've never used it for statistical analysis, but it seems like it would be a major pain **********. Why not just learn Stata, R, or (god forbid) SPSS?
 
Last edited:

Nereis

Distinguished Member
Joined
Feb 12, 2009
Messages
1,358
Reaction score
44

Adjusted r^2 takes into account the adding in of more and more variables.
And while I'm here, why do businesses use Excel for stats work? I've never used it for statistical analysis, but it seems like it would be a major pain **********. Why not just learn Stata, R, or (god forbid) SPSS?


Because people are cheap and don't want to learn how to use new/better/more powerful tools.

OP, I'm not sure how academically rigorous you want your model to be, because doing what you're doing is known in my circles as 'data mining', an offense subject to intellectual ridicule for the foreseeable future.

For every single new variable you add in, you need some sort of justification from prior evidence to back you up. Moreover, you may encounter collinearity seeing as your data does not seem to be continuous. In that case, you will need to use tetra/polychoric correlation scores if your data is integer based if you do want to test first for correlation before addition into your model.

Moreover, and this is the final nail in the coffin if you want to be 'serious business' about this, any violation of the linear model assumptions in your misspecification tests will result in throwing out your entire model. If your X matrix turns out to be correlated to your error term, tough luck broski. You're going to have to argue that your massive (>1000) sample size results in your estimators approaching consistency in that case, despite being biased.
 

John152

Well-Known Member
Joined
Aug 11, 2007
Messages
71
Reaction score
0

Because people are cheap and don't want to learn how to use new/better/more powerful tools.
OP, I'm not sure how academically rigorous you want your model to be, because doing what you're doing is known in my circles as 'data mining', an offense subject to intellectual ridicule for the foreseeable future.
For every single new variable you add in, you need some sort of justification from prior evidence to back you up. Moreover, you may encounter collinearity seeing as your data does not seem to be continuous. In that case, you will need to use tetra/polychoric correlation scores if your data is integer based if you do want to test first for correlation before addition into your model.
Moreover, and this is the final nail in the coffin if you want to be 'serious business' about this, any violation of the linear model assumptions in your misspecification tests will result in throwing out your entire model. If your X matrix turns out to be correlated to your error term, tough luck broski. You're going to have to argue that your massive (>1000) sample size results in your estimators approaching consistency in that case, despite being biased.


Data mining isn't necessarily a bad thing depending on the context, and what you really mean by "data mining." There is machine-learning type data mining, which attempts to pick up imperceptible patterns in the data, and then there is atheoretical trolling of the data to try to find a result (gotta find those stars in the output). The biggest problem I see with what he's doing is, again depending on what kind of data he is looking at, it probably needs a time-series model and probably even a time-series cross-sectional model if the observations are clustered by store.
 

Nereis

Distinguished Member
Joined
Feb 12, 2009
Messages
1,358
Reaction score
44

Data mining isn't necessarily a bad thing depending on the context, and what you really mean by "data mining." There is machine-learning type data mining, which attempts to pick up imperceptible patterns in the data, and then there is atheoretical trolling of the data to try to find a result (gotta find those stars in the output). The biggest problem I see with what he's doing is, again depending on what kind of data he is looking at, it probably needs a time-series model and probably even a time-series cross-sectional model if the observations are clustered by store.


The data mining I'm talking about is simply adding things into the model to try and make it fit the sample better, with no real regard as to the causal effect of the additional explanatory variable.

In this manner you can 'demonstrate' that the growing size of chickens is linked to global warming.

OP, don't even focus on R squared. It doesn't tell you anything of real importance unless you're choosing between two equally well specified models. If you can rule out misspecification and can provide sensitivity analysis, then you're already ahead of the game so far as corporate use of statistics is concerned.
 

Coburn

Senior Member
Joined
Apr 29, 2009
Messages
631
Reaction score
51

ok - thanks everyone for the responses.
Let me clarify.
i have a table of stores and their respective sales for a measuring period. I have 6-8 variables (demographics, store size, etc...) that I want to 'test' and see what the effect each of the variables has on the sales of the stores.
From that relationship, I want to take the top two or three most significant 'drivers' of sales and build a forecasting model (using averages at this point, but I would experiment with a per unit factor against the r2 value)
i also understand that these 'variables' may not affect anything independantly and may co-affect sales together, but thats a bit much for this.
the original premise of the project was to see if we could see a relationship between a sales forecasting model for one store concept and translate it to another store concept without having to have someone rebuild us a real model (they are very expensive). so, if model A says 1.3m for a store, we could expect to do 500K in concept B. Getting from A to B isnt hard and we dont need a high level of accuracy (hence the averages), but do I split the group by store size, by number of competitors, etc... , I thought testing the variables would suss that out for me in a more accurate way.
Make sense? It doesn't to me!


You want regression. The response variable data points should be the sales for each quarter. Q1 Q2 Q3, etc. The X variables would be the demographics, store size, etc You are comparing the sales averaged over several quarters Y = Ax1 + Bx2 +Cx3.. Y is the sales -- x1 is store size.

For the categorical variables such as demographics, you need categorical regression
 

patrickBOOTH

Stylish Dinosaur
Dubiously Honored
Joined
Oct 16, 2006
Messages
38,393
Reaction score
13,643

Okay, first how much stats knowledge do you have? Just curious because that might affect how I explain things.
Let me get it straight first. Your data looks something like this
store....sales....var1....var2.....var3....var4
1.........number...#......#..............#.......#
etc...
correct?
It's still a little difficult to figure out what exactly you're looking for, especially in light of that last paragraph. I think, however, that a simple linear regression should be fine. You might want to consider something like fixed effects to account for the possibility that "store" is having some type of impact on the DV, but that may be beyond what you're looking at doing. So basically, again if I'm parsing this out correctly, just do a linear regression with all the variables in the model. You'll then get some output that includes a beta coefficient, some other things and a p-value. There will also be an r^2 and an adjusted r^2. If you do a multivariate model you'll want to focus on adjusted r^2 since it takes into account any added explanatory power that might come about as a result of just adding in more variables.
You'll also want to look at the p-value. The p-value is the probability the effect being observed is representative of the actual process at work. A p of .05 indicates that there is a 95% probability that your observed effect is a true effect. You can adjust what values you consider significant based on your requirements such as whether you're more worried about type I or type II error (it seems like you'd be more concerned with type I). Then finally, and maybe most importantly, you have the beta coefficient. In a multivariate model the beta coefficient represents the impact that a one-unit change in the IV has on the DV, while controlling for the effects of the other IVs. So, if you have a IV like "store size" measured in square feet and an DV of "sales" measured in dollars, with a beta coefficient of 35 this indicates that adding one square foot to the store causes a $35 increase in sales.
So, that's a basic crash course in regression analysis. A very, very basic crash course. I glossed over a LOT of the nuance and some pretty important points. There's also debate about how to interpret these statistics (not even accounting for the Bayesians!). I also don't know that I'd call a simple OLS regression a forecasting model.


This is pretty on point, but there are still many more things to consider such as getting the data specification, for example a lot of these kind of variables have the common issue of being tied to time. Over time these types of variables just go up so if you regress all of them together you are going to have insane multicolliniarity and inflated R^2 regardless of using adjusted or not. You have to break the trends by differencing, % change, log and so on. Even with that you have to test your VAR's and so on. It is a much more intense process than I feel you can get through an online forum if you want to do it right.

Also this last paragraph is important a model is not a forecast. A model is an explanation of what happened at a point in time, this may or may not be applicable to the future.
 
Last edited:

Featured Sponsor

How important is full vs half canvas to you for heavier sport jackets?

  • Definitely full canvas only

    Votes: 91 37.4%
  • Half canvas is fine

    Votes: 90 37.0%
  • Really don't care

    Votes: 26 10.7%
  • Depends on fabric

    Votes: 40 16.5%
  • Depends on price

    Votes: 38 15.6%

Forum statistics

Threads
506,852
Messages
10,592,443
Members
224,326
Latest member
uajmj15
Top