• I'm happy to introduce the Styleforum Happy Hour, our brand new podcast featuring lively discussion about menswear and the fashion industry. In the inaugural edition, a discussion of what's going on in retail today. Please check it out on the Journal. All episodes will be also be available soon on your favorite podcast platform.

  • Hi, we have updated our Terms of Service and Privacy Policy in anticipation of the upcoming new Calfornia laws, the CCPA. If you are a resident of California, these rights pertain to you. Thanks - Styleforum Team.
  • STYLE. COMMUNITY. GREAT CLOTHING.

    Bored of counting likes on social networks? At Styleforum, you’ll find rousing discussions that go beyond strings of emojis.

    Click Here to join Styleforum's thousands of style enthusiasts today!

Anyone a 'stats geek'?

jgold47

Distinguished Member
Joined
Mar 23, 2008
Messages
1,629
Reaction score
13
Want to help me with a project?

I wanted to test a series of variables against a constant to see what correlation may exist. I think I can set this up pretty easily in excel, but I am not exactly sure what I am looking at for the results. I know the closer r2 is to 1, the better, but I dont know how to interpret the rest of the results.

I will pay you back with infinite real estate knowledge.
 

89826

Senior Member
Joined
Oct 31, 2006
Messages
714
Reaction score
151
Will be no correlation if you regress variables against a CONSTANT.
 

jgold47

Distinguished Member
Joined
Mar 23, 2008
Messages
1,629
Reaction score
13
see. I am already off on the wrong foot.

What should I be doing?
 

gnatty8

Distinguished Member
Joined
Nov 12, 2006
Messages
9,942
Reaction score
2,020
So what two series are you trying to analyze? As someone said here, I am not sure there is much sense in trying to correlate one variable with a constant. Also, r-squared is one of the statistics you should be looking at, but certainly not the only one. Be more specific about what you are trying to do and I am sure someone here can help you.
 

John152

Well-Known Member
Joined
Aug 11, 2007
Messages
72
Reaction score
0
Are you trying to see if there is a significant difference between the variables and the constant? I've never really come across this. You can use a t-test to see if there are significant differences between two variables, but I don't think that really flies with a constant. You can't really look at correlations because there is no covariance. R^2 doesn't make any sense because it is the percent of variance in the dependent variable explained by the independent variable. If your DV is the constant.....there's no variation to explain. If your DV is the variable.....how can a constant value explain any variance in the a variable? You're basically saying that a value of 10 would explain a value of 10, 15, 20, 35, 45, etc. on the DV. Even if you had two variables, don't go betting too much on the R^2 value, since there is some debate about the utility of R^2, but that's beyond the scope of this.

So yea, as gnatty said explain the problem a bit more and I can see if I can think of something.

Edit: The one thing I can think of is to see the mean and standard deviation of the variable to get an idea of the shape of the distribution and then compare how that relates to the constant. For instance, let's say your variable has a mean of 10 with a std. dev. of 2. This means that a little over 95% of the values will lie between 8 and 12. You could then see what relationship this has to the constant you're examining.
 
Last edited:

jgold47

Distinguished Member
Joined
Mar 23, 2008
Messages
1,629
Reaction score
13
ok - thanks everyone for the responses.

Let me clarify.


i have a table of stores and their respective sales for a measuring period. I have 6-8 variables (demographics, store size, etc...) that I want to 'test' and see what the effect each of the variables has on the sales of the stores.

From that relationship, I want to take the top two or three most significant 'drivers' of sales and build a forecasting model (using averages at this point, but I would experiment with a per unit factor against the r2 value)

i also understand that these 'variables' may not affect anything independantly and may co-affect sales together, but thats a bit much for this.

the original premise of the project was to see if we could see a relationship between a sales forecasting model for one store concept and translate it to another store concept without having to have someone rebuild us a real model (they are very expensive). so, if model A says 1.3m for a store, we could expect to do 500K in concept B. Getting from A to B isnt hard and we dont need a high level of accuracy (hence the averages), but do I split the group by store size, by number of competitors, etc... , I thought testing the variables would suss that out for me in a more accurate way.



Make sense? It doesn't to me!
 

John152

Well-Known Member
Joined
Aug 11, 2007
Messages
72
Reaction score
0
Okay, first how much stats knowledge do you have? Just curious because that might affect how I explain things.

Let me get it straight first. Your data looks something like this

store....sales....var1....var2.....var3....var4
1.........number...#......#..............#.......#
etc...

correct?

It's still a little difficult to figure out what exactly you're looking for, especially in light of that last paragraph. I think, however, that a simple linear regression should be fine. You might want to consider something like fixed effects to account for the possibility that "store" is having some type of impact on the DV, but that may be beyond what you're looking at doing. So basically, again if I'm parsing this out correctly, just do a linear regression with all the variables in the model. You'll then get some output that includes a beta coefficient, some other things and a p-value. There will also be an r^2 and an adjusted r^2. If you do a multivariate model you'll want to focus on adjusted r^2 since it takes into account any added explanatory power that might come about as a result of just adding in more variables.

You'll also want to look at the p-value. The p-value is the probability the effect being observed is representative of the actual process at work. A p of .05 indicates that there is a 95% probability that your observed effect is a true effect. You can adjust what values you consider significant based on your requirements such as whether you're more worried about type I or type II error (it seems like you'd be more concerned with type I). Then finally, and maybe most importantly, you have the beta coefficient. In a multivariate model the beta coefficient represents the impact that a one-unit change in the IV has on the DV, while controlling for the effects of the other IVs. So, if you have a IV like "store size" measured in square feet and an DV of "sales" measured in dollars, with a beta coefficient of 35 this indicates that adding one square foot to the store causes a $35 increase in sales.

So, that's a basic crash course in regression analysis. A very, very basic crash course. I glossed over a LOT of the nuance and some pretty important points. There's also debate about how to interpret these statistics (not even accounting for the Bayesians!). I also don't know that I'd call a simple OLS regression a forecasting model.
 
Last edited:

stevent

Distinguished Member
Joined
Feb 16, 2010
Messages
9,562
Reaction score
1,445
Want a high r^2? Just keep adding variables and you'll reach one
 

John152

Well-Known Member
Joined
Aug 11, 2007
Messages
72
Reaction score
0

Want a high r^2? Just keep adding variables and you'll reach one

Adjusted r^2 takes into account the adding in of more and more variables.

And while I'm here, why do businesses use Excel for stats work? I've never used it for statistical analysis, but it seems like it would be a major pain in the ass. Why not just learn Stata, R, or (god forbid) SPSS?
 
Last edited:

Nereis

Distinguished Member
Joined
Feb 12, 2009
Messages
1,373
Reaction score
44

Adjusted r^2 takes into account the adding in of more and more variables.
And while I'm here, why do businesses use Excel for stats work? I've never used it for statistical analysis, but it seems like it would be a major pain in the ass. Why not just learn Stata, R, or (god forbid) SPSS?

Because people are cheap and don't want to learn how to use new/better/more powerful tools.

OP, I'm not sure how academically rigorous you want your model to be, because doing what you're doing is known in my circles as 'data mining', an offense subject to intellectual ridicule for the foreseeable future.

For every single new variable you add in, you need some sort of justification from prior evidence to back you up. Moreover, you may encounter collinearity seeing as your data does not seem to be continuous. In that case, you will need to use tetra/polychoric correlation scores if your data is integer based if you do want to test first for correlation before addition into your model.

Moreover, and this is the final nail in the coffin if you want to be 'serious business' about this, any violation of the linear model assumptions in your misspecification tests will result in throwing out your entire model. If your X matrix turns out to be correlated to your error term, tough luck broski. You're going to have to argue that your massive (>1000) sample size results in your estimators approaching consistency in that case, despite being biased.
 

John152

Well-Known Member
Joined
Aug 11, 2007
Messages
72
Reaction score
0

Because people are cheap and don't want to learn how to use new/better/more powerful tools.
OP, I'm not sure how academically rigorous you want your model to be, because doing what you're doing is known in my circles as 'data mining', an offense subject to intellectual ridicule for the foreseeable future.
For every single new variable you add in, you need some sort of justification from prior evidence to back you up. Moreover, you may encounter collinearity seeing as your data does not seem to be continuous. In that case, you will need to use tetra/polychoric correlation scores if your data is integer based if you do want to test first for correlation before addition into your model.
Moreover, and this is the final nail in the coffin if you want to be 'serious business' about this, any violation of the linear model assumptions in your misspecification tests will result in throwing out your entire model. If your X matrix turns out to be correlated to your error term, tough luck broski. You're going to have to argue that your massive (>1000) sample size results in your estimators approaching consistency in that case, despite being biased.

Data mining isn't necessarily a bad thing depending on the context, and what you really mean by "data mining." There is machine-learning type data mining, which attempts to pick up imperceptible patterns in the data, and then there is atheoretical trolling of the data to try to find a result (gotta find those stars in the output). The biggest problem I see with what he's doing is, again depending on what kind of data he is looking at, it probably needs a time-series model and probably even a time-series cross-sectional model if the observations are clustered by store.
 

Nereis

Distinguished Member
Joined
Feb 12, 2009
Messages
1,373
Reaction score
44

Data mining isn't necessarily a bad thing depending on the context, and what you really mean by "data mining." There is machine-learning type data mining, which attempts to pick up imperceptible patterns in the data, and then there is atheoretical trolling of the data to try to find a result (gotta find those stars in the output). The biggest problem I see with what he's doing is, again depending on what kind of data he is looking at, it probably needs a time-series model and probably even a time-series cross-sectional model if the observations are clustered by store.

The data mining I'm talking about is simply adding things into the model to try and make it fit the sample better, with no real regard as to the causal effect of the additional explanatory variable.

In this manner you can 'demonstrate' that the growing size of chickens is linked to global warming.

OP, don't even focus on R squared. It doesn't tell you anything of real importance unless you're choosing between two equally well specified models. If you can rule out misspecification and can provide sensitivity analysis, then you're already ahead of the game so far as corporate use of statistics is concerned.
 

Coburn

Senior Member
Joined
Apr 29, 2009
Messages
627
Reaction score
47

ok - thanks everyone for the responses.
Let me clarify.
i have a table of stores and their respective sales for a measuring period. I have 6-8 variables (demographics, store size, etc...) that I want to 'test' and see what the effect each of the variables has on the sales of the stores.
From that relationship, I want to take the top two or three most significant 'drivers' of sales and build a forecasting model (using averages at this point, but I would experiment with a per unit factor against the r2 value)
i also understand that these 'variables' may not affect anything independantly and may co-affect sales together, but thats a bit much for this.
the original premise of the project was to see if we could see a relationship between a sales forecasting model for one store concept and translate it to another store concept without having to have someone rebuild us a real model (they are very expensive). so, if model A says 1.3m for a store, we could expect to do 500K in concept B. Getting from A to B isnt hard and we dont need a high level of accuracy (hence the averages), but do I split the group by store size, by number of competitors, etc... , I thought testing the variables would suss that out for me in a more accurate way.
Make sense? It doesn't to me!

You want regression. The response variable data points should be the sales for each quarter. Q1 Q2 Q3, etc. The X variables would be the demographics, store size, etc You are comparing the sales averaged over several quarters Y = Ax1 + Bx2 +Cx3.. Y is the sales -- x1 is store size.

For the categorical variables such as demographics, you need categorical regression
 

patrickBOOTH

Stylish Dinosaur
Dubiously Honored
Joined
Oct 16, 2006
Messages
36,170
Reaction score
10,941

Okay, first how much stats knowledge do you have? Just curious because that might affect how I explain things.
Let me get it straight first. Your data looks something like this
store....sales....var1....var2.....var3....var4
1.........number...#......#..............#.......#
etc...
correct?
It's still a little difficult to figure out what exactly you're looking for, especially in light of that last paragraph. I think, however, that a simple linear regression should be fine. You might want to consider something like fixed effects to account for the possibility that "store" is having some type of impact on the DV, but that may be beyond what you're looking at doing. So basically, again if I'm parsing this out correctly, just do a linear regression with all the variables in the model. You'll then get some output that includes a beta coefficient, some other things and a p-value. There will also be an r^2 and an adjusted r^2. If you do a multivariate model you'll want to focus on adjusted r^2 since it takes into account any added explanatory power that might come about as a result of just adding in more variables.
You'll also want to look at the p-value. The p-value is the probability the effect being observed is representative of the actual process at work. A p of .05 indicates that there is a 95% probability that your observed effect is a true effect. You can adjust what values you consider significant based on your requirements such as whether you're more worried about type I or type II error (it seems like you'd be more concerned with type I). Then finally, and maybe most importantly, you have the beta coefficient. In a multivariate model the beta coefficient represents the impact that a one-unit change in the IV has on the DV, while controlling for the effects of the other IVs. So, if you have a IV like "store size" measured in square feet and an DV of "sales" measured in dollars, with a beta coefficient of 35 this indicates that adding one square foot to the store causes a $35 increase in sales.
So, that's a basic crash course in regression analysis. A very, very basic crash course. I glossed over a LOT of the nuance and some pretty important points. There's also debate about how to interpret these statistics (not even accounting for the Bayesians!). I also don't know that I'd call a simple OLS regression a forecasting model.

This is pretty on point, but there are still many more things to consider such as getting the data specification, for example a lot of these kind of variables have the common issue of being tied to time. Over time these types of variables just go up so if you regress all of them together you are going to have insane multicolliniarity and inflated R^2 regardless of using adjusted or not. You have to break the trends by differencing, % change, log and so on. Even with that you have to test your VAR's and so on. It is a much more intense process than I feel you can get through an online forum if you want to do it right.

Also this last paragraph is important a model is not a forecast. A model is an explanation of what happened at a point in time, this may or may not be applicable to the future.
 
Last edited:

Styleforum is proudly sponsored by

Featured Sponsor

How wide do you like your leg opening on your trousers?

  • 7”

    Votes: 75 17.0%
  • 7.5”

    Votes: 146 33.0%
  • 8”

    Votes: 130 29.4%
  • 8.5”

    Votes: 50 11.3%
  • 9”

    Votes: 20 4.5%
  • 9.5”

    Votes: 9 2.0%
  • 10”

    Votes: 3 0.7%
  • 10.5”

    Votes: 9 2.0%

Related Threads

Forum statistics

Threads
433,029
Messages
9,304,423
Members
195,269
Latest member
ermister
Top