1. Hi, I'm the owner and main administrator of Styleforum. If you find the forum useful and fun, please help support it by buying through the posted links on the forum. Our main, very popular sales thread, where the latest and best sales are listed, are posted HERE

    Purchases made through some of our links earns a commission for the forum and allows us to do the work of maintaining and improving it. Finally, thanks for being a part of this community. We realize that there are many choices today on the internet, and we have all of you to thank for making Styleforum the foremost destination for discussions of menswear.

Statistics, Data Science, and Data Mining Discussion Thread (Business Intelligence, Analytics, etc)

Discussion in 'Business, Careers & Education' started by amathew, Mar 8, 2014.

  1. otc

    otc Stylish Dinosaur

    Messages:
    16,789
    Likes Received:
    6,318
    Joined:
    Aug 15, 2008
    Sounds about right.

    Our sas server forces a structure on users.

    Client
    -Raw Data (stuff in here gets auto-zipped if not accessed in a certain amount of time)
    -SAS Data (only sas datasets, default is read/write access for everyone)
    ---subfolders here for sub-projects
    ---different sources
    ---personal interim data/etc
    -SAS Programs (Subfolders created for any user who logs into that client with read only access for other users)
    ---My folder
    ----Programs sit here
    ----Optional subfolders (if project gets long and I want to archive stuff away/keep a specific version/etc
    ----Output (where I can freely dump temporary PDF/excel type outputs for viewing/emailing without caring about overwriting anything)
    -Stata Data
    -Stata Programs (similar structure for stata but I don't use it so I don't really pay attention.


    Beyond that, organization is left up to the user...I typically create my output folder to dump stuff in. Raw Data doesn't always hold the raw data...we have a staff dedicated to converting data, so sometimes they have the originals (but then usually the SAS datasets can be considered exact transcriptions of the raw data). If data cleaning leads to a new dataset (rather than just a block of code that does the cleaning when reading it in), that will usually end up in a different folder. I often end up with my own folder in the SAS data directory for stuff I am playing with where I might want to write out a permanent dataset.

    The system keeps people well enough organized. It's not perfect--organization within a user's folder might be terrible, and even if I create my "own" data folder, people might end up saving their datasets in there--, but when you are talking about people who aren't formally trained programmers and when there might be 40 different people who have logged in to a client and created a program, it at least means you can tell who did what and have a decent chance of finding someone else's code if you need it.

    Unfortunately, it is not conducive to version control, which I would prefer. By default, the server cleans up files it doesn't recognize (often zipping unused non-sas-program files to save space) and fiddles with permissions, which makes dropping something like a mercurial repository on the server impossible.

    Edit: and I guess I should say, I usually use SAS through Enterprise Guide (although I don't use any of its automated features...just use it to edit code). EG uses project files, so instead of storing a lot of random programs in the directory, I store a project file that contains the code. I might have more than one of these for unrelated tasks that don't share data...and I might make a one-off version when a report goes out that contains only the code needed to produce the numbers in the report (a sort of ghetto version control).
     
    Last edited: Oct 24, 2014

  2. amathew

    amathew Distinguished Member

    Messages:
    1,570
    Likes Received:
    232
    Joined:
    Nov 4, 2011
    Location:
    KS => CO => MN => CA
    ^ Wow, thanks for sharing. Out of curiosity, do most people who use SAS perform tasks using the SAS syntax or do they use the point and click variation. I have a copy of SAS enterprise on my work computer but I've never really bothered to look into it.
     

  3. otc

    otc Stylish Dinosaur

    Messages:
    16,789
    Likes Received:
    6,318
    Joined:
    Aug 15, 2008
    I don't know anyone in our office who regularly uses the point-and-click stuff in EG.

    I think it is the kind of thing that is not so simple that someone with know knowledge can just use it (like Tableau)...but not powerful enough that anyone who can actually write code would use it. I am also not sure how much data cleaning and manipulation it is capable of...so it is not useful on random outside data (I could see it being useful if you had clean data that was maintained by another department...and you wanted to do a bit of point and click analysis on it).

    I've only seen it used or tried to use it myself a couple of times. The one nice thing is that it generates the underlying SAS code. So if I am going to use some graphical or analysis procedure I have never seen before, I could set up the data with code, but then use the point-and-click tool to build up the bones of the procedure. This would tell me the proper syntax, possibly show me options I was not aware of, and structure it in a decent way.
    I don't know that that is much of an improvement over just googling something though...

    edit: And FWIW, I don't think EG is very good...but, when it comes to interacting with a remote SAS server, I think my options are either EG, a slow/laggy/goofy X-forwarded version of interactive SAS, and using the command line to run stuff in batch mode.

    Batch mode is OK (and I use it for huge programs), although I don't have any good program editors that do SAS syntax highlighting. But if I want to be able to do things like scroll through a data set in tabular form, look at intermediate datasets without writing them to a file, or run only select lines of a program...EG or Interactive SAS are my only options. And the x-forwarded version of Unix Interactive SAS is really lacking...so I use EG.
     
    Last edited: Oct 29, 2014

  4. clee1982

    clee1982 Stylish Dinosaur

    Messages:
    11,852
    Likes Received:
    1,539
    Joined:
    Feb 22, 2009
    Location:
    New York City, NY, USA
    To mythikl is your data structured? If so better stick wth SQL, if it's unstructured like all tagging then maybe non relational db is better for you
     

  5. amathew

    amathew Distinguished Member

    Messages:
    1,570
    Likes Received:
    232
    Joined:
    Nov 4, 2011
    Location:
    KS => CO => MN => CA
    One thing I always "struggle" with is nonlinear regression in which the response variable is continuous. I have a rough process in place which I go through, but I'm always looking for better ways. In general, my 'philosophy' is to avoid relying on manual parameter transformations, so I end up using generalized additive models and/or smoothing techniques like splines. Usually, I use a general linear model as a baseline, then test out several generalized linear models, generalized additive models,
    multivariate adaptive regression splines, etc. A lot of this is fairly new to me as most of professional career (all two years of it) has been spent on classification problems, so dealing with continuous response variable is a lot more challenging than I thought it would be.

    So...for nonlinear regression in which the response variable is continuous, how do you approach those problems? Any models or smoothing techniques you're partial to?

    EDIT:
    a. Decision trees are also a god send for when dealing with non-linear interactions.
    b. Supper vector regression has been blowing my mind recently
    http://www.cvip.uofl.edu/wwwcvip/research/publications/TechReport/SVMRegressionTR.pdf
     
    Last edited: Dec 6, 2014

  6. DaveDr89

    DaveDr89 Senior Member

    Messages:
    264
    Likes Received:
    0
    Joined:
    Oct 25, 2007
    Great thread.

    I agree with one of the previous posts RE the lacking of fundamental statistical training by many folks working on data science problems. This also intersects with the generational aspect of how many people use Wikipedia as principal source for learning. While it is definitely a good source, it should not be the only source. Along those lines, when I interview folks for statistician positions at my company I often asked them the following question, "If you could only bring 3 stat books to your next job, what would they be?" Perhaps the question is a bit dated in a digital world, but nevertheless it is surprising how many candidates cannot name 3 stat books (or machine learning, etc.). For anyone wanting to get into data science I would recommend a heavy emphasis on statistical training so that one can distinguish themselves from the crowd. Classics like Frank Harrelll's Regression Modelling Strategies should be high on the list. Finally, the post on Rob Hynd's web site is relevant to this thread:

    http://robjhyndman.com/hyndsight/am-i-a-data-scientist/
     

  7. otc

    otc Stylish Dinosaur

    Messages:
    16,789
    Likes Received:
    6,318
    Joined:
    Aug 15, 2008
    What are these book things you are taking about and why do you need three of them?
     

  8. DaveDr89

    DaveDr89 Senior Member

    Messages:
    264
    Likes Received:
    0
    Joined:
    Oct 25, 2007
    There is no right answer to the book question, it's just to see if they can recall any books. If they provide an eclectic list then all the better. In any event, here are a few books I'd put on the list:

    Casella & Berger, Statistical Inference
    Harrell, Regression Modelling Strategies
    Wilcox, Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy
    Gelman, Data Analysis Using Regression and Multilevel/Hierarchical Models
    Hasti, Tibshiriani, Freedman; Elements of Statistical Learning
    Ruppet, Statistics and Data Analysis for Financial Engineering

    Of course, there are tons of other books that could be on the list. The last book, although tilted toward finance, it a very comprehensive statistic book in its own right and also has lots of R code.
     

  9. amathew

    amathew Distinguished Member

    Messages:
    1,570
    Likes Received:
    232
    Joined:
    Nov 4, 2011
    Location:
    KS => CO => MN => CA
    Other books (newer ones) worth mentioning...

    Categorical Data Analysis - Agresti

    Bayesian Data Analysis - Gelman and others

    Also, John Fox wrote a good book on Generalized Linear Models, but I forgot the name and actually threw it out when I moved (regret it now)

    Then there's Max Khun's Applied Predictive Modeling, which is a must have R/stats book for me as I use the Caret package a lot.
     
    Last edited: Dec 27, 2014

  10. fuji

    fuji Distinguished Member

    Messages:
    7,062
    Likes Received:
    1,436
    Joined:
    Sep 5, 2008
    Location:
    London
    



    Statistics book of the gods.


    Did my undergrad in statistics with finance, going to be doing my masters in statistics next year. Focus will be on stochastic calculus, machine learning and time series.


    I agree a lot of people doing statistics, don't really seem to understand the underlying principles of what they're doing and just know how to analyse data with R or something. my undergrad didn't have me using a computer until the final year, pretty much just probability and distribution theory, a lot of maths and some markhov chain stochastic process kind of stuff.
     
    Last edited: Dec 27, 2014

  11. DaveDr89

    DaveDr89 Senior Member

    Messages:
    264
    Likes Received:
    0
    Joined:
    Oct 25, 2007
    Good luck in grad school. The interesting thing about grad programs in stats is that one can obtain completely different training depending on where one goes (probably more so nowadays as programs broaden their offerings). E.g., RE the books above, if the authors were to give respective short courses on statistics they would not have a great deal in common. Speaking of short courses, I would and this one to any short list:

    https://users.soe.ucsc.edu/~draper/eBay-Google-2013.html
     

  12. fuji

    fuji Distinguished Member

    Messages:
    7,062
    Likes Received:
    1,436
    Joined:
    Sep 5, 2008
    Location:
    London
    


    I suppose the same thing applies to undergrad. After reading this thread I googled principle component analysis and it seems quite important. It's not covered in any undergrad course at my uni and the only masters course that covers it is a course in analysing social science data. We do have to take tonnes of linear algebra though so it's a pretty easy to understand concept.


    Anyone here work in finance? Doing this Msc most likely and i'd like to know, which courses have the most real life applications.

    http://www.lse.ac.uk/statistics/study/prospective/mscstatistics.aspx
     

  13. amathew

    amathew Distinguished Member

    Messages:
    1,570
    Likes Received:
    232
    Joined:
    Nov 4, 2011
    Location:
    KS => CO => MN => CA
    I'm sure there's an undergrad course that teaches explanatory factor analysis, and for many instances that could be enough. Both EFA and PCA are geared towards
    a similar "type" of problem after all.
     

  14. amathew

    amathew Distinguished Member

    Messages:
    1,570
    Likes Received:
    232
    Joined:
    Nov 4, 2011
    Location:
    KS => CO => MN => CA

  15. VinnyMac

    VinnyMac Distinguished Member

    Messages:
    1,868
    Likes Received:
    140
    Joined:
    Sep 15, 2012
    Great thread guys. I just came across it. The type of topics discussed on SF never stop surprising me.

    In response to the above, how do you differentiate between EFA and PCA? I hear people reference PCA as something different from Factor Analysis quite a bit; my understanding is that it's incorrect to do so, but I'm curious to see what you think.

    My understanding is that Factor Analysis (whether Exploratory or Confirmatory) is the general technique. Component Analysis (also PCA) and Common Factor Analysis are two methods of extracting factors for Factor Analysis, not separate techniques.

    Let's compare that to Multiple Regression analysis. MR is the analytical technique. Stepwise Estimation and Forward Addition/Backwards Elimination are model estimation methods (similar to PCA's role in Factor Analysis). No one refers to Stepwise Estimation as its own technique; it's one option that you can use to create a MR model, but people (erroneously) refer to PCA as a separate technique from Factor Analysis, rather than one possible extraction method that can be used for Factor Analysis.

    Exploratory and Confirmatory Factor Analysis are uses of Factor Analysis for certain ends. PCA is an extraction method for Factor Analysis, not a separate technique "geared towards a similar 'type' of problem."

    Thoughts?
     

Share This Page

Styleforum is proudly sponsored by

  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies, our Privacy Policy, and Terms and Conditions.