• Hi, I am the owner and main administrator of Styleforum. If you find the forum useful and fun, please help support it by buying through the posted links on the forum. Our main, very popular sales thread, where the latest and best sales are listed, are posted HERE

    Purchases made through some of our links earns a commission for the forum and allows us to do the work of maintaining and improving it. Finally, thanks for being a part of this community. We realize that there are many choices today on the internet, and we have all of you to thank for making Styleforum the foremost destination for discussions of menswear.
  • This site contains affiliate links for which Styleforum may be compensated.
  • STYLE. COMMUNITY. GREAT CLOTHING.

    Bored of counting likes on social networks? At Styleforum, you’ll find rousing discussions that go beyond strings of emojis.

    Click Here to join Styleforum's thousands of style enthusiasts today!

    Styleforum is supported in part by commission earning affiliate links sitewide. Please support us by using them. You may learn more here.

Statistics, Data Science, and Data Mining Discussion Thread (Business Intelligence, Analytics, etc)

amathew

Distinguished Member
Joined
Nov 4, 2011
Messages
1,501
Reaction score
228
Let's talk statistics, 'big data', and data mining in here.

What kind of work do you do? What problems do you work on? What tools do you use? Random thoughts? Book suggestions or blog posts? etc
Whatever, as long it's related to statistics or statistical computing.

I'm currently working on forecasting leads and sales for an automobile manufacturer and also on trying to apply association rule algorithms to clickstream data to identify common trends in consumer browsing behavior on a website. Besides that, I do a lot of natural language processing of survey verbatims for the purpose of classification and extracting common theme in those different classifications.

By and large, I use R, MySQL, and Python for all my analysis. Occasionally, I'll use Tableau for creating visualizations. In my old job, had some Hadoop and NoSql exposure but I'm now working with much smaller data sets (3 to 5 gb data files). I'd much rather work with 'small data' than 'big data.'

Blog posts I'm enjoying:
http://prdeepakbabu.wordpress.com/2010/02/24/association-rule-mining/
http://blog.revolutionanalytics.com/2014/03/r-and-hidden-markov-models.html
 
Last edited:

Reggs

Distinguished Member
Joined
Mar 11, 2006
Messages
6,219
Reaction score
698
I work in marketing and have an old stats textbook on my desk. I tried to look up something Friday but was no able to find it because I forgot the name.

Basically, it's a way to analyze a queue. I remember in college this was tested by telling you that there were 4 ticket counters. Each counter could process X number of people over a given time. Then you figure out how many ticket counters are needed for a given amount of people.

The teacher said they use this for stuff like scheduling workers for checkout lines at grocery stores to handle peak hours and such.

If anyone could just tell me the name of what can be used for this, I'll look it up in my textbook Monday.
 

amathew

Distinguished Member
Joined
Nov 4, 2011
Messages
1,501
Reaction score
228
I work in marketing and have an old stats textbook on my desk. I tried to look up something Friday but was no able to find it because I forgot the name.

Basically, it's a way to analyze a queue. I remember in college this was tested by telling you that there were 4 ticket counters. Each counter could process X number of people over a given time. Then you figure out how many ticket counters are needed for a given amount of people.

The teacher said they use this for stuff like scheduling workers for checkout lines at grocery stores to handle peak hours and such.

If anyone could just tell me the name of what can be used for this, I'll look it up in my textbook Monday.

Simulation

In the context of checkout lines, let's say that you wanted to know how long it should take you to get 'served.' If you took the number of available tellers and average session length for an average teller, you can come up with an average wait time per person. In reality, the number of available tellers is probably small, so simulation can be used to run the calculations numerous times and grab the average session length from each iteration, say 1000 times.
 
Last edited:

gettoasty

Stylish Dinosaur
Joined
Feb 8, 2010
Messages
16,199
Reaction score
10,429
I wish I used my statistics degree more post-college... seems like a waste taking all those classes and getting the 2nd degree.

What's the job prospects like nowadays for a statistics major? Is a MS/PhD still required? I worked in the statistics department for a bit and most grad students were PhD candidates, but I only witnessed two people leave and work in the industry. Seems like majority all want to get into the teaching track/associate professor positions.
 

amathew

Distinguished Member
Joined
Nov 4, 2011
Messages
1,501
Reaction score
228
I wish I used my statistics degree more post-college... seems like a waste taking all those classes and getting the 2nd degree.

What's the job prospects like nowadays for a statistics major? Is a MS/PhD still required? I worked in the statistics department for a bit and most grad students were PhD candidates, but I only witnessed two people leave and work in the industry. Seems like majority all want to get into the teaching track/associate professor positions.

1. It depends on the job. The more emphasis that the position places on advanced regression analysis, classification models, or machine learning, a MS/PhD is going to be necessary. So unfortunately the positions where people are working on fun problems do require more education. Of course, if you just want to be an analyst at a random company doing hypothesis testing and linear regression, those jobs are certainly there but there's also plenty of competition from non-stats majors in everything from economics to physics to social sciences.

2. Job prospects are pretty good. With that said, the 'big thing' right now is big data so what companies want is both the statistical knowledge along with expertise in programming with big data technologies like Hadoop, NoSQL, etc. Most of the data scientists I've met that work on big data have had backgrounds in computer engineering or physics, so statisticians aren't necessarily benefiting from the increased demand for statistical knowledge.

3. Demand for statisticians varies by industry. A lot of people end working in finance or at tech companies. However, you're seeing more people go to marketing companies and ad agencies as they want to make smarter data-driven decisions. I started my career at a tech company and now I'm at a digital ad agency; best move ever. So many incredibly challenging problems (attribution modeling, click path models, etc) though there are issues as well, namely poor data warehousing methods.

4. Statistical software is an important variable in regards to job market prospects. Given that R and Python have emerged as the most 'in demand' tools in most industries (minus big pharma which still uses SAS) over the past decades, people with those skills are wanted. When I send out my resume, the call backs usually involve some mention of my experience with classification models and my knowledge of R, MySql, and Python. Knowing those three are important in todays job market. A big reason for why I was hired at my current position was my knowledge of R so don't overlook the fact that technology and statistical software are very important. Of course, ten years from now, R may be replaced with Julia and Python could be replaced by Clojure as the 'in demand' technologies to know.
 
Last edited:

otc

Stylish Dinosaur
Joined
Aug 15, 2008
Messages
24,529
Reaction score
19,184
FYI, the coursera Data Mining course is starting up again:
https://www.coursera.org/course/ml

It says it started on March 3rd, but that's not really true...the first videos have been up since the but the real first week technically starts today (with the first review quiz due on sunday). Also, first week is pretty basic and if you bothered to click on this thread, it is probably just review and definition of terminology.
 

amathew

Distinguished Member
Joined
Nov 4, 2011
Messages
1,501
Reaction score
228
There's also an 'Exploratory Data Analysis' being offered by Udacity. It's good for beginners and those looking to learn R.

https://www.udacity.com/course/ud651



I just hope the stats and data science job market doesn't get flooded with bunch of people who took part in a few webinars and now think they're experts in statistical modeling, bayesian stats, etc. There are already too many of those hacks.
 
Last edited:

capnMURPHY2021

Senior Member
Joined
Jan 27, 2014
Messages
180
Reaction score
56
The vast majority of those webinar graduates could not interpret a regression output to save their lives, so you need not worry.

Speaking of webinars, the Stanford class on convex optimization was very interesting and useful. Great lectures, too.
 

clee1982

Stylish Dinosaur
Joined
Feb 22, 2009
Messages
28,968
Reaction score
24,803
I am sure a lot people are jumping in from communication and signal processing...
 

mrscrouge

Active Member
Joined
Jan 29, 2011
Messages
25
Reaction score
0
I graduated last summer with a BBA and decided to take another 4 classes in order to get a B.I certificate. Got an introduction to sql, database structures, a little bit of data mining , and also worked with tableau. I just got a job in the B.I department of a large corporation and i feel like i dont know anything! Im looking to find some helpful MOOCs on relevant statistics and intro to R as it seems to be a requirement .
 

amathew

Distinguished Member
Joined
Nov 4, 2011
Messages
1,501
Reaction score
228
I graduated last summer with a BBA and decided to take another 4 classes in order to get a B.I certificate. Got an introduction to sql, database structures, a little bit of data mining , and also worked with tableau. I just got a job in the B.I department of a large corporation and i feel like i dont know anything! Im looking to find some helpful MOOCs on relevant statistics and intro to R as it seems to be a requirement .

- Every professional job I've had (only two), the initial month has involved me feeling like I don't known anything. So is it really just 'new job jitters' or do you really
feel that there are deficiencies in your understanding of how to examine and analyze data.

- A BI department that requires knowledge of R, that seems odd. BI is much more about reporting and so BI tools like Tableau should be more important. R has a set
great visualization tools, but I'd choose Tableau over it if the purpose is presenting pretty graphics to business people.

- For a basic intro to statistics, try the following book:
Statistics in Plain English

- To learn about hypothesis testing and regression analysis, try the following book:
Data Analysis Using Regression and Multilevel Modeling by Gelman

- To learn the basics of R, start with the Intro to R pdf made available by CRAN.
http://cran.r-project.org/doc/manuals/R-intro.pdf

- After reading the R manual, read the following.
Using R for Introductory Statics by Verzani
R Cookbook

- Also, start looking at R questions on Stack Overflow and R bloggers
http://stackoverflow.com/questions/tagged/r
http://www.r-bloggers.com/

If you're interested, I might even be able to do some one on one tutoring on R. I've done that in the past and can put together some introductory code and tricks/tasks that I commonly use. I also have a meeting at work on Tuesday where I'll be doing an informal presentation to our interns on how I use R at work. Could also share some of that information with you. I'd have a small fee but if you needed something like that, let me know.
 
Last edited:

amathew

Distinguished Member
Joined
Nov 4, 2011
Messages
1,501
Reaction score
228
My job has turned into doing a lot of sentiment analysis of social media postings. It's a lot of cluster analysis (k-means) and information retrieval. At the end of
the day, I feel that sentiment analysis of social media data is bullshit, or at least unnecessarily hyped, so I'm not sold on any of it.
 
Last edited:

Reggs

Distinguished Member
Joined
Mar 11, 2006
Messages
6,219
Reaction score
698
At a new company, need to work with data, all the customer information is all out of order. I have 1K+ "active" customers, 377 have all their cells filled in, and with those 377 customers it's all over the place. The customers in England are listed as UK, the UK, United Kingdom, the united kingdom. I can get that sorted out, but that's such a small % of the list.

Most entries have an address field that has all the information dumped into it, but nothing in the city or country cell. It's just a mess. I need it to be in order, but I don't have time to deal with it. Anyone know any services who take care of stuff like this?
 

otc

Stylish Dinosaur
Joined
Aug 15, 2008
Messages
24,529
Reaction score
19,184
I'd try dumping it through a geocoder.

With a limited number of addresses, the free API to google should work:

https://developers.google.com/maps/documentation/geocoding/

You don't actually care about the latitude and longitude, but it has the nice side effect of returning an address broken up into its component parts.

Just concatenate everything into one string like 123 Street St, City, Country, Zip, Whatever and google should be able to figure out out.
I'd probably use python with the requests and json libraries...but there are about a billion ways to do this (and it could be done straight from R or SAS too)
 
Last edited:

clee1982

Stylish Dinosaur
Joined
Feb 22, 2009
Messages
28,968
Reaction score
24,803
What kind database do you guys use if I want to interact with R or pyhton? I am thinking of some personal project for fun, database doesn't have to be relational, what would be the most intuitive way to load data parse and do computation on the fly, query speed just have to be ok, I value flexibility and ability to calculate and manipulate data on the fly more so than anything else. Oh and data is not necessarily static in the sense that I upload once and be done with it, but update to data with be relatively infrequent.
 

Featured Sponsor

How important is full vs half canvas to you for heavier sport jackets?

  • Definitely full canvas only

    Votes: 91 37.6%
  • Half canvas is fine

    Votes: 90 37.2%
  • Really don't care

    Votes: 25 10.3%
  • Depends on fabric

    Votes: 40 16.5%
  • Depends on price

    Votes: 38 15.7%

Forum statistics

Threads
506,841
Messages
10,592,142
Members
224,322
Latest member
mumberejona
Top