or Connect
Styleforum › Forums › Culture › Business, Careers & Education › Statistics, Data Science, and Data Mining Discussion Thread (Business Intelligence, Analytics, etc)
New Posts  All Forums:Forum Nav:

Statistics, Data Science, and Data Mining Discussion Thread (Business Intelligence, Analytics, etc)

post #1 of 30
Thread Starter 

Let's talk statistics, 'big data', and data mining in here.

 

What kind of work do you do? What problems do you work on? What tools do you use? Random thoughts? Book suggestions or blog posts? etc

Whatever, as long it's related to statistics or statistical computing.

 

I'm currently working on forecasting leads and sales for an automobile manufacturer and also on trying to apply association rule algorithms to clickstream data to identify common trends in consumer browsing behavior on a website. Besides that, I do a lot of natural language processing of survey verbatims for the purpose of classification and extracting common theme in those different classifications.

 
By and large, I use R, MySQL, and Python for all my analysis. Occasionally, I'll use Tableau for creating visualizations. In my old job, had some Hadoop and NoSql exposure but I'm now working with much smaller data sets (3 to 5 gb data files). I'd much rather work with 'small data' than 'big data.'
 

Blog posts I'm enjoying:

http://prdeepakbabu.wordpress.com/2010/02/24/association-rule-mining/

http://blog.revolutionanalytics.com/2014/03/r-and-hidden-markov-models.html


Edited by amathew - 3/12/14 at 1:14pm
post #2 of 30
I work in marketing and have an old stats textbook on my desk. I tried to look up something Friday but was no able to find it because I forgot the name.

Basically, it's a way to analyze a queue. I remember in college this was tested by telling you that there were 4 ticket counters. Each counter could process X number of people over a given time. Then you figure out how many ticket counters are needed for a given amount of people.

The teacher said they use this for stuff like scheduling workers for checkout lines at grocery stores to handle peak hours and such.

If anyone could just tell me the name of what can be used for this, I'll look it up in my textbook Monday.
post #3 of 30
Thread Starter 
Quote:
Originally Posted by Reggs View Post

I work in marketing and have an old stats textbook on my desk. I tried to look up something Friday but was no able to find it because I forgot the name.

Basically, it's a way to analyze a queue. I remember in college this was tested by telling you that there were 4 ticket counters. Each counter could process X number of people over a given time. Then you figure out how many ticket counters are needed for a given amount of people.

The teacher said they use this for stuff like scheduling workers for checkout lines at grocery stores to handle peak hours and such.

If anyone could just tell me the name of what can be used for this, I'll look it up in my textbook Monday.

 

Simulation

 

In the context of checkout lines, let's say that you wanted to know how long it should take you to get 'served.' If you took the number of available tellers and average session length for an average teller, you can come up with an average wait time per person. In reality, the number of available tellers is probably small, so simulation can be used to run the calculations numerous times and grab the average session length from each iteration, say 1000 times. 

post #4 of 30
I wish I used my statistics degree more post-college... seems like a waste taking all those classes and getting the 2nd degree.

What's the job prospects like nowadays for a statistics major? Is a MS/PhD still required? I worked in the statistics department for a bit and most grad students were PhD candidates, but I only witnessed two people leave and work in the industry. Seems like majority all want to get into the teaching track/associate professor positions.
post #5 of 30
Thread Starter 
Quote:
Originally Posted by gettoasty View Post

I wish I used my statistics degree more post-college... seems like a waste taking all those classes and getting the 2nd degree.

What's the job prospects like nowadays for a statistics major? Is a MS/PhD still required? I worked in the statistics department for a bit and most grad students were PhD candidates, but I only witnessed two people leave and work in the industry. Seems like majority all want to get into the teaching track/associate professor positions.

 

1. It depends on the job. The more emphasis that the position places on advanced regression analysis, classification models, or machine learning, a MS/PhD is going to be necessary. So unfortunately the positions where people are working on fun problems do require more education. Of course, if you just want to be an analyst at a random company doing hypothesis testing and linear regression, those jobs are certainly there but there's also plenty of competition from non-stats majors in everything from economics to physics to social sciences.

 

2. Job prospects are pretty good. With that said, the 'big thing' right now is big data so what companies want is both the statistical knowledge along with expertise in programming with big data technologies like Hadoop, NoSQL, etc. Most of the data scientists I've met that work on big data have had backgrounds in computer engineering or physics, so statisticians aren't necessarily benefiting from the increased demand for statistical knowledge.

 

3. Demand for statisticians varies by industry. A lot of people end working in finance or at tech companies. However, you're seeing more people go to marketing companies and ad agencies as they want to make smarter data-driven decisions. I started my career at a tech company and now I'm at a digital ad agency; best move ever. So many incredibly challenging problems (attribution modeling, click path models, etc) though there are issues as well, namely poor data warehousing methods.   

 

4. Statistical software is an important variable in regards to job market prospects. Given that R and Python have emerged as the most 'in demand' tools in most industries (minus big pharma which still uses SAS) over the past decades, people with those skills are wanted. When I send out my resume, the call backs usually involve some mention of my experience with classification models and my knowledge of R, MySql, and Python. Knowing those three are important in todays job market. A big reason for why I was hired at my current position was my knowledge of R so don't overlook the fact that technology and statistical software are very important. Of course, ten years from now, R may be replaced with Julia and Python could be replaced by Clojure as the 'in demand' technologies to know. 


Edited by amathew - 3/8/14 at 6:48pm
post #6 of 30
FYI, the coursera Data Mining course is starting up again:
https://www.coursera.org/course/ml

It says it started on March 3rd, but that's not really true...the first videos have been up since the but the real first week technically starts today (with the first review quiz due on sunday). Also, first week is pretty basic and if you bothered to click on this thread, it is probably just review and definition of terminology.
post #7 of 30
Thread Starter 

There's also an 'Exploratory Data Analysis' being offered by Udacity. It's good for beginners and those looking to learn R.

 

https://www.udacity.com/course/ud651

 

 

 

I just hope the stats and data science job market doesn't get flooded with bunch of people who took part in a few webinars and now think they're experts in statistical modeling, bayesian stats, etc. There are already too many of those hacks. 


Edited by amathew - 3/24/14 at 10:22am
post #8 of 30

The vast majority of those webinar graduates could not interpret a regression output to save their lives, so you need not worry.

 

Speaking of webinars, the Stanford class on convex optimization was very interesting and useful. Great lectures, too.

post #9 of 30
I am sure a lot people are jumping in from communication and signal processing...
post #10 of 30
I graduated last summer with a BBA and decided to take another 4 classes in order to get a B.I certificate. Got an introduction to sql, database structures, a little bit of data mining , and also worked with tableau. I just got a job in the B.I department of a large corporation and i feel like i dont know anything! Im looking to find some helpful MOOCs on relevant statistics and intro to R as it seems to be a requirement .
post #11 of 30
Thread Starter 
Quote:
Originally Posted by mrscrouge View Post

I graduated last summer with a BBA and decided to take another 4 classes in order to get a B.I certificate. Got an introduction to sql, database structures, a little bit of data mining , and also worked with tableau. I just got a job in the B.I department of a large corporation and i feel like i dont know anything! Im looking to find some helpful MOOCs on relevant statistics and intro to R as it seems to be a requirement .

 

- Every professional job I've had (only two), the initial month has involved me feeling like I don't known anything. So is it really just 'new job jitters' or do you really

feel that there are deficiencies in your understanding of how to examine and analyze data. 

 

- A BI department that requires knowledge of R, that seems odd. BI is much more about reporting and so BI tools like Tableau should be more important. R has a set 

great visualization tools, but I'd choose Tableau over it if the purpose is presenting pretty graphics to business people. 

 

- For a basic intro to statistics, try the following book:

Statistics in Plain English 

 

- To learn about hypothesis testing and regression analysis, try the following book:

Data Analysis Using Regression and Multilevel Modeling by Gelman

 

- To learn the basics of R, start with the Intro to R pdf made available by CRAN.

http://cran.r-project.org/doc/manuals/R-intro.pdf

 

- After reading the R manual, read the following.

Using R for Introductory Statics by Verzani

R Cookbook 

 

- Also, start looking at R questions on Stack Overflow and R bloggers

http://stackoverflow.com/questions/tagged/r

http://www.r-bloggers.com/

 

If you're interested, I might even be able to do some one on one tutoring on R. I've done that in the past and can put together some introductory code and tricks/tasks that I commonly use. I also have a meeting at work on Tuesday where I'll be doing an informal presentation to our interns on how I use R at work. Could also share some of that information with you. I'd have a small fee but if you needed something like that, let me know.

post #12 of 30
Thread Starter 

My job has turned into doing a lot of sentiment analysis of social media postings. It's a lot of cluster analysis (k-means) and information retrieval. At the end of 

the day, I feel that sentiment analysis of social media data is bullshit, or at least unnecessarily hyped, so I'm not sold on any of it. 


Edited by amathew - 5/18/14 at 8:25pm
post #13 of 30
At a new company, need to work with data, all the customer information is all out of order. I have 1K+ "active" customers, 377 have all their cells filled in, and with those 377 customers it's all over the place. The customers in England are listed as UK, the UK, United Kingdom, the united kingdom. I can get that sorted out, but that's such a small % of the list.

Most entries have an address field that has all the information dumped into it, but nothing in the city or country cell. It's just a mess. I need it to be in order, but I don't have time to deal with it. Anyone know any services who take care of stuff like this?
post #14 of 30
I'd try dumping it through a geocoder.

With a limited number of addresses, the free API to google should work:

https://developers.google.com/maps/documentation/geocoding/

You don't actually care about the latitude and longitude, but it has the nice side effect of returning an address broken up into its component parts.

Just concatenate everything into one string like 123 Street St, City, Country, Zip, Whatever and google should be able to figure out out.
I'd probably use python with the requests and json libraries...but there are about a billion ways to do this (and it could be done straight from R or SAS too)
Edited by otc - 5/20/14 at 8:05am
post #15 of 30
What kind database do you guys use if I want to interact with R or pyhton? I am thinking of some personal project for fun, database doesn't have to be relational, what would be the most intuitive way to load data parse and do computation on the fly, query speed just have to be ok, I value flexibility and ability to calculate and manipulate data on the fly more so than anything else. Oh and data is not necessarily static in the sense that I upload once and be done with it, but update to data with be relatively infrequent.
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Business, Careers & Education
Styleforum › Forums › Culture › Business, Careers & Education › Statistics, Data Science, and Data Mining Discussion Thread (Business Intelligence, Analytics, etc)