Originally Posted by Pink Socks
@crdb - all very interesting stuff from what I can follow. Any recommendations for a real beginner to look at just to get an overview and understanding - website, book, video. Just enough so I can keep up with this thread? (Seriously, although I feel like I should add a menswear comment too - what is appropriate attire for analysing big data and computer programming - Cucinelli cashmere hoodie
I think the two most important things are:
- understanding the relational model, and applying it in the best open source database available today, PostgreSQL (the most relational of the lot by far thanks to decades of academic research in Ingres and elsewhere);
- understanding statistical learning and if you have time, statistics itself.
The first allows you to reason about data declaratively - that is, without specifying how whatever you like is computed. It's actually incredibly conceptually easy; if you understand Venn diagrams and logic, you can write correct, relational SQL. Which is why I am mystified that most CS courses teach it from a flawed POV that dates from before Codd's seminal paper (https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf).
So, you take your input data, and you literally "declare" what you want and voila, data cleaned and result obtained, provably correctly. Understand the concept of a transaction, of a logical unit of work, of a relation variable vs a relation (a variable vs values, if you will), domains, types, etc. and you're good to go.
The second is about making sense of the data. Curve fitting, basically. Your brain does it everyday with everything, processing GB of data per second from all your senses. Reading SF, you're looking at posts about clothes from dressers of varying ability; first you "learn" which ones are good, then you "learn" why what they are doing is good, and voila, you have learnt about fashion by abstracting from examples (which are your data). There might be obvious patterns (the X vs )( quarters, the jacket ending halfway your silhouette, the famed "Northern Lights"), constraints (no open lacing with a suit) and non-obvious ones ("which skin type works best with which shirt pattern and colour palette").
Statistical learning is the formalisation of this. You have data, you fit a model to it (by minimising the error between the model and the data, usually) and you derive some kind of use from it (in the SF example: you learn to dress "better" although really you learn to buy expensive clothes that very few people will understand beyond "he looks nice"). You can use these models for intuition (e.g. aforementioned "obvious patterns" that "explain"; or the revenue equation mentioned before) or for prediction (try a bunch of new shirt and jacket patterns together and "feel" that they are wrong or right, i.e. the amount of "error" in what you just did vs what you think looks good, which you could call taste).
And there we talk about the separation between model and implementation. The model is only concerned with how things are, defining your input and output, at a conceptual level. Implementation is about how you make it happen. This is a very important distinction. A constraint on an SQL column is a model consideration: this column can only take these values, how you implement it is not important but it has to happen. An index is an implementation consideration (although something like CREATE UNIQUE INDEX WHERE [logical statement]; in SQL straddles the two - it's an implementation trick used to implement a model-level constraint). From that point of view, statistical learning is about the model, and Spark/Hadoop/Redshift (yes you can)/R/whatever is about implementation (at different levels).
I used to try and learn from MOOC but in my experience you just pick up patterns of behaviour that you can then apply in a job without really understanding the fundamentals. That used to cut it in 2008, not so much today. For the same number of hours, read the textbooks and understand what they say, then be able to abstract from that to new situations, patterns and models, and you're a much better thinker for it. A CM equivalent might be the difference between understanding the reason for which a wool tie does not go with the finest worsted suit, or understanding why things work at different levels of formality, vs parroting "no brown in town".
And so I repeat my recommendations: ISLR for statistical learning (free on http://www-bcf.usc.edu/~gareth/ISL/ - although you can bump up to ESLR if you feel comfortable with linear algebra) and Code or Date for the relational model as per above post. Date is I think a bit more readable. They disagree on a few issues. Total reading time 20-50 hours depending on how comfortable you want to get with the material.