Peter Jamieson, a former student of mine who is now a data scientist at Pixability, a Boston-based video ad analytics startup, shared the following suggestions for getting started with data science:
Here's a list of some of the resources that have been helpful for me as I've gotten up to speed in data science.
My go-to tool for working with data these days is Python. It can be tough to get everything you need to get started. Fortunately, Anaconda and Enthought both offer free distributions that are nearly plug-and-play.
IPython is a tool, included in those downloads, that (among other features) lets people manipulate snippets of code in their browser. Becoming comfortable with launching and using IPython notebooks helps you take a number of online courses and share code.
There is an incredible amount of content out there and it can be hard to sift through it all. I've picked some highlights, some of which are focused on business understanding, some on implementation.
Books:
- Data Science for Business: Some math but no programming; a good resource for getting started that provides business use cases.
- DataSmart: Implements popular data science models in Excel and contains an intro to R. I'd recommend this for people without a programming background who are just starting out.
- Here's a list of more advanced machine learning titles from Quora -- very technical, not for the faint of heart.
- Other (free) books covering everything from coding to managing data science teams.
Blogs:
- The Yhat Blog
- Edwin Chen's Blog
- FastML
- Walking Randomly
- no free hunch -- Kaggle's competition and data science blog
- Revolutions -- Revolution Analytic's blog
- DataTau
- R-bloggers
- OkTrends -- This one's mostly for fun, but also illustrates an interesting business use for data science at the dating website OkCupid
Online Classes:
- Harvard's CS 109: Great hands-on intro done in Python, for those with some programming experience.
- Codecademy's Python Intro: Good beginner course in the language.
- Coursera's Intro to Machine Learning: Done in Octave/Matlab, taught by Baidu's data science lead.
- Harvard's Advanced Machine Learning Class, CS281: I haven't gotten too far into this yet. The course is quite technical; programming challenges can be addressed in a number of different languages.
Lectures/Powerpoints:
- Slides and code samples from NYU researcher's presentation on popular machine-learning package SciKit-Learn -- good for people with programming experience in Python who want more depth on a widely used tool.
Other Lists -- places to look if you can't find what you want above:
- Boston Data Community's list of online courses
- The Open Source Data Science Masters, a full online curriculum
Thanks to Peter for sharing this list. Readers: if you have items to add, please use the comments section.
Addendum, Oct. 26, 2015: LearnDataSci.com has compiled a list of free ebooks about data science (thanks, Brendan!).