In less than one year of independent studies I advanced from a data science student to a practitioner, and have contributed over $7,500-worth of data solutions to research consulting projects.
Key Learnings and Recommendations
My value as an analyst increased with new capabilities in exploratory analysis, data visualization, and modeling/machine learning. Currently, this base-layer of data science acts as scaffolding and supports the learning of more advanced techniques related to my consulting work. For instance, segmenting users via cluster analysis is within reach. This is the main deliverable for consumer segmentation studies that currently require professional statisticians.
The research consultancy that I work for benefits as well. Bringing data science capabilities in-house has reduced the cost of data products. Now data products are feasible for all of our client projects. Regularly delivering custom visualizations and models impresses clients and distinguishes the consultancy’s analytics offerings.
Data Scientist In Training
My studies were entirely based on two Coursera specializations: Statistics with R and Data Science. Their completion spanned an 11-month period (August 2017 to July 2018) and required roughly 500 learning hours. Note: The two specializations complement one another but there exists some content overlap. Complete the bolded courses below for an accelerated, yet comprehensive learning path.
Statistics with R introduces the statistical theory that underlies data science. It also provides an introduction to R programming that gently steps through basic syntax and data-visualization exercises. This was a practical starting point for me with no prior statistics or programming experience.
- Introduction to Probability and Data (30 hours)
- Inferential Statistics (35 hours)
- Linear Regression and Modeling (35 hours)
- Bayesian Statistics (40 hours)
- Multiple Linear Regression Capstone (60 hours)
After completing the above I began the Data Science specialization. This specialization focuses on programming and highlights many of R’s capabilities. Its topics span the data science lifecycle from getting and cleaning data to machine learning and app development.
- The Data Scientist’s Toolbox (6 hours)
- R Programming (20 hours)
- Getting and Cleaning Data (20 hours)
- Exploratory Data Analysis (20 hours)
- Reproducible Research (20 hours)
- Statistical Inference (20 hours)
- Regression (25 hours)
- Machine Learning (35 hours)
-
Data Products (25 hours)
-
Natural Language Processing Capstone (120 hours)
Both specializations feature final capstone projects that reflect data science “in-the wild.” The capstones extend into unknown domains and require students to independently develop additional skills. The ability to learn data science on the fly is a meta-skill. Developing it was the single most important factor in transitioning from student to practitioner. Please find links to these capstones above.
Data Solutions For Consulting Projects
I began contributing data solutions to consulting projects after 7 months of learning. Over the next 4 months I generated $7,500-worth of data products and finished the rest my course work. These data products are listed below. Note: All data and graphics have been modified to ensure client privacy and obey non-disclosure agreements.
A statistics contractor produced both models to ensure accuracy; however, I generated them independently and reproduced his results exactly. All other data products resulted from exploratory analyses. Their value is calculated as my labor multiplied by $75/hour, an average hourly rate for a data scientist (1).
Regression Trees ($4,000)
Predicts the number of times a visitor goes to a particular humanities institution annually. Identified key variable splits that produce visitor groups with the highest and lowest average number of visits. An exploratory process that required 5 separate decision trees. Statistician’s cost: $1,000 for the first tree + $750 for each additional tree
Sankey Diagram ($1,500)
An interactive visualization showing user groups and their corresponding responses generated from primary research data. At cost: 20 hours x $75/hour = $1,500
Multiple Linear Regression ($900)
Predicts a visitor’s satisfaction with their visit to a particular humanities institution. Visitation satisfaction regressed on 25 predictor variables to identify significant predictors and their positive or negative correlation. Statistician’s cost: $900
Visitation Distribution Analysis ($750)
Proved that the visitation to a particular humanities institution follows a Pareto distribution. This implies that 20% of the institution’s visitors contributes 80% of its total yearly visits. At cost: 10 hours x $75/hour = $750
Visitor Segment Visualization ($375)
Shows an a priori segmentation of visitors to a particular humanities institution. Indexed these visitor segments on their responses to key variables. At cost: 5 hours x $75/hour
References
- Keenan, T. (2018, January 5). How Much Does It Cost to Hire a Data Scientist?. Retrieved from https://www.upwork.com/hiring/data/how-much-hire-data-scientist/
Additional Resources
These are all the resources I used throughout my learning.
Other Learning Paths
Books
Note: All of these books were available at my public library.
Quick Cheat Sheets