6.3 Data
In an academic or research setting, you likely only work with clean, readily available datasets and therefore can afford to spend most of your time on modeling. In production, it’s likely that you will spend most of your time on the data pipeline. The ability to manage, process, and monitor data will make you attractive to potential employers.
In your interviews, you might be asked questions that evaluate how comfortable you are with working with data. At a high level, you should be familiar with reading, writing, and serializing different types of data. You should have your go-to library for dataframe manipulation: pandas
is popular for general data applications, and dask
is a good option if you want something GPU-compatible. You should be comfortable with at least a visualization library such as seaborn
, matplotlib
, Tableau
, or ggplot
.
If you want to work with big data, it doesn’t hurt to familiarize yourself with distributed data management systems such as Spark and Hadoop.
Beyond Python, SQL is still ubiquitous for all applications that require persistent databases, and while R isn’t the sexiest language, it’s handy for quick data analytics.