6.3.1 Data structures
If our world is run by data, then data structures are what keep us from descending into chaos. From the dawn of the digital age, the best minds of computer science have kept themselves up at night thinking of efficient ways to store and manipulate data. Data structures are even more important in machine learning as the field is fueled by big data.
While there are classical data structures that have stood the test of time, developing new data structures and improving on the existing ones are never-ending battles as new formats are introduced and new data are generated at a scale never seen before. Your familiarity with existing data structures, understanding of how they are implemented, and intuition on what data structures to use and when to use them will be highly valuable.
Some of the data structures whose runtime complexities you should know and that you should be able to implement in at least one language:
- Trees: binary search tree, heap, trie (prefix and suffix tree)
- Queues, stacks, priority queues
- Linked lists
- HashMap and HashTable
We don’t have questions about those data structures here, but you should try to implement them yourself, either on a coding exercise website or locally, and compare with known implementation.
You should be comfortable with manipulating popular data formats such as the ubiquitous CSV format and web- and serialization-friendly JSON format. Both CSV and JSON are examples of the traditional row-based file formats: data is stored and often indexed row-by-row.
In recent years, the column-based format has become more and more common, as it allows big data applications to quickly extract one feature from all the data points by calling the column corresponding to that feature. Popular data frameworks for machine learning include pandas
and dask
are optimized for column-based operations. The two common column-based file formats are Parquet, championed by Apache Hadoop, and ORC, championed by Apache Hive.
Row-based data formats are more efficient for writing while column-based formats are more efficient for reading. If your data is write-once-read-many, use column-based. If it requires regular rewriting, opt for row-based.
For more detail on data engineering for machine learning, check out the lecture note on Data Engineering for the course Machine Learning Systems Design.