Johan van Doornik

Algorithms

Before designing a (machine learning) algorithm to solve a certain problem, it is always wise to start with a very good understanding of the input data. Consider aspects such as different types of noise, missing values, labels, potential bias, unexpected interdependencies, over or under sampling. A proper data cleanup is essential to make any machine learning algorithm work. This is an area where data science meets data engineering. Especially when the data is not static but regularly refreshed or even realtime and requiring a data ingestion architecture. Automated learning on previously unseen data is exciting but has a number of pitfalls and requires a lot of safety valves and preparation for the unexpected.

As for algorithm design, I usually prefer to start as simple as possible, and incrementally add complexity while keeping fallback options for exceptions. It is often not necessary to immediately optimize for all possible inputs, and instead focus on the important 80%. But all inputs must give a reasonable output. (With the exception for non-ergodic systems with a risk of ruin. Then the extremes are all-important). By going from simple to complex I also mean to gradually add black box solutions such as neural networks for function approximation or feature selection. But always be aware that while black box methods (deep learning) will in the end outperfom systems based solely on domain knowledge, they may behave weird in unexpected situations, simply because of less examples in the training data.

Below you can find some of my recent data science projects.

Relevant projects