A Comparative Study of Modern Feature Selection Strategies and Alternative Approaches
The steady growth in collected data, along with the raised awareness for (the existence and power of) machine learning algorithms led to a myriad of trained classification and/or regression models. And although computers are nowadays capable of handling very large data sets, using machine learning purely as black-box algorithms is not advisable.
Aside from a better understanding of the algorithm's internal mechanics, the actually considered features play a crucial role for the model's performance and interpretability. Intuitively, models relying on only a handful of features are usually easier to grasp and likely also faster with regards to training and prediction times. On the other hand, models employing a larger feature set tend to achieve a better performance, but also come with the drawbacks of overfitting, as well as noisy and/or redundant features.
Therefore, finding a highly informative, yet at the same time manageable, feature set has a strong effect on the model's performance and interpretability.
Ideally, one would consider all possible combinations of features (i.e., perform exhaustive search) and afterwards pick the combination that performed best. However, for training a model with this strategy, 2p function evaluations are needed, i.e., the costs for training the machine learning model grows exponentially with the number of features ("curse of dimensionality"), which makes this strategy very expensive for data sets with more than a few features and absolutely impracticable for high-dimensional data sets.
Over the years, alternative feature selection strategies have been proposed, usually relying either on greedy or stochastic principles. And although both approaches already led to improved machine learning algorithms there still remains potential for improved strategies.
Here, strategies that actively respect the similarity among features and thus avoid spending valuable evaluations on redundant features could result in strong performance improvements. Initial ideas for such improvements are (a) finding clusters of similar features and only picking features from separate clusters, or (b) reducing the feature set upfront to features that are of high importance to the most relevant principal components.
Within this thesis, the aforementioned feature selection strategies (and alternative ideas) first have to be implemented (in R) to allow for comparisons against established feature selection strategies. Moreover, all considered strategies have to be carefully reviewed, including an analysis of their strengths and weaknesses. Complementing the theoretical investigations, they shall further be benchmarked against each other -- on various data sets and for a variety of machine learning algorithms.