- Study the learning curves
The first step to improving the results of your machine learning algorithms should begin with determining the problems that your model has. This can be attained by verifying the learning curves against a test set while varying the training instances. With this, you immediately find out if there is a difference between the in-sample and out-sample errors. If you find errors that are both high and similar, that will be a sign that you are working with a biased model.
- Use cross-validation (CV) correctly
A large difference between the CV estimates and the result is a massive problem that appears with a test set of fresh data. This problem means that something has gone wrong with cross-validation. Although cross-validation is good in prediction performance, this issue means that there is a misleading indicator, which causes incorrectness and unsatisfactory results.
- Handle the missing values
One of the biggest challenges in machine learning models is missing values and how people handle them. While this may not necessarily be their fault, depending on the material on the web, which advocates for mean manipulation and replacement of null values with the feature’s mean as a way of handling the missing values are not entirely correct. The first question that needs to be asked in such incidences is why the data is missing in the first place. This should be followed by considering other approaches to handling the missing data instead of using mean/median. Some of these methods are feature prediction modeling, K Nearest Neighbor imputation (KNN), or deleting the row, although this method is not recommended at all times.
- Apply feature engineering
There is a possibility that bias may still affect your model even after trying the above methods. If this is the case, you should try to improve the performance of your model. This can improve the target response. This can be achieved using the polynomial expansion or the support vector machine class of algorithms. The former can automatically look for the better feature spaces in a manner that is memory optimal and fast computationally. While these methods are handy, the human expertise and understanding of the method needed to solve the data issue that the algorithm is trying to learn cannot be substituted. Therefore, features are created based on your knowledge and ideas of how things work in the real world. Therefore, while machines have improved significantly, humans are still unbeatable in some areas.
- Look for more data
After exploring all the previous options, there may still be some issues and high variance that needs to be dealt with appropriately. In such a case, the only option is to increase the size of the training data. Doing this could mean you have increased new cases or new features. Adding more cases requires you to carefully look into the data and determine if you have similar data at hand. A great way to add new features is to locate an open-source data source and match the data with your entry series. You can also obtain both new cases and features through data scrapping from the web.