Nine Things I Learned About Machine Learning During UmojaHack Africa 2021
For UmojaHack2021, I represented my university and country on Intermediate Challenge (Sendy Delivery Rider Response Challenge) where the objective was to create a machine learning model that will predict whether a rider will accept, decline or ignore an order sent by a customer. The challenge took two days and at the end I ranked 6th out of 266 participants on the final leader-board, taking 2nd prize for this challenge overall.
On 27 and 28 of March 2021 Zindi organised UmojaHack Africa 2021, the biggest inter-university machine learning hackathon for African undergraduate and postgraduate students. The hackathon involved 126 universities from 21 countries, and students could choose from three challenges:
The Beginner challenge: Predicting Financial Resilience
The Intermediate challenge: Sendy Delivery Rider Response Challenge
The Advanced challenge: Deepchain Antibody Challenge
From UmojaHack2021, I was able to learn many things from other participants and from the challenge itself. I would love to share some of these things with others in the community. I hope they will be helpful for any machine learning challenges or the next UmojaHack hackathon.
- Understand the problem (challenge) context and write down assumptions before you look up the data-sets.
The best way to start any machine learning challenge is by understanding the problem you’re going to solve. This can be by its context (domain-based such as finance, health etc.), or by the type of challenge (e.g. supervised such as classification or regression). This plays an important role in drawing some initial assumptions, and during feature engineering this helps in creating new features from the most important features which directly affect the result (target). I recommend this before you start any machine learning challenge, if the topic is complex you should try to do some research online to make sure you understand it before you checkout the datasets.
2. Make a baseline and submit early as possible.
Now after you are comfortable with the problem context, it’s time to gather data and start your workout. Create your baseline notebook as soon as you can, and make your first submission. This will give you confidence to improve your model performance and the position on the leader-board.
3. Understand the meaning of variables/features and their effect on the target.
You should make sure you understand all features well, and how they relate to or affect the result/target. This will help you to do well on feature engineering. You can achieve a clear understanding of your features by looking at the variable definitions file, followed by Exploratory Data Analysis (EDA). For fast analysis you can use pandas_profiling tool. This would help you to gain a deep understanding of the features and overall datasets quickly.
4. Use profiling for fast analysis.
This will help you to analyse the whole dataset with only a single line of Python code. After you import the pandas_profiling package, it will give you an easier way to provide relationships between features and a summary of the overall dataset. Also you can get some insights and patterns of different features easily. Learn more about pandas_profilling here.
5. Good coding arrangement and practice to follow ML flow.
The practice of organizing your code in a simple and correct way that follows ML flow will help you to trace your work easily and others to read and understand your work with less effort. This can be achieved by commenting your codes, following the ML flow of work and defining functions with the names of intended tasks to be performed, such as calculate_distance function specifically for distance calculations from one point to another.
6. Use variables/features that directly affect the target to generate possible new features.
This will depend on how you understand the problem context and the analysis you have done so far. This is feature engineering, which offers real potential for machine learning model performance improvement. By using existing feature you can generate new features such as distance from latitudes and longitudes, also days, years, hours, minutes, and seconds from date-time features.
7. Use ID and categorical features to create new features.
From UmojaHack2021 I discovered ID’s also have potential for ML model improvement. Before encoding categorical features, try to generate some new features such as frequencies and others. You can check out a great article covering this technique here.
8. Model the first model with plain-parameter algorithms.
Don’t use other model parameters to train with your model, this can lead you to get wrong results. I recommend to start with plain-parameter algorithms (using initial parameters), then you can use hyper-parameter tuning like op-tuna, grid search and random search to find the best parameters for your model such, or you can try to tune hyper-parameters manually.
9. Try to use parameters such as iterations and over-fitting detectors and learning Rate.
When modelling, try to focus on important parameters such as iterations and use detectors or regulators for over-fitting and under-fitting to help your model to provide strong predictions. You can use strong tree-based algorithms such as Cat-boost, xgboost, lightgbm and gradient-boost which have built-in over-fit and under-fit detectors also they tend to generalize well with data. You can try to read more in A Comprehensive Guide to Ensemble Learning with Python Codes.
Lastly, I would like to say thank you to Zindi, all sponsors and all participants of UmojaHack2021. In mastering data science and machine learning there is no single silver line but practicing on different challenges such hackathons and competitions can help you to get better experience of working with data and to achieve your career goals.
A big thank you to everyone for participating in UmojaHack2021. This competition was all about quick and structured thinking, coding, experimentation, and finding the one approach that got you up the leader-board. In short, what machine learning is all about!
Missed out this time? Don’t worry, you can check out all upcoming competitions and hackathons on Zindi platform, and register yourself today!