September 7, 2018 at 7:38 pm #85110
This gives me a great pleasure to write about the journey I have been through over last few months and how I landed here. Will try to keep it short.
“Who Am I?”
Working with Infosys for a decade and half now. Mechanical Engineer is qualification, managing projects is profession and playing with numbers is passion. As anybody else, it started with the buzzword of BIG data in 2016 and after the initial ‘Analysis Paralysis’, a lot of random walk, and some biases and variances, finally hit the bull’s eye named “Data Science” an year back. Since then I’ve spent lot of time on self-learning.
1. Started with honing and learning the basic skills like Descriptive and Inferential Statistics, Probability, Linear/Vector Algebra, Calculus, etc. from various sources like Udacity, Khana Academy, edX and the beautiful 3Blue1Brown videos on youtube. (And I strongly believe this helps a lot. I have seen people starting with Andrew Ng’s ML course which I took almost an year after I started, when I was done with this basic essential learnings).
2. Post that, I completed courses on R, EDA and Python programming from Coursera, Udacity and IIT Madras (NPTEL) followed by Data Analytics course by IITM (NPTEL) which I would say is an excellent course to learn the mathematics and intuition behind Data Science/Analytics and ML algorithms. (They conduct an exam and provide a certificate too).
3. Currently, going through multiple offerings in parallel. Data Science A-Z from Udemy. Kirill Eremenco is one of the best trainer in this field I have come across! His Machine Learning A-Z is the next one I’m planning to take. For python hands-on, sentdex videos on youtube are a great resource which I keep referring to occasionally.
How I reached the dream Spot?
Very first thing – I dreamt about it! Means, to be in top 3. This was my only second attempt at any Hackathon, first one being the Beer challenge by MachineHack where I finished 10th. That made me confident and overambitious. I really enjoyed every step and experiment I did to improve the score by every 1000th fraction of a point. My own “Pursuit of Happyness”!
Coming to the tools and techniques.
- Multiple tools. Python (pandas, numpy, scipy), R and excel. Excel came handy for data cleaning while R, my first love, helped with statistical analysis and visualizations. And the final product was in iPython/Jupyter notebook.
- Data augmentation. At the first thought, it sounded weird and that tempted me to do it. The dataset was bit too small and further splitting it 80-20 for CV, the model was missing on a substantial set of examples which landed in the cross validation set. So I repeated the full dataframe twice thereby reducing the chances of the model losing some key samples. Not exactly the same but something similar to what they call “epoch” in Neural Networks training.
- Solve the business problem. While solving a Machine Learning problem, you first need to solve the business problem. Data is more crucial than the algorithm. I spent more than 90% of the time here on data. Thinking about the features and assessing their usefulness is the key. So I decided to use as many features as possible – with a calculated risk of over-fitting! Started with the obvious set of features and eventually went on creating every new feature that boosted the performance. There may be some collinearity in my model but who cares if it is predicting well? I calculated location-wise rates and added that as a feature to every record. Society initially seemed an innocent feature but considered that too for calculating the society-rate as a new feature. These two were calculated on train data and included in both – train and test data.
- <b>The algorithm</b>. Of course, the winner XGBoost Regressor was the obvious choice. Though I tried Random Forest Regressor and Recursive Feature Elimination algorithms, the results weren’t impressive. Hyperparameters tuning also played a vital role in improving the performance. This good blog came handy – http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-learning-problem-abhishek-thakur/
- <b>Finally Ensemble Methods. </b>Ensembles are usually done with multiple models. I did with the same model but with different sets of data. Don’t know if this can be called bagging. The random seed you use for splitting the data at times decides your fate – as it decides what portion of data your model is going to learn from. So for best generalization, multiple different random seeds were used and the results were aggregated for better score. This answer on forum provided the hint – https://www.machinehack.com/forums/topic/tools-to-be-used/ Thanks Kishan!!
- And a lot more…don’t want to disclose my secret recipe as yet 😉 Well, just kidding!!
What I feel about MachineHack?
First of all – a BIG THANKS for giving us such a platform. I had never participated in any hackathon before I came across MachineHack. I am a part of a Data Science and ML Meetup group at my organization where we meet every week and discuss, learn, brainstorm (3 of us on the leaderboard belong to this group :)). This is where I heard about MachineHack’s Beer Challenge and decided to get my hands dirty for the first time. And then there was no looking back. The first experience was marvelous. It gave me the much needed break for solving my first real-world problem. And now have a confidence to try my hands on bigger competitions like Kaggle. Kishan, Abhijeet and team has always been very supportive and prompt in responding to queries. Kudos to you guys!! Looking forward for more such Hackathons in future.
Finally, thanks for giving me this opportunity to write this blog and thanks all for reading it!
Btw, the journey was more beautiful than the destination 🙂
- Registered Users
- Topic Tags