Блог пользователя duckladydinh

Автор duckladydinh, история, 7 лет назад, По-английски

Dear coders,

Last weekend, there was a Machine Learning contest on Hackerrank, namely GS CodeSprint 2018. It was my first time participating on a Machine Learning contest, so the result was horrible, but the important thing is: it is fun. During the contest, I only attempted to solve the easy problem, Car Popularity Prediction, but failed to solve it perfectly. It is simply a multi-classification problem: Given m features, map it to one of the 4 classes.

I saw many people perfectly completed it. Therefore, I wonder if it is possible to share your approach? In my solution, what I did was to use a simple sklearn SVM and a grid search on C from 0.001 to 100. Only these few line of codes could get me 0.92. Run it a few more times and I get 0.94, but it was not possible to get 1.

I would be thankful if you can share with me what you have done to solve such problem perfectly. Thank you. Besides, is there anyone knowing how to resubmit the problem? I tried to resubmit but their server did not accept?

Thank you for your time and consideration. I am looking forward to hearing from you.

  • Проголосовать: нравится
  • +2
  • Проголосовать: не нравится

»
7 лет назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

Auto comment: topic has been updated by duckladydinh (previous revision, new revision, compare).

»
7 лет назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

Hi duckladydinh, Sorry but I don't know the answer to your problem but it seems like you have just started with machine learning.

Any advice about how can one start with machine learning to be able to participate in such contests? Basically what path you followed?

Thanks.

  • »
    »
    7 лет назад, # ^ |
      Проголосовать: нравится +3 Проголосовать: не нравится

    As for advice, I would say it is only magic. Machine Learning to me is just like learning a lot of libraries. I used to try the course by Prof. Andrew, but I do not even think that it is necessary, since everything is just a few lines of code.

    For the problem I mentioned, excluding input, output, a 3-4 lines simple SVM class without any modification would get you more than 80% and I did nothing else.

    • »
      »
      »
      7 лет назад, # ^ |
        Проголосовать: нравится 0 Проголосовать: не нравится

      I pulled off some advanced distributed GANs code over summer with knowing only a little math behind it, so yeah, unfortunately its only magic, and pure engineering with some analysis included i.m.o :)

»
7 лет назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

I used sklearn RandomForestClassifier and was able to get a score of 1.I tuned n_estimators and class_weight parameter. clf=RandomForestClassifier(n_estimators=98,class_weight={1:1,2:20,3:40,4:40})

»
7 лет назад, # |
Rev. 2   Проголосовать: нравится +1 Проголосовать: не нравится

xgboost typically rules contests based on categorical structured dataset which was the case with this contest. I got perfect score with gradient-boost and Random forest with barely any tuning. Tuned SVM got score of 97.
Reason why it was difficult to score with SVM in this contest was, distribution in train & test set was skewed and SVM do not perform as good on skewed set. Applying data-transformation on dataset before applying SVM could however fetched you perfect score too. (Basic EDA revealed that dataset was probably generated and not real)

You can refer my blog if you wish to get started on machine learning contests like this.
https://threads-iiith.quora.com/Introduction-to-Competitive-Machine-Learning

  • »
    »
    7 лет назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    Your post is great. That is where I start with. It is your post that makes me give these problems a try. Thank you for the great blog and I truly hope that you can continue your great work.

    It is indeed unfortunate to hear that non-tuned xgboost still worked. I tried to use it at the end of the contest but tried to tune its parameters using grid search and it was still running after the contest ended :'( .

»
7 лет назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

I didn't see the task and the data but SVM is never good first classifier. It requires careful choice of kernel and then (in rbf case) both C and gamma simultaneously. Almost always good baseline is random forest, but there's really no sense in tuning of n_estimators, just take as many as RAM allows

  • »
    »
    7 лет назад, # ^ |
    Rev. 2   Проголосовать: нравится 0 Проголосовать: не нравится

    Wow :O, that's new to me. I always thought that SVM is a good choice :o

    Now to recall, it is true that SVM has "always" failed me in classification tasks :'(. Thank you