Finally built a text classifier - Part 2
After applying the Naive Bayes, I found that the accuracy was only 0.7738980350504514. Now This is the part where I was stuck for a long time. It took a lot of hitting the old notes and random blogs to find out how can I use a Support Vector Machine based model to get a better accuracy.
The advantages of support vector machines are:
The disadvantages of support vector machines include:
The advantages of support vector machines are:
- Effective in high dimensional spaces.
- Still effective in cases where a number of dimensions are greater than the number of samples.
- Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
- Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.
The disadvantages of support vector machines include:
- If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
- SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).
These are some of the reasons which made me decide to go the svm route.
The accuracy did jump to 82%, but it still wasn't that great. I was clearly missing something. That's when I came across GridSearch. Almost all the classifiers will have various parameters which can be tuned to obtain optimal performance. Scikit gives an extremely useful tool ‘GridSearchCV’.
I started by creating a list of parameters for which we would like to do performance tuning. All the parameters name start with the classifier name (remember the arbitrary name we gave). E.g. vect__ngram_range; here we are telling to use unigram and bigrams and choose the one which is optimal.
Next, I create an instance of the grid search by passing the classifier, parameters and n_jobs=-1 which tells to use multiple cores from user machine.
****Right about now would be a good time to learn how to embed jupyter code into a blog from github, but I am going to save that for later! You can follow it on the github code link mentioned later.******
I was able to increase the accuracy for the Naive Bayes Classifier to 90.6% and the svm to an 89%. Ah! the sweet taste of success!
github: https://github.com/jayeetaroy/Text-analytics-Python
Comments
Post a Comment