Finally built a text classifier - Part 2

After applying the Naive Bayes, I found that the accuracy was only 0.7738980350504514. Now This is the part where I was stuck for a long time. It took a lot of hitting the old notes and random blogs to find out how can I use a Support Vector Machine based model to get a better accuracy.

The advantages of support vector machines are:
  1. Effective in high dimensional spaces.
  2. Still effective in cases where a number of dimensions are greater than the number of samples.
  3. Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
  4. Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.


The disadvantages of support vector machines include:
  1. If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
  2. SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).
These are some of the reasons which made me decide to go the svm route. 

The accuracy did jump to 82%, but it still wasn't that great. I was clearly missing something. That's when I came across GridSearch. Almost all the classifiers will have various parameters which can be tuned to obtain optimal performance. Scikit gives an extremely useful tool ‘GridSearchCV’. 

I started by creating a list of parameters for which we would like to do performance tuning. All the parameters name start with the classifier name (remember the arbitrary name we gave). E.g. vect__ngram_range; here we are telling to use unigram and bigrams and choose the one which is optimal.

Next, I create an instance of the grid search by passing the classifier, parameters and n_jobs=-1 which tells to use multiple cores from user machine. 

****Right about now would be a good time to learn how to embed jupyter code into a blog from github, but I am going to save that for later! You can follow it on the github code link mentioned later.******

I was able to increase the accuracy for the Naive Bayes Classifier to 90.6% and the svm to an 89%. Ah! the sweet taste of success!

github: https://github.com/jayeetaroy/Text-analytics-Python

Comments

Popular posts from this blog

Finally built a text classifier - Part 1

Linear Regression Theory