Finally built a text classifier

After applying the Naive Bayes, I found that the accuracy was only 0.7738980350504514. Now This is the part where I was stuck for a long time. It took a lot of hitting the old notes and random blogs to find out how can I use a Support Vector Machine based model to get a better accuracy.

The advantages of support vector machines are:

Effective in high dimensional spaces.
Still effective in cases where a number of dimensions are greater than the number of samples.
Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines include:

If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).

These are some of the reasons which made me decide to go the svm route.

The accuracy did jump to 82%, but it still wasn't that great. I was clearly missing something. That's when I came across GridSearch. Almost all the classifiers will have various parameters which can be tuned to obtain optimal performance. Scikit gives an extremely useful tool ‘GridSearchCV’.

I started by creating a list of parameters for which we would like to do performance tuning. All the parameters name start with the classifier name (remember the arbitrary name we gave). E.g. vect__ngram_range; here we are telling to use unigram and bigrams and choose the one which is optimal.

Next, I create an instance of the grid search by passing the classifier, parameters and n_jobs=-1 which tells to use multiple cores from user machine.

****Right about now would be a good time to learn how to embed jupyter code into a blog from github, but I am going to save that for later! You can follow it on the github code link mentioned later.******

I was able to increase the accuracy for the Naive Bayes Classifier to 90.6% and the svm to an 89%. Ah! the sweet taste of success!

github: https://github.com/jayeetaroy/Text-analytics-Python

Search This Blog

Musings on Learning

Finally built a text classifier - Part 2

Comments

Post a Comment

Popular posts from this blog

Linear Regression Theory