from collections import Counter
print(f"Training class distributions summary: {Counter(y_train)}")
print(f"Test class distributions summary: {Counter(y_test)}")
Training class distributions summary: Counter({2: 593, 1: 584, 0: 480, 3: 377})
Test class distributions summary: Counter({2: 394, 1: 389, 0: 319, 3: 251})
The usual scikit-learn pipeline
You might usually use scikit-learn pipeline by combining the TF-IDF vectorizer to feed a multinomial naive bayes classifier. A classification report summarized the results on the testing set.
As expected, the recall of the class #3 is low mainly due to the class imbalanced.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
To improve the prediction of the class #3, it could be interesting to apply a balancing before to train the naive bayes classifier. Therefore, we will use a RandomUnderSampler to equalize the number of samples in all the classes before the training.
It is also important to note that we are using the make_pipeline function implemented in imbalanced-learn to properly handle the samplers.
Although the results are almost identical, it can be seen that the resampling allowed to correct the poor recall of the class #3 at the cost of reducing the other metrics for the other classes. However, the overall results are slightly better.