您当前的位置：首页 > IT编程 > 自然语言处理
\| C语言 \| Java \| VB \| VC \| python \| Android \| TensorFlow \| C++ \| oracle \| 学术与代码 \| cnn卷积神经网络 \| gnn \| 图像修复 \| Keras \| 数据集 \| Neo4j \| 自然语言处理 \| 深度学习 \| 医学CAD \| 医学影像 \| 超参数 \| pointnet \| pytorch \| 异常检测 \| Transformers \|

自学教程：处理数据不平衡（文本分类）

51自学网 2023-10-31 14:45:59

自然语言处理

这篇教程处理数据不平衡（文本分类）写得很实用，希望能帮到您。

Example of topic classification in text documents

This example shows how to balance the text data before to train a classifier.

Note that for this example, the data are slightly imbalanced but it can happen that for some data sets, the imbalanced ratio is more significant.

# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>
# License: MIT

print(__doc__)

Setting the data set

We use a part of the 20 newsgroups data set by loading 4 topics. Using the scikit-learn loader, the data are split into a training and a testing set.

Note the class #3 is the minority class and has almost twice less samples than the majority class.

from sklearn.datasets import fetch_20newsgroups

categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
]
newsgroups_train = fetch_20newsgroups(subset="train", categories=categories)
newsgroups_test = fetch_20newsgroups(subset="test", categories=categories)

X_train = newsgroups_train.data
X_test = newsgroups_test.data

y_train = newsgroups_train.target
y_test = newsgroups_test.target

from collections import Counter

print(f"Training class distributions summary: {Counter(y_train)}")
print(f"Test class distributions summary: {Counter(y_test)}")

Training class distributions summary: Counter({2: 593, 1: 584, 0: 480, 3: 377})
Test class distributions summary: Counter({2: 394, 1: 389, 0: 319, 3: 251})

The usual scikit-learn pipeline

You might usually use scikit-learn pipeline by combining the TF-IDF vectorizer to feed a multinomial naive bayes classifier. A classification report summarized the results on the testing set.

As expected, the recall of the class #3 is low mainly due to the class imbalanced.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

from imblearn.metrics import classification_report_imbalanced

print(classification_report_imbalanced(y_test, y_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.67      0.94      0.86      0.79      0.90      0.82       319
          1       0.96      0.92      0.99      0.94      0.95      0.90       389
          2       0.87      0.98      0.94      0.92      0.96      0.92       394
          3       0.97      0.36      1.00      0.52      0.60      0.33       251

avg / total       0.87      0.84      0.94      0.82      0.88      0.78      1353

Balancing the class before classification

To improve the prediction of the class #3, it could be interesting to apply a balancing before to train the naive bayes classifier. Therefore, we will use a RandomUnderSampler to equalize the number of samples in all the classes before the training.

It is also important to note that we are using the make_pipeline function implemented in imbalanced-learn to properly handle the samplers.

from imblearn.pipeline import make_pipeline as make_pipeline_imb

from imblearn.under_sampling import RandomUnderSampler

model = make_pipeline_imb(TfidfVectorizer(), RandomUnderSampler(), MultinomialNB())

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Although the results are almost identical, it can be seen that the resampling allowed to correct the poor recall of the class #3 at the cost of reducing the other metrics for the other classes. However, the overall results are slightly better.

print(classification_report_imbalanced(y_test, y_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.70      0.91      0.88      0.79      0.89      0.80       319
          1       0.98      0.85      0.99      0.91      0.92      0.83       389
          2       0.94      0.89      0.98      0.91      0.93      0.86       394
          3       0.80      0.73      0.96      0.76      0.84      0.68       251

avg / total       0.87      0.85      0.95      0.85      0.90      0.80      1353

Total running time of the script: ( 0 minutes 9.356 seconds)

Estimated memory usage: 126 MB

Download Python source code: plot_topic_classication.py

返回列表
tokenizer.tokenize(), tokenizer.encode() , tokenizer.encode_plus() 方法介绍及其区别