最近要处理不平衡数据,在网上找到一个不平衡处理的python工具包,官方地址https://imbalanced-learn.org/stable/index.html
Python库中Imblearn是专门用于处理不平衡数据,imblearn库包含了上采样、下采样、混合采样中的SMOTE、SMOTEENN、ADASYN和KMeansSMOTE等多种算法。
首先:看一下安装的环境要求,
第二步:在anaconda中创建一个imbalance的虚拟环境,name自定义imbalance,python版本选择3.6版本的
第三步:安装imbalanced-learn包,pip或者conda都可以;
imbalanced-learn is currently available on the PyPi’s repositories and you can install it via pip:
pip install -U imbalanced-learn
The package is release also in Anaconda Cloud platform:
conda install -c conda-forge imbalanced-learn
会自动匹配下载一些安装包。缺的一些自己apply下载或者更新,注意需要满足imbalanced-learn包要求的各种版本。
第四步:测试
使用https://blog.csdn.net/u010654299/article/details/103980964提供的代码
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=2, n_redundant=0, flip_y=0,
n_features=2, n_clusters_per_class=1, n_samples=100,random_state=10)
print('Original dataset shape %s' % Counter(y))
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))
会报错,sad(⊙︿⊙)
AttributeError: 'sm' object has no attribute 'fit_resample'
第五步:debug
解决上述问题:单独运行查看每一步的结果,再根据错误提示,查找原因,翻看了很多解决方案,在https://stackoverflow.com/questions/57466592/randomundersampler-object-has-no-attribute-fit-resample?noredirect=1
找到了有效的解决方案,需要更新imbalanced-learn包,或者将“fit_resample”替换为“fit_sample”.
#The method fit_resample was introduced lately to imbalanced-learn API. Either update imbalanced-learn or use fit_sample instead.
#updating the scikit-learn version to 0.23.1
(1)首先试了将“fit_resample”替换为“fit_sample”,并不是所有的地方都管用,
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import BorderlineSMOTE
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=2, n_redundant=0, flip_y=0,
n_features=2, n_clusters_per_class=1, n_samples=100, random_state=9)
print('Original dataset shape %s' % Counter(y))
sm = BorderlineSMOTE(random_state=42,kind="borderline-1")
X_res, y_res = sm.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))
还是会有报错的,
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-25-4c5cdbf10c4f> in <module>
1 from collections import Counter
2 from sklearn.datasets import make_classification
----> 3 from imblearn.over_sampling import BorderlineSMOTE
4 X, y = make_classification(n_classes=2, class_sep=2,
5 weights=[0.1, 0.9], n_informative=2, n_redundant=0, flip_y=0,
ImportError: cannot import name 'BorderlineSMOTE'
(2)用jupyter查看虚拟环境imbalance中安装的各种包的版本,代码如下
# python
import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import sklearn; print("Scikit-Learn", sklearn.__version__)
import imblearn; print("imblearn", imblearn.__version__)
输出
Windows-10-10.0.18362-SP0
Python 3.6.13 |Anaconda, Inc.| (default, Mar 16 2021, 11:37:27) [MSC v.1916 64 bit (AMD64)]
NumPy 1.15.4
SciPy 1.1.0
Scikit-Learn 0.18.1
imblearn 0.3.1
发现scikit-learn, imbalanced-learn 其实都有更高的版本, scikit-learn在anaconda中也不能直接upgrade,就使用pip更新了对应的包
pip install --upgrade scikit-learn
pip install --upgrade imbalanced-learn
更新后
Windows-10-10.0.18362-SP0
Python 3.6.13 |Anaconda, Inc.| (default, Mar 16 2021, 11:37:27) [MSC v.1916 64 bit (AMD64)]
NumPy 1.15.4
SciPy 1.1.0
Scikit-Learn 0.24.2
imblearn 0.8.0
记得将之前改为“fit_sample”的部分全部改回来,“fit_resample”……
更新,将不平衡相关的一些数据集、代码集合在一起,动手实践起来:
不平衡处理的python工具包,官方地址https://imbalanced-learn.org/stable/index.html
不平衡数据处理之SMOTE、Borderline SMOTE和ADASYN详解及Python使用https://blog.csdn.net/u010654299/article/details/103980964
https://blog.csdn.net/qq_27802435/article/details/81201357Python sklearn 实现过采样和欠采样
http://archive.ics.uci.edu/ml/index.php包含不平衡数据的UCI数据集
https://www.is.ovgu.de/Research/Codes.html计算智能代码