这篇教程Quora Duplicate Question Pairs Classification(bert+XGBoost Classifier)写得很实用,希望能帮到您。
Quora Duplicate Question Pairs Classification
There are multiple approaches to this problem of quora duplicate questions classification like Tfidf, Deep Learning, Transformers etc. I have used two approaches here i.e. Tfidf + XGBoost Classifer and Transformers + XGBoost Classifier and will select the model with the best accuracy and other metrics.
Method 1 : TF-IDF and XGBoost Classifier
Collecting contractions
Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21 (from contractions)
Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 289.9/289.9 kB 9.4 MB/s eta 0:00:00
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
Downloading pyahocorasick-2.0.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (110 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 110.8/110.8 kB 10.2 MB/s eta 0:00:00
Installing collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.2 contractions-0.1.73 pyahocorasick-2.0.0 textsearch-0.0.24
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Number of records: 404287
|
question1 |
question2 |
is_duplicate |
0 |
What is the step by step guide to invest in sh... |
What is the step by step guide to invest in sh... |
0 |
1 |
What is the story of Kohinoor (Koh-i-Noor) Dia... |
What would happen if the Indian government sto... |
0 |
2 |
How can I increase the speed of my internet co... |
How can Internet speed be increased by hacking... |
0 |
3 |
Why am I mentally very lonely? How can I solve... |
Find the remainder when [math]23^{24}[/math] i... |
0 |
4 |
Which one dissolve in water quikly sugar, salt... |
Which fish would survive in salt water? |
0 |
|
question1 |
question2 |
is_duplicate |
0 |
what is the step by step guide to invest in sh... |
what is the step by step guide to invest in sh... |
0 |
1 |
what is the story of kohinoor kohinoor diamond |
what would happen if the indian government sto... |
0 |
2 |
how can increase the speed of my internet conn... |
how can internet speed be increased by hacking... |
0 |
3 |
why am mentally very lonely how can solve it |
find the remainder when mathmath is divided by |
0 |
4 |
which one dissolve in water quikly sugar salt ... |
which fish would survive in salt water |
0 |
0 255024
1 149263
Name: is_duplicate, dtype: int64
Training data shape: (283000, 3)
Dev data shape: (121287, 3)
((283000, 82854), (121287, 82854))
Accuracy score: 0.7493383462366123
Classification Report
: precision recall f1-score support
0 0.76 0.89 0.82 76508
1 0.73 0.52 0.60 44779
accuracy 0.75 121287
macro avg 0.74 0.70 0.71 121287
weighted avg 0.75 0.75 0.74 121287
The accuracy score and f1 scores are pretty low with Tfidf.
Method 2: Transformers and XGBoost Classifier
We will use a CrossEncoder for training the sentence pairs. A CrossEncoder is like a Siamese BERT network which produces an output value between 0 and 1 indicating the similarity of the input sentence pair. A classifer can be trained on the similarity scores to predict labels.
Collecting sentence-transformers
Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86.0/86.0 kB 3.7 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Requirement already satisfied: transformers<5.0.0,>=4.6.0 in /opt/conda/lib/python3.10/site-packages (from sentence-transformers) (4.30.1)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.10/site-packages (from sentence-transformers) (4.64.1)
Requirement already satisfied: torch>=1.6.0 in /opt/conda/lib/python3.10/site-packages (from sentence-transformers) (2.0.0)
Requirement already satisfied: torchvision in /opt/conda/lib/python3.10/site-packages (from sentence-transformers) (0.15.1)
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from sentence-transformers) (1.23.5)
Requirement already satisfied: scikit-learn in /opt/conda/lib/python3.10/site-packages (from sentence-transformers) (1.2.2)
Requirement already satisfied: scipy in /opt/conda/lib/python3.10/site-packages (from sentence-transformers) (1.10.1)
Requirement already satisfied: nltk in /opt/conda/lib/python3.10/site-packages (from sentence-transformers) (3.2.4)
Requirement already satisfied: sentencepiece in /opt/conda/lib/python3.10/site-packages (from sentence-transformers) (0.1.99)
Requirement already satisfied: huggingface-hub>=0.4.0 in /opt/conda/lib/python3.10/site-packages (from sentence-transformers) (0.15.1)
Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (3.12.0)
Requirement already satisfied: fsspec in /opt/conda/lib/python3.10/site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (2023.6.0)
Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (2.28.2)
Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.10/site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (5.4.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.10/site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (4.5.0)
Requirement already satisfied: packaging>=20.9 in /opt/conda/lib/python3.10/site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (21.3)
Requirement already satisfied: sympy in /opt/conda/lib/python3.10/site-packages (from torch>=1.6.0->sentence-transformers) (1.12)
Requirement already satisfied: networkx in /opt/conda/lib/python3.10/site-packages (from torch>=1.6.0->sentence-transformers) (3.1)
Requirement already satisfied: jinja2 in /opt/conda/lib/python3.10/site-packages (from torch>=1.6.0->sentence-transformers) (3.1.2)
Requirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.10/site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (2023.5.5)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /opt/conda/lib/python3.10/site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (0.13.3)
Requirement already satisfied: safetensors>=0.3.1 in /opt/conda/lib/python3.10/site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (0.3.1)
Requirement already satisfied: six in /opt/conda/lib/python3.10/site-packages (from nltk->sentence-transformers) (1.16.0)
Requirement already satisfied: joblib>=1.1.1 in /opt/conda/lib/python3.10/site-packages (from scikit-learn->sentence-transformers) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.10/site-packages (from scikit-learn->sentence-transformers) (3.1.0)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /opt/conda/lib/python3.10/site-packages (from torchvision->sentence-transformers) (9.5.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.10/site-packages (from packaging>=20.9->huggingface-hub>=0.4.0->sentence-transformers) (3.0.9)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.10/site-packages (from jinja2->torch>=1.6.0->sentence-transformers) (2.1.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (2.1.1)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (1.26.15)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (2023.5.7)
Requirement already satisfied: mpmath>=0.19 in /opt/conda/lib/python3.10/site-packages (from sympy->torch>=1.6.0->sentence-transformers) (1.3.0)
Building wheels for collected packages: sentence-transformers
Building wheel for sentence-transformers (setup.py) ... done
Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125938 sha256=570296732d2c127024e168ebd30170568d07d942992eba0067c608a04dbf0e55
Stored in directory: /root/.cache/pip/wheels/62/f2/10/1e606fd5f02395388f74e7462910fe851042f97238cbbd902f
Successfully built sentence-transformers
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.2.2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Number of records: 404287
|
question1 |
question2 |
is_duplicate |
0 |
What is the step by step guide to invest in sh... |
What is the step by step guide to invest in sh... |
0 |
1 |
What is the story of Kohinoor (Koh-i-Noor) Dia... |
What would happen if the Indian government sto... |
0 |
2 |
How can I increase the speed of my internet co... |
How can Internet speed be increased by hacking... |
0 |
3 |
Why am I mentally very lonely? How can I solve... |
Find the remainder when [math]23^{24}[/math] i... |
0 |
4 |
Which one dissolve in water quikly sugar, salt... |
Which fish would survive in salt water? |
0 |
Train data shape: (283000, 3)
Dev data shape: (121287, 3)
Number of train samples: 283000
Number of dev samples: 121287
/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/__init__.py:98: UserWarning: unable to load libtensorflow_io_plugins.so: unable to open file: libtensorflow_io_plugins.so, from paths: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
warnings.warn(f"unable to load libtensorflow_io_plugins.so: {e}")
/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/__init__.py:104: UserWarning: file system plugins are not loaded: unable to open file: libtensorflow_io.so, from paths: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']
warnings.warn(f"file system plugins are not loaded: {e}")
Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.dense.bias', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
array([9.7388941e-01, 6.3137675e-04, 3.7562340e-01, ..., 8.5639851e-03,
4.2043296e-03, 9.1427000e-04], dtype=float32)
|
question1 |
question2 |
is_duplicate |
cosine_sim |
0 |
Which are the best books for iit jam mathematics? |
What are the best books for IIT-JAM mathematics? |
1 |
0.973889 |
1 |
What are the best Photo startups? |
What is a good photo? |
0 |
0.000631 |
2 |
How can I find a job? |
What is the fastest way to find a job? |
1 |
0.375623 |
3 |
Which water purifier company provide best afte... |
I am planning to purchase water purifier, whic... |
1 |
0.896129 |
4 |
Can I add an app to my Vizio smart TV? |
Are there any YouTube like apps for sharing on... |
0 |
0.000396 |
We can see from the plots that majority of the duplicate questions have a cosine similarity in the range 0.8 - 0.99 and majority of the non-duplicate questions have a cosine similarity in the range of 0-0.1
Accuracy Score: 0.8981918919587425
Classification Report:
precision recall f1-score support
0 0.93 0.91 0.92 76508
1 0.85 0.88 0.86 44779
accuracy 0.90 121287
macro avg 0.89 0.89 0.89 121287
weighted avg 0.90 0.90 0.90 121287
The accuracy along with other metrics like f1 score are far better than what we got with the Tfidf approach.
|
test_id |
question1 |
question2 |
0 |
0 |
How does the Surface Pro himself 4 compare wit... |
Why did Microsoft choose core m3 and not core ... |
1 |
1 |
Should I have a hair transplant at age 24? How... |
How much cost does hair transplant require? |
2 |
2 |
What but is the best way to send money from Ch... |
What you send money to China? |
3 |
3 |
Which food not emulsifiers? |
What foods fibre? |
4 |
4 |
How "aberystwyth" start reading? |
How their can I start reading? |
array([2.5094922e-03, 1.6457531e-01, 5.1467196e-04, ..., 4.2800361e-04,
8.1884366e-04, 9.4690728e-01], dtype=float32)
|
test_id |
is_duplicate |
0 |
0 |
0 |
1 |
1 |
0 |
2 |
2 |
0 |
3 |
3 |
0 |
4 |
4 |
0 |
返回列表 一本读懂BERT(实践篇) |