您当前的位置：首页 > IT编程 > Transformers
\| C语言 \| Java \| VB \| VC \| python \| Android \| TensorFlow \| C++ \| oracle \| 学术与代码 \| cnn卷积神经网络 \| gnn \| 图像修复 \| Keras \| 数据集 \| Neo4j \| 自然语言处理 \| 深度学习 \| 医学CAD \| 医学影像 \| 超参数 \| pointnet \| pytorch \| 异常检测 \| Transformers \| 情感分类 \|

自学教程：BERT详解

51自学网 2023-12-25 20:18:35

Transformers

这篇教程BERT详解写得很实用，希望能帮到您。

作者：TRiddle
链接：https://www.zhihu.com/question/510738704/answer/2671000185
来源：知乎
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

编程环境

本文使用的语言是python3，深度学习库是tensorflow2.4。不熟悉python或者tf2.4？没关系，现在正是学习它们的时候：）。执行完下面这些import之后，就正式开始实现啦。

from typing import List, Optional, Tuple
import numpy as np
import random
import tensorflow as tf

总体框架

我们首先要实现一些BERT的关键组件，然后藉由这些组件，像拼装乐高玩具一样搭建起BERT的骨架。

实现模型组件

在本节中，我们会先实现自注意力层（Self-Attention）和点式前馈网络层（Feed Forward Neural Network），接着用它们组装成encoder层，最后用多个encoder层（例如下图中的encoder#1, #2）堆叠出BERT的最核心的模型结构——Transformer Encoder。

注意力层

注意力机制（attention mechanism）的作用，通俗地说，就是计算一些向量的加权和。每个向量的权重由一个“查询向量”决定，与查询向量匹配程度越高的向量，分配到的权重也就越高。

比较特殊的注意力机制是“自注意力（self-attention）”，它的作用是：对句子中的每个token，根据句子中各个token与它的匹配度，计算各个token的向量表示的加权和，并将其作为这个token新的向量表示。

transformer中的encoder正是通过自注意力机制让输入序列中的每个token能够感知到其它token，并将每个token的语义与其它token的语义组合在一起，形成新的语义。

例如下图，通过计算Thinking和Machines的向量表示的加权和，Thinking就感知到了Machines，并将其语义与Machines的语义组合起来构成短语Thinking Machines的语义。

下面先实现通用的注意力机制，再实现特殊的自注意力。

注意力机制最关键的输入是三个张量query，key，value。query是上面提到的查询向量构成的张量，key和value分别是被查询向量的匹配表示（与查询向量算匹配分）和值表示（用来计算加权和）构成的张量。

首先要计算查询向量和被查询向量之间的匹配度，注意力机制使用向量内积来衡量匹配度。假设query张量包含m个f维向量，key张量包含n个f维向量，value张量包含n个f维向量。query和key两两之间计算出来的内积就会构成m行n列的矩阵（注意力矩阵）。然后value张量根据注意力矩阵计算加权和，就会得到m个f维向量。

一个不得不提的细节是，注意力机制会使用一个被称之为mask的张量来遮盖注意力矩阵。这在对序列做padding或者构建单向语言模型时都会有一些应用，这里暂时先按下不表。

上面所有的描述可以表示成一个公式：

Attention(Q,K,V,M)=softmaxk(QKT+Mdk)V<script type="math/tex;mode=inline" id="MathJax-Element-1">Attention(Q, K, V, M) = softmax_k(\frac{QK^T + M}{\sqrt{d_k}})V</script>Attention(Q, K, V, M) = softmax_k(\frac{QK^T + M}{\sqrt{d_k}})V

其中Q，K，V是三维张量（上面的描述和上图都省略了第一维），也就是上文的query，key，value张量，他们的三个维度分别代表batch，sequence，feature（或者hidden_state）。M是三维张量，也就是mask矩阵构成的张量，第一个维度代表batch，后面两个维度构成mask矩阵。 dk<script type="math/tex;mode=inline" id="MathJax-Element-3">d_k</script>d_k 是前面提到的向量维数f， softmaxk<script type="math/tex;mode=inline" id="MathJax-Element-2">softmax_k</script>softmax_k 是k维的softmax函数，可以用来给匹配分数做一个归一化。

这里先提一下：后面会提到一种叫做“多头注意力”的特殊的注意力机制，它的Q，K，V，M将会是四维张量。

代码：

def scaled_dot_product_attention(q, k, v, mask=None):
    # 计算匹配度
    matmul_qk = tf.matmul(q, k, transpose_b=True)
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_dot = matmul_qk / tf.math.sqrt(dk)

    # 用mask矩阵干扰attention矩阵
    if mask is not None:
        scaled_attention_dot += (mask * -1e9)

    # 归一化然后聚合
    attention_weights = tf.nn.softmax(scaled_attention_dot, axis=-1)
    output = tf.matmul(attention_weights, v)

    # 输出三维张量和attention矩阵
    return output, attention_weights

实际上，我们希望注意力层能够自己学会表示Q，K，V。因此在注意力层中，会让Q，K，V先经过一个全连接层，再去做注意力计算。这样在模型的训练过程中，通过梯度的反向传播，随着全连接层的参数不断被调整，注意力层也就逐渐掌握表示Q，K，V的能力了。

另外，transformer还使用了“多头注意力”的技巧来增强注意力层的表达能力。具体就是将Q，K，V中的每个向量都拆分成h个被称为“注意力头”小向量，原来的每个f维向量，现在都变成h个 fh<script type="math/tex;mode=inline" id="MathJax-Element-4">\frac{f}{h}</script>\frac{f}{h} 维的注意力头。

假设Q，K，V的第0个注意力头是 Q0,K0,V0<script type="math/tex;mode=inline" id="MathJax-Element-5">Q_0, K_0, V_0</script>Q_0, K_0, V_0 ，第1个注意力头是 Q1,K1,V1<script type="math/tex;mode=inline" id="MathJax-Element-6">Q_1, K_1, V_1</script>Q_1, K_1, V_1 ，……，第h个注意力头是 Qh,Kh,Vh<script type="math/tex;mode=inline" id="MathJax-Element-8">Q_h, K_h, V_h</script>Q_h, K_h, V_h 。在多头注意力的计算过程中，我们对所有 Qi,Ki,Vi<script type="math/tex;mode=inline" id="MathJax-Element-7">Q_i, K_i, V_i</script>Q_i, K_i, V_i 计算注意力，得到 Zi<script type="math/tex;mode=inline" id="MathJax-Element-9">Z_i</script>Z_i ，接着将多个头的计算结果拼接在一起，然后用一个全连接层组合一下，就得到了多头注意力的计算结果Z（如下图）。

代码：

def get_initializer(seed):
    return tf.keras.initializers.truncated_normal(mean=0., stddev=0.02, seed=seed)


class MultiHeadAttention(tf.keras.layers.Layer):

    def __init__(self, d_model, num_heads, seed=233):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model

        assert d_model % self.num_heads == 0
        self.depth = d_model // self.num_heads

        # Q，K，V前的全连接层
        self.wq = tf.keras.layers.Dense(d_model, kernel_initializer=get_initializer(seed), name='wq')
        self.wk = tf.keras.layers.Dense(d_model, kernel_initializer=get_initializer(seed), name='wk')
        self.wv = tf.keras.layers.Dense(d_model, kernel_initializer=get_initializer(seed), name='wv')
        
        # 组合多头注意力结果的全连接层
        self.dense = tf.keras.layers.Dense(d_model, kernel_initializer=get_initializer(seed))

    def split_heads(self, x, batch_size):
        # 新增一个注意力头的维度
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, v, k, q, mask):
        batch_size = tf.shape(q)[0]

        # 过全连接层
        q = self.wq(q)
        k = self.wk(k)
        v = self.wv(v)

        # 分拆注意力头
        q = self.split_heads(q, batch_size)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)

        # 多头注意力机制计算
        scaled_attention, attention_weights = scaled_dot_product_attention(q, k, v, mask)
        
        # 先拼接多头计算结果再组合
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])
        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model))
        output = self.dense(concat_attention)

        return output, attention_weights

上面的MultiHeadAttention类就是完整的注意力层的实现。它的构造函数有三个参数：

d_model：模型的主维度，体现了模型的表达能力。是注意力层所有张量的最后一个维度的大小，也是整个BERT大部分张量的最后一个维度的大小
num_heads：注意力头的数量
seed：随机数种子，影响初始化质量

值得注意的是，BERT中的所有参数都是从带截断的正态分布中采样的，正如get_initializer函数中描述的那样。

激活函数

BERT中使用的激活函数是GELU，其中的细节就不详细说了，直接给出代码：

def gelu(x):
    cdf = 0.5 * (1.0 + tf.tanh(
        (np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))
    return x * cdf

点式前馈网络层

点式前馈网络层以注意力层输出的三维张量作为输入，经过两个全连接层后又输出一个三维张量。第一个全连接层的激活函数是GELU，会将张量的最后一维放大。第二个全连接层没有激活函数，会将张量的最后一维还原。

在我的理解中，“点式“的含义是：对每个batch的每个token独立地施加ffn的影响，给人一种力作用在一个个“点”上的感觉。

关于点式前馈网络的直观意义，网络上有各种各样的猜测，但是目前还没有人能把它说得特别明白。如果不在意这一点的话，可以简单地认为，它的存在就是为了引入深层ffn和非线性激活函数，从而提高神经网络的复杂性。

def point_wise_feed_forward_network(d_model, dff, seed=233):
    return tf.keras.Sequential([
        tf.keras.layers.Dense(dff, activation=gelu, kernel_initializer=get_initializer(seed), name='ffn_1'),
        tf.keras.layers.Dense(d_model, kernel_initializer=get_initializer(seed), name='ffn_2'),
    ])

encoder层

鉴于我们已经将注意力层和点式前馈网络层实现完毕，现在就可以将它们组合起来实现encoder层了。encoder层的输入是一个三维张量x和四维张量mask，输出也是一个三维张量。

首先我们要用注意力层实现自注意力机制，方法就是将x同时当作Q，K，V输入到注意力层中：

attention_output=Attention(x,x,x,mask)<script type="math/tex;mode=inline" id="MathJax-Element-10">attention\_output = Attention(x, x, x, mask)</script>attention\_output = Attention(x, x, x, mask)

然后我们会在这里应用到残差网络中的残差连接，还会用到一个LayerNormalization层：

output1=LayerNorm(x+attention_output)<script type="math/tex;mode=inline" id="MathJax-Element-11">output_1 = LayerNorm(x + attention\_output)</script>output_1 = LayerNorm(x + attention\_output)

残差连接和LayerNormalization都能在具有很多层的网络里起到稳定梯度的作用。

最后，我们让 output1<script type="math/tex;mode=inline" id="MathJax-Element-12">output_1</script>output_1 通过点式前馈网络层，再应用一下残差连接和LayerNormalization层，整个encoder层就实现完毕了

feed_forword_output=FeedForward(output1) output2=LayerNorm(output1+feed_forword_output)<script type="math/tex;mode=inline" id="MathJax-Element-13">feed\_forword\_output = FeedForward(output_1) \\ \ \\ output_2 = LayerNorm(output_1 + feed\_forword\_output)</script>feed\_forword\_output = FeedForward(output_1) \\ \ \\ output_2 = LayerNorm(output_1 + feed\_forword\_output)

代码：

def get_layer_norm():
    return tf.keras.layers.LayerNormalization(
        epsilon=1e-6,
        beta_initializer=tf.keras.initializers.zeros(),
        gamma_initializer=tf.keras.initializers.ones(),
    )


class EncoderLayer(tf.keras.layers.Layer):

    def __init__(self, d_model, num_heads, dff, name, rate=0.1, seed=233):
        super(EncoderLayer, self).__init__(name=name)

        self.mha = MultiHeadAttention(d_model, num_heads, seed=seed)
        self.ffn = point_wise_feed_forward_network(d_model, dff, seed=seed)

        self.layer_norm_1 = get_layer_norm()
        self.layer_norm_2 = get_layer_norm()

        self.dropout_1 = tf.keras.layers.Dropout(rate, seed=seed)
        self.dropout_2 = tf.keras.layers.Dropout(rate, seed=seed)

    def call(self, x, mask):
        # 自注意力层 + 残差连接 + layer-norm
        attn_output, _ = self.mha(x, x, x, mask)
        attn_output = self.dropout_1(attn_output)
        out1 = self.layer_norm_1(x + attn_output)

        # 点式前馈网络层 + 残差连接 + layer-norm
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout_2(ffn_output)
        out2 = self.layer_norm_2(out1 + ffn_output)

        return out2

encoder

现在我们已经有了所有的组件，是时候实现BERT中最核心的transformer encoder了。它的作用是提取出文本的特征表示。通俗地说，它为每个词分配了一个向量，其中极有可能包含了：

这个词的语义特征：词语“猫”和“狗”的特征是不同的
这个词在上下文中的语义特征：“狗”这个词语在形容动物和形容人时的特征是不同的
这个词与其它词结合后的语义特征：短语“可爱的狗”和“凶猛的狗”的特征是不同的

它分为两个部分，第一个部分是三个embedding层，第二个部分是堆叠在一起的若干个encoder层。前者负责将token id序列（二维张量）转化成embedding序列（三维张量）。后者负责将embedding序列层层编码，提取出文本的特征表示。

在实现embedding层之前，先复习一下BERT的输入是什么样的。因为BERT的训练任务中有Next Sentence Prediction(NSP)，所以encoder的输入是由两个文本序列拼接成的句子对。假设句子对是"my dog is cute"和"he likes playing"，那么encoder的输入应该是这样的：

[cls] my dog is cute [sep] he likes play #ing [sep] [pad] [pad] [pad] [pad] [pad]

其中[cls]和[sep]这两个特殊token的作用分别是指出哪里是序列的开头和结尾。因为有两个句子，所以会有两个[sep]。[pad]的作用是：通过在末尾添加无意义token（padding操作），将两个句子拼接结果的长度限定在一个定值上。在上面的例子中，这个定值为16。

为了将BERT实现得更加简洁，在这里我想对输入做两点简化。其一，将每个句子的长度（而不是整个句子的长度）限定在一个定值上。其二，在每个句子的末尾（而不是整个输入的末尾）做padding。在这个新的设定下，encoder的输入应该是这样的：

[cls] my dog is cute [pad] [pad] [pad] [sep] he likes play #ing [sep] [pad] [pad] [pad] [pad]

在embedding层中，每个token会拿到三个embedding。分别是：

EToken<script type="math/tex;mode=inline" id="MathJax-Element-14">E_{Token}</script>E_{Token} ：每个不同单词有一个不同的embedding，可以简单理解为这个单词的词向量
ESegment<script type="math/tex;mode=inline" id="MathJax-Element-15">E_{Segment}</script>E_{Segment} ：总共只有两种不同的embedding，第一个句子A是一种，第二个句子B是另一种
EPosition<script type="math/tex;mode=inline" id="MathJax-Element-16">E_{Position}</script>E_{Position} ：每个不同位置有一个不同embedding，通过它encoder可以知道这个token在什么位置

以"cute"这个token为例，在embedding层中会拿到 EToken(cute)<script type="math/tex;mode=inline" id="MathJax-Element-17">E_{Token}(cute)</script>E_{Token}(cute) ， ESegment(A)<script type="math/tex;mode=inline" id="MathJax-Element-18">E_{Segment}(A)</script>E_{Segment}(A) 和 EPosition(4)<script type="math/tex;mode=inline" id="MathJax-Element-19">E_{Position}(4)</script>E_{Position}(4) 。最终我们将它们的和作为这个token整体的embedding：

EBERT(cute)=EToken(cute)+ESegment(A)+EPosition(4)<script type="math/tex;mode=inline" id="MathJax-Element-20">E_{BERT}(cute) = E_{Token}(cute) + E_{Segment}(A) + E_{Position}(4)</script>E_{BERT}(cute) = E_{Token}(cute) + E_{Segment}(A) + E_{Position}(4)

拿到embedding层输出的三维张量后，先让它依次通过Dropout层和LayerNormalization层，再让它依次通过若干层encoder，就可以得到整个encoder的输出了

代码：

class TransformerEncoder(tf.keras.layers.Layer):

    def __init__(self, num_layers, d_model, num_heads, dff, segment_max_len, input_vocab_size, rate=0.1, seed=233):
        super(TransformerEncoder, self).__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        # 句子对的最大长度等于两个句子的最大长度加上三个特殊token的长度
        max_len = 2 * segment_max_len + 3

        # 三种embedding
        self.embedding = tf.keras.layers.Embedding(
            input_dim=input_vocab_size, output_dim=d_model, embeddings_initializer=get_initializer(seed), trainable=True,
        )
        self.position_embedding = tf.keras.layers.Embedding(
            input_dim=max_len, output_dim=d_model, embeddings_initializer=get_initializer(seed), trainable=True,
        )
        self.segment_embedding = tf.keras.layers.Embedding(
            input_dim=2, output_dim=d_model, embeddings_initializer=get_initializer(seed), trainable=True,
        )

        # 对于每个句对而言，position id序列是固定的：[0, 1, 2, 3, ...]
        position_ids = range(max_len)
        self.position_ids = tf.cast(np.array(position_ids).reshape((1, -1)), dtype=tf.float32)

        # 基于前面提到的简化，segment id序列是固定的：[0, 0, 0, ..., 1, 1, 1, ...]
        segment_ids = [0 for _ in range(2 + segment_max_len)] + [1 for _ in range(1 + segment_max_len)]
        self.segment_ids = tf.cast(np.array(segment_ids).reshape((1, -1)), dtype=tf.float32)

        # 堆叠encoder层
        self.enc_layers = []
        for i in range(num_layers):
            self.enc_layers.append(EncoderLayer(d_model, num_heads, dff, 'encoder_layer_{}'.format(i), rate, seed))

        self.layer_norm = get_layer_norm()
        self.dropout = tf.keras.layers.Dropout(rate, seed=seed)

    def call(self, x, mask):
        # embedding层的输出为三种embedding相加的结果
        x = self.embedding(x) + self.position_embedding(self.position_ids) + self.segment_embedding(self.segment_ids)
        x = self.layer_norm(self.dropout(x))

        # 依次调用encoder_layer
        for i in range(self.num_layers):
            x = self.enc_layers[i](x, mask)

        return x

以上就是整个encoder的实现。简单解释一下构造函数的参数，num_layers：有多少个encoder层；d_model：每个encoder的d_model是多少；num_heads：有多少个注意力头；dff：点式前馈网络的第一个全连接层有多少个神经元；segment_max_len：每个句子的最大长度是多少；input_vocab_size：tokenizer的词典的大小是多少；rate：Dropout层的丢弃概率有多大；seed：初始化使用的随机化种子是哪个。

构建模型

在这一部分里，我们将以Transformer Encoder为基础，实现完整的BERT模型。包括模型结构和两个预训练任务的loss。

padding mask

为了让注意力层感知到句子对的mask操作，我们要根据padding信息构造一个mask张量。

代码：

def create_padding_mask(seq):
    seq = tf.cast(tf.math.equal(seq, 0), tf.float32)
    return seq[:, tf.newaxis, tf.newaxis, :]

create_padding_mask函数接收将BERT输入的二维张量。输出一个四维张量，即注意力机制中的mask张量，它的四个维度分别对应了batch，attention-haed，sequence，feature。

完成模型结构

BERT的输入是四个二维张量：token_ids，nsp_labels，mask_token_ids，is_masked。被mask过的句子对的token id信息，也就是mask_token_ids张量会先被输入到Transformer Encoder中，得到三维张量encoder_output，它是BERT的两个预训练任务的共同输入。BERT的两个预训练任务分别是Next Sentence Prediction（NSP）和Masked Language Model（MLM）。

NSP是“判断输入BERT的句子对在文章中是否是连续的”的任务。这部分任务对应的网络结构的输入是[cls] token对应的输出，也就是encoder_output[:, 0:1, :]，它经过一个全连接层和softmax激活函数（sigmoid也可以）得到NSP的预测概率，这个概率对应的真实值是由nsp_labels张量指定的，它们之间的损失函数是二分类交叉熵损失函数。

MLM是完形填空任务，也就是遮盖住句子对中的若干个字，然后预测被遮盖住的字是什么。这部分任务对应的网络结构的输入是整个encoder_output。这部分网络的计算过程是这样的：

mlm_prediction=Softmax(LayerNormalization(Dense(encoder_output))×ETokenT+bias)<script type="math/tex;mode=inline" id="MathJax-Element-21">mlm\_prediction =Softmax(LayerNormalization(Dense(encoder\_output)) \times E_{Token}^T + bias)</script>mlm\_prediction =Softmax(LayerNormalization(Dense(encoder\_output)) \times E_{Token}^T + bias)

即encoder_output经过一个全连接层和一个LayerNormalization层后得到中间产物mlm_norm_output，它与token embedding矩阵的转置相乘后，再加上一个bias权重得到中间产物mlm_output，将其通过softmax归一化成概率后就得到了每个token被预测成各个token的概率，这个概率对应的真实值是由token_ids张量指定的，它们之间的损失函数是多分类交叉熵函数，需要计算损失函数的token是由is_masked张量指定的。

代码：

def build_bert(segment_max_len: int, input_vocab_size: int, num_layers: int, d_model: int, num_heads: int, d_ff: int,
               rate: float, seed: int):

    # 句子对的最大长度等于两个句子的最大长度加上三个特殊token的长度
    max_len = 2 * segment_max_len + 3

    # 输入层
    token_ids = tf.keras.layers.Input(shape=(max_len, ), name='token_ids')
    target_token_ids = tf.keras.layers.Input(shape=(max_len,), name='target_token_ids')
    is_masked = tf.keras.layers.Input(shape=(max_len,), name='is_masked')

    # 构造padding mask张量
    mask = create_padding_mask(token_ids)

    # 拿到transformer encoder的输出
    encoder = TransformerEncoder(
        num_layers=num_layers, d_model=d_model, num_heads=num_heads, dff=d_ff,
        segment_max_len=segment_max_len, input_vocab_size=input_vocab_size, rate=rate, seed=seed,
    )
    encoder_output = encoder(token_ids, mask=mask)

    # NSP目标对应网络结构
    pooler = encoder_output[:, 0:1, :]
    pool_dense = tf.keras.layers.Dense(
        d_model, activation='tanh', kernel_initializer=get_initializer(seed), name='pooler_dense'
    )
    nsp_prob = tf.keras.layers.Dense(1, activation='sigmoid', kernel_initializer=get_initializer(seed), name='nsp_prob')
    nsp_output = nsp_prob(pool_dense(pooler))

    # MLM目标对应网络结构
    mlm_dense = tf.keras.layers.Dense(
        d_model, activation=gelu, kernel_initializer=get_initializer(seed), name='mlm_dense'
    )
    mlm_norm = get_layer_norm()
    mlm_activation = tf.keras.layers.Dense(
        input_vocab_size, activation='linear', kernel_initializer=get_initializer(seed), name='mlm_activation'
    )
    mlm_norm_output = mlm_norm(mlm_dense(encoder_output))
    embedding_token = mlm_norm_output @ tf.transpose(encoder.embedding.embeddings)

    bias = tf.keras.initializers.Zeros()(shape=(input_vocab_size,))
    mlm_bias = tf.Variable(tf.cast(bias, tf.float32), name="mlm_bias")
    mlm_output = mlm_activation(tf.nn.bias_add(embedding_token, mlm_bias))

    # MLM的loss和acc的实现
    def mlm_loss(inputs):
        y_true, y_pred, mask_ = inputs
        loss = tf.keras.losses.sparse_categorical_crossentropy(y_true, y_pred, from_logits=True)
        loss = tf.reduce_sum(loss * mask_) / (tf.reduce_sum(mask_) + tf.keras.backend.epsilon())
        return loss

    def mlm_acc(inputs):
        y_true, y_pred, mask_ = inputs
        acc = tf.keras.metrics.sparse_categorical_accuracy(y_true, y_pred)
        acc = tf.reduce_sum(acc * mask_) / (tf.reduce_sum(mask_) + tf.keras.backend.epsilon())
        return acc

    mlm_loss = tf.keras.layers.Lambda(mlm_loss, name='mlm_loss')([token_ids, mlm_output, is_masked])
    mlm_acc = tf.keras.layers.Lambda(mlm_acc, name='mlm_acc')([token_ids, mlm_output, is_masked])

    # 定义模型对象并返回
    model = tf.keras.models.Model(
        inputs=[token_ids, target_token_ids, is_masked],
        outputs=[nsp_output, mlm_loss, mlm_acc],
    )

    return model

编译模型

接下来要做的就是编译BERT了。简单起见，这里只用了原生Adam优化器，如果像换成AdamW或者LAMB的话可以自己操作一下。因为MLM的loss实现在模型内部并输出了，所以编译时只需使用输出的loss。NSP的loss就还是使用tf.keras.losses.BinaryCrossentropy()。另外我们还可以通过metrics字段监控两个训练目标的acc和auc。

代码：

def compile_bert(model: tf.keras.models.Model, learning_rate: float) -> tf.keras.models.Model:
    model.compile(
        loss={
            'mlm_loss': lambda y_true, y_pred: y_pred,
            'nsp_prob': tf.keras.losses.BinaryCrossentropy(),
        },
        optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-6),
        metrics={
            'mlm_acc': lambda y_true, y_pred: y_pred,
            'nsp_prob': tf.keras.metrics.AUC()
        },
    )
    return model

tokenize

原始语料是文本序列，输入到BERT之前就变成了二维张量，这个转化过程就是tokenize。首先，我们用全部语料“训练”一个tokenizer，也就是利用tokenizer的规则切分语料，从而得到一个词典。然后给词典中每个token分配一个id。最后把语料中的句子切分成token，将token替换成对应id就完成了。

实现tokenizer

BERT使用的tokenizer是WordPiece，但由于这不是BERT的重点，同时中文也不需要子词结构，所以这里就直接使用tensorflow自带的Tokenizer，同时让每个字作为一个token。

代码：

# '今天天气不错' -> '今 天 天 气 不 错'
def segment(raw_texts: List[str]) -> List[str]:
    res = []
    for text in raw_texts:
        res.append(' '.join([c for c in text]))
    return res


# “训练”tokenizer
def build_tokenizer(texts: List[str]) -> tf.keras.preprocessing.text.Tokenizer:
    tokenizer = tf.keras.preprocessing.text.Tokenizer(lower=True)
    tokenizer.fit_on_texts(texts)
    return tokenizer


# 使用tokenizer对句子做tokenize
def tokenize(tokenizer: tf.keras.preprocessing.text.Tokenizer, texts: List[str],
             max_len: Optional[int] = None) -> List[List[int]]:
    seqs = tokenizer.texts_to_sequences(texts=texts)
    if max_len is not None:
        for i in range(len(seqs)):
            seqs[i] = seqs[i][0:max_len]
            while len(seqs[i]) < max_len:
                seqs[i].append(0)
    return seqs

mask

BERT的MLM任务需要将句子中一部分token给遮盖掉，下面的get_mask_token_ids函数就实现了这个功能。输入经过tokenize之后的token id序列，也就是token_ids，输出经过遮盖之后的token id序列，也就是mask_token_ids，以及指示有哪些token被mask掉的0/1序列，也就是is_masked。

值得注意的是，BERT中不是所有被mask掉的token都会被[mask]这个token给替换掉，而是只有80%的情况会是如此。有10%的情况会被随机的token替换掉，另外还有10%的情况不会被任何token替换掉，这时就相当于在看见token的情况下预测token，这样可以让BERT不那么依赖[mask]，毕竟在微调或者推理的时候，BERT是不会见到[mask]的。

代码：

# 替换掉被mask的token
def mask_replace(token_id: int, mask_token_id: int, vocab_size: int) -> int:
    rand = random.random()
    if rand <= 0.8:
        return mask_token_id
    elif rand <= 0.9:
        return token_id
    else:
        return random.randint(0, vocab_size)


# 对序列进行替换
def get_mask_token_ids(token_ids: List[int], mask_token_id: int, mask_rate: float,
                       vocab_size: int) -> Tuple[List[int], List[int]]:
    mask_token_ids, is_masked = [], []
    for token_id in token_ids:
        if token_id == 0 or random.random() > mask_rate:
            mask_token_ids.append(token_id)
            is_masked.append(0)
        else:
            mask_token_ids.append(mask_replace(token_id, mask_token_id, vocab_size))
            is_masked.append(1)
    return mask_token_ids, is_masked

get_mask_token_ids函数的mask_token_id参数指的是[mask]这个token的id。mask_rate参数是mask操作的概率，在BERT里这个概率通常是15%。vocab_size参数指的是tokenizer的词典的大小，有了它才能将被mask掉的token替换成随机的token。

实现input构造

恭喜你，现在到实现BERT所需的最后一个函数了。这个函数的功能是将语料库构造成BERT输入的make_feature函数。

具体的步骤是：我们遍历文档，然后遍历文档中的每个句子。为每个句子匹配下一个句子作为NSP任务的正例，然后随机匹配一个句子作为NSP任务的负例。最后将每个句子对mask一下，再用[cls]和[sep]包裹一下。

代码：

def make_feature(tokenizer: tf.keras.preprocessing.text.Tokenizer, raw_docs: List[List[str]], segment_max_len: int,
                 beg_token_id: int, sep_token_id: int, mask_token_id: int, mask_rate: float,
                 random_seed: int) -> Tuple[List[List[int]], List[List[int]], List[List[int]], List[int]]:

    random.seed(random_seed)

    docs = []
    tot_texts = []

    for raw_doc in raw_docs:
        doc = segment(raw_doc)
        docs.append(list(range(len(tot_texts), len(tot_texts) + len(doc))))
        tot_texts.extend(doc)

    tot_token_ids = tokenize(tokenizer, tot_texts, segment_max_len)
    tot_text_idx = list(range(len(tot_texts)))

    tmp_token_ids = []
    nsp_labels = []

    for doc in docs:
        for i in range(1, len(doc)):
            pre_doc_id = doc[i - 1]
            cur_doc_ids = [random.choice(tot_text_idx), doc[i]]
            for j in range(2):
                cur_doc_id = cur_doc_ids[j]
                tmp_token_ids.append([tot_token_ids[pre_doc_id], tot_token_ids[cur_doc_id]])
                nsp_labels.append(j)

    token_ids = []
    mask_token_ids = []
    is_masked = []

    vocab_size = len(tokenizer.word_index) + 4

    for p in tmp_token_ids:
        first_cur_mask_token_ids, first_cur_is_masked = get_mask_token_ids(p[0], mask_token_id, mask_rate, vocab_size)
        second_cur_mask_token_ids, second_cur_is_masked = get_mask_token_ids(p[1], mask_token_id, mask_rate, vocab_size)
        token_ids.append([beg_token_id] + p[0] + [sep_token_id] + p[1] + [sep_token_id])
        mask_token_ids.append(
            [beg_token_id] + first_cur_mask_token_ids + [sep_token_id] + second_cur_mask_token_ids + [sep_token_id]
        )
        is_masked.append([0] + first_cur_is_masked + [0] + second_cur_is_masked + [0])

    return token_ids, mask_token_ids, is_masked, nsp_labels

简单解释一下还没解释过的参数：raw_docs参数就是语料库，我们用字符串类型来存储文章中的句子，用List[str]类型来存储某篇文章，用List[List[str]]类型来存储语料库。beg_token_id, sep_token_id, mask_token_id分别是[beg], [sep], [mask]对应的token id。

跑起来

样例数据

现在我们来伪造一个语料库。第一篇文章是关于中秋节的文章，第二篇文章是关于英国女王的文章。

data = [
    [
        '“时逢三五便团圆，满把晴光护玉栏。”',
        '10日是一年一度的中秋节。',
        '据天文专家介绍，今年中秋节是十五的月亮十五圆，预计最圆时刻将出现在17时59分左右。',
        '那么佳节赏月天气如何呢',
        '能看到的是皓月当空还是彩云伴月？',
        '是适合登高望月、还是泛舟赏月？',
    ],
    [
        '中新网9月10日电 综合英媒报道，当地时间9月10日，英国国王查尔斯三世已经批准，英国女王伊丽莎白二世葬礼当天将是英国的公共假日。',
        '英女王的葬礼日期目前暂未确定，但《卫报》称，葬礼预计为9月19日。',
        '9月8日，英国女王伊丽莎白二世去世，终年96岁。',
        '9月10日，查尔斯三世被英国王位授权理事会正式授权成为英国新君主。',
    ],
]

tokenize

然后可以用语料库中所有句子来“训练”tokenizer了。

tokenizer.word_index是一个将token映射到id的dict，因此token的id会在[1, len(tokenizer.word_index)]之间。注意，到目前为止，我们还漏了四个特殊的token，那不妨令0是[pad]的id，len(tokenizer.word_index) + 1是[beg]的id，len(tokenizer.word_index) + 2是[end]的id，len(tokenizer.word_index) + 3是[mask]的id。所以整个词典的大小应该是vocab_size = len(tokenizer.word_index) + 4

raw_texts = [text for doc in data for text in doc]
texts = segment(raw_texts)
tokenizer = build_tokenizer(texts)
vocab_size = len(tokenizer.word_index) + 4

构建模型

可以随便使用一些超参数来生成并编译模型对象，这些超参数可以根据自己的喜好随便改。通过实验来确定最佳的超参数也是非常有趣的事情～

# 随机数种子
SEED = 233

# 定义模型
model = build_bert(
    segment_max_len=64, input_vocab_size=vocab_size,
    num_layers=4, d_model=64, num_heads=2, d_ff=256,
    rate=0.1, seed=SEED
)

# 编译模型
model = compile_bert(model, 0.00176)
model.summary()

构建数据

调用maks_feature函数将语料转换成输入BERT的张量，然后将这些张量打包成dataset。

token_ids, mask_token_ids, is_masked, nsp_labels = make_feature(
    tokenizer, data, 64, vocab_size - 3, vocab_size - 2, vocab_size - 1, 0.15, SEED
)

fake_labels = [0 for _ in range(len(nsp_labels))]

ds = tf.data.Dataset.from_tensor_slices((
    {'token_ids': token_ids, 'target_token_ids': mask_token_ids, 'is_masked': is_masked},
    {'nsp_prob': nsp_labels, 'mlm_loss':  fake_labels, 'mlm_acc': fake_labels},
)).batch(4)

可以将中间结果输出出来看一下，直观感受一下tokenize的结果：

def decode(tokenizer_: tf.keras.preprocessing.text.Tokenizer, token_ids_: List[int], vocab_size_: int):
    res = []
    for token_id in token_ids_:
        if token_id == 0:
            res.append('[pad]')
        elif token_id == vocab_size_ - 3:
            res.append('[beg]')
        elif token_id == vocab_size_ - 2:
            res.append('[sep]')
        elif token_id == vocab_size_ - 1:
            res.append('[mask]')
        elif token_id in tokenizer_.index_word:
            res.append(tokenizer_.index_word[token_id])
        else:
            res.append('[unk]')
    return ''.join(res)


for i in range(len(token_ids)):
    print(token_ids[i])
    print(decode(tokenizer, token_ids[i], vocab_size))
    print(mask_token_ids[i])
    print(decode(tokenizer, mask_token_ids[i], vocab_size))
    print(is_masked[i])
    print(nsp_labels[i])
    print('')

训练

大功告成！赶快训练一下你的模型吧～千万不要满足于使用伪造的语料库，一定要用自己的语料库来试试哦～

model.fit(ds, shuffle=False, epochs=1)

参考

tensorflow官方transformer教程：https://www.tensorflow.org/text/tutorials/transformer
The Illustrated Transformer
The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
BERT官方源码：https://github.com/google-research/bert

返回列表
Transformer的基本原理