.. raw:: html

   <!--
   # Pretraining BERT
   -->

.. _sec_bert-pretraining:

Tiá»n Huáº¥n luyá»‡n BERT
====================


.. raw:: html

   <!--
   With the BERT model implemented in :numref:`sec_bert` and the pretraining examples generated from the WikiText-2 dataset in :numref:`sec_bert-dataset`, 
   we will pretrain BERT on the WikiText-2 dataset in this section.
   -->

Trong pháº§n nÃ y, sá» dá»¥ng mÃ´ hÃ¬nh BERT Ä‘Ã£ Ä‘Æ°á»£c láºp trÃ¬nh trong
:numref:`sec_bert` vÃ  cÃ¡c máº«u dá»¯ liá»‡u tiá»n huáº¥n luyá»‡n Ä‘Æ°á»£c táº¡o ra tá»«
táºp dá»¯ liá»‡u WikiText-2 trong :numref:`sec_bert-dataset`, ta sáº½ tiá»n
huáº¥n luyá»‡n BERT trÃªn táºp dá»¯ liá»‡u nÃ y.

.. code:: python

    from d2l import mxnet as d2l
    from mxnet import autograd, gluon, init, np, npx
    
    npx.set_np()

.. raw:: html

   <!--
   To start, we load the WikiText-2 dataset as minibatches of pretraining examples for masked language modeling and next sentence prediction.
   The batch size is 512 and the maximum length of a BERT input sequence is 64.
   Note that in the original BERT model, the maximum length is 512.
   -->

Äáº§u tiÃªn, ta náº¡p cÃ¡c máº«u dá»¯ liá»‡u cá»§a táºp dá»¯ liá»‡u WikiText-2 thÃ nh cÃ¡c
minibatch cho quÃ¡ trÃ¬nh tiá»n huáº¥n luyá»‡n hai tÃ¡c vá»¥: mÃ´ hÃ¬nh hÃ³a ngÃ´n ngá»¯
cÃ³ máº·t náº¡ vÃ  dá»± Ä‘oÃ¡n cÃ¢u tiáº¿p theo. KÃch thÆ°á»›c batch lÃ  512 vÃ  Ä‘á»™ dÃ i
tá»‘i Ä‘a cá»§a chuá»—i Ä‘áº§u vÃ o BERT lÃ  64. LÆ°u Ã½ ráº±ng trong mÃ´ hÃ¬nh BERT gá»‘c,
Ä‘á»™ dÃ i tá»‘i Ä‘a nÃ y lÃ  512.

.. code:: python

    batch_size, max_len = 512, 64
    train_iter, vocab = d2l.load_data_wiki(batch_size, max_len)

.. raw:: html

   <!--
   ## Pretraining BERT
   -->

Tiá»n Huáº¥n luyá»‡n BERT
--------------------

.. raw:: html

   <!--
   The original BERT has two versions of different model sizes :cite:`Devlin.Chang.Lee.ea.2018`.
   The base model ($\text{BERT}_{\text{BASE}}$) uses 12 layers (Transformer encoder blocks) with 768 hidden units (hidden size) and 12 self-attention heads.
   The large model ($\text{BERT}_{\text{LARGE}}$) uses 24 layers with 1024 hidden units and 16 self-attention heads.
   Notably, the former has 110 million parameters while the latter has 340 million parameters.
   For demonstration with ease, we define a small BERT, using 2 layers, 128 hidden units, and 2 self-attention heads.
   -->

MÃ´ hÃ¬nh BERT gá»‘c cÃ³ hai phiÃªn báº£n vá»›i hai kÃch thÆ°á»›c mÃ´ hÃ¬nh khÃ¡c nhau
:cite:`Devlin.Chang.Lee.ea.2018`. MÃ´ hÃ¬nh cÆ¡ báº£n
(:math:`\text{BERT}_{\text{BASE}}`) sá» dá»¥ng 12 táº§ng (khá»‘i mÃ£ hÃ³a cá»§a
Transformer) vá»›i 768 nÃºt áº©n (kÃch thÆ°á»›c áº©n) vÃ  táº§ng tá»± táºp trung 12 Ä‘áº§u.
MÃ´ hÃ¬nh lá»›n (:math:`\text{BERT}_{\text{LARGE}}`) sá» dá»¥ng 24 táº§ng vá»›i
1024 nÃºt áº©n vÃ  táº§ng tá»± táºp trung 16 Ä‘áº§u. ÄÃ¡ng chÃº Ã½ lÃ  tá»•ng sá»‘ lÆ°á»£ng
tham sá»‘ trong mÃ´ hÃ¬nh Ä‘áº§u tiÃªn lÃ  110 triá»‡u, cÃ²n á»Ÿ mÃ´ hÃ¬nh thá»© hai lÃ 
340 triá»‡u. Äá»ƒ minh há»a thÃ¬ ta Ä‘á»‹nh nghÄ©a mÃ´ hÃ¬nh BERT nhá» dÆ°á»›i Ä‘Ã¢y, sá»
dá»¥ng 2 táº§ng vá»›i 128 nÃºt áº©n vÃ  táº§ng tá»± táºp trung 2 Ä‘áº§u.

.. code:: python

    net = d2l.BERTModel(len(vocab), num_hiddens=128, ffn_num_hiddens=256,
                        num_heads=2, num_layers=2, dropout=0.2)
    devices = d2l.try_all_gpus()
    net.initialize(init.Xavier(), ctx=devices)
    loss = gluon.loss.SoftmaxCELoss()

.. raw:: html

   <!--
   Before defining the training loop, we define a helper function `_get_batch_loss_bert`.
   Given the shard of training examples, this function computes the loss for both the masked language modeling and next sentence prediction tasks.
   Note that the final loss of BERT pretraining is just the sum of both the masked language modeling loss and the next sentence prediction loss.
   -->

Ta sáº½ Ä‘á»‹nh nghÄ©a hÃ m há»— trá»£ ``_get_batch_loss_bert`` trÆ°á»›c khi báº¯t Ä‘áº§u
láºp trÃ¬nh vÃ²ng láº·p cho quÃ¡ trÃ¬nh huáº¥n luyá»‡n. HÃ m nÃ y nháºn Ä‘áº§u vÃ o lÃ  má»™t
batch cÃ¡c máº«u huáº¥n luyá»‡n vÃ  tÃnh giÃ¡ trá»‹ máº¥t mÃ¡t Ä‘á»‘i vá»›i hai tÃ¡c vá»¥ mÃ´
hÃ¬nh hÃ³a ngÃ´n ngá»¯ cÃ³ máº·t náº¡ vÃ  dá»± Ä‘oÃ¡n cÃ¢u tiáº¿p theo. LÆ°u Ã½ ráº±ng máº¥t mÃ¡t
cuá»‘i cÃ¹ng cá»§a tÃ¡c vá»¥ tiá»n huáº¥n luyá»‡n BERT chá»‰ lÃ  tá»•ng máº¥t mÃ¡t cá»§a cáº£ hai
tÃ¡c vá»¥ nÃ³i trÃªn.

.. code:: python

    #@save
    def _get_batch_loss_bert(net, loss, vocab_size, tokens_X_shards,
                             segments_X_shards, valid_lens_x_shards,
                             pred_positions_X_shards, mlm_weights_X_shards,
                             mlm_Y_shards, nsp_y_shards):
        mlm_ls, nsp_ls, ls = [], [], []
        for (tokens_X_shard, segments_X_shard, valid_lens_x_shard,
             pred_positions_X_shard, mlm_weights_X_shard, mlm_Y_shard,
             nsp_y_shard) in zip(
            tokens_X_shards, segments_X_shards, valid_lens_x_shards,
            pred_positions_X_shards, mlm_weights_X_shards, mlm_Y_shards,
            nsp_y_shards):
            # Forward pass
            _, mlm_Y_hat, nsp_Y_hat = net(
                tokens_X_shard, segments_X_shard, valid_lens_x_shard.reshape(-1),
                pred_positions_X_shard)
            # Compute masked language model loss
            mlm_l = loss(
                mlm_Y_hat.reshape((-1, vocab_size)), mlm_Y_shard.reshape(-1),
                mlm_weights_X_shard.reshape((-1, 1)))
            mlm_l = mlm_l.sum() / (mlm_weights_X_shard.sum() + 1e-8)
            # Compute next sentence prediction loss
            nsp_l = loss(nsp_Y_hat, nsp_y_shard)
            nsp_l = nsp_l.mean()
            mlm_ls.append(mlm_l)
            nsp_ls.append(nsp_l)
            ls.append(mlm_l + nsp_l)
            npx.waitall()
        return mlm_ls, nsp_ls, ls

.. raw:: html

   <!--
   Invoking the two aforementioned helper functions, the following `train_bert` function defines 
   the procedure to pretrain BERT (`net`) on the WikiText-2 (`train_iter`) dataset.
   Training BERT can take very long.
   Instead of specifying the number of epochs for training as in the `train_ch13` function (see :numref:`sec_image_augmentation`), 
   the input `num_steps` of the following function specifies the number of iteration steps for training.
   -->

Sá» dá»¥ng hai hÃ m há»— trá»£ Ä‘Æ°á»£c Ä‘á» cáºp á»Ÿ trÃªn, hÃ m ``train_bert`` dÆ°á»›i Ä‘Ã¢y
sáº½ Ä‘á»‹nh nghÄ©a quÃ¡ trÃ¬nh tiá»n huáº¥n luyá»‡n BERT (``net``) trÃªn táºp dá»¯ liá»‡u
WikiText-2 (``train_iter``). Viá»‡c huáº¥n luyá»‡n BERT cÃ³ thá»ƒ máº¥t ráº¥t nhiá»u
thá»i gian. Do Ä‘Ã³, thay vÃ¬ truyá»n vÃ o sá»‘ lÆ°á»£ng epoch huáº¥n luyá»‡n nhÆ° trong
hÃ m ``train_ch13`` (:numref:`sec_image_augmentation`), ta sá» dá»¥ng tham
sá»‘ ``num_steps`` trong hÃ m sau Ä‘á»ƒ xÃ¡c Ä‘á»‹nh sá»‘ vÃ²ng láº·p huáº¥n luyá»‡n.

.. code:: python

    #@save
    def train_bert(train_iter, net, loss, vocab_size, devices, log_interval,
                   num_steps):
        trainer = gluon.Trainer(net.collect_params(), 'adam',
                                {'learning_rate': 1e-3})
        step, timer = 0, d2l.Timer()
        animator = d2l.Animator(xlabel='step', ylabel='loss',
                                xlim=[1, num_steps], legend=['mlm', 'nsp'])
        # Sum of masked language modeling losses, sum of next sentence prediction
        # losses, no. of sentence pairs, count
        metric = d2l.Accumulator(4)
        num_steps_reached = False
        while step < num_steps and not num_steps_reached:
            for batch in train_iter:
                (tokens_X_shards, segments_X_shards, valid_lens_x_shards,
                 pred_positions_X_shards, mlm_weights_X_shards,
                 mlm_Y_shards, nsp_y_shards) = [gluon.utils.split_and_load(
                    elem, devices, even_split=False) for elem in batch]
                timer.start()
                with autograd.record():
                    mlm_ls, nsp_ls, ls = _get_batch_loss_bert(
                        net, loss, vocab_size, tokens_X_shards, segments_X_shards,
                        valid_lens_x_shards, pred_positions_X_shards,
                        mlm_weights_X_shards, mlm_Y_shards, nsp_y_shards)
                for l in ls:
                    l.backward()
                trainer.step(1)
                mlm_l_mean = sum([float(l) for l in mlm_ls]) / len(mlm_ls)
                nsp_l_mean = sum([float(l) for l in nsp_ls]) / len(nsp_ls)
                metric.add(mlm_l_mean, nsp_l_mean, batch[0].shape[0], 1)
                timer.stop()
                if (step + 1) % log_interval == 0:
                    animator.add(step + 1,
                                 (metric[0] / metric[3], metric[1] / metric[3]))
                step += 1
                if step == num_steps:
                    num_steps_reached = True
                    break
    
        print(f'MLM loss {metric[0] / metric[3]:.3f}, '
              f'NSP loss {metric[1] / metric[3]:.3f}')
        print(f'{metric[2] / timer.sum():.1f} sentence pairs/sec on '
              f'{str(devices)}')

.. raw:: html

   <!--
   We can plot both the masked language modeling loss and the next sentence prediction loss during BERT pretraining.
   -->

Ta cÃ³ thá»ƒ váº½ Ä‘á»“ thá»‹ hÃ m máº¥t mÃ¡t á»©ng vá»›i hai tÃ¡c vá»¥ mÃ´ hÃ¬nh hÃ³a ngÃ´n ngá»¯
cÃ³ máº·t náº¡ vÃ  dá»± Ä‘oÃ¡n cÃ¢u tiáº¿p theo trong quÃ¡ trÃ¬nh tiá»n huáº¥n luyá»‡n BERT.

.. code:: python

    train_bert(train_iter, net, loss, len(vocab), devices, 1, 50)


.. parsed-literal::
    :class: output

    MLM loss 7.901, NSP loss 0.740
    21269.1 sentence pairs/sec on [gpu(0)]



.. figure:: output_bert-pretraining_vn_e425f8_11_1.svg


.. raw:: html

   <!--
   ## Representing Text with BERT
   -->

Biá»ƒu diá»…n VÄƒn báº£n vá»›i BERT
--------------------------

.. raw:: html

   <!--
   After pretraining BERT, we can use it to represent single text, text pairs, or any token in them.
   The following function returns the BERT (`net`) representations for all tokens in `tokens_a` and `tokens_b`.
   -->

Ta cÃ³ thá»ƒ sá» dá»¥ng mÃ´ hÃ¬nh BERT Ä‘Ã£ tiá»n huáº¥n luyá»‡n Ä‘á»ƒ biá»ƒu diá»…n má»™t vÄƒn
báº£n Ä‘Æ¡n, cáº·p vÄƒn báº£n hay má»™t token báº¥t ká»³ trong vÄƒn báº£n. HÃ m sau sáº½ tráº£
vá» biá»ƒu diá»…n cá»§a mÃ´ hÃ¬nh BERT (``net``) cho toÃ n bá»™ cÃ¡c token trong
``tokens_a`` vÃ  ``tokens_b``.

.. code:: python

    def get_bert_encoding(net, tokens_a, tokens_b=None):
        tokens, segments = d2l.get_tokens_and_segments(tokens_a, tokens_b)
        token_ids = np.expand_dims(np.array(vocab[tokens], ctx=devices[0]),
                                   axis=0)
        segments = np.expand_dims(np.array(segments, ctx=devices[0]), axis=0)
        valid_len = np.expand_dims(np.array(len(tokens), ctx=devices[0]), axis=0)
        encoded_X, _, _ = net(token_ids, segments, valid_len)
        return encoded_X

.. raw:: html

   <!--
   Consider the sentence "a crane is flying".
   Recall the input representation of BERT as discussed in :numref:`subsec_bert_input_rep`.
   After inserting special tokens â€œ&lt;cls&gt;â€ (used for classification) and â€œ&lt;sep&gt;â€ (used for separation), the BERT input sequence has a length of six.
   Since zero is the index of the â€œ&lt;cls&gt;â€ token, `encoded_text[:, 0, :]` is the BERT representation of the entire input sentence.
   To evaluate the polysemy token "crane", we also print out the first three elements of the BERT representation of the token.
   -->

XÃ©t cÃ¢u â€œa crane is flyingâ€. HÃ£y nhá»› láº¡i biá»ƒu diá»…n Ä‘áº§u vÃ o cá»§a BERT Ä‘Æ°á»£c
tháº£o luáºn trong :numref:`subsec_bert_input_rep`, sau khi thÃªm cÃ¡c
token Ä‘áº·c biá»‡t â€œ<cls>â€ (dÃ¹ng cho phÃ¢n loáº¡i) vÃ  â€œ<sep>â€ (dÃ¹ng Ä‘á»ƒ ngÄƒn
cÃ¡ch), chiá»u dÃ i cá»§a chuá»—i Ä‘áº§u vÃ o BERT lÃ  6. VÃ¬ 0 lÃ  chá»‰ sá»‘ cá»§a token
â€œ<cls>â€, ``encoded_text[:, 0, :]`` lÃ  biá»ƒu diá»…n BERT cá»§a toÃ n bá»™ cÃ¢u Ä‘áº§u
vÃ o. Äá»ƒ Ä‘Ã¡nh giÃ¡ token Ä‘a nghÄ©a â€œcraneâ€, ta sáº½ in cáº£ ba pháº§n tá» Ä‘áº§u tiÃªn
trong biá»ƒu diá»…n BERT cá»§a token nÃ y.

.. code:: python

    tokens_a = ['a', 'crane', 'is', 'flying']
    encoded_text = get_bert_encoding(net, tokens_a)
    # Tokens: '<cls>', 'a', 'crane', 'is', 'flying', '<sep>'
    encoded_text_cls = encoded_text[:, 0, :]
    encoded_text_crane = encoded_text[:, 2, :]
    encoded_text.shape, encoded_text_cls.shape, encoded_text_crane[0][:3]




.. parsed-literal::
    :class: output

    ((1, 6, 128),
     (1, 128),
     array([ 0.6976905 ,  0.98500854, -0.7272007 ], ctx=gpu(0)))



.. raw:: html

   <!--
   Now consider a sentence pair "a crane driver came" and "he just left".
   Similarly, `encoded_pair[:, 0, :]` is the encoded result of the entire sentence pair from the pretrained BERT.
   Note that the first three elements of the polysemy token "crane" are different from those when the context is different.
   This supports that BERT representations are context-sensitive.
   -->

BÃ¢y giá», ta sáº½ xem xÃ©t cáº·p cÃ¢u â€œa crane driver cameâ€ vÃ  â€œhe just leftâ€.
TÆ°Æ¡ng tá»± nhÆ° trÃªn, ``encoded_pair[:, 0, :]`` lÃ  káº¿t quáº£ mÃ£ hÃ³a cá»§a cáº·p
cÃ¢u nÃ y thÃ´ng qua BERT Ä‘Ã£ Ä‘Æ°á»£c tiá»n huáº¥n luyá»‡n. LÆ°u Ã½ ráº±ng khi token Ä‘a
nghÄ©a â€œcraneâ€ xuáº¥t hiá»‡n trong ngá»¯ cáº£nh khÃ¡c nhau, ba pháº§n tá» Ä‘áº§u tiÃªn
trong biá»ƒu diá»…n BERT token nÃ y cÅ©ng thay Ä‘á»•i. Äiá»u nÃ y thá»ƒ hiá»‡n ráº±ng
biá»ƒu diá»…n BERT cÃ³ tÃnh nháº¡y ngá»¯ cáº£nh.

.. code:: python

    tokens_a, tokens_b = ['a', 'crane', 'driver', 'came'], ['he', 'just', 'left']
    encoded_pair = get_bert_encoding(net, tokens_a, tokens_b)
    # Tokens: '<cls>', 'a', 'crane', 'driver', 'came', '<sep>', 'he', 'just',
    # 'left', '<sep>'
    encoded_pair_cls = encoded_pair[:, 0, :]
    encoded_pair_crane = encoded_pair[:, 2, :]
    encoded_pair.shape, encoded_pair_cls.shape, encoded_pair_crane[0][:3]




.. parsed-literal::
    :class: output

    ((1, 10, 128),
     (1, 128),
     array([ 0.6613879,  1.0305922, -0.6988825], ctx=gpu(0)))



.. raw:: html

   <!--
   In :numref:`chap_nlp_app`, we will fine-tune a pretrained BERT model
   for downstream natural language processing applications.
   -->

á»ž :numref:`chap_nlp_app`, ta sáº½ tinh chá»‰nh mÃ´ hÃ¬nh BERT Ä‘Ã£ Ä‘Æ°á»£c tiá»n
huáº¥n luyá»‡n vá»›i má»™t sá»‘ tÃ¡c vá»¥ xuÃ´i dÃ²ng trong xá» lÃ½ ngÃ´n ngá»¯ tá»± nhiÃªn.

TÃ³m táº¯t
-------

.. raw:: html

   <!--
   * The original BERT has two versions, where the base model has 110 million parameters and the large model has 340 million parameters.
   * After pretraining BERT, we can use it to represent single text, text pairs, or any token in them.
   * In the experiment, the same token has different BERT representation when their contexts are different. This supports that BERT representations are context-sensitive.
   -->

-  MÃ´ hÃ¬nh BERT gá»‘c cÃ³ hai phiÃªn báº£n, trong Ä‘Ã³ mÃ´ hÃ¬nh cÆ¡ báº£n cÃ³ 110
   triá»‡u tham sá»‘ vÃ  mÃ´ hÃ¬nh lá»›n cÃ³ 340 triá»‡u tham sá»‘.
-  Ta cÃ³ thá»ƒ sá» dá»¥ng mÃ´ hÃ¬nh BERT Ä‘Ã£ Ä‘Æ°á»£c tiá»n huáº¥n luyá»‡n Ä‘á»ƒ biá»ƒu diá»…n
   má»™t vÄƒn báº£n Ä‘Æ¡n, cáº·p vÄƒn báº£n hay má»™t token báº¥t ká»³.
-  Trong thÃ nghiá»‡m trÃªn, ta Ä‘Ã£ tháº¥y ráº±ng cÃ¹ng má»™t token cÃ³ thá»ƒ cÃ³ nhiá»u
   cÃ¡ch biá»ƒu diá»…n khÃ¡c nhau vá»›i nhá»¯ng ngá»¯ cáº£nh khÃ¡c nhau. Äiá»u nÃ y thá»ƒ
   hiá»‡n ráº±ng biá»ƒu diá»…n BERT cÃ³ tÃnh nháº¡y ngá»¯ cáº£nh.

BÃ i táºp
-------

.. raw:: html

   <!--
   1. In the experiment, we can see that the masked language modeling loss is significantly higher than the next sentence prediction loss. Why?
   2. Set the maximum length of a BERT input sequence to be 512 (same as the original BERT model). 
   Use the configurations of the original BERT model such as $\text{BERT}_{\text{LARGE}}$. 
   Do you encounter any error when running this section? Why?
   -->

1. Káº¿t quáº£ thÃ nghiá»‡m trÃªn cho tháº¥y máº¥t mÃ¡t á»©ng vá»›i tÃ¡c vá»¥ mÃ´ hÃ¬nh hÃ³a
   ngÃ´n ngá»¯ cÃ³ máº·t náº¡ cao hÆ¡n Ä‘Ã¡ng ká»ƒ so vá»›i tÃ¡c vá»¥ dá»± Ä‘oÃ¡n cÃ¢u tiáº¿p
   theo. HÃ£y giáº£i thÃch hiá»‡n tÆ°á»£ng nÃ y.
2. Thay Ä‘á»•i chiá»u dÃ i tá»‘i Ä‘a cá»§a chuá»—i Ä‘áº§u vÃ o BERT thÃ nh 512 (giá»‘ng vá»›i
   mÃ´ hÃ¬nh BERT gá»‘c) vÃ  sá» dá»¥ng cáº¥u hÃ¬nh cá»§a mÃ´ hÃ¬nh BERT gá»‘c nhÆ° lÃ 
   :math:`\text{BERT}_{\text{LARGE}}`. Báº¡n cÃ³ gáº·p lá»—i khi cháº¡y láº¡i thÃ
   nghiá»‡m khÃ´ng? Giáº£i thÃch táº¡i sao.

Tháº£o luáºn
---------

-  Tiáº¿ng Anh: `MXNet <https://discuss.d2l.ai/t/390>`__
-  Tiáº¿ng Viá»‡t: `Diá»…n Ä‘Ã n Machine Learning CÆ¡
   Báº£n <https://forum.machinelearningcoban.com/c/d2l>`__

Nhá»¯ng ngÆ°á»i thá»±c hiá»‡n
---------------------

Báº£n dá»‹ch trong trang nÃ y Ä‘Æ°á»£c thá»±c hiá»‡n bá»Ÿi:

-  ÄoÃ n VÃµ Duy Thanh
-  BÃ¹i Thá»‹ Cáº©m Nhung
-  Nguyá»…n VÄƒn Quang
-  Pháº¡m Minh Äá»©c
-  Nguyá»…n VÄƒn CÆ°á»ng

*Láº§n cáºp nháºt gáº§n nháº¥t: 12/09/2020. (Cáºp nháºt láº§n cuá»‘i tá»« ná»™i dung gá»‘c:
21/07/2020)*