Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішення
1. Apr 2023•0 gefällt mir
1 gefällt mir
Sei der Erste, dem dies gefällt
Mehr anzeigen
•11 Aufrufe
Aufrufe
Aufrufe insgesamt
0
Auf Slideshare
0
Aus Einbettungen
0
Anzahl der Einbettungen
0
Melden
Business
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішення
AI & BigData Online Day 2023 Spring
Website - www.aiconf.com.ua
Youtube - https://www.youtube.com/startuplviv
FB - https://www.facebook.com/aiconf
BERT (Bidirectional Encoder
Representations from
Transformers)
• BERT is pre-trained using a masked
language modelling (MLM) task
• GPT is pre-trained using a language
modeling task, where the model is
trained to predict the next word in
a sequence of text.
• Bi-directional vs uni-directional
GPT (generative pretrained
transformers)
• GPT is a large-scale language model that is pre-
trained on a massive amount of text data using
a transformer-based architecture.
• GPT uses a transformer-based architecture,
which allows it to model complex relationships
between words and capture long-term
dependencies in a sequence of text.
• Despite its impressive performance, GPT still
has some limitations, such as a tendency to
generate biased or offensive text when trained
on biased or offensive data.
GPT-J
- GPT-J is one of the large language
models, with 6 billion parameters
- GPT-J is pre-trained on a massive amount
of text data, similar to other GPT models
(820 Gb)
- GPT-J is open-source, which means that
the code and pre-trained weights are
freely available
LLaMA (Large Language
Model Association)
• LLAMA framework breaks down a large
language model into smaller components, which
are optimized separately and then combined to
create a larger
• LLAMA framework supports per-channel
quantization
• "layer dropping”
• "knowledge distillation"
LLaMA in details
• In LLAMA, the model is divided into multiple sub-models, which are
trained independently using different optimization techniques. Each
sub-model is designed to capture specific aspects of language, such
as grammar, syntax, or semantics. Once the sub-models are trained,
they are combined into a larger model by "associating" them together
using a set of weights.
• The weights in the LLAMA model determine how much each sub-
model contributes to the overall prediction.
• One approach to creating sub-models is to use a technique called
"layer dropping", where individual layers in the model are dropped
during training to create smaller sub-models.
• Common approach is to use a weighted average of the outputs of the
sub-models
Fine-tuning
• Fine-tuning for LLMs involves
adapting the pre-trained model to a
specific task or domain by training it
on a new dataset.
• Fine-tuning of GPT-J
Alpaca
• Fine-tuned from the LLaMA 7B
model on 52K instruction-
following demonstrations
• Behaves qualitatively similarly to
OpenAI’s text-davinci-003
• Cheap to reproduce (<600$)
Few-shot learning
• Is a Machine Learning framework that enables a pre-trained model to
generalize over new categories of data (that the pre-trained model has not
seen during training) using only a few labeled samples per class.
• Support Set: The support set consists of the few labeled samples per
novel category of data, which a pre-trained model will use to generalize on
on these new classes.
• The requirement for large volumes of costly labeled data is eradicated for
training a model because, as the name suggests, the aim is to generalize
using only a few labeled samples.
• Since a pre-trained model is extended to new categories of data, there is
no need to re-train a model from scratch, which saves a lot of
computational power.
• Even if the model has been pre-trained using a statistically different
distribution of data, it can be used to extend to other data domains as well,
well, as long as the data in the support and query sets are coherent.
Quantization
• It is the process of mapping continuous infinite
values to a smaller set of discrete finite values
• In ML: converting the weights and activations of
the model from their original high-precision
format (such as 32-bit floating point numbers) to
a lower-precision format (such as 8-bit integers)
• Bitsandbytes library
• Colab with GPT-J
Future of LLM?
• Supervised learning (SL) requires large
numbers of labeled samples.
• Reinforcement learning (RL) requires insane
amounts of trials.
• Self-Supervised Learning (SSL) requires large
numbers of unlabeled samples.
• LLMs:
• Outputs one text token after another
• They make stupid mistakes
• LLMs have no knowledge of the
underlying reality
• Letter to pause experiments