Transformers
- BERT (Bidirectional Encoder Representations from Transformers): Used for a variety of NLP tasks like sentiment analysis and question answering.
- GPT (Generative Pre-trained Transformer): Used for text generation and completion tasks.
- Vision Transformers (ViT): Adapted for image classification tasks, where the image is divided into patches that the transformer processes as sequences.Architecture:
- Self-Attention Mechanism: The core component is the self-attention mechanism, which allows the model to weigh the importance of different input tokens when making predictions.
- Layers: Consists of multiple layers, each with self-attention and feedforward sub-layers. Each layer is followed by layer normalization and residual connections.
- Positional Encoding: Since transformers lack inherent sequence handling (like RNNs), they use positional encodings to provide information about the order of input tokens.
Characteristics:
- Purpose: Originally designed for sequence-to-sequence tasks in NLP, such as translation and text generation, but now widely used in other domains.
- Parallelization: Self-attention allows for parallel computation, making transformers highly efficient for training on large datasets.
- Scalability: Transformers can scale well with data size and complexity, and they have been used to train some of the largest models in existence (e.g., GPT, BERT).
- Flexibility: Transformers can handle variable-length input sequences, making them suitable for tasks involving sequential or structured data.