{"id":6455,"date":"2025-10-13T23:14:51","date_gmt":"2025-10-13T23:14:51","guid":{"rendered":"https:\/\/cloudypos.com\/nbosi\/?post_type=articles&#038;p=6455"},"modified":"2025-10-13T23:18:37","modified_gmt":"2025-10-13T23:18:37","slug":"the-mathematics-behind-modern-ai","status":"publish","type":"articles","link":"https:\/\/cloudypos.com\/nbosi\/articles\/the-mathematics-behind-modern-ai\/","title":{"rendered":"The Mathematics Behind Modern AI"},"content":{"rendered":"<p>A Deep Dive into GPT, DeepSeek, Gemini, and Claude<\/p>\n<p>Hene Aku Kwapong, PhD, MBA, MIT Practice School Alumni<\/p>\n<p>Document prepared with the help of:<br \/>\nJeremie Kwapong, New England Innovation Academy<\/p>\n<p>Date:\u00a0October 9, 2025<\/p>\n<p>&nbsp;<\/p>\n<p>Decoding the Language of Intelligent Machines<\/p>\n<p>In 2017, a team of researchers at Google published a paper with a bold title: \u201cAttention Is All You Need.\u201d This work introduced the Transformer architecture, which would revolutionize artificial intelligence and give birth to the large language models that now shape our digital world. But what makes these systems work? What mathematical principles allow a computer to understand context, generate coherent text, and even reason about complex problems?<\/p>\n<p>This document explores the mathematical foundations of three groundbreaking AI architectures: OpenAI\u2019s GPT models, which demonstrated the power of scaling; DeepSeek\u2019s efficient design, which proved that smart engineering can rival brute force; and Google\u2019s Gemini, which extended AI\u2019s reach into the multimodal realm of images, audio, and video.<\/p>\n<p>We\u2019ll present the actual mathematical formulations that power these systems, but explain them in a way that reveals their elegant simplicity. Our goal is to make sophisticated concepts accessible without sacrificing accuracy or depth.<\/p>\n<p>&nbsp;<\/p>\n<p>Part 1: The Foundational Architecture &#8211; The Transformer<\/p>\n<p>Before diving into the specific innovations of GPT, DeepSeek, and Gemini, we must understand the Transformer architecture that underlies them all. Introduced by Vaswani et al.\u00a0in 2017, the Transformer solved a fundamental problem: how to process sequences of information in parallel while still capturing relationships between distant elements.<\/p>\n<p>&nbsp;<\/p>\n<h3>The Attention Mechanism: Learning What Matters<\/h3>\n<p>Imagine reading a detective novel. When you encounter the phrase \u201cthe butler did it,\u201d your mind instantly connects this to earlier mentions of the butler, the crime scene, and various clues scattered throughout hundreds of pages. You\u2019re not giving equal weight to every word you\u2019ve read\u2014you\u2019re selectively attending to relevant information.<\/p>\n<p>The Transformer does exactly this through its attention mechanism. The mathematics, while precise, capture this intuitive process beautifully.<\/p>\n<p>&nbsp;<\/p>\n<h4>Scaled Dot-Product Attention<\/h4>\n<p>The core attention formula is:<\/p>\n<p>Attention(Q, K, V) = softmax(QK^T \/ \u221ad_k) V<\/p>\n<p>Let\u2019s unpack this step by step:<\/p>\n<p>The Three Matrices:\u00a0&#8211; Q (Query): Represents \u201cwhat am I looking for?\u201d For each word, this encodes what information it needs from other words. &#8211; K (Key): Represents \u201cwhat information do I have?\u201d Each word advertises what it can offer. &#8211; V (Value): Represents \u201cwhat is my actual content?\u201d The information that will be retrieved.<\/p>\n<p>The Computation:<\/p>\n<ol start=\"1\">\n<li>QK^T: We multiply queries by keys (transposed). This creates a matrix where each entry measures how relevant each key is to each query. Think of it as computing compatibility scores.<\/li>\n<li>Division by \u221ad_k: This scaling factor is crucial. Without it, when d_k (the dimension of our keys) is large, the dot products can become very large, pushing the softmax function into regions where gradients become tiny. The square root scaling keeps the values in a reasonable range. It\u2019s like adjusting the volume on a stereo\u2014too loud and you get distortion, too quiet and you miss important details.<\/li>\n<li>softmax(\u2026): This converts our compatibility scores into probabilities that sum to 1. It\u2019s a way of saying \u201cdistribute your attention across all the words, but focus more on the relevant ones.\u201d<\/li>\n<li>Multiply by V: Finally, we use these attention weights to compute a weighted sum of the values. We\u2019re gathering information from all words, but in proportion to their relevance.<\/li>\n<\/ol>\n<p>Example in Action:<\/p>\n<p>Consider the sentence: \u201cThe cat sat on the mat because it was comfortable.\u201d<\/p>\n<p>When processing \u201cit,\u201d the attention mechanism computes: &#8211; High attention to \u201cmat\u201d (it could refer to the mat) &#8211; High attention to \u201ccat\u201d (it could refer to the cat) &#8211; Lower attention to \u201csat,\u201d \u201con,\u201d \u201cthe\u201d (less relevant for resolving the pronoun)<\/p>\n<p>The context (\u201cbecause it was comfortable\u201d) helps the model determine that \u201cit\u201d likely refers to the mat, not the cat.<\/p>\n<p>&nbsp;<\/p>\n<h4>Multi-Head Attention: Multiple Perspectives<\/h4>\n<p>A single attention mechanism is powerful, but multiple attention mechanisms working in parallel are even better. This is multi-head attention:<\/p>\n<p>MultiHead(Q, K, V) = Concat(head\u2081, &#8230;, head\u2095) W^O<\/p>\n<p>where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)<\/p>\n<p>Why Multiple Heads?<\/p>\n<p>Different heads can learn to attend to different types of relationships: &#8211; One head might focus on syntactic relationships (subject-verb agreement) &#8211; Another might capture semantic relationships (synonyms, antonyms) &#8211; Yet another might track long-range dependencies (pronouns to their antecedents)<\/p>\n<p>The Mathematics:<\/p>\n<p>Each head has its own learned projection matrices: &#8211; W_i^Q \u2208 \u211d^(d_model \u00d7 d_k): Projects the input into query space for head i &#8211; W_i^K \u2208 \u211d^(d_model \u00d7 d_k): Projects the input into key space for head i &#8211; W_i^V \u2208 \u211d^(d_model \u00d7 d_v): Projects the input into value space for head i<\/p>\n<p>After computing attention for each head independently, we concatenate all the outputs and project them back to the model dimension using W^O \u2208 \u211d^(hd_v \u00d7 d_model).<\/p>\n<p>Typical Configuration:\u00a0&#8211; Number of heads (h): 8 to 96 (depending on model size) &#8211; d_k = d_v = d_model \/ h (typically 64 for each head in a 512-dimensional model)<\/p>\n<p>This design ensures that the total computational cost is similar to single-head attention with full dimensionality, but we gain the benefit of multiple specialized attention patterns.<\/p>\n<p>&nbsp;<\/p>\n<h3>Positional Encoding: Understanding Order<\/h3>\n<p>Unlike recurrent neural networks that process sequences step by step, Transformers process all positions simultaneously. This parallelism is great for speed, but it creates a problem: the model has no inherent sense of word order. \u201cDog bites man\u201d and \u201cMan bites dog\u201d would look identical!<\/p>\n<p>The solution is positional encoding, which adds position information to each word\u2019s representation:<\/p>\n<p>PE(pos, 2i) = sin(pos \/ 10000^(2i\/d_model))<br \/>\nPE(pos, 2i+1) = cos(pos \/ 10000^(2i\/d_model))<\/p>\n<p>Understanding the Formula:<\/p>\n<ul>\n<li>pos: The position of the word in the sequence (0, 1, 2, \u2026)<\/li>\n<li>i: The dimension index (0, 1, 2, \u2026, d_model\/2)<\/li>\n<li>2i and 2i+1: Even and odd dimensions use sine and cosine respectively<\/li>\n<\/ul>\n<p>Why Sine and Cosine?<\/p>\n<p>This choice is brilliant for several reasons:<\/p>\n<ol start=\"1\">\n<li>Unique Fingerprints: Each position gets a unique pattern of sine and cosine values across all dimensions.<\/li>\n<li>Relative Positions: The model can learn to attend to relative positions. Due to trigonometric identities, PE(pos+k) can be expressed as a linear function of PE(pos), making it easy for the model to learn patterns like \u201cattend to the word 3 positions back.\u201d<\/li>\n<li>Extrapolation: The model can potentially handle sequences longer than those seen during training, since the sinusoidal pattern continues smoothly.<\/li>\n<li>No Learning Required: Unlike learned positional embeddings, these are fixed functions, reducing the number of parameters.<\/li>\n<\/ol>\n<p>Visualization:<\/p>\n<p>Imagine each position as a unique musical chord. Position 0 might be a C major chord, position 1 a slightly different chord, and so on. Each dimension contributes a different frequency to this chord, creating a rich, unique signature for each position.<\/p>\n<p>&nbsp;<\/p>\n<h3>Position-Wise Feed-Forward Networks<\/h3>\n<p>After attention, each position\u2019s representation is processed through a simple two-layer network:<\/p>\n<p>FFN(x) = max(0, xW\u2081 + b\u2081)W\u2082 + b\u2082<\/p>\n<p>Breaking It Down:<\/p>\n<ol start=\"1\">\n<li>First Linear Layer (xW\u2081 + b\u2081): Expands the representation from d_model dimensions to d_ff dimensions (typically d_ff = 4 \u00d7 d_model). This expansion gives the network more capacity to process information.<\/li>\n<li>ReLU Activation (max(0, \u2026)): The Rectified Linear Unit acts as a gatekeeper, setting negative values to zero. This non-linearity is crucial\u2014without it, the entire network would collapse to a single linear transformation.<\/li>\n<li>Second Linear Layer (W\u2082 + b\u2082): Projects back down to d_model dimensions.<\/li>\n<\/ol>\n<p>Why \u201cPosition-Wise\u201d?<\/p>\n<p>The same FFN is applied to each position independently. It\u2019s like having the same expert analyze each word separately, after the attention mechanism has already allowed words to share information with each other.<\/p>\n<p>Typical Dimensions:\u00a0&#8211; Input\/Output: d_model = 512 to 12,288 (depending on model size) &#8211; Hidden layer: d_ff = 2,048 to 49,152 (typically 4\u00d7 the model dimension)<\/p>\n<p>&nbsp;<\/p>\n<h3>Layer Normalization and Residual Connections<\/h3>\n<p>Two additional components stabilize training:<\/p>\n<p>Residual Connections:<\/p>\n<p>output = LayerNorm(x + Sublayer(x))<\/p>\n<p>The \u201c+x\u201d is a residual connection\u2014we add the input back to the output. This allows gradients to flow directly through the network during training, preventing the vanishing gradient problem in deep networks.<\/p>\n<p>Layer Normalization:<\/p>\n<p>Normalizes the activations to have mean 0 and variance 1, then applies learned scaling and shifting. This keeps the values in a reasonable range and stabilizes training.<\/p>\n<p>&nbsp;<\/p>\n<h3>The Complete Transformer Block<\/h3>\n<p>Putting it all together, a Transformer block consists of:<\/p>\n<ol start=\"1\">\n<li>Multi-head self-attention<\/li>\n<li>Residual connection and layer normalization<\/li>\n<li>Position-wise feed-forward network<\/li>\n<li>Another residual connection and layer normalization<\/li>\n<\/ol>\n<p>This block is stacked multiple times (6 to 96+ layers) to create the full model.<\/p>\n<p>&nbsp;<\/p>\n<hr style=\"page-break-before: always; display: none;\" \/>\n<p>Part 2: OpenAI GPT &#8211; Scaling the Transformer<\/p>\n<p>&nbsp;<\/p>\n<h3>The GPT Philosophy: Simplicity at Scale<\/h3>\n<p>OpenAI\u2019s approach with GPT (Generative Pre-trained Transformer) was elegantly simple: take the Transformer decoder, scale it up massively, train it on enormous amounts of text, and let capabilities emerge from scale.<\/p>\n<p>&nbsp;<\/p>\n<h3>Architecture Details<\/h3>\n<p>GPT uses the standard Transformer architecture with a few key choices:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/cloudypos.com\/nbosi\/wp-content\/uploads\/sites\/59\/2025\/10\/image1.png\" \/><\/p>\n<p>GPT Architecture Diagram<\/p>\n<p>Figure 1: OpenAI GPT Architecture.\u00a0The model consists of stacked Transformer blocks with causal self-attention, allowing each position to only attend to previous positions. Each block contains multi-head attention followed by a feed-forward network, with residual connections and layer normalization.<\/p>\n<p>Decoder-Only Architecture:<\/p>\n<p>Unlike the original Transformer (which had both encoder and decoder), GPT uses only the decoder. This means: &#8211; It processes text left-to-right &#8211; Each position can only attend to previous positions (causal masking) &#8211; This makes it naturally suited for text generation<\/p>\n<p>Causal Masking:<\/p>\n<p>The attention mechanism is modified to prevent \u201clooking ahead\u201d:<\/p>\n<p>Attention(Q, K, V) = softmax((QK^T + M) \/ \u221ad_k) V<\/p>\n<p>where M_ij = 0 if i \u2265 j, -\u221e if i &lt; j<\/p>\n<p>The mask M ensures that position i can only attend to positions j where j \u2264 i. The -\u221e values become 0 after softmax, effectively blocking attention to future positions.<\/p>\n<p>&nbsp;<\/p>\n<h3>The Mathematics of Scale<\/h3>\n<p>Model Dimensions (GPT-3 as example):\u00a0&#8211; Layers: 96 &#8211; d_model: 12,288 &#8211; Number of heads: 96 &#8211; d_k = d_v: 128 per head &#8211; d_ff: 49,152 &#8211; Context length: 2,048 tokens (later extended to 128K in GPT-4) &#8211; Total parameters: ~175 billion (GPT-3)<\/p>\n<p>Attention Complexity:<\/p>\n<p>For a sequence of length n: &#8211; Attention computation: O(n\u00b2 \u00d7 d_model) &#8211; Memory for attention scores: O(n\u00b2 \u00d7 h)<\/p>\n<p>This quadratic scaling with sequence length is why context length was initially limited.<\/p>\n<p>&nbsp;<\/p>\n<h3>Training Objective: Next-Token Prediction<\/h3>\n<p>GPT is trained with a simple but powerful objective:<\/p>\n<p>L = -\u2211_{t=1}^T log P(x_t | x_1, &#8230;, x_{t-1})<\/p>\n<p>What This Means:<\/p>\n<p>For each position t in the training text: 1. Given all previous tokens (x_1 through x_{t-1}) 2. Predict the probability distribution over the next token (x_t) 3. Maximize the log probability of the actual next token<\/p>\n<p>This seemingly simple task forces the model to learn: &#8211; Grammar and syntax &#8211; Factual knowledge &#8211; Reasoning patterns &#8211; Common sense &#8211; And much more<\/p>\n<p>Why It Works:<\/p>\n<p>The key insight is that predicting the next word requires understanding context, relationships, and patterns. To predict the next word in \u201cThe capital of France is &#8220;, the model must know geography. To predict the next word in &#8220;She was happy because \u201d, the model must understand causation and emotion.<\/p>\n<p>&nbsp;<\/p>\n<h3>Computational Requirements<\/h3>\n<p>Training:\u00a0&#8211; GPT-3 training: ~3.14 \u00d7 10\u00b2\u00b3 FLOPs &#8211; Training time: Several months on thousands of GPUs &#8211; Training data: ~300 billion tokens<\/p>\n<p>Inference:\u00a0&#8211; For each token generated: ~2 \u00d7 (number of parameters) FLOPs &#8211; All parameters are active for every token &#8211; Memory: Must store all parameters plus KV cache<\/p>\n<p>&nbsp;<\/p>\n<h3>Strengths and Limitations<\/h3>\n<p>Strengths:\u00a0&#8211; Proven, reliable architecture &#8211; Excellent general-purpose performance &#8211; Strong few-shot learning capabilities &#8211; Predictable scaling behavior<\/p>\n<p>Limitations:\u00a0&#8211; High computational cost (all parameters always active) &#8211; Quadratic attention complexity limits context length &#8211; Large memory footprint &#8211; Expensive to train and run<\/p>\n<p>&nbsp;<\/p>\n<hr style=\"page-break-before: always; display: none;\" \/>\n<p>Part 3: DeepSeek &#8211; The Efficiency Revolution<\/p>\n<p>&nbsp;<\/p>\n<h3>The DeepSeek Philosophy: Matching Performance with Fraction of Cost<\/h3>\n<p>DeepSeek asked a provocative question: \u201cCan we achieve GPT-level performance while using only 5% of the computational resources per token?\u201d Their answer came through two major innovations: Multi-Head Latent Attention (MLA) for memory efficiency and DeepSeekMoE for computational efficiency.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/cloudypos.com\/nbosi\/wp-content\/uploads\/sites\/59\/2025\/10\/image2.png\" \/><br \/>\nDeepSeek Architecture Diagram<\/p>\n<p>Figure 2: DeepSeek Architecture.\u00a0The model features Multi-Head Latent Attention (MLA) with 98% KV cache reduction through compression, and DeepSeekMoE with sparse expert routing that activates only 37B out of 671B parameters per token. The architecture combines shared experts (always active) with routed experts (selectively activated based on token affinity).<\/p>\n<p>&nbsp;<\/p>\n<h3>Innovation 1: Multi-Head Latent Attention (MLA)<\/h3>\n<p>The KV cache problem is one of the biggest bottlenecks in LLM inference. For each token generated, the model must store the keys and values from all previous tokens. With long contexts and many attention heads, this memory requirement becomes enormous.<\/p>\n<p>&nbsp;<\/p>\n<h4>The Compression Strategy<\/h4>\n<p>MLA solves this through low-rank compression. Instead of storing full keys and values, it stores compressed representations:<\/p>\n<p>Step 1: Compress Keys and Values<\/p>\n<p>c_t^KV = W^DKV h_t<\/p>\n<p>Where:\u00a0&#8211; h_t \u2208 \u211dd: The input hidden state for token t &#8211; WDKV \u2208 \u211d^(d_c \u00d7 d): Down-projection matrix &#8211; c_t^KV \u2208 \u211d^d_c: Compressed latent vector (d_c \u226a d_h \u00d7 n_h)<\/p>\n<p>Typical Values:\u00a0&#8211; d (model dimension): 5,120 &#8211; n_h (number of heads): 128 &#8211; d_h (dimension per head): 128 &#8211; d_c (compressed dimension): 512<\/p>\n<p>This means instead of storing 128 \u00d7 128 = 16,384 values per token, we store only 512\u2014a 97% reduction!<\/p>\n<p>Step 2: Decompress Keys<\/p>\n<p>[k_{t,1}^C; k_{t,2}^C; &#8230;; k_{t,n_h}^C] = k_t^C = W^UK c_t^KV<\/p>\n<p>Where:\u00a0&#8211; W^UK \u2208 \u211d^(d_h\u00d7n_h \u00d7 d_c): Up-projection matrix for keys &#8211; k_t^C: Decompressed keys for all heads (concatenated)<\/p>\n<p>Step 3: Add Rotary Position Information<\/p>\n<p>k_t^R = RoPE(W^KR h_t)<br \/>\nk_{t,i} = [k_{t,i}^C; k_t^R]<\/p>\n<p>Where:\u00a0&#8211; W^KR \u2208 \u211d(d_hR \u00d7 d): Matrix for RoPE keys &#8211; RoPE(\u00b7): Rotary Position Embedding function &#8211; k_t^R: Position-dependent component (shared across heads) &#8211; [\u00b7; \u00b7]: Concatenation<\/p>\n<p>Why Separate Position Information?<\/p>\n<p>RoPE (Rotary Position Embedding) is more efficient than sinusoidal encoding. By keeping position information separate and shared across heads, we save even more memory.<\/p>\n<p>Step 4: Decompress Values<\/p>\n<p>[v_{t,1}^C; v_{t,2}^C; &#8230;; v_{t,n_h}^C] = v_t^C = W^UV c_t^KV<\/p>\n<p>Where:\u00a0&#8211; W^UV \u2208 \u211d^(d_h\u00d7n_h \u00d7 d_c): Up-projection matrix for values<\/p>\n<p>&nbsp;<\/p>\n<h4>Query Compression<\/h4>\n<p>Queries are also compressed, but this is for reducing activation memory during training, not inference:<\/p>\n<p>c_t^Q = W^DQ h_t<br \/>\n[q_{t,1}^C; q_{t,2}^C; &#8230;; q_{t,n_h}^C] = q_t^C = W^UQ c_t^Q<br \/>\n[q_{t,1}^R; q_{t,2}^R; &#8230;; q_{t,n_h}^R] = q_t^R = RoPE(W^QR c_t^Q)<br \/>\nq_{t,i} = [q_{t,i}^C; q_{t,i}^R]<\/p>\n<p>Where:\u00a0&#8211; c_t^Q \u2208 \u211d(d_c\u2019): Compressed query latent (d_c\u2019 \u226a d_h \u00d7 n_h) &#8211; WDQ \u2208 \u211d^(d_c\u2019 \u00d7 d): Query down-projection &#8211; W^UQ \u2208 \u211d^(d_h\u00d7n_h \u00d7 d_c\u2019): Query up-projection &#8211; W^QR \u2208 \u211d^(d_h\u00d7n_h \u00d7 d_c\u2019): Query RoPE matrix<\/p>\n<p>&nbsp;<\/p>\n<h4>Final Attention Computation<\/h4>\n<p>o_{t,i} = \u2211_{j=1}^t softmax_j(q_{t,i}^T k_{j,i} \/ \u221a(d_h + d_h^R)) v_{j,i}^C<\/p>\n<p>u_t = W^O [o_{t,1}; o_{t,2}; &#8230;; o_{t,n_h}]<\/p>\n<p>Where:\u00a0&#8211; W^O \u2208 \u211d^(d \u00d7 d_h\u00d7n_h): Output projection matrix &#8211; o_{t,i}: Attention output for head i at position t<\/p>\n<p>Memory Savings:<\/p>\n<p>For DeepSeek-V3 with 128 heads: &#8211; Standard MHA: 128 \u00d7 128 \u00d7 2 = 32,768 values per token &#8211; MLA: 512 + 128 = 640 values per token &#8211; Reduction: 98%<\/p>\n<p>&nbsp;<\/p>\n<h3>Innovation 2: DeepSeekMoE Architecture<\/h3>\n<p>The second major innovation is the Mixture-of-Experts (MoE) architecture, which activates only a small fraction of parameters for each token.<\/p>\n<p>&nbsp;<\/p>\n<h4>The Basic MoE Formula<\/h4>\n<p>h_t&#8217; = u_t + \u2211_{i=1}^{N_s} FFN_i^{(s)}(u_t) + \u2211_{i=1}^{N_r} g_{i,t} FFN_i^{(r)}(u_t)<\/p>\n<p>Where:\u00a0&#8211; u_t: Input to the FFN layer &#8211; N_s: Number of shared experts (always activated) &#8211; N_r: Number of routed experts (selectively activated) &#8211; FFN_i{(s)}(\u00b7): i-th shared expert &#8211; FFN_i{(r)}(\u00b7): i-th routed expert &#8211; g_{i,t}: Gating value for routed expert i at token t &#8211; h_t\u2019: Output of the FFN layer<\/p>\n<p>DeepSeek-V3 Configuration:\u00a0&#8211; N_s = 2 shared experts &#8211; N_r = 256 routed experts &#8211; K_r = 8 routed experts activated per token &#8211; Expert size: Each expert has ~15M parameters<\/p>\n<p>Total Computation:\u00a0&#8211; Shared experts: 2 experts (always active) &#8211; Routed experts: 8 experts (selected from 256) &#8211; Total active: 10 experts per token &#8211; Percentage: 10\/258 \u2248 3.9% of FFN parameters<\/p>\n<p>&nbsp;<\/p>\n<h4>The Routing Mechanism<\/h4>\n<p>Step 1: Compute Affinity Scores<\/p>\n<p>s_{i,t} = Sigmoid(u_t^T e_i)<\/p>\n<p>Where:\u00a0&#8211; e_i \u2208 \u211d^d: Centroid vector for expert i (learned parameter) &#8211; s_{i,t}: Token-to-expert affinity score (between 0 and 1) &#8211; Sigmoid: Ensures scores are in (0, 1) range<\/p>\n<p>Intuition:<\/p>\n<p>Each expert has a \u201cspecialty\u201d represented by its centroid vector e_i. The affinity score measures how well the current token matches that specialty. Think of it as asking \u201cIs this expert relevant for processing this token?\u201d<\/p>\n<p>Step 2: Select Top-K Experts<\/p>\n<p>g_{i,t}&#8217; = s_{i,t} \u00a0 \u00a0if s_{i,t} \u2208\u00a0TopK({s_{j,t} | 1 \u2264 j \u2264 N_r}, K_r)<br \/>\ng_{i,t}&#8217; = 0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0otherwise<\/p>\n<p>Only the K_r experts with highest affinity scores are selected.<\/p>\n<p>Step 3: Normalize Gating Values<\/p>\n<p>g_{i,t} = g_{i,t}&#8217; \/ \u2211_{j=1}^{N_r} g_{j,t}&#8217;<\/p>\n<p>The gating values are normalized so they sum to 1, ensuring the output is a proper weighted combination.<\/p>\n<p>&nbsp;<\/p>\n<h3>Innovation 3: Auxiliary-Loss-Free Load Balancing<\/h3>\n<p>Traditional MoE models face a critical problem: some experts become overused while others are underutilized. This is typically solved with an auxiliary loss that penalizes imbalance, but this loss can hurt model performance.<\/p>\n<p>DeepSeek\u2019s solution is elegant: adjust the routing dynamically without modifying the training loss.<\/p>\n<p>&nbsp;<\/p>\n<h4>The Bias Adjustment Mechanism<\/h4>\n<p>g_{i,t}&#8217; = s_{i,t} \u00a0 \u00a0if s_{i,t} + b_i \u2208\u00a0TopK({s_{j,t} + b_j | 1 \u2264 j \u2264 N_r}, K_r)<br \/>\ng_{i,t}&#8217; = 0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0otherwise<\/p>\n<p>Where:\u00a0&#8211; b_i: Bias term for expert i (dynamically adjusted)<\/p>\n<p>Key Insight:<\/p>\n<p>The bias b_i is used only for routing decisions (TopK selection), not for computing the gating values. This means: &#8211; Routing is influenced by the bias (encouraging balance) &#8211; Gating values remain based on true affinity (preserving performance)<\/p>\n<p>&nbsp;<\/p>\n<h4>Dynamic Bias Update<\/h4>\n<p>At the end of each training step:<\/p>\n<p>b_i(step+1) = b_i(step) &#8211; \u03b3 \u00a0 \u00a0if expert i is overloaded<br \/>\nb_i(step+1) = b_i(step) + \u03b3 \u00a0 \u00a0if expert i is underloaded<br \/>\nb_i(step+1) = b_i(step) \u00a0 \u00a0 \u00a0 \u00a0if expert i is balanced<\/p>\n<p>Where:\u00a0&#8211; \u03b3: Bias update speed (hyperparameter, typically 0.01)<\/p>\n<p>Load Criteria:<\/p>\n<p>An expert is considered: &#8211; Overloaded\u00a0if it receives more than (1 + \u03b5) \u00d7 (K_r \/ N_r) \u00d7 batch_size tokens &#8211; Underloaded\u00a0if it receives fewer than (1 &#8211; \u03b5) \u00d7 (K_r \/ N_r) \u00d7 batch_size tokens &#8211; Balanced\u00a0otherwise<\/p>\n<p>Where \u03b5\u00a0is a tolerance parameter (typically 0.2)<\/p>\n<p>How It Works:<\/p>\n<ol start=\"1\">\n<li>If an expert is overloaded, we decrease its bias, making it less likely to be selected<\/li>\n<li>If an expert is underloaded, we increase its bias, making it more likely to be selected<\/li>\n<li>Over time, this naturally balances the load without forcing artificial constraints<\/li>\n<\/ol>\n<p>Analogy:<\/p>\n<p>Think of it like dynamic pricing: if a restaurant is too crowded, prices go up slightly (negative bias), discouraging some customers. If it\u2019s empty, prices go down (positive bias), attracting more customers. The quality of the food (true affinity) doesn\u2019t change, but the selection behavior adjusts.<\/p>\n<p>&nbsp;<\/p>\n<h3>Complementary Sequence-Wise Balance Loss<\/h3>\n<p>While the auxiliary-loss-free strategy handles batch-level balance, a small auxiliary loss prevents extreme imbalance within individual sequences:<\/p>\n<p>L_Bal = \u03b1 \u2211_{i=1}^{N_r} f_i P_i<\/p>\n<p>Where:<\/p>\n<p>f_i = (N_r \/ (K_r T)) \u2211_{t=1}^T \ud835\udfd9[s_{i,t} \u2208\u00a0TopK({s_{j,t} | 1 \u2264 j \u2264 N_r}, K_r)]<\/p>\n<p>P_i = (1\/T) \u2211_{t=1}^T s_{i,t}&#8217;<\/p>\n<p>s_{i,t}&#8217; = s_{i,t} \/ \u2211_{j=1}^{N_r} s_{j,t}<\/p>\n<p>Breaking It Down:<\/p>\n<ul>\n<li>f_i: Fraction of tokens in the sequence routed to expert i<\/li>\n<li>P_i: Average routing probability for expert i<\/li>\n<li>\u03b1: Balance loss coefficient (typically 0.001-0.01)<\/li>\n<li>T: Sequence length<\/li>\n<li>\ud835\udfd9[\u00b7]: Indicator function (1 if condition is true, 0 otherwise)<\/li>\n<\/ul>\n<p>Why This Loss?<\/p>\n<p>The product f_i \u00d7 P_i is minimized when experts are balanced. If an expert has high routing probability (P_i) but low actual usage (f_i), or vice versa, the loss increases. This gently encourages balance within each sequence without the strong performance penalty of traditional auxiliary losses.<\/p>\n<p>&nbsp;<\/p>\n<h3>Multi-Token Prediction<\/h3>\n<p>DeepSeek-V3 also employs a multi-token prediction objective:<\/p>\n<p>L = -\u2211_{t=1}^T \u2211_{k=1}^K \u03bb_k log P(x_{t+k} | x_{\u2264t})<\/p>\n<p>Where:\u00a0&#8211; K: Number of future tokens to predict (typically 2-4) &#8211; \u03bb_k: Weight for predicting k steps ahead (typically decreasing)<\/p>\n<p>Why Predict Multiple Tokens?<\/p>\n<ol start=\"1\">\n<li>Better Representations: Forces the model to learn representations that are useful for multiple future predictions<\/li>\n<li>Speculative Decoding: The additional prediction heads can be used for speculative decoding during inference, generating multiple tokens in parallel<\/li>\n<li>Improved Performance: Empirically shown to improve benchmark scores<\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<h3>DeepSeek-V3 Complete Specifications<\/h3>\n<p>Model Architecture:\u00a0&#8211; Total parameters: 671B &#8211; Active parameters per token: 37B (5.5%) &#8211; Layers: 61 &#8211; Model dimension (d): 7,168 &#8211; Number of attention heads (n_h): 128 &#8211; FFN hidden dimension: 18,432 &#8211; Number of shared experts (N_s): 2 &#8211; Number of routed experts (N_r): 256 &#8211; Activated routed experts (K_r): 8 &#8211; Context length: 128K tokens &#8211; KV compression dimension (d_c): 1,536 &#8211; Query compression dimension (d_c\u2019): 1,536<\/p>\n<p>Training Details:\u00a0&#8211; Training tokens: 14.8 trillion &#8211; Training cost: $5.576 million (2.788M H800 GPU hours) &#8211; Training time: ~2 months &#8211; Batch size: 9,216 sequences &#8211; Learning rate: Peak 4.2 \u00d7 10\u207b\u2074, cosine decay &#8211; Precision: FP8 mixed precision<\/p>\n<p>Performance:\u00a0&#8211; MMLU: 88.5 &#8211; MMLU-Pro: 75.9 &#8211; GPQA: 59.1 &#8211; MATH-500: 90.2 &#8211; Codeforces: 78.3 percentile<\/p>\n<p>&nbsp;<\/p>\n<hr style=\"page-break-before: always; display: none;\" \/>\n<p>Part 4: Google Gemini &#8211; The Multimodal Frontier<\/p>\n<p>&nbsp;<\/p>\n<h3>The Gemini Philosophy: Unified Multimodal Understanding<\/h3>\n<p>While GPT and DeepSeek primarily process text, Gemini was designed from the ground up to understand multiple modalities\u2014text, images, audio, and video\u2014in a unified architecture. This isn\u2019t just about processing different types of data separately; it\u2019s about understanding relationships across modalities.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/cloudypos.com\/nbosi\/wp-content\/uploads\/sites\/59\/2025\/10\/image3.png\" alt=\"Gemini Architecture Diagram\" \/><\/p>\n<p>Gemini Architecture Diagram<\/p>\n<p>Figure 3: Google Gemini Multimodal Architecture.\u00a0The model processes multiple input modalities (text, images, audio, video) through specialized tokenizers, then combines them in a unified token sequence. Sparse MoE layers with cross-modal attention enable understanding across modalities. Optional extended reasoning provides step-by-step verification and refinement.<\/p>\n<p>&nbsp;<\/p>\n<h3>Sparse Mixture-of-Experts Foundation<\/h3>\n<p>Like DeepSeek, Gemini 2.5 uses a sparse MoE architecture, but optimized for multimodal processing:<\/p>\n<p>FFN_MoE(x) = \u2211_{i=1}^N w_i(x) Expert_i(x)<\/p>\n<p>Where:<\/p>\n<p>w_i(x) = softmax(Router(x))_i \u00a0 \u00a0if i \u2208\u00a0TopK(Router(x), K)<br \/>\nw_i(x) = 0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0otherwise<\/p>\n<p>Components:\u00a0&#8211; N: Total number of experts (not publicly disclosed, estimated 100-200) &#8211; K: Number of activated experts per token (estimated 10-20) &#8211; Router(x): Learned routing function &#8211; Expert_i(\u00b7): i-th expert FFN<\/p>\n<p>Routing Function:<\/p>\n<p>Router(x) = Softmax(W_r x + b_r)<\/p>\n<p>Where:\u00a0&#8211; W_r \u2208 \u211d^(N \u00d7 d): Routing weight matrix &#8211; b_r \u2208 \u211d^N: Routing bias vector<\/p>\n<p>&nbsp;<\/p>\n<h3>Multimodal Token Processing<\/h3>\n<p>The revolutionary aspect of Gemini is how it processes different modalities in a unified token space:<\/p>\n<p>h_multimodal = Transformer([TokenEmbed_text(x_text);<br \/>\nTokenEmbed_image(x_image);<br \/>\nTokenEmbed_audio(x_audio);<br \/>\nTokenEmbed_video(x_video)])<\/p>\n<p>Where [\u00b7; \u00b7] denotes concatenation in the sequence dimension.<\/p>\n<p>&nbsp;<\/p>\n<h4>Text Tokenization<\/h4>\n<p>Standard subword tokenization:<\/p>\n<p>tokens_text = BPE(text)<br \/>\nembeddings_text = E_text[tokens_text]<\/p>\n<p>Where:\u00a0&#8211; BPE: Byte-Pair Encoding algorithm &#8211; E_text \u2208 \u211d^(V \u00d7 d): Text embedding matrix &#8211; V: Vocabulary size (~256K tokens)<\/p>\n<p>&nbsp;<\/p>\n<h4>Image Tokenization<\/h4>\n<p>Images are divided into patches and embedded:<\/p>\n<p>patches = Reshape(image, patch_size)<br \/>\ntokens_image = Linear(Flatten(patches))<br \/>\nembeddings_image = tokens_image + PE_2D<\/p>\n<p>Where:\u00a0&#8211; patch_size: Typically 14\u00d714 or 16\u00d716 pixels &#8211; Linear: Learned linear projection to model dimension &#8211; PE_2D: 2D positional encoding (preserves spatial relationships)<\/p>\n<p>For a 224\u00d7224 image with 14\u00d714 patches:\u00a0&#8211; Number of patches: (224\/14)\u00b2 = 256 patches &#8211; Each patch becomes one token &#8211; Total: 256 image tokens<\/p>\n<p>&nbsp;<\/p>\n<h4>Audio Tokenization<\/h4>\n<p>Audio is processed in overlapping windows:<\/p>\n<p>spectrograms = STFT(audio)<br \/>\ntokens_audio = CNN(spectrograms)<br \/>\nembeddings_audio = tokens_audio + PE_temporal<\/p>\n<p>Where:\u00a0&#8211; STFT: Short-Time Fourier Transform (converts audio to frequency domain) &#8211; CNN: Convolutional neural network for feature extraction &#8211; PE_temporal: Temporal positional encoding<\/p>\n<p>Typical Configuration:\u00a0&#8211; Window size: 25ms &#8211; Hop length: 10ms &#8211; For 1 second of audio: ~100 audio tokens<\/p>\n<p>&nbsp;<\/p>\n<h4>Video Tokenization<\/h4>\n<p>Video combines spatial and temporal processing:<\/p>\n<p>frames = Sample(video, frame_rate)<br \/>\ntokens_video = [TokenEmbed_image(frame) for frame in frames]<br \/>\nembeddings_video = tokens_video + PE_spatiotemporal<\/p>\n<p>Where:\u00a0&#8211; Sample: Samples frames at specified rate (typically 1-4 fps) &#8211; PE_spatiotemporal: 3D positional encoding (spatial + temporal)<\/p>\n<p>For a 1-minute video at 1 fps:\u00a0&#8211; 60 frames &#8211; 256 tokens per frame &#8211; Total: 15,360 video tokens<\/p>\n<p>&nbsp;<\/p>\n<h3>Cross-Modal Attention<\/h3>\n<p>The key to multimodal understanding is allowing different modalities to attend to each other:<\/p>\n<p>Attention_multimodal(Q_modality_A, K_modality_B, V_modality_B) =<br \/>\nsoftmax(Q_modality_A K_modality_B^T \/ \u221ad_k) V_modality_B<\/p>\n<p>Example:<\/p>\n<p>When processing the query \u201cWhat color is the car in the image?\u201d: &#8211; Text tokens attend to image tokens to locate the car &#8211; Image tokens attend to text tokens to understand what\u2019s being asked &#8211; The model generates a response by integrating both modalities<\/p>\n<p>&nbsp;<\/p>\n<h3>Long Context Processing<\/h3>\n<p>Gemini 2.5 Pro supports context lengths exceeding 1 million tokens through several optimizations:<\/p>\n<p>&nbsp;<\/p>\n<h4>Efficient Attention Mechanisms<\/h4>\n<p>Sparse Attention Patterns:<\/p>\n<p>Instead of full O(n\u00b2) attention, use structured sparsity:<\/p>\n<p>Attention_sparse(Q, K, V) = Attention(Q, K_local \u222a\u00a0K_global, V_local \u222a\u00a0V_global)<\/p>\n<p>Where:\u00a0&#8211; K_local, V_local: Keys and values from nearby positions (e.g., within 512 tokens) &#8211; K_global, V_global: Keys and values from selected global positions (e.g., every 64th token)<\/p>\n<p>This reduces complexity from O(n\u00b2) to O(n \u00d7 \u221an) or better.<\/p>\n<p>&nbsp;<\/p>\n<h4>Memory-Efficient KV Cache<\/h4>\n<p>Quantization:<\/p>\n<p>K_quantized = Quantize(K, bits=8)<br \/>\nV_quantized = Quantize(V, bits=8)<\/p>\n<p>Storing keys and values in 8-bit precision instead of 16-bit reduces memory by 50% with minimal quality loss.<\/p>\n<p>Selective Caching:<\/p>\n<p>importance_i = \u2211_{t=recent} attention_weight_{t,i}<br \/>\ncache_positions = TopK(importance, cache_size)<\/p>\n<p>Only cache the most important positions based on recent attention patterns.<\/p>\n<p>&nbsp;<\/p>\n<h3>Extended Reasoning: The Thinking Process<\/h3>\n<p>Gemini 2.5 incorporates \u201cthinking\u201d capabilities through extended reasoning:<\/p>\n<p>Output = Chain(Input) \u2192 Verify \u2192 Refine \u2192 Answer<\/p>\n<p>Chain-of-Thought Generation:<\/p>\n<p>P(reasoning_step_i | Input, reasoning_step_{&lt;i})<\/p>\n<p>The model generates intermediate reasoning steps before the final answer.<\/p>\n<p>Verification:<\/p>\n<p>confidence = Evaluate(reasoning_chain)<br \/>\nif confidence &lt; threshold:<br \/>\nreasoning_chain = Regenerate(Input, reasoning_chain)<\/p>\n<p>The model evaluates its own reasoning and regenerates if confidence is low.<\/p>\n<p>Mathematical Formulation:<\/p>\n<p>L_reasoning = -\u2211_{i=1}^M log P(r_i | x, r_{&lt;i}) &#8211; log P(y | x, r_{1:M})<\/p>\n<p>Where:\u00a0&#8211; r_i: i-th reasoning step &#8211; M: Number of reasoning steps &#8211; y: Final answer &#8211; x: Input<\/p>\n<p>&nbsp;<\/p>\n<h3>Distillation for Smaller Models<\/h3>\n<p>Smaller Gemini models (Flash, Flash-Lite) learn from larger models through distillation:<\/p>\n<p>L_distillation = KL(P_student(y|x) || P_teacher(y|x))<\/p>\n<p>Where:<\/p>\n<p>KL(P || Q) = \u2211_y P(y) log(P(y) \/ Q(y))<\/p>\n<p>K-Sparse Approximation:<\/p>\n<p>To reduce storage, only the top-K logits from the teacher are stored:<\/p>\n<p>P_teacher_sparse(y|x) = TopK(P_teacher(y|x), K)<br \/>\nNormalize(P_teacher_sparse)<\/p>\n<p>Typical K: 256-1024 (out of ~256K vocabulary)<\/p>\n<p>&nbsp;<\/p>\n<h3>Training Objectives<\/h3>\n<p>Gemini uses multiple training objectives:<\/p>\n<p>1. Next-Token Prediction (Primary):<\/p>\n<p>L_next = -\u2211_{t=1}^T log P(x_t | x_{&lt;t}, context_multimodal)<\/p>\n<p>2. Multimodal Alignment:<\/p>\n<p>L_align = -log P(text | image) &#8211; log P(image | text)<\/p>\n<p>Ensures text and image representations are aligned.<\/p>\n<p>3. Instruction Following:<\/p>\n<p>L_instruct = -log P(response | instruction, context)<\/p>\n<p>4. Preference Learning:<\/p>\n<p>L_preference = -log \u03c3(r(x, y_preferred) &#8211; r(x, y_rejected))<\/p>\n<p>Where:\u00a0&#8211; \u03c3: Sigmoid function &#8211; r(x, y): Reward model score<\/p>\n<p>&nbsp;<\/p>\n<h3>Gemini 2.5 Pro Specifications<\/h3>\n<p>Model Architecture:\u00a0&#8211; Architecture: Sparse MoE Transformer &#8211; Total parameters: Not disclosed (estimated 500B-1T) &#8211; Active parameters: Estimated 100-200B per token &#8211; Context length: 1M+ tokens (2M for some versions) &#8211; Supported modalities: Text, Image, Audio, Video &#8211; Output modalities: Text, Audio (experimental) &#8211; Output length: Up to 64K tokens<\/p>\n<p>Training Details:\u00a0&#8211; Training data cutoff: January 2025 &#8211; Training infrastructure: TPUv5p pods &#8211; Precision: Mixed precision (BF16\/FP8) &#8211; Training scale: Multiple datacenters, thousands of accelerators<\/p>\n<p>Performance Highlights:\u00a0&#8211; MMLU: ~90 (estimated) &#8211; Codeforces: High percentile &#8211; Multimodal benchmarks: State-of-the-art &#8211; Video understanding: Up to 3 hours of video &#8211; Long context: Excellent performance up to 1M tokens<\/p>\n<p>&nbsp;<\/p>\n<p>Part 5: Anthropic Claude &#8211; Constitutional AI and Safety-First Design<\/p>\n<p>&nbsp;<\/p>\n<h3>The Claude Philosophy: Helpfulness Through Constitutional Principles<\/h3>\n<p>Anthropic\u2019s Claude represents a different approach to LLM development, prioritizing safety and alignment with human values from the ground up. While the exact architectural details remain proprietary, Claude\u2019s distinguishing feature is its training methodology: Constitutional AI (CAI), which uses AI feedback guided by explicit principles rather than relying solely on human feedback.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/cloudypos.com\/nbosi\/wp-content\/uploads\/sites\/59\/2025\/10\/image4.png\" \/><\/p>\n<p>Claude Architecture Diagram<\/p>\n<p>Figure 4: Anthropic Claude Architecture.\u00a0Built on a Transformer foundation, Claude employs Constitutional AI (CAI) for alignment, combining supervised learning, AI feedback (RLAIF), and reinforcement learning from human feedback (RLHF). The model processes multimodal inputs (text and images) and uses a constitution of ethical principles to guide outputs toward helpful, harmless, and honest responses.<\/p>\n<p>&nbsp;<\/p>\n<h3>Base Architecture: Transformer with Multimodal Capabilities<\/h3>\n<p>Claude 3 models (Opus, Sonnet, and Haiku) are built on the Transformer architecture with multimodal capabilities, similar to Gemini. The models can process both text and visual inputs.<\/p>\n<p>Model Family Specifications:<\/p>\n<table>\n<tbody>\n<tr>\n<td>Model<\/td>\n<td>Capability Level<\/td>\n<td>Speed<\/td>\n<td>Context Length<\/td>\n<td>Use Case<\/td>\n<\/tr>\n<tr>\n<td>Claude 3 Opus<\/td>\n<td>Highest intelligence<\/td>\n<td>Moderate<\/td>\n<td>200K tokens<\/td>\n<td>Complex reasoning, coding, research<\/td>\n<\/tr>\n<tr>\n<td>Claude 3 Sonnet<\/td>\n<td>Balanced<\/td>\n<td>Fast<\/td>\n<td>200K tokens<\/td>\n<td>General-purpose, balanced performance<\/td>\n<\/tr>\n<tr>\n<td>Claude 3 Haiku<\/td>\n<td>Fast and efficient<\/td>\n<td>Fastest<\/td>\n<td>200K tokens<\/td>\n<td>Quick tasks, high-volume processing<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Context Window:<\/p>\n<p>Claude 3 models support a 200,000 token context window, which is approximately 150,000 words or about 500 pages of text. This enables:<\/p>\n<p>Context_capacity = 200K tokens \u2248 150K words<\/p>\n<p>&nbsp;<\/p>\n<h3>Constitutional AI: Training with Principles<\/h3>\n<p>The core innovation in Claude is Constitutional AI, which modifies the standard RLHF (Reinforcement Learning from Human Feedback) pipeline.<\/p>\n<p>&nbsp;<\/p>\n<h4>Standard RLHF Process<\/h4>\n<p>Traditional RLHF involves three stages:<\/p>\n<p>Stage 1: Supervised Fine-Tuning (SFT)<\/p>\n<p>L_SFT = -\u2211_{(x,y)\u2208D_SFT} log P_\u03b8(y | x)<\/p>\n<p>Where:\u00a0&#8211; D_SFT: Dataset of (prompt, response) pairs curated by humans &#8211; P_\u03b8(y | x): Probability of generating response y given prompt x &#8211; \u03b8: Model parameters<\/p>\n<p>Stage 2: Reward Model Training<\/p>\n<p>L_RM = -E_{(x,y_w,y_l)\u2208D_comp} [log \u03c3(r_\u03c6(x, y_w) &#8211; r_\u03c6(x, y_l))]<\/p>\n<p>Where:\u00a0&#8211; D_comp: Dataset of comparison pairs (prompt, winning response, losing response) &#8211; r_\u03c6: Reward model with parameters \u03c6 &#8211; \u03c3: Sigmoid function &#8211; y_w, y_l: Winning and losing responses as judged by humans<\/p>\n<p>Stage 3: Reinforcement Learning<\/p>\n<p>L_RL = E_{x\u223cD,y\u223c\u03c0_\u03b8} [r_\u03c6(x, y)] &#8211; \u03b2 KL(\u03c0_\u03b8 || \u03c0_ref)<\/p>\n<p>Where:\u00a0&#8211; \u03c0_\u03b8: Policy (the LLM being trained) &#8211; \u03c0_ref: Reference policy (initial supervised model) &#8211; \u03b2: KL divergence coefficient (prevents the model from deviating too far) &#8211; KL: Kullback-Leibler divergence<\/p>\n<p>&nbsp;<\/p>\n<h4>Constitutional AI: RLAIF (RL from AI Feedback)<\/h4>\n<p>Constitutional AI modifies this process by using AI-generated feedback based on a constitution:<\/p>\n<p>The Constitution:<\/p>\n<p>A set of principles that guide the model\u2019s behavior. Examples from Claude\u2019s constitution include:<\/p>\n<ol start=\"1\">\n<li>\u201cPlease choose the response that is the most helpful, honest, and harmless.\u201d<\/li>\n<li>\u201cPlease choose the response that is least intended to build a relationship with the user.\u201d<\/li>\n<li>\u201cPlease choose the response that has the least objectionable, offensive, unlawful, deceptive, inaccurate, or harmful content.\u201d<\/li>\n<li>Principles derived from the UN Declaration of Human Rights<\/li>\n<li>Principles for accessibility and disability rights (from Collective Constitutional AI)<\/li>\n<\/ol>\n<p>Stage 1: Supervised Learning (SL-CAI)<\/p>\n<p>For each harmful prompt x_h:<br \/>\n1. Generate initial response: y_0 = LLM(x_h)<br \/>\n2. Critique using constitution: c = LLM(&#8220;Critique: &#8221; + x_h + y_0 + constitution_i)<br \/>\n3. Revise response: y_revised = LLM(&#8220;Revise: &#8221; + x_h + y_0 + c + constitution_i)<br \/>\n4. Fine-tune on revised responses<\/p>\n<p>This creates a dataset of (harmful_prompt, safe_response) pairs without human labeling.<\/p>\n<p>Stage 2: AI Feedback for Preference Modeling<\/p>\n<p>Instead of human comparisons, the model generates its own preference data:<\/p>\n<p>For each prompt x:<br \/>\n1. Generate multiple responses: {y_1, y_2, &#8230;, y_n}<br \/>\n2. For each pair (y_i, y_j):<br \/>\nAsk LLM: &#8220;Which response better follows principle P from the constitution?&#8221;<br \/>\n3. Create preference dataset D_AI from AI judgments<br \/>\n4. Train reward model on D_AI<\/p>\n<p>Mathematical Formulation:<\/p>\n<p>L_RLAIF = -E_{(x,y_i,y_j)\u2208D_AI} [log \u03c3(r_\u03c6(x, y_better) &#8211; r_\u03c6(x, y_worse))]<\/p>\n<p>Where:\u00a0&#8211; D_AI: Dataset of AI-generated preference comparisons &#8211; y_better, y_worse: Responses ranked by AI according to constitutional principles<\/p>\n<p>Stage 3: Reinforcement Learning<\/p>\n<p>Same as standard RLHF, but using the reward model trained on AI feedback:<\/p>\n<p>L_RL-CAI = E_{x\u223cD,y\u223c\u03c0_\u03b8} [r_\u03c6(x, y)] &#8211; \u03b2 KL(\u03c0_\u03b8 || \u03c0_ref)<\/p>\n<p>&nbsp;<\/p>\n<h3>Advantages of Constitutional AI<\/h3>\n<p>1. Scalability<\/p>\n<p>AI feedback is cheaper and faster than human feedback:<\/p>\n<p>Cost_human &gt;&gt; Cost_AI<br \/>\nTime_human &gt;&gt; Time_AI<\/p>\n<p>2. Consistency<\/p>\n<p>AI feedback based on explicit principles is more consistent than human judgments:<\/p>\n<p>Variance_AI &lt; Variance_human (for principle-based judgments)<\/p>\n<p>3. Transparency<\/p>\n<p>The constitution makes the training objectives explicit and auditable.<\/p>\n<p>4. Harmlessness Without Helpfulness Trade-off<\/p>\n<p>Traditional RLHF often creates a trade-off between helpfulness and harmlessness. Constitutional AI reduces this:<\/p>\n<p>Maximize: Helpfulness + Harmlessness + Honesty<br \/>\nSubject to: Constitutional principles<\/p>\n<p>&nbsp;<\/p>\n<h3>Multimodal Processing<\/h3>\n<p>Claude 3 models process images alongside text:<\/p>\n<p>Image Input Processing:<\/p>\n<p>image_tokens = VisionEncoder(image)<br \/>\ncombined_input = Concat([text_tokens, image_tokens])<br \/>\noutput = Transformer(combined_input)<\/p>\n<p>Where:\u00a0&#8211; VisionEncoder: Converts images to token representations &#8211; Transformer: Processes the combined multimodal sequence<\/p>\n<p>Supported Image Formats:\u00a0&#8211; JPEG, PNG, GIF, WebP &#8211; Up to 10MB per image &#8211; Maximum resolution: 8000\u00d78000 pixels<\/p>\n<p>Image Understanding Capabilities:\u00a0&#8211; Chart and graph interpretation &#8211; Document analysis (tables, diagrams) &#8211; Photo analysis and description &#8211; Visual reasoning<\/p>\n<p>&nbsp;<\/p>\n<h3>Training Infrastructure<\/h3>\n<p>Hardware:\u00a0&#8211; Amazon Web Services (AWS) &#8211; Google Cloud Platform (GCP)<\/p>\n<p>Frameworks:\u00a0&#8211; PyTorch &#8211; JAX &#8211; Triton (for optimized kernels)<\/p>\n<p>Training Data:\u00a0&#8211; Proprietary mix of public internet data (as of August 2023) &#8211; Third-party licensed data &#8211; Data from labeling services &#8211; Internally generated data &#8211; No user data\u00a0from Claude conversations<\/p>\n<p>&nbsp;<\/p>\n<h3>Performance Characteristics<\/h3>\n<p>Claude 3 Opus Benchmarks:<\/p>\n<table>\n<tbody>\n<tr>\n<td>Benchmark<\/td>\n<td>Score<\/td>\n<td>Comparison<\/td>\n<\/tr>\n<tr>\n<td>MMLU<\/td>\n<td>86.8%<\/td>\n<td>Comparable to GPT-4<\/td>\n<\/tr>\n<tr>\n<td>GPQA<\/td>\n<td>50.4%<\/td>\n<td>State-of-the-art<\/td>\n<\/tr>\n<tr>\n<td>MATH<\/td>\n<td>60.1%<\/td>\n<td>Strong mathematical reasoning<\/td>\n<\/tr>\n<tr>\n<td>HumanEval<\/td>\n<td>84.9%<\/td>\n<td>Excellent coding capability<\/td>\n<\/tr>\n<tr>\n<td>MMMU<\/td>\n<td>59.4%<\/td>\n<td>Multimodal understanding<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Context Length Performance:<\/p>\n<p>Claude maintains strong performance across its full 200K token context:<\/p>\n<p>Recall@200K &gt; 95% (on needle-in-haystack tests)<\/p>\n<p>&nbsp;<\/p>\n<h3>Safety and Alignment Features<\/h3>\n<p>1. Reduced False Refusals<\/p>\n<p>Claude 3 models are better at distinguishing between harmful and harmless requests:<\/p>\n<p>False_refusal_rate_Claude3 &lt; False_refusal_rate_Claude2<\/p>\n<p>2. Improved Honesty<\/p>\n<p>The models are more likely to express uncertainty when appropriate:<\/p>\n<p>P(admit_uncertainty | uncertain) &gt; P(confident_wrong_answer | uncertain)<\/p>\n<p>3. Steerable Personality<\/p>\n<p>Users can guide Claude\u2019s tone and style while maintaining safety:<\/p>\n<p>Output = f(Input, Personality_guidance, Constitutional_constraints)<\/p>\n<p>&nbsp;<\/p>\n<h3>Claude 3 Model Family Comparison<\/h3>\n<p>Computational Efficiency:<\/p>\n<p>Quality: Opus &gt; Sonnet &gt; Haiku<br \/>\nSpeed: Haiku &gt; Sonnet &gt; Opus<br \/>\nCost: Haiku &lt; Sonnet &lt; Opus<\/p>\n<p>Use Case Optimization:<\/p>\n<ul>\n<li>Opus: When quality is paramount (research, complex analysis, difficult coding)<\/li>\n<li>Sonnet: When balancing quality and speed (general-purpose applications)<\/li>\n<li>Haiku: When speed and cost matter most (high-volume processing, simple tasks)<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h3>Key Innovations Summary<\/h3>\n<p>1. Constitutional AI (CAI)\u00a0&#8211; Uses AI feedback guided by explicit principles &#8211; Reduces reliance on expensive human labeling &#8211; Improves consistency and transparency &#8211; Mathematical formulation combines supervised learning, RLAIF, and RLHF<\/p>\n<p>2. Harmlessness Without Helpfulness Loss\u00a0&#8211; Maintains high capability while being safe &#8211; Reduces false refusals compared to earlier models &#8211; Explicit optimization for helpful + harmless + honest<\/p>\n<p>3. Long Context Understanding\u00a0&#8211; 200K token context window &#8211; Strong performance across full context length &#8211; Enables processing of entire books, codebases, or long documents<\/p>\n<p>4. Multimodal Capabilities\u00a0&#8211; Native image understanding &#8211; Vision-language reasoning &#8211; Document and chart analysis<\/p>\n<p>5. Model Family Design\u00a0&#8211; Three models (Opus, Sonnet, Haiku) for different use cases &#8211; Optimized for different points on the quality-speed-cost curve &#8211; Allows users to choose the right tool for their needs<\/p>\n<p>&nbsp;<\/p>\n<p>Part 6: Comparative Analysis<\/p>\n<p>&nbsp;<\/p>\n<h3>Mathematical Complexity Comparison<\/h3>\n<table>\n<tbody>\n<tr>\n<td>Operation<\/td>\n<td>GPT (Dense)<\/td>\n<td>DeepSeek (MoE + MLA)<\/td>\n<td>Gemini (MoE)<\/td>\n<td>Claude (Dense)<\/td>\n<\/tr>\n<tr>\n<td>Attention per Layer<\/td>\n<td>O(n\u00b2 \u00d7 d)<\/td>\n<td>O(n\u00b2 \u00d7 d_c), d_c \u226a d<\/td>\n<td>O(n\u00b2 \u00d7 d) sparse<\/td>\n<td>O(n\u00b2 \u00d7 d)<\/td>\n<\/tr>\n<tr>\n<td>FFN per Layer<\/td>\n<td>O(n \u00d7 d\u00b2)<\/td>\n<td>O(n \u00d7 d\u00b2 \u00d7 (N_s + K_r)\/N_total)<\/td>\n<td>O(n \u00d7 d\u00b2 \u00d7 K\/N)<\/td>\n<td>O(n \u00d7 d\u00b2)<\/td>\n<\/tr>\n<tr>\n<td>Parameters per Token<\/td>\n<td>100%<\/td>\n<td>5.5% (37B\/671B)<\/td>\n<td>~10-20%<\/td>\n<td>~100% (dense)<\/td>\n<\/tr>\n<tr>\n<td>KV Cache per Token<\/td>\n<td>O(n_h \u00d7 d_h) = O(d)<\/td>\n<td>O(d_c + d_h^R) \u2248 0.02 \u00d7 O(d)<\/td>\n<td>O(d) quantized<\/td>\n<td>O(d)<\/td>\n<\/tr>\n<tr>\n<td>Context Length<\/td>\n<td>128K<\/td>\n<td>128K<\/td>\n<td>1M+<\/td>\n<td>200K<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<h3>Training Efficiency Comparison<\/h3>\n<table>\n<tbody>\n<tr>\n<td>Metric<\/td>\n<td>GPT-3<\/td>\n<td>DeepSeek-V3<\/td>\n<td>Gemini 2.5 Pro<\/td>\n<td>Claude 3 Opus<\/td>\n<\/tr>\n<tr>\n<td>Total Parameters<\/td>\n<td>175B<\/td>\n<td>671B<\/td>\n<td>~500B-1T (est.)<\/td>\n<td>Not disclosed<\/td>\n<\/tr>\n<tr>\n<td>Active Parameters<\/td>\n<td>175B<\/td>\n<td>37B<\/td>\n<td>~100-200B (est.)<\/td>\n<td>Not disclosed (dense)<\/td>\n<\/tr>\n<tr>\n<td>Training Tokens<\/td>\n<td>~300B<\/td>\n<td>14.8T<\/td>\n<td>Not disclosed<\/td>\n<td>Not disclosed<\/td>\n<\/tr>\n<tr>\n<td>Training Cost<\/td>\n<td>Not disclosed<\/td>\n<td>$5.576M<\/td>\n<td>Not disclosed<\/td>\n<td>Not disclosed<\/td>\n<\/tr>\n<tr>\n<td>Training Method<\/td>\n<td>Standard RLHF<\/td>\n<td>MoE + RLAIF<\/td>\n<td>Distillation + RLHF<\/td>\n<td>Constitutional AI (RLAIF)<\/td>\n<\/tr>\n<tr>\n<td>FLOPs per Token (Training)<\/td>\n<td>High<\/td>\n<td>Low (MoE)<\/td>\n<td>Medium<\/td>\n<td>Medium-High (dense)<\/td>\n<\/tr>\n<tr>\n<td>FLOPs per Token (Inference)<\/td>\n<td>~2 \u00d7 175B<\/td>\n<td>~2 \u00d7 37B<\/td>\n<td>~2 \u00d7 100-200B (est.)<\/td>\n<td>Not disclosed<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<h3>Architectural Philosophy Comparison<\/h3>\n<p>OpenAI GPT:\u00a0&#8211; Philosophy: Simplicity and scale &#8211; Key Innovation: Demonstrating that scaled Transformers develop emergent capabilities &#8211; Trade-off: Maximum capability at maximum cost &#8211; Best For: Applications where cost is secondary to reliability and performance<\/p>\n<p>DeepSeek:\u00a0&#8211; Philosophy: Efficiency without compromise &#8211; Key Innovations: &#8211; MLA for 98% KV cache reduction &#8211; Auxiliary-loss-free load balancing &#8211; Multi-token prediction &#8211; Trade-off: Architectural complexity for computational efficiency &#8211; Best For: Cost-sensitive deployments, organizations with limited resources<\/p>\n<p>Gemini:\u00a0&#8211; Philosophy: Unified multimodal intelligence &#8211; Key Innovations: &#8211; Native multimodal processing &#8211; Extended reasoning capabilities &#8211; Distillation from thinking models &#8211; Trade-off: Complexity in multimodal integration for broader capability &#8211; Best For: Applications requiring cross-modal understanding, long-context processing<\/p>\n<p>Claude:\u00a0&#8211; Philosophy: Safety and alignment through constitutional principles &#8211; Key Innovations: &#8211; Constitutional AI (RLAIF) &#8211; Harmlessness without helpfulness loss &#8211; Reduced false refusals &#8211; Trade-off: Dense architecture (higher inference cost) for safety guarantees &#8211; Best For: Applications where safety, alignment, and ethical behavior are paramount<\/p>\n<p>&nbsp;<\/p>\n<h3>Performance Characteristics<\/h3>\n<p>Latency:\u00a0&#8211; GPT: Moderate (all parameters active) &#8211; DeepSeek: Low (fewer active parameters, efficient attention) &#8211; Gemini: Variable (depends on modality and context length)<\/p>\n<p>Memory:\u00a0&#8211; GPT: High (full KV cache, all parameters) &#8211; DeepSeek: Low (compressed KV cache, sparse activation) &#8211; Gemini: High for long contexts (but optimized)<\/p>\n<p>Throughput:\u00a0&#8211; GPT: Moderate (limited by memory bandwidth) &#8211; DeepSeek: High (sparse activation enables larger batch sizes) &#8211; Gemini: Moderate to high (depends on modality mix)<\/p>\n<p>&nbsp;<\/p>\n<h3>Use Case Suitability<\/h3>\n<p>GPT is Ideal For:\u00a0&#8211; General-purpose text generation &#8211; Applications requiring proven reliability &#8211; Organizations with substantial computational resources &#8211; Use cases where consistency is critical<\/p>\n<p>DeepSeek is Ideal For:\u00a0&#8211; Cost-sensitive deployments &#8211; High-throughput applications &#8211; Organizations with limited GPU resources &#8211; Applications requiring long context at reasonable cost<\/p>\n<p>Gemini is Ideal For:\u00a0&#8211; Multimodal applications (image + text, video + text) &#8211; Applications requiring very long context (&gt;100K tokens) &#8211; Complex reasoning tasks &#8211; Agentic systems combining multiple capabilities<\/p>\n<p>&nbsp;<\/p>\n<p>Part 6: The Future of LLM Architectures<\/p>\n<p>&nbsp;<\/p>\n<h3>Emerging Trends<\/h3>\n<p>1. Convergence of Approaches<\/p>\n<p>We\u2019re seeing convergence toward hybrid architectures: &#8211; GPT models exploring sparse activation &#8211; DeepSeek expanding into multimodal &#8211; All models adopting MoE for efficiency<\/p>\n<p>2. Efficiency as a First-Class Concern<\/p>\n<p>The success of DeepSeek demonstrates that efficiency innovations can match or exceed brute-force scaling:<\/p>\n<p>Performance \u221d\u00a0(Scale \u00d7 Efficiency \u00d7 Data Quality)<\/p>\n<p>Not just:<\/p>\n<p>Performance \u221d\u00a0Scale<\/p>\n<p>3. Multimodal as Default<\/p>\n<p>Future models will likely be multimodal by default, as the world\u2019s information exists in multiple modalities.<\/p>\n<p>4. Extended Reasoning<\/p>\n<p>The \u201cthinking\u201d capabilities in Gemini 2.5 and similar models represent a shift toward more deliberate, verifiable reasoning:<\/p>\n<p>Answer = f(Input) \u00a0\u2192 \u00a0Answer = Reasoning(Input) \u2192 Verify \u2192 Answer<\/p>\n<p>&nbsp;<\/p>\n<h3>Open Research Questions<\/h3>\n<p>1. Optimal Sparsity Patterns<\/p>\n<p>What is the optimal balance between: &#8211; Number of experts &#8211; Expert size &#8211; Activation rate &#8211; Routing mechanism<\/p>\n<p>2. Scaling Laws for MoE<\/p>\n<p>Traditional scaling laws:<\/p>\n<p>Loss \u221d\u00a0(Parameters)^(-\u03b1) \u00d7 (Data)^(-\u03b2)<\/p>\n<p>How do these change for MoE models where active parameters \u2260 total parameters?<\/p>\n<p>3. Cross-Modal Understanding<\/p>\n<p>How can we better measure and improve cross-modal reasoning? Current benchmarks focus on single-modality performance.<\/p>\n<p>4. Efficient Long Context<\/p>\n<p>Can we achieve O(n) or O(n log n) attention without sacrificing quality?<\/p>\n<p>Candidates: &#8211; Linear attention mechanisms &#8211; State space models &#8211; Hierarchical attention<\/p>\n<p>&nbsp;<\/p>\n<h3>Mathematical Frontiers<\/h3>\n<p>1. Better Attention Mechanisms<\/p>\n<p>Current research explores alternatives to softmax attention:<\/p>\n<p>Attention_linear(Q, K, V) = \u03c6(Q) (\u03c6(K)^T V)<\/p>\n<p>Where \u03c6 is a feature map that allows O(n) complexity.<\/p>\n<p>2. Adaptive Computation<\/p>\n<p>Allow models to use more computation for harder problems:<\/p>\n<p>Computation(x) = f(Difficulty(x))<\/p>\n<p>3. Continuous Learning<\/p>\n<p>Current models are static after training. Future models might continuously update:<\/p>\n<p>\u03b8_{t+1} = \u03b8_t + \u03b7 \u2207L(x_t; \u03b8_t)<\/p>\n<p>While maintaining stability and preventing catastrophic forgetting.<\/p>\n<p>&nbsp;<\/p>\n<p>Conclusion: The Elegant Mathematics of Machine Intelligence<\/p>\n<p>We\u2019ve journeyed through the mathematical foundations of three remarkable AI architectures. What emerges is a picture of elegant simplicity giving rise to extraordinary complexity.<\/p>\n<p>&nbsp;<\/p>\n<h3>The Core Insights<\/h3>\n<p>1. Attention is Universal<\/p>\n<p>The attention mechanism\u2014a simple weighted sum based on compatibility scores\u2014turns out to be sufficient for modeling complex relationships in language, vision, and beyond:<\/p>\n<p>Attention(Q, K, V) = softmax(QK^T \/ \u221ad_k) V<\/p>\n<p>This formula, simple enough to fit on a single line, powers systems that can write poetry, prove theorems, and understand images.<\/p>\n<p>2. Sparsity is Powerful<\/p>\n<p>DeepSeek demonstrated that we don\u2019t need to activate all parameters for every token. Selective activation through MoE:<\/p>\n<p>Output = \u2211_{i \u2208\u00a0TopK} w_i Expert_i(Input)<\/p>\n<p>This achieves comparable performance to dense models while using a fraction of the computation.<\/p>\n<p>3. Compression Preserves Information<\/p>\n<p>MLA showed that we can compress keys and values by 98%:<\/p>\n<p>c_t^KV = W^DKV h_t \u00a0(compression)<br \/>\nk_t^C = W^UK c_t^KV \u00a0(decompression)<\/p>\n<p>And still maintain performance, because the essential information is preserved in a low-dimensional subspace.<\/p>\n<p>4. Multimodality is Natural<\/p>\n<p>Gemini demonstrated that different modalities can be processed in a unified token space:<\/p>\n<p>h = Transformer([tokens_text; tokens_image; tokens_audio; tokens_video])<\/p>\n<p>The same attention mechanism that relates words to words can relate images to text, audio to video, and any modality to any other.<\/p>\n<p>&nbsp;<\/p>\n<h3>The Human Element<\/h3>\n<p>Despite the sophisticated mathematics, these systems ultimately serve human needs. They help us: &#8211; Communicate across language barriers &#8211; Access and synthesize information &#8211; Solve complex problems &#8211; Create new forms of art and expression &#8211; Understand the world in new ways<\/p>\n<p>&nbsp;<\/p>\n<h3>A Note of Wonder<\/h3>\n<p>There\u2019s something profound about the fact that intelligence\u2014whether natural or artificial\u2014can be approximated by mathematical functions. The same principles that describe the motion of planets and the behavior of particles also describe the processing of information and the generation of meaning.<\/p>\n<p>We\u2019ve seen that: &#8211; A simple attention formula captures the essence of focus and relevance &#8211; Sparse activation mirrors how human experts specialize &#8211; Compression reveals the low-dimensional structure of information &#8211; Unified processing across modalities reflects how humans integrate sensory information<\/p>\n<p>&nbsp;<\/p>\n<h3>Looking Forward<\/h3>\n<p>The mathematics we\u2019ve explored represents our current understanding, but the field continues to evolve rapidly. Future architectures will likely: &#8211; Be more efficient (doing more with less) &#8211; Be more capable (understanding deeper relationships) &#8211; Be more accessible (available to more people and organizations) &#8211; Be more aligned (better serving human values and needs)<\/p>\n<p>The journey from simple formulas to intelligent behavior is a testament to both the power of mathematics and the ingenuity of human creativity. As we continue to refine these architectures, we move closer to systems that can truly augment human intelligence and help us tackle the great challenges of our time.<\/p>\n<p>&nbsp;<\/p>\n<p>Acknowledgments<\/p>\n<p>This document was prepared to illuminate the mathematical foundations of modern AI for a broad audience. We are grateful to the research teams at OpenAI, DeepSeek, and Google DeepMind for their groundbreaking work and for publishing detailed technical reports that make this kind of analysis possible.<\/p>\n<p>Special thanks to the authors of \u201cAttention Is All You Need\u201d (Vaswani et al., 2017), whose elegant architecture laid the foundation for the current generation of AI systems.<\/p>\n<p>&nbsp;<\/p>\n<p>References<\/p>\n<ol start=\"1\">\n<li>Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, \u0141., &amp; Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998-6008. https:\/\/arxiv.org\/abs\/1706.03762<\/li>\n<li>DeepSeek-AI. (2024). DeepSeek-V3 Technical Report. arXiv preprint arXiv:2412.19437. https:\/\/arxiv.org\/abs\/2412.19437<\/li>\n<li>Dai, D., Deng, C., Zhao, C., Xu, R. X., Gao, H., Chen, D., Li, J., Ding, W., Li, M., Xiao, Y., Wang, P., Huang, K., Sui, Y., Ruan, C., Zheng, Z., Yu, K., Cheng, X., Guo, X., Gu, S., \u2026 &amp; Bi, J. (2024). DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066.<\/li>\n<li>Gemini Team, Google. (2023). Gemini: A Family of Highly Capable Multimodal Models. arXiv preprint arXiv:2312.11805. https:\/\/arxiv.org\/abs\/2312.11805<\/li>\n<li>Gemini Team, Google. (2025). Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. Technical Report. https:\/\/storage.googleapis.com\/deepmind-media\/gemini\/gemini_v2_5_report.pdf<\/li>\n<li>Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., &amp; Liu, Y. (2024). RoFormer: Enhanced transformer with rotary position embedding. Neurocomputing, 568, 127063.<\/li>\n<li>Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., &amp; Dean, J. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.<\/li>\n<li>Fedus, W., Zoph, B., &amp; Shazeer, N. (2021). Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120), 1-39.<\/li>\n<li>Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., &amp; Chen, Z. (2020). GShard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668.<\/li>\n<li>Clark, A., de Las Casas, D., Guy, A., Mensch, A., Paganini, M., Hoffmann, J., Damoc, B., Hechtman, B., Cai, T., Borgeaud, S., et al.\u00a0(2022). Unified scaling laws for routed language models. arXiv preprint arXiv:2202.01169.<\/li>\n<\/ol>\n<p>Document prepared by:<\/p>\n<p>Hene Aku Kwapong, PhD, MBA<br \/>\nMIT Practice School Alumni<\/p>\n<p>With the assistance of:<\/p>\n<p>Jeremie Kwapong<br \/>\nNew England Innovation Academy<\/p>\n<p>October 9, 2025<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A Deep Dive into GPT, DeepSeek, Gemini, and Claude Hene Aku Kwapong, PhD, MBA, MIT Practice School Alumni Document prepared with the help of: Jeremie Kwapong, New England Innovation Academy Date:\u00a0October 9, 2025 &nbsp; Decoding the Language of Intelligent Machines In 2017, a team of researchers at Google published a paper with a bold title: [&hellip;]<\/p>\n","protected":false},"author":3488,"featured_media":6324,"menu_order":0,"comment_status":"open","ping_status":"open","template":"","format":"standard","meta":{"_jf_save_progress":"","is_featured":"","footnotes":""},"access-tier":[],"industry":[],"article-tags":[],"topics":[101],"class_list":["post-6455","articles","type-articles","status-publish","format-standard","has-post-thumbnail","hentry","topics-artificial-intelligence"],"_links":{"self":[{"href":"https:\/\/cloudypos.com\/nbosi\/wp-json\/wp\/v2\/articles\/6455","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cloudypos.com\/nbosi\/wp-json\/wp\/v2\/articles"}],"about":[{"href":"https:\/\/cloudypos.com\/nbosi\/wp-json\/wp\/v2\/types\/articles"}],"author":[{"embeddable":true,"href":"https:\/\/cloudypos.com\/nbosi\/wp-json\/wp\/v2\/users\/3488"}],"replies":[{"embeddable":true,"href":"https:\/\/cloudypos.com\/nbosi\/wp-json\/wp\/v2\/comments?post=6455"}],"version-history":[{"count":6,"href":"https:\/\/cloudypos.com\/nbosi\/wp-json\/wp\/v2\/articles\/6455\/revisions"}],"predecessor-version":[{"id":6461,"href":"https:\/\/cloudypos.com\/nbosi\/wp-json\/wp\/v2\/articles\/6455\/revisions\/6461"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cloudypos.com\/nbosi\/wp-json\/wp\/v2\/media\/6324"}],"wp:attachment":[{"href":"https:\/\/cloudypos.com\/nbosi\/wp-json\/wp\/v2\/media?parent=6455"}],"wp:term":[{"taxonomy":"access-tier","embeddable":true,"href":"https:\/\/cloudypos.com\/nbosi\/wp-json\/wp\/v2\/access-tier?post=6455"},{"taxonomy":"industry","embeddable":true,"href":"https:\/\/cloudypos.com\/nbosi\/wp-json\/wp\/v2\/industry?post=6455"},{"taxonomy":"article-tags","embeddable":true,"href":"https:\/\/cloudypos.com\/nbosi\/wp-json\/wp\/v2\/article-tags?post=6455"},{"taxonomy":"topics","embeddable":true,"href":"https:\/\/cloudypos.com\/nbosi\/wp-json\/wp\/v2\/topics?post=6455"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}