ConvGPT - New Language Model Compression Method
Today, I present another model from my “weird” series - models that utilize two-dimensional convolutional networks in natural language modeling. This time, I’m introducing ConvGPT - an architecture created for mobile/edge devices (so-called SLMs). In this architecture, the convolutional network does not act as a separate layer (as it did in the NeuroBLAST model); instead, it is responsible (along with an average pooling layer) for reducing the size (compression) of the input hidden state vector, which then passes through subsequent layers in the residual stream.
Compression in the World of LLMs
In the LLM world, there are many techniques for compressing large language models into smaller counterparts to reduce the resources (GPU/TPU, energy) required to run them, while trying to ensure the smaller version maintains sufficient text generation quality (coherence, credibility, high benchmark scores). These techniques include quantization, knowledge distillation, pruning, and LoRA (the latter doesn’t necessarily reduce model size per se, but modifies it without training the full large model, reducing resources during fine-tuning).
However, all these techniques apply to ready-made, large models that we want to shrink. This isn’t entirely efficient because maintaining quality usually involves significant compromises, such as weaker/unpredictable text generation behavior or worse recall of domain knowledge.
In my experiments, I start from the basics: I introduce architectural modifications at the initial stage, meaning we train a compressed model from scratch. While ambitious, this approach has a major downside: training from scratch requires hardware resources I cannot provide on my own. Therefore, I try to take every available opportunity; currently, I am experimenting with JAX and training models on TPUs through the Google TPU Research Cloud program. A second downside is that we don’t know what to expect from such a model - specifically, whether it will be competitive against its larger counterparts.
But let’s get back to the compression method I’m using in this architecture, which is over 4 years old (the current version has only been updated with a newer attention module).
Why?
My experiments show that we don’t need such a large amount of information (a large vector) in the decoder layer (attention + MLP), whereas we do need it in the embedding layer and the prediction head. Passing information through a convolutional layer + average pooling reduces this vector by 9 times in the ConvGPT checkpoint I have shared.
How does it work?
First, we initialize the model with an embedding dimension (hidden size in the architecture) of 1296, while the rest of the network operates on an input vector of size 144 (the transformer dimension). The transformer dimension is calculated as follows:
self.emb_dim_factor = int(math.sqrt(self.hidden_size))
if self.emb_dim_factor**2 != self.hidden_size:
raise ValueError(
f"hidden_size ({self.hidden_size}) must be a perfect square. Got {math.sqrt(self.hidden_size)} as sqrt."
)
self.emb_factor = int(math.sqrt(self.emb_dim_factor)) // 2
self.transformer_dim = int(
self.hidden_size / (self.emb_factor * self.emb_factor)
)
Think of this process as “zipping” a file or creating a thumbnail from a high-resolution image before doing the heavy work.
Here is the step-by-step breakdown of what the math is actually doing conceptually:
1. The “Square” Logic (Reshaping)
The code insists that hidden_size (1296) must be a perfect square.
Intuition: It treats the data not as a long list of numbers (a vector), but as a 2D Grid (an image).
The Math:
\(\sqrt{1296} = 36\).
The Visual: Instead of a line of 1,296 numbers, the model sees a 36x36 pixel image for every single word (token).
2. The Scaling Factor (The Compression Ratio)
The code calculates emb_factor.
The Math: It takes the square root of 36 (which is 6) and divides by 2, resulting in 3.
Intuition: This
3is your compression block size. It means the model is going to look at 3x3 blocks of the grid and summarize them into a single point.
3. The Result (Transformer Dim)
Finally, it calculates the transformer_dim.
The Math:
\(1296 / (3 \times 3) = 144\)The Visual:
You start with a 36x36 grid (1296 points).
You compress every 3x3 block into 1 point.
You end up with a 12x12 grid.
\(12 \times 12 = 144\)
The intuition is “High-Resolution Perception, Low-Resolution Reasoning.”
The model “sees” the word in high fidelity (1296 dimensions) to capture every nuance of its meaning. However, carrying that heavy 1296-sized backpack through every layer of the network is too expensive.
So, it runs a mathematical “summarization” to shrink the vector to 144 dimensions. The internal brain of the model (the Transformer layers) does all its logic and reasoning on this lighter, compressed version (144), which makes it roughly 9 times faster and more memory-efficient than if it used the full size.
However, using convolutional networks in generative language models carries the risk of “token leakage”. This happens when incorrect multidimensional transformations violate the causality principle in autoregressive models, essentially “cheating” the model: a current token “sees” future tokens, whereas it should only see previous ones. Consequently, you might achieve very good (low) loss values during training, but very high values during testing (resulting in the generation of gibberish).
To maintain the autoregressive nature of the model during compression, ConvGPT performs the following transformations:
B, S, L = hidden_states.size()
# Reshape to 2D representation per token for convolution
hidden_states = hidden_states.view(
B,
S,
self.embedding_dim_factor,
self.embedding_dim_factor,
).transpose(1, 2).contiguous()
# Padding to prevent future token leakage
pad_w = (self.conv.kernel_size[1] - 1) * self.conv.dilation[1]
hidden_states = nn.functional.pad(hidden_states, (pad_w, 0, 0, 0))
hidden_dtype = hidden_states.dtype
hidden_states = self.conv(hidden_states)
hidden_states = hidden_states.transpose(1, 2).contiguous()
hidden_states = self.pool(hidden_states)
# Flatten back to vector form
hidden_states = hidden_states.view(
B,
S,
hidden_states.shape[-2]
* hidden_states.shape[-1],
)
hidden_states = self.norm_emb(hidden_states)
Think of this part as “Reading with Blinders”.
1. The Transformation (Pop-up Book Effect)
Code: hidden_states.view(...).transpose(...)
Intuition: The model takes the flat list of numbers and “inflates” it into a 2D grid (the image concept we established earlier).
Why? A standard Transformer sees a word as a single point. This model wants to see the word as a “texture” or a surface to find patterns that a simple point would miss.
2. The “Don’t Look Ahead” Rule (Causal Padding)
Code: nn.functional.pad(hidden_states, (pad_w, 0, 0, 0))
The Problem: In a standard image convolution (like Photoshop filters), the computer analyzes a pixel by looking at its neighbors to the left, right, up, and down.
The Danger: In language, “right” means “future.” If the model sees the word to the right, it is “cheating” (seeing the answer before it predicts it). This is called Token Leakage.
The Solution: The code adds Padding only to one side (the past).
The Visual: Imagine sliding a window over a sentence. This code shoves a bunch of blank space into the start of the sentence. This forces the “window” to be shifted so that it can only see the current word and the previous words. It physically blocks the model from peeking at the future.
3. The Crunch (Convolution & Pooling)
Code: self.conv(...) and self.pool(...)
Intuition: Now that we are safe from cheating, the Convolution scans the 2D grid to extract features (shapes, patterns), and the Pooling acts like a trash compactor. It keeps the most important information (”There is a strong signal here”) and discards the empty noise.
4. Back to Reality (Flattening)
Code: hidden_states.view(...)
Intuition: The Transformer layers (the brain) don’t understand 2D grids; they only understand 1D lines (vectors).
The Result: This step takes that compressed, high-value 2D square and unrolls it back into a flat line. It is now much smaller (compressed), but dense with meaning, ready for the rest of the network to process efficiently.
Summary
If the first part was “Zipping” the file, this part ensures you don’t zip tomorrow’s newspaper inside it. It forces the compression to happen strictly in chronological order.
PS. As you can see, I add normalization at the end because, after passing through the convolution layer, the values in the hidden vector can become very large.
What does this give us?
In my experiment, the model used has 164M parameters. A model with the same configuration but without convolution + pooling would have 722M parameters. We therefore achieve a size reduction of more than 4x, allowing us to train smaller SLMs while maintaining high token separation in the embedding layer and the prediction head.
Furthermore, if we set the hidden size to 2048 and the intermediate size to 6144 (it is 3072 in the current model), such a variant of ConvGPT would have only 266M parameters, while the model without compression would be 1.7B parameters - a massive 6.5x reduction! (Just think of how much VRAM is saved).
I trained the shared checkpoint (link) on over 250B tokens from the PleIAs/SYNTH dataset. It is the first model this small to achieve a score exceeding 30% on the GPQA-Diamond benchmark, which is designed to test the reasoning skills of large language models on difficult multiple-choice questions (where 25% represents random guessing).
I hope my method appeals to you and sparks a discussion about compression possibilities and their impact on how models function during the pre-training stage.
Happy experimenting!
Links
Hugging Face: ConvGPT 250B EC checkpoint
GitHub: vLLM and SGLang support

