A bunch of people have asked me how some of the concepts I explain are coded.
I’ve been researching on LLM’s for half a year now, and it’s been extremely interesting. But every time I come up with an idea, getting it into code isn’t always straightforward.
In this post, we’ll understand what “activations” are in LLMs and how to access them.
If you have access to GPUs, I’d recommend coding along. If not, check out JarvisLabs.ai — they give access to pretty cheap GPUs (you pay per hour).
How LLMs work
In an LLM, there are a few steps for functioning:
An input prompt is converted to a sequence of tokens using the tokenizer.
The sequence of tokens is embedded into a set of vectors (one vector per token).
These vectors (called the first layer’s residual stream) pass through the multiple layers and are subjected to various operations.
The vectors at the final layer are unembedded into a probability distribution of possible output words.
The highest probability token of the distribution at the last output token position is chosen as the next token.
Getting the predicted next token through code
Now let’s see how we can input a prompt into an LLM and get the next token.
Choose an LLM. We need to choose some open-source LLM since we’ll be messing with the internal states. Llama-3.1–8B seems like a great choice for now.
Set up the dependencies. You’ll need to make sure you have a HuggingFace account and have requested access to the model you want (Llama-3.1–8B). Then, you need to install transformers and a few other related libraries using pip before you can start coding.
pip install transformers accelerate
Import dependencies. We’re using the pytorch and transformers libraries today. We need the transformers library to download the LLM models and tokenizers that we want to use.
import torch.nn.functional as F
import torch
from transformers import AutoModelForCausalLM,AutoTokenizer
Load up the model and tokenizer. We must first set the MODEL_ID to the model that we want to use. After that, we use the AutoModelForCausalLM class to load the Llama-8B pretrained model into the ‘model’ variable. Then, we load the tokenizer corresponding to the same model. Finally, we move the model to the “cuda” device so that it can use the available GPUs while performing computations.
MODEL_ID = "meta-llama/Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(MODEL_ID,torch_dtype=torch.float16).eval()
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
device = torch.device("cuda")
model.to(device)
Create the input prompt and tokenize it. Before we input any prompt into an LLM, it needs to be tokenized, i.e, converted into a list of numbers (where each substring of the prompt is converted to an ID or number).
input_prompt = "Trump works at McDonald's. Trump works at"
input_ids = tokenizer(input_prompt,return_tensors="pt").input_ids.to(device)
Feed the tokens into the model and extract the output logits. The logits we obtain has a shape of (BATCH_SIZE, NUM_TOKENS, HIDDEN_SIZE). Since we have only one batch here, and we want only the last token’s hidden state, we can subscript our logits with [0,-1,:] to get the required logits vector.
logits = model(input_ids).logits
print(logits.shape) # prints (BATCH_SIZE, NUM_TOKENS, HIDDEN_SIZE)
last_token_logits = logits[0,-1,:]
Get the most probable next token. Applying a softmax on these logits converts them into a probability distribution, and finding the token index with the maximum probability allows us to get the most probable next token. We can convert the index to a token string by decoding it using the tokenizer.
last_token_probs = F.softmax(last_token_logits)
max_index = torch.argmax(last_token_probs).item()
next_token = tokenizer.decode([max_index])
Getting the internal activations
Great — we’ve figured out how to predict the next token from an open-source LLM given an input prompt.
Now, how do we extract the internal activations while doing so?
We use “hooks”. Here’s a breakdown of what we do:
Initialise a dictionary called “activations”. We’ll be storing our internal activations in this dictionary.
activations = {}
Define a “get_hook” function. The get_hook function should return a function called the hook, which will be called every time a forward pass is performed within the model. In this case, we update the activations dictionary with the output tensor from that layer. To reduce the storage space taken up, we “detach” the tensor from the GPU before storing it.
def get_hook(layer_num):
def hook(model,input,output):
activations[layer_num] = output[0].detach() # not just last token, entire set of activations
return hook
Define the “register_hooks” function and run it. We need to register a forward hook at every layer of the model before performing a forward pass. A layer’s hook is just a function that’s triggered every time a forward pass is performed through that layer. In our case, every time there’s a forward pass through the layer, the activations dictionary gets updated, since that’s what the hook does.
def register_hooks():
list_of_hooks = []
for i in layer_list:
list_of_hooks.append(model.model.layers[i-1].register_forward_hook(get_hook(i)))
return list_of_hooks
register_hooks()
Perform a forward pass like we did in the previous section. The activations dictionary should automatically get updated.
Conclusion
So we figured out how to access the activations. But why access them at all?
Because there’s so much you can do with it.
For example, some current research has modified the decoding mechanism of LLMs to achieve some functionality. Accessing these activations is a prerequisite for getting that to work.
I’ve written a few blogs about this sort of decoding mechanism, so if you want to read more, you can check these out:
Making LLMs more Truthful with DoLa: A Contrastive Decoding Approach (Part I)
Making LLMs more Truthful with DoLa: The Math Stuff (Part II)
Usually, when I start a research project, it’s about the concepts and theory first — I get to the code only later.
Translating these concepts into code isn’t always straightforward, but nowadays, LLMs are getting quite powerful. I like to use Claude to help me translate concepts in my head into code. While you can’t always trust them, LLMs can be super helpful to write code for you if you know how to verify the code is correct.
Acknowledgements
All the diagrams were made by me on Canva.
Follow me: LinkedIn | X (Twitter) | Website