Hindi and Tamil Question Answer / RAG

In this notebook, we use new Navarasa LLMs from TeluguLLM to create a Hindi and Tamil Question Answering system. Since we're using a 7B model with PEFT, this notebook is run on Google Colab with an A100. If you're working with a smaller machine, I'd encourage to try the 2B model instead.

Time: 25 min	Level: Beginner
Author	Nirant Kasliwal

!pip install -U fastembed datasets qdrant-client peft transformers accelerate bitsandbytes -qq

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

import numpy as np
from datasets import load_dataset
from peft import AutoPeftModelForCausalLM
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct, VectorParams, Distance
from transformers import AutoTokenizer

from fastembed import TextEmbedding

hf_token = "<your_hf_token_here>"  # Get your token from https://huggingface.co/settings/token, needed for Gemma weights

Setting Up

We'll download the dataset, our LLM model weights and embedding model weights next

embedding_model = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
model_id = "Telugu-LLM-Labs/Indic-gemma-2b-finetuned-sft-Navarasa"

ds = load_dataset("nirantk/chaii-hindi-and-tamil-question-answering", split="train")

ds

This dataset has questions and contexts which have corresponding answers. The answers must be found by the LLM. This is an extractive Question Answering problem.

In order to do this, we'll setup an embedding model from FastEmbed. And then add it to Qdrant in memory mode, which is powered by Numpy.

embedding_model = TextEmbedding(model_name=embedding_model)

We'll use the 7B model here, the 2B model isn't great and was suffering from reading comprehension challenges.

Downloading the Navarasa LLM

We'll download the Navarasa LLM from TeluguLLM-Labs. This is a 7B model with PEFT.

model = AutoPeftModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=False,
    token=hf_token,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

Embed the Context into Vectors

questions, contexts = list(ds["question"]), list(ds["context"])

context_embeddings: list[np.ndarray] = list(
    embedding_model.embed(contexts)
)  # Note the list() call - this is a generator

len(context_embeddings[0])

def embed_text(text: str) -&gt; np.array:
    return list(embedding_model.embed(text))[0]

context_points = [
    PointStruct(id=idx, vector=emb, payload={"text": text})
    for idx, (emb, text) in enumerate(zip(context_embeddings, contexts))
]

len(context_points[0].vector)

Insert into Qdrant

search_client = QdrantClient(":memory:")

search_client.create_collection(
    collection_name="hindi_tamil_contexts",
    vectors_config=VectorParams(size=len(context_points[0].vector), distance=Distance.COSINE),
)
search_client.upsert(collection_name="hindi_tamil_contexts", points=context_points)

Selecting a Question

I've randomly selected a question here, with a specific and we then find the answer to it. We have the correct answer for it too -- so we can compare the two when you run the code.

idx = 997

question = questions[idx]
print(question)
search_context = search_client.search(
    query_vector=embed_text(question), collection_name="hindi_tamil_contexts", limit=2
)

search_context_text = search_context[0].payload["text"]
len(search_context_text)

Running the Model with a Question & Context

input_prompt = """
Answer the following question based on the context given after it in the same language as the question:
### Question:
{}

### Context:
{}

### Answer:
{}"""

input_text = input_prompt.format(
    questions[idx],  # question
    search_context_text[:2000],  # context
    "",  # output - leave this blank for generation!
)

inputs = tokenizer([input_text], return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=50, use_cache=True)
response = tokenizer.batch_decode(outputs)[0]

response.split(sep="### Answer:")[-1].strip("<eos>").strip()

ds[idx]["answer_text"]