Dataset Preparation

N = 3

API_REFERENCE_PATHS = [
    "**/*.txt",
]

QUESTION_GENERATION_SYSTEM_PROMPT = """You are Lumo, a helpful AI assistant. Your task is to help a user understand everything about Solana, from fundamentals, to coding, or anything at all. Carefully examine the function documentation snippet and generate {} questions a medium to experienced Solana user could ask. Questions must be answerable from the information in the snippet. Do not assume anything about Solana that is not discussed in the snippet, make sure you include complete code contents in your answers when it might add value. If the snippet is too short or contains too little information, output an empty JSON array.""".format(
    N
)

QUESTION_ANSWERING_SYSTEM_PROMPT = """You are a Lumo, helpful AI assistant. Your task is to help a user understand everything about Solana, from fundamentals, to coding, or anything at all. Carefully examine the function documentation and generate an explanatory response based on the user's question which showcases usage and examples. Do not assume anything about Solana that is not discussed in the reference documentation snippet, make sure you include complete code contents in your answers when it might add value."""

def chunk_text(text, chunk_size=2000, overlap=200):
    """Split text into overlapping chunks."""
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunk = text[start:end]
        chunks.append(chunk)
        if end >= len(text):
            break
        start = end - overlap
        if start < 0:
            start = 0
    return chunks

Data Loading and Preprocessing

  • Loading the Dataset: The dataset is loaded using the Hugging Face datasets library, providing a convenient and efficient way to handle and process the data.

  • Tokenization: The text data (questions and answers) is tokenized using the LLaMa 3.1 tokenizer, which converts the text into a sequence of numerical tokens.

  • Chat Template Application: The apply_chat_template() function from the transformers library is used to format the input data according to the LLaMa 3.1 chat template. This involves creating a sequence of messages with roles: "system" (for the system prompt), "user" (for the question), and "assistant" (for the answer).

2. Data Splitting

The dataset is split into three subsets:

  • Training set: Used to train the Lumo model. Typically constitutes the majority of the dataset.

  • Validation set: Used to monitor the model's performance during training and tune hyperparameters.

  • Test set: Used to evaluate the final performance of the trained model on unseen data.

api_reference_chunks = []
for wcpath in API_REFERENCE_PATHS:
    for path in glob.glob(os.path.join(args.input, wcpath), recursive=True):
        with open(path) as f:
            content = f.read()
            splitted_chunks = chunk_text(content, 2000, 200)
            api_reference_chunks.extend(splitted_chunks)
print(f"Found {len(api_reference_chunks)} chunks of documentation")

client = OpenAI()

def process_chunk(chunk, client, args):
    completion = client.chat.completions.create(
        model=args.model,
        temperature=0.3,
        messages=[
            {"role": "system", "content": QUESTION_GENERATION_SYSTEM_PROMPT},
            {"role": "user", "content": chunk},
        ],
        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": "questions",
                "schema": {
                    "type": "object",
                    "required": ["questions"],
                    "properties": {
                        "questions": {
                            "type": "array",
                            "items": {"type": "string"},
                        }
                    },
                    "additionalProperties": False,
                },
                "strict": True,
            },
        },
    )
    questions = json.loads(completion.choices[0].message.content)["questions"]
    prompt_tokens_used = completion.usage.prompt_tokens
    completion_tokens_used = completion.usage.completion_tokens
    results = []
    for question in questions[:N]:
        completion = client.chat.completions.create(
            model=args.model,
            temperature=0.3,
            messages=[
                {
                    "role": "system",
                    "content": QUESTION_ANSWERING_SYSTEM_PROMPT,
                },
                {"role": "assistant", "content": chunk},
                {"role": "user", "content": question},
            ],
        )
        answer = completion.choices[0].message.content
        prompt_tokens_used += completion.usage.prompt_tokens
        completion_tokens_used += completion.usage.completion_tokens
        results.append({"question": question, "answer": answer, "chunk": chunk})
    return results, prompt_tokens_used, completion_tokens_used

The dataset is split using the train_test_split() function from the datasets library, ensuring a random and representative distribution of data across the three subsets.

3. Data Collation

  • Collation Function: A custom collation function is defined to handle the batching of data during training. This function ensures that batches have consistent lengths and are efficiently processed by the model.

Last updated