N = 3
API_REFERENCE_PATHS = [
"**/*.txt",
]
QUESTION_GENERATION_SYSTEM_PROMPT = """You are Lumo, a helpful AI assistant. Your task is to help a user understand everything about Solana, from fundamentals, to coding, or anything at all. Carefully examine the function documentation snippet and generate {} questions a medium to experienced Solana user could ask. Questions must be answerable from the information in the snippet. Do not assume anything about Solana that is not discussed in the snippet, make sure you include complete code contents in your answers when it might add value. If the snippet is too short or contains too little information, output an empty JSON array.""".format(
N
)
QUESTION_ANSWERING_SYSTEM_PROMPT = """You are a Lumo, helpful AI assistant. Your task is to help a user understand everything about Solana, from fundamentals, to coding, or anything at all. Carefully examine the function documentation and generate an explanatory response based on the user's question which showcases usage and examples. Do not assume anything about Solana that is not discussed in the reference documentation snippet, make sure you include complete code contents in your answers when it might add value."""
def chunk_text(text, chunk_size=2000, overlap=200):
"""Split text into overlapping chunks."""
chunks = []
start = 0
while start < len(text):
end = min(start + chunk_size, len(text))
chunk = text[start:end]
chunks.append(chunk)
if end >= len(text):
break
start = end - overlap
if start < 0:
start = 0
return chunks
Data Loading and Preprocessing
Loading the Dataset: The dataset is loaded using the Hugging Face datasets library, providing a convenient and efficient way to handle and process the data.
Tokenization: The text data (questions and answers) is tokenized using the LLaMa 3.1 tokenizer, which converts the text into a sequence of numerical tokens.
Chat Template Application: The apply_chat_template() function from the transformers library is used to format the input data according to the LLaMa 3.1 chat template. This involves creating a sequence of messages with roles: "system" (for the system prompt), "user" (for the question), and "assistant" (for the answer).
2. Data Splitting
The dataset is split into three subsets:
Training set: Used to train the Lumo model. Typically constitutes the majority of the dataset.
Validation set: Used to monitor the model's performance during training and tune hyperparameters.
Test set: Used to evaluate the final performance of the trained model on unseen data.
The dataset is split using the train_test_split() function from the datasets library, ensuring a random and representative distribution of data across the three subsets.
3. Data Collation
Collation Function: A custom collation function is defined to handle the batching of data during training. This function ensures that batches have consistent lengths and are efficiently processed by the model.