LogoLogo
HuggingFace Community$LUMOXTelegram
  • Introduction
  • Roadmap
  • Partnerships and Listings
  • LumoKit: Solana AI Toolkit Framwork
    • Introduction to LumoKit
    • Installation Guide
      • Pre-requisites
      • Environment Configuration
      • Local Installation
  • How to Add Tools
  • Tools
    • Wallet Portfolio tool
    • Token Identification Tool
    • Rugcheck Token Information Tool
    • Fluxbeam Token Price
    • BirdEye Token Trending
    • Birdeye All Time Trades
    • CoinMarketCap Crypto News
    • Crypto.news Memecoin News
    • GeckoTerminal Trending Pump.Fun Tool
    • CoinGecko Global Crypto Data Tool
    • CoinGecko Trending Crypto Tool
    • CoinGecko Exchange Rates Tool
    • CoinGecko Coin Data Tool
    • CoinMarketCap Trending Coins Tool
    • DexScreener Top Boosts Tool
    • DexScreener Token Information
    • Jupiter Token Price
    • Jupiter Token Metadata Tool
    • Solana Send SOL Tool
    • Solana Send SPL Tokens Tool
    • Solana Burn Tokens Tool
    • Jupiter Swap (Buy/Sell) Tool
    • Pump.Fun Launch Coin Tool
  • Lumo-8B-Instruct Model
    • Model Overview
    • Capabilities and Limitations
    • Use Cases
  • Lumo Dataset
    • About Lumo-Iris
    • About Lumo-8B
    • Dataset Preparation
    • Training Metrics
  • Using The Model
    • HuggingFace Hub
    • How to Inference
  • Lumo Community
    • How to Contribute
    • Report Bugs/Issues
Powered by GitBook

Copyright © 2025 Lumo. All Rights Reserved. This software is open-source and licensed under the GNU Affero General Public License (AGPL) v3.0. You are free to redistribute and modify it under the terms of this license.

On this page
  1. Lumo Dataset

Dataset Preparation

N = 3

API_REFERENCE_PATHS = [
    "**/*.txt",
]

QUESTION_GENERATION_SYSTEM_PROMPT = """You are Lumo, a helpful AI assistant. Your task is to help a user understand everything about Solana, from fundamentals, to coding, or anything at all. Carefully examine the function documentation snippet and generate {} questions a medium to experienced Solana user could ask. Questions must be answerable from the information in the snippet. Do not assume anything about Solana that is not discussed in the snippet, make sure you include complete code contents in your answers when it might add value. If the snippet is too short or contains too little information, output an empty JSON array.""".format(
    N
)

QUESTION_ANSWERING_SYSTEM_PROMPT = """You are a Lumo, helpful AI assistant. Your task is to help a user understand everything about Solana, from fundamentals, to coding, or anything at all. Carefully examine the function documentation and generate an explanatory response based on the user's question which showcases usage and examples. Do not assume anything about Solana that is not discussed in the reference documentation snippet, make sure you include complete code contents in your answers when it might add value."""

def chunk_text(text, chunk_size=2000, overlap=200):
    """Split text into overlapping chunks."""
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunk = text[start:end]
        chunks.append(chunk)
        if end >= len(text):
            break
        start = end - overlap
        if start < 0:
            start = 0
    return chunks

Data Loading and Preprocessing

  • Loading the Dataset: The dataset is loaded using the Hugging Face datasets library, providing a convenient and efficient way to handle and process the data.

  • Tokenization: The text data (questions and answers) is tokenized using the LLaMa 3.1 tokenizer, which converts the text into a sequence of numerical tokens.

  • Chat Template Application: The apply_chat_template() function from the transformers library is used to format the input data according to the LLaMa 3.1 chat template. This involves creating a sequence of messages with roles: "system" (for the system prompt), "user" (for the question), and "assistant" (for the answer).

2. Data Splitting

The dataset is split into three subsets:

  • Training set: Used to train the Lumo model. Typically constitutes the majority of the dataset.

  • Validation set: Used to monitor the model's performance during training and tune hyperparameters.

  • Test set: Used to evaluate the final performance of the trained model on unseen data.

api_reference_chunks = []
for wcpath in API_REFERENCE_PATHS:
    for path in glob.glob(os.path.join(args.input, wcpath), recursive=True):
        with open(path) as f:
            content = f.read()
            splitted_chunks = chunk_text(content, 2000, 200)
            api_reference_chunks.extend(splitted_chunks)
print(f"Found {len(api_reference_chunks)} chunks of documentation")

client = OpenAI()

def process_chunk(chunk, client, args):
    completion = client.chat.completions.create(
        model=args.model,
        temperature=0.3,
        messages=[
            {"role": "system", "content": QUESTION_GENERATION_SYSTEM_PROMPT},
            {"role": "user", "content": chunk},
        ],
        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": "questions",
                "schema": {
                    "type": "object",
                    "required": ["questions"],
                    "properties": {
                        "questions": {
                            "type": "array",
                            "items": {"type": "string"},
                        }
                    },
                    "additionalProperties": False,
                },
                "strict": True,
            },
        },
    )
    questions = json.loads(completion.choices[0].message.content)["questions"]
    prompt_tokens_used = completion.usage.prompt_tokens
    completion_tokens_used = completion.usage.completion_tokens
    results = []
    for question in questions[:N]:
        completion = client.chat.completions.create(
            model=args.model,
            temperature=0.3,
            messages=[
                {
                    "role": "system",
                    "content": QUESTION_ANSWERING_SYSTEM_PROMPT,
                },
                {"role": "assistant", "content": chunk},
                {"role": "user", "content": question},
            ],
        )
        answer = completion.choices[0].message.content
        prompt_tokens_used += completion.usage.prompt_tokens
        completion_tokens_used += completion.usage.completion_tokens
        results.append({"question": question, "answer": answer, "chunk": chunk})
    return results, prompt_tokens_used, completion_tokens_used

The dataset is split using the train_test_split() function from the datasets library, ensuring a random and representative distribution of data across the three subsets.

3. Data Collation

  • Collation Function: A custom collation function is defined to handle the batching of data during training. This function ensures that batches have consistent lengths and are efficiently processed by the model.

PreviousAbout Lumo-8BNextTraining Metrics

Last updated 5 months ago