LogoLogo
HuggingFace Community$LUMOXTelegram
  • Introduction
  • Roadmap
  • Partnerships and Listings
  • LumoKit: Solana AI Toolkit Framwork
    • Introduction to LumoKit
    • Installation Guide
      • Pre-requisites
      • Environment Configuration
      • Local Installation
  • How to Add Tools
  • Tools
    • Wallet Portfolio tool
    • Token Identification Tool
    • Rugcheck Token Information Tool
    • Fluxbeam Token Price
    • BirdEye Token Trending
    • Birdeye All Time Trades
    • CoinMarketCap Crypto News
    • Crypto.news Memecoin News
    • GeckoTerminal Trending Pump.Fun Tool
    • CoinGecko Global Crypto Data Tool
    • CoinGecko Trending Crypto Tool
    • CoinGecko Exchange Rates Tool
    • CoinGecko Coin Data Tool
    • CoinMarketCap Trending Coins Tool
    • DexScreener Top Boosts Tool
    • DexScreener Token Information
    • Jupiter Token Price
    • Jupiter Token Metadata Tool
    • Solana Send SOL Tool
    • Solana Send SPL Tokens Tool
    • Solana Burn Tokens Tool
    • Jupiter Swap (Buy/Sell) Tool
    • Pump.Fun Launch Coin Tool
  • Lumo-8B-Instruct Model
    • Model Overview
    • Capabilities and Limitations
    • Use Cases
  • Lumo Dataset
    • About Lumo-Iris
    • About Lumo-8B
    • Dataset Preparation
    • Training Metrics
  • Using The Model
    • HuggingFace Hub
    • How to Inference
  • Lumo Community
    • How to Contribute
    • Report Bugs/Issues
Powered by GitBook

Copyright © 2025 Lumo. All Rights Reserved. This software is open-source and licensed under the GNU Affero General Public License (AGPL) v3.0. You are free to redistribute and modify it under the terms of this license.

On this page
  • Data Sources
  • Data Extraction and Processing
  • Dataset Structure
  1. Lumo Dataset

About Lumo-8B

PreviousAbout Lumo-IrisNextDataset Preparation

Last updated 4 months ago

The Lumo 8B Instruct dataset is a cornerstone for the Lumo large language model, specifically designed to empower the model with a deep understanding of the Solana ecosystem. This meticulously curated dataset provides the foundation for Lumo's ability to answer questions, generate code, and assist users within the Solana domain.

The dataset draws from a diverse range of authoritative sources within the Solana ecosystem.

Data Sources

The dataset draws from a diverse range of authoritative sources within the Solana ecosystem:

  • Official Solana Documentation:

    • Comprehensive documentation covering Solana's core concepts, protocols, and development tools.

    • Includes sections on:

      • Fundamentals: Blockchain architecture, consensus mechanisms (Proof-of-History, Proof-of-Stake), tokenomics.

      • Development: Smart contract development (using languages like Rust, Solidity), interacting with the Solana RPC, using Solana developer tools.

      • Ecosystem: DeFi protocols, NFTs, dApps, governance, and the broader Solana ecosystem.

      • Terminology: Definitions of key terms and concepts within the Solana ecosystem.

  • Project-Specific Documentation:

    • Jito: Documentation for the Jito wallet and its associated features.

    • Raydium: Documentation for the Raydium decentralized exchange (DEX) on Solana.

    • Jupiter: Documentation for the Jupiter decentralized exchange aggregator.

    • Helius: Documentation for the Helius Solana developer tools.

    • QuickNode: Documentation for the QuickNode Solana infrastructure platform.

    • ChainStack: Documentation for the ChainStack Solana infrastructure platform.

    • Meteora: Documentation for the Meteora Solana infrastructure platform.

    • PumpPortal: Documentation for the PumpPortal Solana-focused platform.

    • DexScreener: Documentation for the DexScreener decentralized exchange explorer.

    • MagicEden: Documentation for the MagicEden NFT marketplace.

Data Extraction and Processing

  • Data Extraction:

    • Data was meticulously extracted from the designated sources using a combination of manual curation and automated techniques.

    • Note: The dataset was compiled with a strong emphasis on data integrity and accuracy. No automated scraping techniques were employed to avoid potential biases or inaccuracies.

  • Data Cleaning:

    • Removal of HTML/Markdown: HTML tags, Markdown formatting, and other irrelevant formatting elements were removed to ensure clean and consistent text.

    • Deduplication: Duplicate entries were identified and removed to prevent redundancy and ensure data quality.

    • Error Correction: Minor spelling and grammatical errors were corrected to improve data consistency.

    • Standardization: Terminology was standardized across different sources to maintain consistency and improve data coherence.

  • Text Chunking:

    • The extracted text was divided into smaller, manageable chunks of 2000 characters with an overlap of 200 characters. This approach ensures that each chunk contains sufficient information for generating meaningful questions and answers while maintaining context.

  • Question-Answer Pair Generation:

    • For each chunk, three high-quality question-answer pairs were generated using a powerful language model (e.g., GPT-4).

    • The model was instructed to:

      • Generate questions that are relevant to the provided text chunk.

      • Ensure that the questions are answerable based solely on the information within the chunk.

      • Generate concise and informative answers that accurately reflect the content of the chunk.

Dataset Structure

The Lumo 8B Instruct dataset is structured as a JSONL file, where each line represents a single question-answer pair. Each line contains the following fields:

  • question: The question generated from the given text chunk.

  • answer: The corresponding answer to the generated question.

  • chunk: The original text chunk from which the question-answer pair was derived.