LLM learning/inference toolkit (code-only public snapshot)

Python 100%

Find a file

kannan b43fb4f1f3 Initial public snapshot (code only)		2026-06-25 20:40:55 +04:00
config	Initial public snapshot (code only)	2026-06-25 20:40:55 +04:00
data/documents	Initial public snapshot (code only)	2026-06-25 20:40:55 +04:00
notes	Initial public snapshot (code only)	2026-06-25 20:40:55 +04:00
src	Initial public snapshot (code only)	2026-06-25 20:40:55 +04:00
tests	Initial public snapshot (code only)	2026-06-25 20:40:55 +04:00
.gitignore	Initial public snapshot (code only)	2026-06-25 20:40:55 +04:00
Pipfile	Initial public snapshot (code only)	2026-06-25 20:40:55 +04:00
Pipfile.lock	Initial public snapshot (code only)	2026-06-25 20:40:55 +04:00
pyproject.toml	Initial public snapshot (code only)	2026-06-25 20:40:55 +04:00
README.md	Initial public snapshot (code only)	2026-06-25 20:40:55 +04:00
requirements.txt	Initial public snapshot (code only)	2026-06-25 20:40:55 +04:00

README.md

lollama

Local LLM Learning Environment - From fundamentals to fine-tuning

A comprehensive toolkit for learning and working with Large Language Models locally on CPU. Designed for systems with 32GB+ RAM without GPU.

Features

Interactive CLI - Chat with local LLMs via command line
RAG Pipeline - Document Q&A with retrieval-augmented generation
Benchmarking - Compare model performance and quantization levels
Fine-tuning Prep - Dataset preparation for LoRA/QLoRA training
CPU Optimized - Quantized models for efficient CPU inference

Quick Start

1. Install Ollama

# Linux/macOS
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama server
ollama serve

2. Download a Model

# Recommended for 32GB RAM (Q4 quantization)
ollama pull mistral:7b-instruct-q4_K_M

# Alternative: Smaller model for testing
ollama pull llama3.2:3b

3. Install lollama

# Clone and install
git clone https://github.com/krisk248/lollama.git
cd lollama

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/macOS
# or: venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

4. Start Chatting

# Interactive chat
python -m src.inference.cli chat

# Or with specific model
python -m src.inference.cli chat -m llama3.2:3b

Project Structure

lollama/
├── src/
│   ├── inference/         # LLM interaction
│   │   ├── cli.py         # Interactive chat CLI
│   │   └── api_explorer.py # Ollama API examples
│   ├── rag/               # RAG pipeline
│   │   ├── document_loader.py  # Load PDFs, TXT, MD
│   │   ├── vector_store.py     # ChromaDB operations
│   │   └── pipeline.py         # Complete RAG chain
│   ├── benchmarks/        # Performance testing
│   │   ├── quantization_benchmark.py
│   │   └── comprehensive_benchmark.py
│   ├── finetuning/        # Dataset preparation
│   │   └── dataset_prep.py
│   └── utils/             # Helpers
│       ├── config.py
│       └── helpers.py
├── config/
│   └── settings.yaml      # Configuration
├── data/
│   └── documents/         # Your documents for RAG
├── tests/
└── docs/

Usage Examples

Interactive Chat

# Basic chat
python -m src.inference.cli chat

# Custom model and temperature
python -m src.inference.cli chat -m mistral:7b-instruct-q4_K_M -t 0.5

# With system prompt
python -m src.inference.cli chat -s "You are a helpful coding assistant."

RAG Document Q&A

from src.rag.pipeline import RAGPipeline

# Create pipeline from documents
pipeline = RAGPipeline()
pipeline.create_from_directory("./data/documents")

# Ask questions
answer = pipeline.query("What is the main topic discussed?")
print(answer)

# With source attribution
response = pipeline.query_with_sources("Explain the key concepts")
print(f"Answer: {response.answer}")
print(f"Sources: {response.sources}")

Benchmarking

from src.benchmarks.comprehensive_benchmark import run_comprehensive_benchmark

# Compare models
suite = run_comprehensive_benchmark(
    models=["mistral:7b-instruct-q4_K_M", "llama3.2:3b"],
    runs_per_prompt=3,
    output_file="benchmark_results.json"
)

# Print report
from src.benchmarks.comprehensive_benchmark import print_benchmark_report
print_benchmark_report(suite)

Dataset Preparation (for Fine-tuning)

from src.finetuning.dataset_prep import (
    TrainingExample,
    create_instruction_dataset,
    save_dataset,
)

# Create training examples
examples = [
    TrainingExample(
        instruction="What is machine learning?",
        input="",
        output="Machine learning is a branch of AI..."
    ),
    # Add more examples...
]

# Create dataset in Alpaca format
dataset = create_instruction_dataset(examples)
save_dataset(dataset, "./data/training")

Recommended Models

Use Case	Model	RAM Required
General (Recommended)	`mistral:7b-instruct-q4_K_M`	~5 GB
Better Quality	`mistral:7b-instruct-q5_K_M`	~6 GB
Limited RAM	`llama3.2:3b`	~2 GB
Coding	`deepseek-coder:6.7b-instruct-q4_K_M`	~4 GB
RAG	`qwen2.5:7b-instruct-q4_K_M`	~5 GB

Quantization Guide

Level	Bits	Memory	Quality	Use Case
Q4_K_M	~4.5	Lowest	Good	Recommended for CPU
Q5_K_M	~5.5	Medium	Better	Balance quality/speed
Q6_K	~6	Higher	Great	Near-original
Q8_0	8	High	Best	Max quality

Configuration

Edit config/settings.yaml:

model:
  name: mistral:7b-instruct-q4_K_M
  temperature: 0.7
  max_tokens: 2048

rag:
  embedding_model: nomic-embed-text
  chunk_size: 500
  retrieval_k: 4

System Requirements

RAM: 32GB+ recommended (16GB minimum with smaller models)
CPU: Modern multi-core processor
Storage: 10GB+ for models
OS: Linux, macOS, or Windows

Learning Path

This project supports a structured learning path:

Week 1-2: LLM Fundamentals
- Understanding transformers and attention
- Tokenization concepts
- Explore src/inference/api_explorer.py
Week 2-3: Local Inference
- Ollama setup and model management
- Quantization comparison
- Run benchmarks in src/benchmarks/
Week 3-4: RAG Pipeline
- Document loading and chunking
- Vector stores with ChromaDB
- Build Q&A systems with src/rag/
Week 5-6: Fine-tuning Concepts
- Dataset preparation
- LoRA/QLoRA understanding
- Prepare data with src/finetuning/

Running Tests

# Install test dependencies
pip install pytest

# Run tests
pytest tests/ -v

# Run with coverage
pytest tests/ -v --cov=src

Troubleshooting

Ollama not connecting

# Check if Ollama is running
curl http://localhost:11434/api/tags

# Start Ollama
ollama serve

Out of memory

Try a smaller model: ollama pull llama3.2:3b
Use Q4 quantization: ollama pull mistral:7b-instruct-q4_K_M
Close other applications

Slow generation

Ensure using quantized model (Q4_K_M recommended)
Check CPU usage - model should use all cores
Consider smaller model for faster responses

License

MIT License

Contributing

Contributions welcome! Please read the contributing guidelines first.