Work in Progress

Nemotron Competition

Fine-tuning and prompt engineering pipeline for the Nvidia Nemotron competition. Benchmarks Claude and vLLM backends across six problem categories, with a Streamlit analytics dashboard tracking experiment runs and per-category accuracy.

LLMfine-tuningbenchmarkingpython
GitHub

Token Efficiency in Self-Improving Agents

Research into reducing token consumption in agents that iteratively rewrite or extend their own prompts and code. Exploring compression strategies, prompt distillation, and selective context retention to make recursive self-improvement economically viable.

agentsLLMself-improvementresearch
GitHub

Robustness in Long-Running Agents (20+ Hours)

Investigating failure modes that emerge only over extended autonomous runs — context drift, compounding errors, resource leaks, and recovery strategies. Aims to produce a reliability benchmark suite for agents operating continuously beyond 20 hours.

agentsreliabilitybenchmarkingresearch
GitHub

Identifying and Sourcing Alternative Datasets

Systematic methods for discovering non-standard training and evaluation data — scraping pipelines, synthetic generation, domain-specific crawls, and quality filtering. Focus on filling gaps where canonical benchmarks are saturated or misaligned with real-world tasks.

datasetsdata-engineeringresearch
GitHub

Large Data Handling Within Context Limits

Techniques for processing datasets that far exceed a model's context window — hierarchical summarisation, retrieval-augmented chunking, streaming state machines, and lossy compression with bounded information loss. Targeting practical patterns for production agent pipelines.

LLMRAGdata-engineeringresearch
GitHub