Sumit Yadav

aka rockerritesh

AI Safety & Mechanistic Interpretability Researcher

Computer Engineering graduate · Pulchowk Campus, Tribhuvan University, Nepal

Hello! I'm an AI researcher working on the interpretability and safety of language models, and on bringing language technology to Maithili and other under-served languages. I work on AI safety and agentic systems at Astha.ai, write here, and post short explainers on YouTube.

2019

Began B.E. Computer Engineering, Pulchowk Campus; started learning AI/ML

2020

First project — IRB image-recognition robotic arm; Deep Learning Specialization (DeepLearning.AI)

2021

GANs Specialization & DeepLearning.AI GAN mentor; LogPoint CTF win

2022

DELTA 3.0 & IT-Meet vision wins

2023

ML for Tirhuta Lipi; GritFeat AI Hackathon

2024

AI Engineer at AMNIL; banknote-security paper

2025

AI research at Astha.ai (safety & agents)

2026

ACL Main, GLOW @ IJCAI, maiBERT; CE graduate

Publications

See my Google Scholar for the full list.

Topic

SafeConstellations: Mitigating Over-Refusals in LLMs Through Task-Aware Representation Steering

Utsav Maskey, Sumit Yadav, Mark Dras, Usman Naseem

TL;DR: Steering only over-refusal-prone tasks in representation space cuts LLM over-refusals by up to 73% — no retraining, minimal utility loss.

ACL 2026 (Main)|Paper|arXiv|DOI Interpretability & Safety

On the Relationship Between Representation Geometry and Generalization in Deep Neural Networks

Sumit Yadav (sole author)

TL;DR: A label-free geometric metric — effective dimension — predicts, and causally drives, generalization across vision and language models.

Preprint, 2026|Paper|DOI Interpretability & Safety

Geometric Phases of Mechanism Formation in Neural Networks

Sumit Yadav

TL;DR: Neural-network classification mechanisms form output-layer-first, within the first ~5% of training — from small models up to LLM pretraining.

GLOW 2026|Preprint|DOI Interpretability & Safety

MaiBERT: A Pre-training Corpus and Language Model for Low-Resourced Maithili Language

Sumit Yadav, Raju Kumar Yadav, Utsav Maskey, Gautam Siddharth Kashyap, Ganesh Gautam, Usman Naseem

TL;DR: The first BERT for Maithili (~50M speakers), reaching 87% news-classification accuracy and beating multilingual baselines like MuRIL and NepBERTa.

LoResLM @ ACL 2026|Paper|arXiv|DOI|Model Low-resource NLP

Revolutionizing Currency Security: A YOLOv8-Based Approach for Detecting Counterfeit Nepali Banknotes

Sumit Yadav et al.

TL;DR: A YOLOv8 detector for counterfeit Nepali banknotes, reaching 0.986 true-positive recall on the note's back face.

J. Bus. Econ. Stud., 2024|Paper Vision & Other

Evaluating Auto-Encoding Transformer Language Models for Maithili Text Classification

Sumit Yadav, Raju Kumar Yadav

TL;DR: A Maithili masked language model built via transfer learning — the precursor work to maiBERT.

B.E. Thesis (Electronics & Computer Engineering), Tribhuvan University, 2024|Paper Low-resource NLP

Machine Learning Analysis of Tirhuta Lipi

Sumit Yadav, Raju Kumar Yadav

TL;DR: Character recognition for the endangered Tirhuta script at 97% accuracy, enabling OCR for Maithili.

Technical Report, Tribhuvan University, 2023|Paper|DOI Low-resource NLP

Support Vectors Are a Better Way of Text Classification for Imbalanced Data

Sumit Yadav et al.

TL;DR: For 100+ imbalanced text classes, a TF-IDF support-vector pipeline beats neural baselines and retrains incrementally.

2023|Paper Text Classification

Writing

Notes and essays on AI, math, and a few things in between.

Where Does a Language Model Think? Finding and Removing the 'Workspace' Layers of Llama-3.1-8B A hands-on interpretability walkthrough. We watch concepts form layer-by-layer inside Llama-3.1-8B-Instruct, measure precisely which layers carry meaning, then delete layers — one at a time and in whole bands — and… AI & ML2026-07-16 World Models from Scratch: Two Toy Experiments (with an in-browser demo) A world model has three pieces: an encoder , a state-space model , and a decoder . This post builds two of them from scratch — one for CartPole, one for a little visual counting + mass world — and at the end you can… AI & ML2026-06-29 T The Steerability Spectrum: When Reading a Feature Isn't Enough to Steer It You can read an attribute off a vision model perfectly — and still fail to steer it with a single vector. A number called $k^*$ says why, and prescribes the fix. Sumit Yadav · June 2026 · DINOv2 · CLIP · SigLIP ·… Interpretability2026-06-23 SafeConstellations: Mitigating Over-Refusals in LLMs Through Task-Aware Representation Steering LLMs refuse benign requests because the text looks dangerous. Our ACL 2026 paper finds that each NLP task traces a stable constellation in the residual stream, with refusal and non-refusal forming distinct sub-trajectories — so over-refusal can be fixed per task at inference time. Up to 73% less over-refusal, no fine-tuning, no utility loss, ~0.2s overhead. Interpretability & Safety2026-06-22 Double Descent Explained: Why More Data Can Make a Model Worse One of the strangest facts in deep learning: adding more data or more capacity can make a model worse before it gets better . Test error falls, then rises into a bump, then falls again — the famous double descent curve… AI & ML2026-06-18 Why Does Attention Attend There? Tracing the Other Half of the Circuit Attribution graphs gave us a way to see what information a model moves between tokens. But they froze one crucial thing: the attention pattern itself. They could tell you a head copied information from an earlier token… AI & ML2026-06-17 How Does an LLM Know When to Start a New Line? The Geometry of Counting Here's a deceptively simple puzzle. When a language model wraps text to a fixed width, it has to know how many characters still fit on the current line. But it has no eyes and no ruler — only a stream of tokens. So… AI & ML2026-06-16 Towards Monosemanticity: Pulling Clean Features Out of a Messy Neuron Inside a neural network, a single neuron fires for many unrelated things at once — it's polysemantic , which makes it almost impossible to read. So how do you pull out clean, single-meaning concepts? The 2023 paper… AI & ML2026-06-15

All 74 posts →

Experience

AI Researcher — Safety & Agentic Systems, Astha.ai · 2025–present
Zero-Trust agent oversight, MCP-Scanner vulnerability platform, SAFE-MCP framework.
AI Engineer — RAG & Infrastructure, AMNIL Technologies · 2024–2025
Guardrails, LLM-as-a-Judge evaluation, self-hosted LLM serving with vLLM.
Data Team Lead, GradeUp Educations · 2022–2024
Learning agents/chatbots, an automated grade-evaluation system, and semantic-similarity matching.
GAN Specialization Mentor, DeepLearning.AI · 2021–present

Honors & Awards

Outstanding Reviewer, Mechanistic Interpretability Workshop (ICML 2026) (2026)
Recognized for high-quality, timely peer reviews.
Winner, GritFeat AI Hackathon (2023)
SWIFT — wearable LSTM fall-detection for the elderly (79.86%).
1st Runner-Up, Docsumo DataVerse — LOCUS 2023 (2023)
Team Deep Learners — NLP classification of imbalanced research-paper abstracts.
1st Runner-Up, Docsumo DataRush — LOCUS 2021 (2021)
Team Deep Learners — abstract classification into 158 classes (SVC + TF-IDF).
Best AI Project, DELTA 3.0 (2022)
Nepali Harvest — crop-disease prediction & harvest timing.
Winner, IT-Meet Image Challenge (2022)
Computer-vision classification of Nepali ballot-paper images.
Winner, LogPoint Capture The Flag (2021)
Binary exploitation & forensics.

Projects

maiBERT — First BERT for Maithili (demo)
Whisper-tiny Maithili (ASR) — Open Maithili speech-to-text — OpenAI Whisper-tiny fine-tuned on the IISc SYSPIN corpus (63.9% WER) (live demo)
MMS-TTS Maithili (TTS) — Maithili text-to-speech — VITS / Meta MMS-TTS fine-tuned on a SYSPIN male voice (live demo)
SAFE-MCP / SAF-MCP — Contributed detection techniques and mitigations to a community security framework for the Model Context Protocol (MCP) — an ATT&CK-style catalogue of agentic-system threats and defenses
AgentGuard — Zero-Trust protocol for AI agents: identity, policy, mTLS, audit (Python SDK + Go server)
spiffe-core · TraT — SPIFFE-based agent identity/attestation and Transaction Tokens for multi-agent workflows (TraT)
sumit-mcp-server — Federated memory MCP server (live on HF Spaces)
Vibe-Coder — An agent that builds Streamlit/FastAPI apps
IRB Robotics Arm — Open-source image-recognition robotic arm (UN SDG3)

Documents & Links

Curriculum Vitae · PDF Résumé · PDF Tensor Processing Units · PDF YouTube — Audio Obsession · AI & Math Photography · gallery Drum Machine · interactive Hugging Face · models Deep Learning Specialization — DeepLearning.AI · certificate GANs Specialization — DeepLearning.AI · certificate

Notes & Lab Reports

Singular Value Decomposition · PDF Data Insights · slides Artificial Neural Network · lab Genetic Algorithm · lab Constraint Programming · lab Propositional Logic · lab UNIX & Shell Programming · lab CT/DT Functions · lab