Sumit Yadav

Sumit Yadav

AI Safety & Interpretability Researcher

Computer Engineering · Pulchowk Campus, Tribhuvan University, Nepal

I am a Computer Engineering student at Pulchowk Campus, Tribhuvan University, working on the interpretability and safety of language models. My research looks at how safety-aligned LLMs fail silently — through over-refusals, geometric misrepresentation, and surface-level triggers — and how the internal structure of a model's representations can be understood and steered to make it safer and more reliable.

A recurring theme in my work is geometry: representations trace structured trajectories inside a model, and that structure turns out to predict both safety behavior and generalization. I am equally invested in extending these tools to low-resource and multilingual settings, where I built the first language model for Maithili (~50M speakers).

I currently lead AI-safety and agentic-systems research at Astha.ai, and I am ready to begin a PhD (Fall 2027) in mechanistic interpretability and AI alignment — actively looking for the right group to join.

Research Interests

Selected Publications

See my Google Scholar for the full list.

1. SafeConstellations: Mitigating Over-Refusals in LLMs Through Task-Aware Representation Steering. ACL 2026 (Main Conference). An inference-time, task-aware trajectory-shifting method that cuts over-refusals by up to 73% with minimal utility loss — no retraining required. [paper]
2. On the Relationship Between Representation Geometry and Generalization in Deep Neural Networks. Preprint, 2026 (sole author). Shows that effective dimension — an unsupervised geometric metric — predicts generalization across vision and language models (partial r = 0.75 over 52 classifiers). [paper]
3. Geometric Phases of Mechanism Formation in Neural Networks. Working paper, 2026. Traces how internal mechanisms form across training using linear probes, CKA, and targeted ablations (CIFAR-10/100).
4. Can maiBERT Speak for Maithili? LoResLM @ ACL 2026. The first monolingual BERT for Maithili (~50M speakers); 87% accuracy on news classification, outperforming MuRIL and NepBERTa. [paper] [model]
5. Revolutionizing Currency Security: A YOLOv8-Based Approach for Detecting Counterfeit Nepali Banknotes. J. Bus. Econ. Stud., 2024. [paper]
6. Machine Learning Analysis of Tirhuta Lipi. 2023. 0.97 accuracy in Tirhuta script recognition for OCR and translation of low-resource scripts. [paper]
7. Support Vectors Are a Better Way of Text Classification for Imbalanced Data. 2023. A robust SVC method for 100+ class text classification under severe imbalance. [paper]

Preprints & Work in Progress

News

Experience

Selected Projects

Honors & Awards

Active Kaggle competitor — recent: ARC-AGI / NeuroGolf 2026 (minimal-cost ONNX networks), BirdCLEF 2026, and Scientific Image Forgery Detection (SAM-based).

Documents & Links

Notes & Lab Reports

Contact

Academic: 076bct088.sumit@pcampus.edu.np
Personal: rockerritesh4@gmail.com · +977-9819856148
Kathmandu, Nepal