Sumit Yadav

Sumit Yadav

Interpretability & AI-safety research

Computer Engineering · Pulchowk Campus, Tribhuvan University, Nepal

I'm a Computer Engineering graduate from Pulchowk Campus, Tribhuvan University. Most of my time goes into trying to understand how language models work on the inside — why a safety-aligned model sometimes refuses perfectly harmless questions, and how the geometry of a model's internal representations relates to what it has actually learned. A lot of it is simply curiosity, following a question until it starts to make sense.

I care about making these tools useful for languages that are often left out. My family speaks Maithili — around 50 million speakers, yet very little language technology exists for it — so building a small model for it felt as personal as it was technical.

Alongside research I work on AI safety and agentic systems at Astha.ai, and I hope to pursue a PhD in interpretability and alignment. I try to learn in the open: I write at Tatva, share short explainers on YouTube, and keep a camera and a fondness for Mithila's stories and festivals close by. If any of this overlaps with your own work, I'd be glad to hear from you.

Research Interests

Selected Publications

See my Google Scholar for the full list.

1. SafeConstellations: Mitigating Over-Refusals in LLMs Through Task-Aware Representation Steering. ACL 2026 (Main Conference). An inference-time, task-aware trajectory-shifting method that cuts over-refusals by up to 73% with minimal utility loss — no retraining required. [paper]
2. On the Relationship Between Representation Geometry and Generalization in Deep Neural Networks. Preprint, 2026 (sole author). Shows that effective dimension — an unsupervised geometric metric — predicts generalization across vision and language models (partial r = 0.75 over 52 classifiers). [paper]
3. Geometric Phases of Mechanism Formation in Neural Networks. GLOW 2026 — Workshop on Generalizing from Limited Resources in the Open World @ IJCAI 2026 (Poster). Using linear probes and centered kernel alignment (CKA) across dense training checkpoints, finds that classification mechanisms form output-layer-first and within the first ~5% of training: output layers reach >70% probe accuracy by epoch 5 while input layers stay below 50% (Cohen's d = 3.68). The same deep-first pattern holds in the first ~200M tokens of from-scratch LLM pretraining (GPT-2 Small, SmolLM2-135M) and reproduces on public Pythia / OLMo-2 checkpoints — and isn't explained by gradient magnitude (input layers receive up to 6.9× more gradient yet learn less). A temporal map of when and where mechanisms emerge (CIFAR-10/100, decoder-only LLMs). arXiv & code coming soon.
4. Can maiBERT Speak for Maithili? LoResLM @ ACL 2026. The first monolingual BERT for Maithili (~50M speakers); 87% accuracy on news classification, outperforming MuRIL and NepBERTa. [paper] [model]
5. Revolutionizing Currency Security: A YOLOv8-Based Approach for Detecting Counterfeit Nepali Banknotes. J. Bus. Econ. Stud., 2024. [paper]
6. Machine Learning Analysis of Tirhuta Lipi. 2023. 0.97 accuracy in Tirhuta script recognition for OCR and translation of low-resource scripts. [paper]
7. Support Vectors Are a Better Way of Text Classification for Imbalanced Data. 2023. A robust SVC method for 100+ class text classification under severe imbalance. [paper]

Preprints & Work in Progress

News

Experience

Selected Projects

Honors & Awards

Active Kaggle competitor — recent: 🥉 BirdCLEF+ 2026 (bronze medal, rank 354 / 4084), ARC-AGI / NeuroGolf 2026 (minimal-cost ONNX networks), and Scientific Image Forgery Detection (SAM-based).

Documents & Links

Notes & Lab Reports

Contact

Academic: 076bct088.sumit@pcampus.edu.np
Personal: rockerritesh4@gmail.com · +977-9819856148
Kathmandu, Nepal