Twenty Questions, Interpreted
2026A mechanistic-interpretability study of whether an LLM truly commits to a secret in 20 Questions, using linear probes, activation patching, steering, and sparse autoencoders on Gemma-3.
Institute of Science Tokyo · AI / interpretability
Swiss-Rwandan master's student in AI at the Institute of Science Tokyo. I work on mechanistic interpretability of large language models from a linguistics-oriented perspective: grammatical generalization, syntactic structure, lexical frequency effects, and how linguistic knowledge is represented inside neural networks.
Lexical frequency and grammatical generalization in LLMs
Under review (ARR 2026 / EMNLP 2026)
Large Language Models Are Robust to Low-Frequency Words in Grammatical Evaluation
言語処理学会 (NLP) 2026, poster
A mechanistic-interpretability study of whether an LLM truly commits to a secret in 20 Questions, using linear probes, activation patching, steering, and sparse autoencoders on Gemma-3.
Bachelor's thesis (UZH). Tests whether gains in multi-agent LLM reasoning come from genuinely separate model instances or just role-based perspective diversity. It compares two DeepSeek-V3 instances against a single model alternating roles, across Debate / Cooperative / Teacher-Student strategies on AIME, GPQA Diamond, and LiveBench Reasoning. Model separation helped most in critique-oriented dialogue; cooperative settings didn't require true independence.
Classifies overlapping speech in spontaneous multi-party conversation (AMI Meeting Corpus) as cooperative (e.g. backchannels) or competitive (e.g. interruptions). Combines Wav2Vec audio embeddings with lexical sentence embeddings from noisy ASR, trained via a weakly-supervised labeling pipeline (heuristics + LLM-assisted annotation). Adding lexical features improved performance, though competitive overlaps stayed hard.