Aarush Sinha

About

Aarush Sinha

I am an undergraduate student at Vellore Institute of Technology - Chennai. My research is centered on enhancing the reasoning and retrieval capabilities and overall performance of small language models while ensuring efficiency. Additionally, I work on addressing hallucination issues in models across various modalities.

I am fortunate to collaborate with wonderful mentors across various labs. At the Stanford STAIR Lab, I'm working with Rylan Schaeffer and Prof. Sanmi Koyejo on model collapse. I'm also exploring agent evaluation with Prof. Anand Rao at Carnegie Mellon University, mitigating hallucinations in Text2Video models with Prof. Amitava Das at the AI Institute, UofSC, and building efficient dense retrievers with Prof. Nirav Bhatt at IIT-Madras.

Research Interests

Information Retrieval Natural Language Processing AI Safety Reasoning in Language Models

Recent Updates

June 2025
Serving as a reviewer for the 2025 ACL SRW workshop.
May 2025
Started new collaboration with Stanford STAIR on Model Collapse.
May 2025
New pre-print on synthetic dataset generation for dense retrieval released on ArXiv.
April 2025
Started working on Agentic Systems and Evaluation with Prof. Anand Rao at CMU.
March 2025
Released GMLM: Bridging Graph Neural Networks and Language Models for Heterophilic Node Classification on arXiv.

Publications

Vipula Rawte, Sarthak Jain, Aarush Sinha, Garv Kaushik, Aman Bansal, Prathiksha Rumale Vishwanath, Samyak Rajesh Jain, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha, Amit P Sheth, Amitava Das
The 5th Workshop on Trustworthy NLP @ NAACL 2025
Abstract

Recent advances in Large Multimodal Models (LMMs) have expanded their capabilities to video understanding, with Text-to-Video (T2V) models excelling in generating videos from textual prompts. However, they still frequently produce hallucinated content, revealing AI-generated inconsistencies. We introduce ViBe https://huggingface.co/datasets/ViBe-T2V-Bench/ViBe: a large-scale dataset of hallucinated videos from open-source T2V models. We identify five major hallucination types: Vanishing Subject, Omission Error, Numeric Variability, Subject Dysmorphia, and Visual Incongruity. Using ten T2V models, we generated and manually annotated 3,782 videos from 837 diverse MS COCO captions. Our proposed benchmark includes a dataset of hallucinated videos and a classification framework using video embeddings. ViBe serves as a critical resource for evaluating T2V reliability and advancing hallucination detection. We establish classification as a baseline, with the TimeSFormer + CNN ensemble achieving the best performance (0.345 accuracy, 0.342 F1 score). While initial baselines proposed achieve modest accuracy, this highlights the difficulty of automated hallucination detection and the need for improved methods. Our research aims to drive the development of more robust T2V models and evaluate their outputs based on user preferences.

Dense Retrieval paper thumbnail
Aarush Sinha
arXiv preprint arXiv:2504.21015
Abstract

Training effective dense retrieval models often relies on hard negative (HN) examples mined from the document corpus via methods like BM25 or cross-encoders (CE), processes that can be computationally demanding and require full corpus access. This paper introduces a different approach, an end-to-end pipeline where a Large Language Model (LLM) first generates a query from a passage, and then generates a hard negative example using only that query text. This corpus-free negative generation contrasts with standard mining techniques. We evaluated this LLM Query LLM HN approach against traditional LLM Query BM25 HN and LLM Query CE HN pipelines using E5-Base and GTE-Base models on several BEIR benchmark datasets. Our results show the proposed all-LLM pipeline achieves performance identical to both the BM25 and the computationally intensive CE baselines across nDCG@10, Precision@10, and Recall@100 metrics. This demonstrates that our corpus-free negative generation method matches the effectiveness of complex, corpus-dependent mining techniques, offering a potentially simpler and more efficient pathway for training high-performance retrievers without sacrificing results. We make the dataset including the queries and the hard-negatives for all three methods publicly available

GMLM paper thumbnail
Aarush Sinha, OM Kumar CU
arXiv preprint arXiv:2503.05763
Abstract

Integrating structured graph data with rich textual information from nodes poses a significant challenge, particularly for heterophilic node classification. Current approaches often struggle with computational costs or effective fusion of disparate modalities. We propose Graph Masked Language Model (GMLM), a novel architecture efficiently combining Graph Neural Networks (GNNs) with Pre-trained Language Models (PLMs). GMLM introduces three key innovations: (i) a dynamic active node selection strategy for scalable PLM text processing; (ii) a GNN-specific contrastive pretraining stage using soft masking with a learnable graph [MASK] token for robust structural representations; and (iii) a dedicated fusion module integrating RGCN-based GNN embeddings with PLM (GTE-Small & DistilBERT) embeddings. Extensive experiments on heterophilic benchmarks (Cornell, Wisconsin, Texas) demonstrate GMLM's superiority. Notably, GMLM(DistilBERT) achieves significant performance gains, improving accuracy by over 4.7% on Cornell and over 2.0% on Texas compared to the previous best-performing baselines. This work underscores the benefits of targeted PLM engagement and modality-specific pretraining for improved, efficient learning on text-rich graphs.

Bond Yields paper thumbnail
Jaskaran Singh Walia, Aarush Sinha, Srinitish Srinivasan, Srihari Unnikrishnan
arXiv preprint arXiv:2502.17011
Abstract

Financial bond yield forecasting is challenging due to data scarcity, nonlinear macroeconomic dependencies, and evolving market conditions. In this paper, we propose a novel framework that leverages Causal Generative Adversarial Networks (CausalGANs) and Soft Actor-Critic (SAC) reinforcement learning (RL) to generate high-fidelity synthetic bond yield data for four major bond categories (AAA, BAA, US10Y, Junk). By incorporating 12 key macroeconomic variables, we ensure statistical fidelity by preserving essential market properties. To transform this market dependent synthetic data into actionable insights, we employ a finetuned Large Language Model (LLM) Qwen2.5-7B that generates trading signals (BUY/HOLD/SELL), risk assessments, and volatility projections. We use automated, human and LLM evaluations, all of which demonstrate that our framework improves forecasting performance over existing methods, with statistical validation via predictive accuracy, MAE evaluation(0.103%), profit/loss evaluation (60% profit rate), LLM evaluation (3.37/5) and expert assessments scoring 4.67 out of 5. The reinforcement learning-enhanced synthetic data generation achieves the least Mean Absolute Error of 0.103, demonstrating its effectiveness in replicating real-world bond market dynamics. We not only enhance data-driven trading strategies but also provides a scalable, high-fidelity synthetic financial data pipeline for risk & volatility management and investment decision-making. This work establishes a bridge between synthetic data generation, LLM driven financial forecasting, and language model evaluation, contributing to AI-driven financial decision-making.