I am a third-year PhD student in the Decision, Risk, and Operations (DRO) division at Columbia Business School, working with Prof. Assaf Zeevi and Prof. Kaizheng Wang (Columbia IEOR). I am broadly interested in AI and Operations Research (OR), with a focus on digital twin simulation and sequential decision making under uncertainty. Prior to joining the PhD program at DRO, I received my B.A. in Mathematics and Statistics also from Columbia University.
Contact: yuhang.wu@columbia.edu
News
Apr 30, 2026
Paper “Adaptive Querying with AI Persona Priors” accepted at International Conference on Machine Learning (ICML), 2026.
Apr 10, 2026
New paper “SYN-DIGITS: A Synthetic Control Framework for Calibrated Digital Twin Simulation” posted on arXiv.
Nov 26, 2025
New paper “E-GEO: A Testbed for Generative Engine Optimization in E-Commerce” posted on arXiv.
Jun 02, 2025
Paper “Performance of LLMs on Stochastic Modeling Operations Research Problems: From Theory to Practice” accepted at Winter Simulation Conference (WSC), 2025.
May 01, 2025
Paper “Uncertainty Quantification for LLM-Based Survey Simulations” accepted at International Conference on Machine Learning (ICML), 2025.
We study adaptive querying for learning user-dependent quantities of interest, such as responses to held-out items and psychometric indicators, within tight question budgets. Classical Bayesian design and computerized adaptive testing typically rely on restrictive parametric assumptions or expensive posterior approximations, limiting their use in heterogeneous, high-dimensional, and cold-start settings. We introduce a persona-induced latent variable model that represents a user's state through membership in a finite dictionary of AI personas, each offering response distributions produced by a large language model. This yields expressive priors with closed-form posterior updates and efficient finite-mixture predictions, enabling scalable Bayesian design for sequential item selection. Experiments on synthetic data and WorldValuesBench demonstrate that persona-based posteriors deliver accurate probabilistic predictions and an interpretable adaptive elicitation pipeline.
@misc{wang2026adaptivequeryingaipersona,
title={Adaptive Querying with AI Persona Priors},
author={Kaizheng Wang and Yuhang Wu and Assaf Zeevi},
year={2026},
eprint={2605.00696},
archivePrefix={arXiv},
primaryClass={stat.ML},
url={https://arxiv.org/abs/2605.00696},
}
AI-based persona simulation -- often referred to as digital twin simulation -- is increasingly used for market research, recommender systems, and social sciences. Despite their flexibility, large language models (LLMs) often exhibit systematic bias and miscalibration relative to real human behavior, limiting their reliability. Inspired by synthetic control methods from causal inference, we propose SYN-DIGITS (SYNthetic Control Framework for Calibrated DIGItal Twin Simulation), a principled and lightweight calibration framework that learns latent structure from digital-twin responses and transfers it to align predictions with human ground truth. SYN-DIGITS operates as a post-processing layer on top of any LLM-based simulator and thus is model-agnostic. We develop a latent factor model that formalizes when and why calibration succeeds through latent space alignment conditions, and we systematically evaluate ten calibration methods across thirteen persona constructions, three LLMs, and two datasets. SYN-DIGITS supports both individual-level and distributional simulation for previously unseen questions and unobserved populations, with provable error guarantees. Experiments show that SYN-DIGITS achieves up to 50% relative improvements in individual-level correlation and 50--90% relative reductions in distributional discrepancy compared to uncalibrated baselines.
@misc{fan2026syndigits,
title={SYN-DIGITS: A Synthetic Control Framework for Calibrated Digital Twin Simulation},
author={Grace Jiarui Fan and Chengpiao Huang and Tianyi Peng and Kaizheng Wang and Yuhang Wu},
year={2026},
eprint={2604.07513},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2604.07513},
}
With the rise of large language models (LLMs), generative engines are becoming powerful alternatives to traditional search, reshaping retrieval tasks. In e-commerce, for instance, conversational shopping agents now guide consumers to relevant products. This shift has created the need for generative engine optimization (GEO)--improving content visibility and relevance for generative engines. Yet despite its growing importance, current GEO practices are ad hoc, and their impacts remain poorly understood, especially in e-commerce. We address this gap by introducing E-GEO, the first benchmark built specifically for e-commerce GEO. E-GEO contains over 7,000 realistic, multi-sentence consumer product queries paired with relevant listings, capturing rich intent, constraints, preferences, and shopping contexts that existing datasets largely miss. Using this benchmark, we conduct the first large-scale empirical study of e-commerce GEO, evaluating 15 common rewriting heuristics and comparing their empirical performance. To move beyond heuristics, we further formulate GEO as a tractable optimization problem and develop a lightweight iterative prompt-optimization algorithm that can significantly outperform these baselines. Surprisingly, the optimized prompts reveal a stable, domain-agnostic pattern--suggesting the existence of a "universally effective" GEO strategy.
@misc{bagga2025egeo,
title={E-GEO: A Testbed for Generative Engine Optimization in E-Commerce},
author={Puneet S. Bagga and Vivek F. Farias and Tamar Korkotashvili and Tianyi Peng and Yuhang Wu},
year={2025},
eprint={2511.20867},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2511.20867},
}
Large language models (LLMs) are increasingly used to simulate survey responses, but synthetic data can be misaligned with the human population, leading to unreliable inference. We develop a general framework that converts LLM-simulated responses into reliable confidence sets for population parameters of human responses, addressing the distribution shift between the simulated and real populations. The key design choice is the number of simulated responses: too many produce overly narrow sets with poor coverage, while too few yield excessively loose estimates. We propose a data-driven approach that adaptively selects the simulation sample size to achieve nominal average-case coverage, regardless of the LLM's simulation fidelity or the confidence set construction procedure. The selected sample size is further shown to reflect the effective human population size that the LLM can represent, providing a quantitative measure of its simulation fidelity. Experiments on real survey datasets reveal heterogeneous fidelity gaps across different LLMs and domains.
@misc{huang2025human,
title={How Many Human Survey Respondents is a Large Language Model Worth? An Uncertainty Quantification Perspective},
author={Chengpiao Huang and Yuhang Wu and Kaizheng Wang},
year={2025},
eprint={2502.17773},
archivePrefix={arXiv},
primaryClass={stat.ME},
url={https://arxiv.org/abs/2502.17773},
}