We discuss experiences from evaluating the learning perfor- mance of a user-adaptive personal assistant agent. We dis- cuss the challenge of designing adequate evaluation and the tension of collecting adequate data without a fully functional, deployed system. Reflections on negative and positive expe- riences point to the challenges of evaluating user-adaptive AI systems. Lessons learned concern early consideration of eval- uation and deployment, characteristics of AI technology and domains that make controlled evaluations appropriate or not, holistic experimental design, implications of "in the wild" evaluation, and the effect of AI-enabled functionality and its impact upon existing tools and work practices.