×

Lessons on off-policy methods from a notification component of a chatbot. (English) Zbl 07465681

Summary: This work serves as a review of our experience applying off-policy techniques to train and evaluate a contextual bandit model powering a troubleshooting notification in a chatbot. First, we demonstrate the effectiveness of off-policy evaluation when data volume is orders of magnitude less than typically found in the literature. We present our reward function and choices behind its design, as well as how we construct our logging policy to balance exploration and performance on key metrics. Next, we present a guided framework to update a model post-training called Post-Hoc Reward Distribution Hacking, which we employed to improve model performance and correct deficiencies in trained models stemming from the existence of a null action and a noisy reward signal. Throughout the work, we include discussions of various practical pitfalls encountered while using off-policy methods in hopes to expedite other applications of these techniques.

MSC:

68T05 Learning and adaptive systems in artificial intelligence

Software:

Keras
Full Text: DOI

References:

[1] Agarwal, A., Basu, S., Schnabel, T., & Joachims, T. (2017). Effective evaluation using logged bandit feedback from multiple loggers. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 687-696.
[2] Aly, M. (2005). Survey on multiclass classification methods. Neural Networks,19, 1-9.
[3] Bottou, L., Peters, J., Quiñonero-Candela, J., Charles, D. X., Chickering, D. M., Ed Snelson, et al. (2013). Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research14(65), 3207-3260. · Zbl 1318.62206
[4] Cer, D., Yang, Y., Kong, S. Y., Hua, N., Limtiaco, N. L. U., St. John, R., Constant, N., Guajardo-Céspedes, M., Yuan, S., Tar, C., Hsuan Sung, Y., Strope, B., & Kurzweil, R. (2018). Universal sentence encoder. In EMNLP Demonstration. Brussels, Belgium. https://www.aclweb.org/anthology/D18-2.pdf.
[5] Chen, M., Beutel, A., Covington, P., Jain, S., Belletti, F., & Ed H. Chi. (2018). Top-k off-policy correction for a reinforce recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM ’19, pages 456-464, New York, NY, USA: ACM.
[6] Chollet, F. et al. (2015). Keras. https://keras.io.
[7] Earley, J., An efficient context-free parsing algorithm, Communication of the ACM, 13, 2, 94-102 (1970) · Zbl 0185.43401 · doi:10.1145/362007.362035
[8] Hanna, J. P., Thomas, P. S., Stone, P., & Niekum, S. (2017). Data-efficient policy evaluation through behavior policy search. In D. Precup, Y. W. Teh (Eds.), InProceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1394-1403, International Convention Centre, Sydney, Australia, 06-11. PMLR.
[9] Ionides, EL, Truncated importance sampling, Journal of Computational and Graphical Statistics, 17, 2, 295-311 (2008) · doi:10.1198/106186008X320456
[10] Jeunen, O., Mykhaylov, D., Rohde, D., Vasile, F., Gilotte, A., & Martin, B. (2019). Learning from bandit feedback: An overview of the state-of-the-art.
[11] Joachims, T., & Swaminathan, A. (2016). Sigir tutorial on counterfactual evaluation and learning for search, recommendation and ad placement. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1199-1201. ACM.
[12] Karampatziakis, N., Kochman, S., Huang, J., Mineiro, P., Osborne, K., & Chen, W. (2019). Lessons from contextual bandit learning in a customer support bot. arXiv preprint arXiv:1905.02219.
[13] Lample, G., Conneau, A., Denoyer, L., & Ranzato, M. (2018). Unsupervised machine translation using monolingual corpora only. International Conference on Learning Representations.
[14] Li, L., Chen, S., Kleban, J., & Gupta, A. (2015). Counterfactual estimation and optimization of click metrics in search engines: A case study. In Proceedings of the 24th International Conference on World Wide Web, WWW ’15 Companion, pages 929-934, New York, NY, USA: ACM.
[15] Li, L., Chu, W., Langford, J., Moon, T., & Wang, X. (2012) An unbiased offline evaluation of contextual bandit algorithms with generalized linear models. In D. Glowacka, L. Dorard, & J. Shawe-Taylor (Eds.), Proceedings of the Workshop on On-line Trading of Exploration and Exploitation 2, volume 26 of Proceedings of Machine Learning Research, pages 19-36, Bellevue, Washington, USA, 02: PMLR.
[16] Li, L., Chu, W., Langford, J., & Schapire, R.E. (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, pages 661-670, New York, NY, USA: ACM.
[17] Li, L., Chu, W., Langford, J., & Wang, X. (2011). Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM ’11, pages 297-306, New York, NY, USA: ACM.
[18] Li, L., Munos, R., & Szepesvári, C. (2015). Toward minimax off-policy value estimation. 01, pages 608-616.
[19] Patro, A., Govindan, S., & Banerjee, S. (2013). Observing home wireless experience through wifi aps. In Proceedings of the 19th Annual International Conference on Mobile Computing & Networking, MobiCom ’13, page 339-350, New York, NY, USA: Association for Computing Machinery.
[20] Pinckernell, N., & Rome, S. (2020). Operationalizing streaming telemetry and machine learning model serving: Customer experience automation. In Proceedings of SCTE \(\bullet\) ISBE Cable-Tec Expo, Fall Technical Forum.
[21] Schulman, J., Levine, S., Moritz, P., Jordan, M.I., Abbeel, P. (2015). Trust region policy optimization. CoRR, abs/1502.05477.
[22] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. CoRR, abs/1707.06347.
[23] Strehl, A., Langford, J., Li, L., & Kakade, S.M. (2010). Learning from logged implicit exploration data. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta (Eds.), Advances in Neural Information Processing Systems 23, pages 2217-2225. Curran Associates, Inc.
[24] Swaminathan, A.; Joachims, T., Batch learning from logged bandit feedback through counterfactual risk minimization, Journal of Machine Learning Research, 16, 52, 1731-1755 (2015) · Zbl 1351.68236
[25] Swaminathan, A., & Joachims, T. (2015). The self-normalized estimator for counterfactual learning. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), Advances in Neural Information Processing Systems 28, pages 3231-3239. Curran Associates, Inc.
[26] Swaminathan, A., Krishnamurthy, A., Agarwal, A., Dudík, M., Langford, J., Jose, D., & Zitouni, I. (2016). Off-policy evaluation for slate recommendation. CoRR, abs/1605.04812.
[27] Vlassis, N., Bibaut, A., Dimakopoulou, M., & Jebara, T. (2019). On the design of estimators for bandit off-policy evaluation. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6468-6476, Long Beach, California, USA, 09-15: PMLR.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.