research-article

Free access

Stochastic Multi-path Routing Problem with Non-stationary Rewards: Building PayU's Dynamic Routing

Authors:

Pankaj Trivedi,

Arvind SinghAuthors Info & Claims

WWW '18: Companion Proceedings of the The Web Conference 2018

Pages 1707 - 1712

https://doi.org/10.1145/3184558.3191630

Published: 23 April 2018 Publication History

All formats PDF

Abstract

Payment transaction engine at PayU processes multimillion trans- actions every day through multiple payment gateways. Routing a transaction through an appropriate payment gateway is crucial to the engine for optimizing the availability and cost. The problem is that every transaction needs to choose one of K available payment gateways characterized by an unknown probability reward distri- bution. The reward for a gateway is a combination of its health and cost factors. The reward for a gateway is only realized when transaction is processed by the gateway i.e. by its success or failure. The objective of dynamic routing is to maximize the cumulative expected rewards over some given horizon of transactions' life. To do this, the dynamic switching system needs to acquire informa- tion about gateways (exploration) while simultaneously optimizing immediate rewards by selecting the best gateway at the moment (exploitation); the price paid due to this trade o is referred to as the regret. The main objective is to minimize the regret and maximize the rewards. The basic idea is to choose a gateway according to its probability of being the best gateway. The routing problem is a direct formulation of reinforcement learning (RL) problem. In an RL problem, an agent interacts with a dynamic, stochastic, and incompletely known environment, with the goal of finding an action-selection strategy, or policy, that optimizes some long-term performance measure. Thompson Sampling algorithm has experimentally been shown to be close to optimal.

References

[1]

Omar Besbes, Yonatan Gur, Assaf Zeevi Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-stationary Rewards.

[2]

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2--3):235--256, 2002.

Digital Library

[3]

M. Zelen. Play the winner rule and the controlled clinical trials. Journal of the American Statistical Association, 64:131--146, 1969.

[4]

D. A. Berry and B. Fristedt. Bandit problems: sequential allocation of experiments. Chapman and Hall, 1985.

[5]

O. Chapelle and L. Li. An empirical evaluation of thompson sampling. In NIPS, 2011.

Digital Library

[6]

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 2002.

Digital Library

[7]

R. D. Kleinberg and T. Leighton. The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 594--605, 2003.

Digital Library

[8]

Sebastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1--122, 2012.

[9]

J. C. Gittins. Bandit processes and dynamic allocation indices (with discussion). Journal of the Royal Statistical Society, Series B, 41:148--177, 1979.

[10]

O. Besbes, Y. Gur, and A. Zeevi. Non-stationary stochastic optimization. Working paper, 2014.

[11]

S. Agrawal and N. Goyal. Analysis of thompson sampling for the multi-armed bandit problem. CoRR, abs/1111.1797, 2011.

[12]

M. Babaioff, Y. Sharma, and A. Slivkins. Characterizing truthful multi-armed bandit mechanisms: extended abstract. In Tenth ACM Conference on Electronic Commerce, pages 79--88. ACM, 2009.

Digital Library

[13]

O. Chapelle and L. Li. An empirical evaluation of thompson sampling. In J. Shawe- Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2249--2257. Curran Associates, Inc., 2011.

Digital Library

[14]

N. Gatti, A. Lazaric, and F. TrovÃ. A truthful learning mechanism for contextual multi-slot sponsored search auctions with externalities. In Thirteenth ACM Conference on Electronic Commerce, pages 605--622, 2012.

Digital Library

[15]

O.-C. Granmo. Solving two-armed bernoulli bandit problems using a bayesian learning automaton. International Journal of Intelligent Computing and Cybernetics, 3(2):207--234, 2010.

[16]

S. Scott. A modern bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 26:639--658, 2010.

Digital Library

[17]

D. Bergemann and J. Valimaki. Learning and strategic pricing. Econometrica, 64:1125--1149, 1996.

[18]

D. Bergemann and U. Hege. The financing of innovation: Learning and stopping. RAND Journal of Economics, 36 (4):719--752, 2005.

[19]

B. Awerbuch and R. D. Kleinberg. Addaptive routing with end-to-end feedback: distributed learning and geometric approaches. In Proceedings of the 36th ACM Symposiuim on Theory of Computing (STOC), pages 45--53, 2004.

Digital Library

[20]

F. Caro and G. Gallien. Dynamic assortment with demand learning for seasonal consumer goods. Management Science, 53:276--292, 2007.

Digital Library

[21]

S. Pandey, D. Agarwal, D. Charkrabarti, and V. Josifovski. Bandits for taxonomies: A model-based approach. In SIAM International Conference on Data Mining, 2007

Cited By

Chaudhary ARai AGupta A(2023)Maximizing Success Rate of Payment Routing using Non-stationary BanditsProceedings of the Third International Conference on AI-ML Systems10.1145/3639856.3639883(1-7)Online publication date: 25-Oct-2023
https://dl.acm.org/doi/10.1145/3639856.3639883
Bygari RGupta ARaghuvanshi SBapna ASahu B(2021)An AI-powered Smart Routing Solution for Payment Systems2021 IEEE International Conference on Big Data (Big Data)10.1109/BigData52589.2021.9671961(2026-2033)Online publication date: 15-Dec-2021
https://doi.org/10.1109/BigData52589.2021.9671961
Cui HSu XZeng JLiu B(2018)Energy Efficient Based Splitting for MPTCP in Heterogeneous NetworksAd Hoc Networks10.1007/978-3-030-05888-3_10(105-114)Online publication date: 19-Dec-2018
https://doi.org/10.1007/978-3-030-05888-3_10

Index Terms

Stochastic Multi-path Routing Problem with Non-stationary Rewards: Building PayU's Dynamic Routing
1. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory

Recommendations

Stochastic multi-armed-bandit problem with non-stationary rewards
NIPS'14: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1

In a multi-armed bandit (MAB) problem a gambler needs to choose at each round of play one of K arms, each characterized by an unknown reward distribution. Reward realizations are only observed when an arm is selected, and the gambler's objective is to ...
On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems

We consider a totally asynchronous stochastic approximation algorithm, Q-learning, for solving finite space stochastic shortest path SSP problems, which are undiscounted, total cost Markov decision processes with an absorbing and cost-free state. For ...
Reinforcement learning approaches for the stochastic discrete lot-sizing problem on parallel machines
Abstract
This paper addresses the stochastic discrete lot-sizing problem on parallel machines, which is a computationally challenging problem also for relatively small instances. We propose two heuristics to deal with it by leveraging reinforcement ...
Highlights
- Addressing the Stochastic Discrete Lot-Sizing Problem (SDLSP).
- Open-source environment for SDLSP Reinforcement Learning (RL).
- Presenting LSCMA as our multi-agent RL-based proposed method for SDLSP.
- Introducing Branch and Bound ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

WWW '18: Companion Proceedings of the The Web Conference 2018

April 2018

2023 pages

ISBN:9781450356404

General Chairs:
Pierre-Antoine Champin
Université Claude Bernard Lyon 1, France
,
Fabien Gandon
Inria, Université Côte d'Azur, CNRS, I3S, France
,
Lionel Médini
Université Claude Bernard Lyon 1, CNRS, LIRIS, France
,
Program Chairs:
Mounia Lalmas
Spotify, UK
,
Panagiotis G. Ipeirotis
New York University, USA

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IW3C2: International World Wide Web Conference Committee

In-Cooperation

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

International World Wide Web Conferences Steering Committee

Republic and Canton of Geneva, Switzerland

Publication History

Published: 23 April 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WWW '18

Sponsor:

IW3C2

WWW '18: The Web Conference 2018

April 23 - 27, 2018

Lyon, France

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
1,199
Total Downloads

Downloads (Last 12 months)237
Downloads (Last 6 weeks)14

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chaudhary ARai AGupta A(2023)Maximizing Success Rate of Payment Routing using Non-stationary BanditsProceedings of the Third International Conference on AI-ML Systems10.1145/3639856.3639883(1-7)Online publication date: 25-Oct-2023
https://dl.acm.org/doi/10.1145/3639856.3639883
Bygari RGupta ARaghuvanshi SBapna ASahu B(2021)An AI-powered Smart Routing Solution for Payment Systems2021 IEEE International Conference on Big Data (Big Data)10.1109/BigData52589.2021.9671961(2026-2033)Online publication date: 15-Dec-2021
https://doi.org/10.1109/BigData52589.2021.9671961
Cui HSu XZeng JLiu B(2018)Energy Efficient Based Splitting for MPTCP in Heterogeneous NetworksAd Hoc Networks10.1007/978-3-030-05888-3_10(105-114)Online publication date: 19-Dec-2018
https://doi.org/10.1007/978-3-030-05888-3_10

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents