In this work, we apply the Q-learning agent to train this MDP environment and solve the problem. The training goal is to collect the maximum cumulative reward. The algorithm has a function that ...