CSCA 5902: Mastering Classical Reinforcement Learning Algorithms
ÌýÌýPreview this courseÌýin the non-credit experience today!Ìý
Start working toward program admission and requirements right away.ÌýWork you complete in the non-credit experience will transfer to the for-credit experience when you upgrade and pay tuition. See How It Works for details.
Course Type: MS-AI Breadth, MS-CS Elective
Specialization: Reinforcement Learning
Instructor:ÌýDr. Ashutosh Trivedi, Associate Professor of Computer Science
Prior knowledge needed: TBD
Learning Outcomes
- Formulate sequential decision-making problems as deterministic decision processes, Markov chains, and finite Markov decision processes.
- Explain and apply core reinforcement learning concepts, including discounting, value functions, policies, Bellman equations, and optimality.
- Implement planning algorithms for finite Markov decision processes, including value iteration, policy iteration, and linear programming formulations.
- Implement and compare tabular reinforcement learning algorithms, including bandits, Monte Carlo methods, temporal-difference learning, SARSA, and Q-learning.
- Analyze the role of sampling, exploration, and convergence guarantees in classical reinforcement learning.
Course Grading Policy
| Assessment | Percentage of Grade | AI Usage Policy |
|---|---|---|
| Quizzes (5) | 70% (14% each) | Conditional |
| Final Exam | 30% | Conditional |
Course Content
Duration: 3Ìýhours, 59 minutes
This module introduces the modeling and optimization foundations for sequential decision-making in their simplest form: deterministic decision processes with discounted rewards. We begin with states, actions, transitions, and rewards as a language for representing decision problems over time. We then develop value functions and discounted optimality equations as tools for optimizing long-term return. The goal is to build intuition for why dynamic programming is correct in the simpler setting of deterministic decision processes before introducing stochastic transitions, learning from sampled experience, and bootstrapping in later modules.
Duration: 2Ìýhours, 50 minutes
This module adds stochasticity to the deterministic picture developed in the previous module. Learners continue with the surprise-quiz example, now with uncertain outcomes: studying usually helps but may not always help, and relaxing may reduce preparation but may not always do so. The module first introduces stochastic transitions as probability distributions over next states, then studies Markov chains as stochastic systems without choices and finally adds actions to obtain Markov decision processes. The goal is to make expected discounted reward, policies, and Bellman equations feel like natural extensions of the deterministic setting.Ìý
Duration: 2 hours, 26 minutes
This module focuses on known-model optimization. Learners use Bellman equations as computational tools for policy evaluation, policy improvement, value iteration, policy iteration, and linear programming formulations of discounted MDPs.
Duration: 2Ìýhours, 21 minutes
This module begins the transition from planning to reinforcement learning. Inplanning, the MDP model is known and Bellman backups compute expectationsexactly. In reinforcement learning, the model is replaced by sampledexperience. Learners first view reinforcement learning as sample-based dynamicprogramming, then study rewards, uncertainty, agent--environment interaction,bandit estimation, exploration versus exploitation, Monte Carlo policyevaluation, and Monte Carlo control.
Duration: 1Ìýhour, 42 minutes
This module completes the tabular reinforcement-learning part of Course 1. Module 4 introduced sample-based learning through bandits and Monte Carlo methods. Module 5 introduces temporal-difference learning: updating after one sampled transition by combining an observed reward with a bootstrapped value estimate. The module ends by summarizing tabular reinforcement learning and motivating the transition to function approximation and deep RL.
Duration: 2 hours, 12 minutes
- The exam is non proctored.
- It is a two-hour exam.
- You may submit your exam only once.
- The exam contains only multiple choice questions.
- You are not allowed to use any notes or access other websites when you take your exam.
Notes
- Cross-listed Courses: CoursesÌýthat are offered under two or more programs. Considered equivalent when evaluating progress toward degree requirements. You may not earn credit for more than one version of a cross-listed course.
- Page Updates: This page is periodically updated. Course information on the Coursera platform supersedes the information on this page. Click theÌýView on CourseraÌýbuttonÌýabove for the most up-to-date information.