CSCA 5902: Mastering Classical Reinforcement Learning Algorithms

Get a head start on program admission

��Preview this course��in the non-credit experience today!��
Start working toward program admission and requirements right away.��Work you complete in the non-credit experience will transfer to the for-credit experience when you upgrade and pay tuition. See How It Works for details.

Course Type: MS-AI Breadth, MS-CS Elective

Specialization: Reinforcement Learning

Instructor:��Dr. Ashutosh Trivedi, Associate Professor of Computer Science

Prior knowledge needed: TBD

Learning Outcomes

Formulate sequential decision-making problems as deterministic decision processes, Markov chains, and finite Markov decision processes.
Explain and apply core reinforcement learning concepts, including discounting, value functions, policies, Bellman equations, and optimality.
Implement planning algorithms for finite Markov decision processes, including value iteration, policy iteration, and linear programming formulations.
Implement and compare tabular reinforcement learning algorithms, including bandits, Monte Carlo methods, temporal-difference learning, SARSA, and Q-learning.
Analyze the role of sampling, exploration, and convergence guarantees in classical reinforcement learning.

Course Grading Policy

Assessment	Percentage of Grade	AI Usage Policy
Quizzes (5)	70% (14% each)	Conditional
Final Exam	30%	Conditional

Course Content

Module 1 | Deterministic Decision Processes

Duration: 3��hours, 59 minutes

This module introduces the modeling and optimization foundations for sequential decision-making in their simplest form: deterministic decision processes with discounted rewards. We begin with states, actions, transitions, and rewards as a language for representing decision problems over time. We then develop value functions and discounted optimality equations as tools for optimizing long-term return. The goal is to build intuition for why dynamic programming is correct in the simpler setting of deterministic decision processes before introducing stochastic transitions, learning from sampled experience, and bootstrapping in later modules.

Module 2 | Markov Chains and Markov Decision Processes

Duration: 2��hours, 50 minutes

This module adds stochasticity to the deterministic picture developed in the previous module. Learners continue with the surprise-quiz example, now with uncertain outcomes: studying usually helps but may not always help, and relaxing may reduce preparation but may not always do so. The module first introduces stochastic transitions as probability distributions over next states, then studies Markov chains as stochastic systems without choices and finally adds actions to obtain Markov decision processes. The goal is to make expected discounted reward, policies, and Bellman equations feel like natural extensions of the deterministic setting.��

Module 3 | Dynamic Programming in MDPs

Duration: 2 hours, 26 minutes

This module focuses on known-model optimization. Learners use Bellman equations as computational tools for policy evaluation, policy improvement, value iteration, policy iteration, and linear programming formulations of discounted MDPs.

Module 4 | Learning from Sampled Experience

Duration: 2��hours, 21 minutes

This module begins the transition from planning to reinforcement learning. Inplanning, the MDP model is known and Bellman backups compute expectationsexactly. In reinforcement learning, the model is replaced by sampledexperience. Learners first view reinforcement learning as sample-based dynamicprogramming, then study rewards, uncertainty, agent--environment interaction,bandit estimation, exploration versus exploitation, Monte Carlo policyevaluation, and Monte Carlo control.

Module 5 | Control, Exploration, and Tabular RL Algorithms

Duration: 1��hour, 42 minutes

This module completes the tabular reinforcement-learning part of Course 1. Module 4 introduced sample-based learning through bandits and Monte Carlo methods. Module 5 introduces temporal-difference learning: updating after one sampled transition by combining an observed reward with a bootstrapped value estimate. The module ends by summarizing tabular reinforcement learning and motivating the transition to function approximation and deep RL.

Module 6 | Final Exam (For-Credit Experience Only)

Duration: 2 hours, 12 minutes

The exam is non proctored.
It is a two-hour exam.
You may submit your exam only once.
The exam contains only multiple choice questions.
You are not allowed to use any notes or access other websites when you take your exam.

Notes

Cross-listed Courses: Courses��that are offered under two or more programs. Considered equivalent when evaluating progress toward degree requirements. You may not earn credit for more than one version of a cross-listed course.
Page Updates: This page is periodically updated. Course information on the Coursera platform supersedes the information on this page. Click the��View on Coursera��button��above for the most up-to-date information.

��ý��