You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Comment Edition
Some minor word corrections in the comments only.
Thanks for the notebook.
* Skipped this one
This typo was trying to escape, but caught it on revision :-)
Copy file name to clipboardExpand all lines: notebooks/Chapter01_Tic_Tac_Toe.jl
+13-13Lines changed: 13 additions & 13 deletions
Original file line number
Diff line number
Diff line change
@@ -23,31 +23,31 @@ md"""
23
23
24
24
In the following notebooks, we'll mainly use the [ReinforcementLearning.jl](https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl) package to demonstrate how to generate figures in the book [Reinforcement Learning: An Introduction(2nd)](http://incompleteideas.net/book/the-book-2nd.html). In case you haven't used this package before, you can visit [juliareinforcementlearning.org](https://juliareinforcementlearning.org/) for a detailed introduction. Besides, we'll also explain some basic concepts gradually when we meet them for the first time.
25
25
26
-
[ReinforcementLearning.jl](https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl) contains a collection of tools to describe and solve problems we want to handle in the reinforcement leanring field. Though we are mostly interested in traditional tabular methods here, it also contains many state-of-the-art algorithms. To use it, we can simply run the following code here:
26
+
[ReinforcementLearning.jl](https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl) contains a collection of tools to describe and solve problems we want to handle in the reinforcement learning field. Though we are mostly interested in traditional tabular methods here, it also contains many state-of-the-art algorithms. To use it, we can simply run the following code here:
27
27
"""
28
28
29
29
# ╔═╡ bfe260f4-481e-11eb-28e8-eb3cca823162
30
30
md"""
31
31
!!! note
32
-
As you might have noticed, it takes quite a long time to load this package for the first time (the good news is that it will be largedly decreased after `Julia@v1.6`). Once loaded, things should be very fast.
32
+
As you might have noticed, it takes quite a long time to load this package for the first time (the good news is that it will be largely decreased after `Julia@v1.6`). Once loaded, things should be very fast.
33
33
34
34
"""
35
35
36
36
# ╔═╡ 147a9204-4841-11eb-04fd-4f760dff4bc8
37
37
md"""
38
38
## Tic-Tac-Toe
39
39
40
-
In chapter 1.5, a simple game of [Tic-Tac-Toe](https://en.wikipedia.org/wiki/Tic-tac-toe) is introduced to illustrate the general idea of reinforcement learning. Before looking into the details about how to implement the monte carlo based policy, let's take a look at how the Tic-Tac-Toe environment is defined in [ReinforcementLearning.jl](https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl).
40
+
In chapter 1.5, a simple game of [Tic-Tac-Toe](https://en.wikipedia.org/wiki/Tic-tac-toe) is introduced to illustrate the general idea of reinforcement learning. Before looking into the details about how to implement the Monte Carlo based policy, let's take a look at how the Tic-Tac-Toe environment is defined in [ReinforcementLearning.jl](https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl).
41
41
"""
42
42
43
43
# ╔═╡ 46d5181a-4845-11eb-222c-3d8b78f89dc6
44
44
env =TicTacToeEnv()
45
45
46
46
# ╔═╡ d9a93b80-4827-11eb-39cf-c589ebcd092b
47
47
md"""
48
-
There are many important information provided above. First, the `TicTacToeEnv` is a [zero-sum](https://en.wikipedia.org/wiki/Zero-sum_game), two player environment. `Sequential` means each player takes an action alternatively. Since all the players can observe the same information (the board), it is an environment of [perfect information](https://en.wikipedia.org/wiki/Perfect_information). Note that each player only get the reward at the end of the game (`-1`, `0`, or `1`). So we call `RewardStyle` of this kind of environments `TerminalReward()`. At each step, only part of the actions are valid (the position of the board), so we say the `ActionStyle` of this env is of `FullActionSet()`.
48
+
There is many important information provided above. First, the `TicTacToeEnv` is a [zero-sum](https://en.wikipedia.org/wiki/Zero-sum_game), two player environment. `Sequential` means each player takes an action alternatively. Since all the players can observe the same information (the board), it is an environment of [perfect information](https://en.wikipedia.org/wiki/Perfect_information). Note that each player only gets the reward at the end of the game (`-1`, `0`, or `1`). So we call the `RewardStyle` of this kind of environments `TerminalReward()`. At each step, only part of the actions are valid (the position of the board), so we say the `ActionStyle` of this env is of `FullActionSet()`.
49
49
50
-
All these information are provided by traits. In later chapters, we'll see how to define these traits for new customized environments. Now let's get familiar with some basic interfaces defined with the environment first.
50
+
All this information is provided by traits. In later chapters, we'll see how to define these traits for new customized environments. Now let's get familiar with some basic interfaces defined with the environment first.
Beside that we added a fourth argument in the `run`, another important difference is that we used a `MultiAgentManager` policy instead of a simple `RandomPolicy`. The reason is that we want different players use a separate policy and then we can collect different information of them separately.
174
+
Besides that we added a fourth argument in the `run`, another important difference is that we used a `MultiAgentManager` policy instead of a simple `RandomPolicy`. The reason is that we want different players to use a separate policy and then we can collect different information of them separately.
175
175
176
176
Now let's take a look at the total reward of each player in the above 10 episodes:
177
177
"""
@@ -189,9 +189,9 @@ end
189
189
md"""
190
190
## Tackling the Tic-Tac-Toe Problem with Monte Carlo Prediction
191
191
192
-
Actually the Monte Carlo method mention in the first chapter to solve the Tic-Tac-Toe problem is explained in Chapter 5.1, so it's ok if you don't fully understand it right now for the first time. Just keep reading and turn back to this chapter once you finished chapter 5.
192
+
Actually the Monte Carlo method mentioned in the first chapter to solve the Tic-Tac-Toe problem is explained in Chapter 5.1, so it's ok if you don't fully understand it right now for the first time. Just keep reading and turn back to this chapter once you have finished chapter 5.
193
193
194
-
The intuition behind monte carlo prediction that, we use a table to record the estimated gain at each step. If such estimation is accurate, then we can simply look one step ahead and compare which state is of the largest gain. Then we just take the action which will lead to that state to maximize our reward.
194
+
The intuition behind Monte Carlo prediction is that we use a table to record the estimated gain at each step. If such estimation is accurate, then we can simply look one step ahead and compare which state is of the largest gain. Then we just take the action which will lead to that state to maximize our reward.
195
195
196
196
### TabularVApproximator
197
197
@@ -237,7 +237,7 @@ By using the same interfaces defined in `Flux.jl`, we get a bunch of different o
237
237
md"""
238
238
### MonteCarloLearner
239
239
240
-
In `ReinforcementLearning.jl`, many different variants of monte carlo learners are supported. We'll skip the implementation detail for now. But to convince you that implementing such an algorithm is quite simple and straightforward in this package, we can take a look at the code snippet and compare it with the pseudocode on the book.
240
+
In `ReinforcementLearning.jl`, many different variants of Monte Carlo learners are supported. We'll skip the implementation detail for now. But to convince you that implementing such an algorithm is quite simple and straightforward in this package, we can take a look at the code snippet and compare it with the pseudocode on the book.
241
241
242
242
```julia
243
243
function _update!(
@@ -353,7 +353,7 @@ run(P, E, StopAfterEpisode(10))
353
353
md"""
354
354
## Training
355
355
356
-
One main question we haven't answer is, **how to train the policy?**
356
+
One main question we haven't answered is, **how to train the policy?**
357
357
358
358
Well, the usage is similar to the above one, the only difference is now we wrap the `policy` in an `Agent`, which is also an `AbstractPolicy`. An `Agent` is `policy + trajectory`, or people usually call it *experience replay buffer*.
359
359
"""
@@ -395,7 +395,7 @@ run(policies, E, StopAfterEpisode(100_000))
395
395
md"""
396
396
The interface is almost the same with the above one. Let me explain what's happening here.
397
397
398
-
First, the `policies` is a `MultiAgentManager` and the environment is a `Sequential` one. So at each step, the agent manager forward the `env` to its inner policies. Here each inner policy is an `Agent`. Then it will collect necessary information at each step (here we used a `VectorSARTTrajectory` to tell the agent to collect `state`, `action`, `reward`, and `is_terminated` info). Finally, after each episode, the agent send the `trajectory` to the inner `VBasedPolicy`, and then forwarded to the `MonteCarloLearner` to update the `TabularVApproximator`. Thanks to **multiple dispatch** each step above is fully customizable.
398
+
First, the `policies` is a `MultiAgentManager` and the environment is a `Sequential` one. So at each step, the agent manager forwards the `env` to its inner policies. Here each inner policy is an `Agent`. Then it will collect necessary information at each step (here we used a `VectorSARTTrajectory` to tell the agent to collect `state`, `action`, `reward`, and `is_terminated` info). Finally, after each episode, the agent sends the `trajectory` to the inner `VBasedPolicy`, and then forwards it to the `MonteCarloLearner` to update the `TabularVApproximator`. Thanks to **multiple dispatch** each step above is fully customizable.
399
399
"""
400
400
401
401
# ╔═╡ 52392820-4a54-11eb-03eb-cf339de8e3dc
@@ -432,9 +432,9 @@ end
432
432
433
433
# ╔═╡ 696c441c-4a56-11eb-39c0-e1d1465ca4f7
434
434
md"""
435
-
As you can see, in most cases, we reach a **tie**. But why there're still some cases that they don't reach a tie?
435
+
As you can see, in most cases, we reach a **tie**. But why are there still some cases that don't reach a tie?
436
436
437
-
Is it because we're not training for enough time?
437
+
Is it because we have not trained for enough time?
438
438
439
439
*Maybe*. But a more possible reason is that we're still using the `EpsilonGreedyExplorer` in the testing mode.
0 commit comments