@@ -46,7 +46,7 @@ https://pytorch-optimizer.rtfd.io
4646
4747Citation
4848--------
49- Please cite original authors of optimization algorithms. If you like this
49+ Please cite the original authors of the optimization algorithms. If you like this
5050package::
5151
5252 @software{Novik_torchoptimizers,
@@ -57,7 +57,7 @@ package::
5757 version = {1.0.1}
5858 }
5959
60- Or use github feature: "cite this repository" button.
60+ Or use the github feature: "cite this repository" button.
6161
6262
6363Supported Optimizers
@@ -155,29 +155,29 @@ Supported Optimizers
155155
156156Visualizations
157157--------------
158- Visualizations help us to see how different algorithms deal with simple
158+ Visualizations help us see how different algorithms deal with simple
159159situations like: saddle points, local minima, valleys etc, and may provide
160- interesting insights into inner workings of algorithm. Rosenbrock _ and Rastrigin _
161- benchmark _ functions was selected, because:
160+ interesting insights into the inner workings of an algorithm. Rosenbrock _ and Rastrigin _
161+ benchmark _ functions were selected because:
162162
163163* Rosenbrock _ (also known as banana function), is non-convex function that has
164- one global minima `(1.0. 1.0) `. The global minimum is inside a long,
165- narrow, parabolic shaped flat valley. To find the valley is trivial. To
166- converge to the global minima , however, is difficult. Optimization
167- algorithms might pay a lot of attention to one coordinate, and have
168- problems to follow valley which is relatively flat.
164+ one global minimum `(1.0. 1.0) `. The global minimum is inside a long,
165+ narrow, parabolic shaped flat valley. Finding the valley is trivial.
166+ Converging to the global minimum , however, is difficult. Optimization
167+ algorithms might pay a lot of attention to one coordinate, and struggle
168+ following the valley which is relatively flat.
169169
170170 .. image :: https://upload.wikimedia.org/wikipedia/commons/3/32/Rosenbrock_function.svg
171171
172- * Rastrigin _ function is a non-convex and has one global minima in `(0.0, 0.0) `.
172+ * Rastrigin _ is a non-convex function and has one global minimum in `(0.0, 0.0) `.
173173 Finding the minimum of this function is a fairly difficult problem due to
174174 its large search space and its large number of local minima.
175175
176176 .. image :: https://upload.wikimedia.org/wikipedia/commons/8/8b/Rastrigin_function.png
177177
178- Each optimizer performs `501 ` optimization steps. Learning rate is best one found
179- by hyper parameter search algorithm, rest of tuning parameters are default. It
180- is very easy to extend script and tune other optimizer parameters.
178+ Each optimizer performs `501 ` optimization steps. Learning rate is the best one found
179+ by a hyper parameter search algorithm, the rest of the tuning parameters are default. It
180+ is very easy to extend the script and tune other optimizer parameters.
181181
182182
183183.. code ::
@@ -187,14 +187,14 @@ is very easy to extend script and tune other optimizer parameters.
187187
188188 Warning
189189-------
190- Do not pick optimizer based on visualizations, optimization approaches
190+ Do not pick an optimizer based on visualizations, optimization approaches
191191have unique properties and may be tailored for different purposes or may
192- require explicit learning rate schedule etc. Best way to find out, is to try one
193- on your particular problem and see if it improves scores.
192+ require explicit learning rate schedule etc. The best way to find out is to try
193+ one on your particular problem and see if it improves scores.
194194
195- If you do not know which optimizer to use start with built in SGD/Adam, once
196- training logic is ready and baseline scores are established, swap optimizer and
197- see if there is any improvement.
195+ If you do not know which optimizer to use, start with the built in SGD/Adam. Once
196+ the training logic is ready and baseline scores are established, swap the optimizer
197+ and see if there is any improvement.
198198
199199
200200A2GradExp
@@ -366,7 +366,7 @@ AdaBound
366366
367367AdaMod
368368------
369- AdaMod method restricts the adaptive learning rates with adaptive and momental
369+ The AdaMod method restricts the adaptive learning rates with adaptive and momental
370370upper bounds. The dynamic learning rate bounds are based on the exponential
371371moving averages of the adaptive learning rates themselves, which smooth out
372372unexpected large learning rates and stabilize the training of deep neural networks.
@@ -455,9 +455,9 @@ Adahessian
455455
456456AdamP
457457------
458- AdamP propose a simple and effective solution: at each iteration of Adam optimizer
458+ AdamP propose a simple and effective solution: at each iteration of the Adam optimizer
459459applied on scale-invariant weights (e.g., Conv weights preceding a BN layer), AdamP
460- remove the radial component (i.e., parallel to the weight vector) from the update vector.
460+ removes the radial component (i.e., parallel to the weight vector) from the update vector.
461461Intuitively, this operation prevents the unnecessary update along the radial direction
462462that only increases the weight norm without contributing to the loss minimization.
463463
0 commit comments