tag:blogger.com,1999:blog-14578778750095274882024-03-06T05:14:39.462+08:00Хувийн тэмдэглэлSharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.comBlogger75125tag:blogger.com,1999:blog-1457877875009527488.post-8783108247238892032023-11-17T23:56:00.011+08:002024-02-29T05:03:08.479+08:00Шенноны чөтгөр<p>Англи хэлээр Shannon's Demon буюу Шенноны чөтгөр гэдэг ойлголт нь анх <a href="https://en.wikipedia.org/wiki/Claude_Shannon" target="_blank">Claude Shannon</a>-ны <a href="https://en.wikipedia.org/wiki/Information_theory" target="_blank">Information Theory</a> ээс эхлэлтэй бий болсон бөгөөд тогтмол давтамжтайгаар санхүүгийн багцыг дахин хувиарлах нь цаашдаа ашигтай байх боломжтой гэдэг ойлголтыг илэрхийлдэг.<br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4uMUjvhpot9stY8kr3jf_ywD2HtM0tXz4rxXB5m8rLYZ48NPLpU1AJPHRzdSz4pTvHViEeiSCEVZQ1EX-8nj2BWx5KW0rJ3UlYjwNgyX6XHY-CM5Vm9bxbvIy3au6ENuwE90O9ZIkbc2-RzmPJWD6HxDIu1FQy0AOODGOYZs5m3u2FkhYLVyzs16q/s1129/Screenshot_20231117_232523.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1129" data-original-width="855" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4uMUjvhpot9stY8kr3jf_ywD2HtM0tXz4rxXB5m8rLYZ48NPLpU1AJPHRzdSz4pTvHViEeiSCEVZQ1EX-8nj2BWx5KW0rJ3UlYjwNgyX6XHY-CM5Vm9bxbvIy3au6ENuwE90O9ZIkbc2-RzmPJWD6HxDIu1FQy0AOODGOYZs5m3u2FkhYLVyzs16q/w485-h640/Screenshot_20231117_232523.png" width="485" /></a></div><br /><p>Заримдаа үүнийг rebalancing bonus гэж нэрлэдэг.</p><p>Тодорхой давтамжаар багцад буй хөрөнгүүдийг дахин хувиаралбал тэндээс ашиг бий болж байдаг тэр сонин үзэгдлийг чөтгөр гэдэг үгээр зүйрэлсэн нь энэ.</p><p>Хэрэв хөрөнгө удирдаж байгаа стратегиуд хоорондоо хамааралгүй хөдөлдөг бол энэ үзэгдлийн үр шимийг хүртэх боломжтой.</p><p>Зарим стратегиуд алдагдалтай ажиллаж байх үед бусад стратегиуд тэрхүү алдагдлуудыг нөхөн ажиллах, мөн нэг стратегид хэтэрхий их эрсдэл үүрүүлэх байдлаас сэргийлж байдаг болохоор багцад буй нийт мөнгөн хөрөнгө нь үргэлжийн өсөлттэй байх боломжийг энэ үзэгдэл олгож байдаг.</p><p>Тэхээр эндээс ургуулаад бодохоор энгийн retail trader буюу жижиглэнгийн арилжаачид нэг стратеги болон нэг хослолд хэт найдах хэрэггүй аль болох олон өөр төрлийн стратеги, олон өөр арилжааны хослол, олон ялгаатай timeframe дээр арилжаагаа хийж тэдгээрт 7 хоногтоо нэг удаа ч юм уу эсвэл сардаа 1 удаа ч юм уу олсон ашгаа дахин хувиарлаж арилжаа хийвэл энэхүү Shannon's Demon гээч үзэгдлийн үр ашгийг хүртэх боломжтой. </p><p>Заавал 7 хоног байх албагүй тодорхой хэдэн хоногийн дараа байсан ч болно, үүнийг өөрсдийнхөө арилжааныхаа түүхээс харж байгаад хамгийн тохиромжтой хоног нь хэд байх уу гэдэг дээр дүн шинжилгээ хийж байж тогтох хэрэгтэй.</p><p>Эсвэл олуулаа нийлж арилжаа хийдэг trading house байлаа гэхэд аль болох өөр өөр хэв маягаар арилжаа хийдэг арилжаачингууд цуглуулж байгаад нийт ашгаа мөн дахин дахин хувиарлаж явж чадвал энэ үзэгдлийг давуу талыг ч гэсэн бий болгох боломжтой.</p><p>Энэ үзэгдэл үнэхээр оршин байдаг эсэхийг нь батлахаар 50-н ширхэг санамсаргүйгээр гүйцэтгэлтэй стратегиуд үүсгээд тэдгээрийг тодорхой давтамжтайгаар дахин балансалбал яах нь уу мөн баланслахгүй бол яах нь уу гэдэг дээр Python хэлээр бичиж харьцуулж харвал балансалдаг багц нь үргэлж Maximum Drawdown багатайгаас гадна Sharpe Ratio байнга өндөр байгааг ажиглаж болно. </p><p>Мөн зарим тохиолдолд баланслаагүй багц нь уналттай байхад балансалсан тохиолдолд ямар ч байсан ашигтай явж байгааг харж болно.</p><p>Энэ кодыг дараах холбоос дээр байрлууллаа.<br /><a href="https://shannon-demon.streamlit.app/">https://shannon-demon.streamlit.app/</a> </p><p><br /></p><pre class="code_syntax" style="background: rgb(255, 255, 255); counter-reset: line 0; margin-bottom: 0px; margin-left: 2em; margin-top: 0px;"><span class="line_wrapper" style="counter-increment: line 1;"><span style="color: maroon; font-weight: bold;">import</span> pandas <span style="color: maroon; font-weight: bold;">as</span> pd</span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: maroon; font-weight: bold;">import</span> numpy <span style="color: maroon; font-weight: bold;">as</span> np</span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: maroon; font-weight: bold;">import</span> quantstats <span style="color: maroon; font-weight: bold;">as</span> qs</span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: maroon; font-weight: bold;">import</span> streamlit <span style="color: maroon; font-weight: bold;">as</span> st</span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: maroon; font-weight: bold;">import</span> matplotlib<span style="color: #808030;">.</span>pyplot <span style="color: maroon; font-weight: bold;">as</span> plt</span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: maroon; font-weight: bold;">from</span> datetime <span style="color: maroon; font-weight: bold;">import</span> timedelta</span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: maroon; font-weight: bold;">def</span> simulate_portfolios<span style="color: #808030;">(</span>start_date<span style="color: #808030;">,</span> end_date<span style="color: #808030;">,</span> num_securities<span style="color: #808030;">,</span> num_days<span style="color: #808030;">,</span> initial_cash<span style="color: #808030;">,</span> outliers_percentage<span style="color: #808030;">)</span><span style="color: #808030;">:</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> trade_dates <span style="color: #808030;">=</span> pd<span style="color: #808030;">.</span>to_datetime<span style="color: #808030;">(</span>np<span style="color: #808030;">.</span>sort<span style="color: #808030;">(</span>np<span style="color: #808030;">.</span>random<span style="color: #808030;">.</span>choice<span style="color: #808030;">(</span>pd<span style="color: #808030;">.</span>date_range<span style="color: #808030;">(</span>start<span style="color: #808030;">=</span>start_date<span style="color: #808030;">,</span> end<span style="color: #808030;">=</span>end_date<span style="color: #808030;">,</span> periods<span style="color: #808030;">=</span>num_days <span style="color: #44aadd;">+</span> <span style="color: #008c00;">1</span><span style="color: #808030;">)</span><span style="color: #808030;">,</span> num_days<span style="color: #808030;">,</span> replace<span style="color: #808030;">=</span><span style="color: #074726;">False</span><span style="color: #808030;">)</span><span style="color: #808030;">)</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> df <span style="color: #808030;">=</span> pd<span style="color: #808030;">.</span>DataFrame<span style="color: #808030;">(</span><span style="color: purple;">{</span><span style="color: #0000e6;">'datetime'</span> <span style="color: #808030;">:</span> trade_dates<span style="color: purple;">}</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> df<span style="color: #808030;">[</span><span style="color: #0000e6;">'datetime'</span> <span style="color: #808030;">]</span> <span style="color: #808030;">=</span> pd<span style="color: #808030;">.</span>to_datetime<span style="color: #808030;">(</span>df<span style="color: #808030;">[</span><span style="color: #0000e6;">'datetime'</span><span style="color: #808030;">]</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> df<span style="color: #808030;">[</span><span style="color: #0000e6;">'datetime_'</span><span style="color: #808030;">]</span> <span style="color: #808030;">=</span> df<span style="color: #808030;">[</span><span style="color: #0000e6;">'datetime'</span><span style="color: #808030;">]</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> df <span style="color: #808030;">=</span> df<span style="color: #808030;">.</span>set_index<span style="color: #808030;">(</span><span style="color: #0000e6;">'datetime'</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> outliers_percentage <span style="color: #808030;">=</span> outliers_percentage<span style="color: #44aadd;">/</span><span style="color: green;">100.0</span> <span style="color: dimgrey;"># percentage of all returns are outliers</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> outliers_count <span style="color: #808030;">=</span> <span style="color: #400000;">int</span><span style="color: #808030;">(</span>num_days<span style="color: #44aadd;">*</span>outliers_percentage<span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> allocated_cash <span style="color: #808030;">=</span> initial_cash<span style="color: #44aadd;">/</span>num_securities</span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">for</span> idx <span style="color: maroon; font-weight: bold;">in</span> <span style="color: #400000;">range</span><span style="color: #808030;">(</span><span style="color: #008c00;">0</span><span style="color: #808030;">,</span> num_securities<span style="color: #808030;">)</span><span style="color: #808030;">:</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> percentage_changes <span style="color: #808030;">=</span> np<span style="color: #808030;">.</span>random<span style="color: #808030;">.</span>uniform<span style="color: #808030;">(</span><span style="color: #44aadd;">-</span><span style="color: green;">0.05</span><span style="color: #808030;">,</span> <span style="color: green;">0.05</span><span style="color: #808030;">,</span> num_days<span style="color: #808030;">)</span><span style="color: #808030;">.</span>astype<span style="color: #808030;">(</span><span style="color: #400000;">float</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> extreme_returns <span style="color: #808030;">=</span> np<span style="color: #808030;">.</span>random<span style="color: #808030;">.</span>uniform<span style="color: #808030;">(</span><span style="color: #44aadd;">-</span><span style="color: green;">0.09</span><span style="color: #808030;">,</span> <span style="color: green;">0.1</span><span style="color: #808030;">,</span> outliers_count<span style="color: #808030;">)</span><span style="color: #808030;">.</span>astype<span style="color: #808030;">(</span><span style="color: #400000;">float</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> outliers_date <span style="color: #808030;">=</span> df<span style="color: #808030;">[</span><span style="color: #0000e6;">'datetime_'</span><span style="color: #808030;">]</span><span style="color: #808030;">.</span>sample<span style="color: #808030;">(</span>n<span style="color: #808030;">=</span>outliers_count<span style="color: #808030;">)</span><span style="color: #808030;">.</span>to_list<span style="color: #808030;">(</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> df<span style="color: #808030;">[</span>f<span style="color: #0000e6;">"pct_change_{idx}"</span><span style="color: #808030;">]</span> <span style="color: #808030;">=</span> percentage_changes</span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">for</span> outlier <span style="color: maroon; font-weight: bold;">in</span> <span style="color: #400000;">list</span><span style="color: #808030;">(</span><span style="color: #400000;">zip</span><span style="color: #808030;">(</span>outliers_date<span style="color: #808030;">,</span> extreme_returns<span style="color: #808030;">)</span><span style="color: #808030;">)</span><span style="color: #808030;">:</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> outlier_dt <span style="color: #808030;">=</span> outlier<span style="color: #808030;">[</span><span style="color: #008c00;">0</span><span style="color: #808030;">]</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> outlier_ret <span style="color: #808030;">=</span> outlier<span style="color: #808030;">[</span><span style="color: #008c00;">1</span><span style="color: #808030;">]</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> df<span style="color: #808030;">.</span>loc<span style="color: #808030;">[</span>outlier_dt<span style="color: #808030;">,</span> f<span style="color: #0000e6;">"pct_change_{idx}"</span><span style="color: #808030;">]</span> <span style="color: #808030;">=</span> outlier_ret</span>
<span class="line_wrapper" style="counter-increment: line 1;"> </span>
<span class="line_wrapper" style="counter-increment: line 1;"> df<span style="color: #808030;">[</span>f<span style="color: #0000e6;">"ret_path_{idx}"</span> <span style="color: #808030;">]</span> <span style="color: #808030;">=</span> df<span style="color: #808030;">[</span>f<span style="color: #0000e6;">"pct_change_{idx}"</span><span style="color: #808030;">]</span><span style="color: #808030;">.</span>cumsum<span style="color: #808030;">(</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> df<span style="color: #808030;">[</span>f<span style="color: #0000e6;">"cash_path_{idx}"</span> <span style="color: #808030;">]</span> <span style="color: #808030;">=</span> <span style="color: #808030;">(</span><span style="color: #008c00;">1</span><span style="color: #44aadd;">+</span>df<span style="color: #808030;">[</span>f<span style="color: #0000e6;">"pct_change_{idx}"</span><span style="color: #808030;">]</span><span style="color: #808030;">)</span><span style="color: #808030;">.</span>cumprod<span style="color: #808030;">(</span><span style="color: #808030;">)</span><span style="color: #44aadd;">*</span><span style="color: #808030;">(</span>allocated_cash<span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> sim_cols <span style="color: #808030;">=</span> <span style="color: #808030;">[</span>col_name <span style="color: maroon; font-weight: bold;">for</span> col_name <span style="color: maroon; font-weight: bold;">in</span> df<span style="color: #808030;">.</span>columns <span style="color: maroon; font-weight: bold;">if</span> col_name<span style="color: #808030;">.</span>startswith<span style="color: #808030;">(</span><span style="color: #0000e6;">"cash"</span><span style="color: #808030;">)</span><span style="color: #808030;">]</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> fig<span style="color: #808030;">,</span> ax <span style="color: #808030;">=</span> plt<span style="color: #808030;">.</span>subplots<span style="color: #808030;">(</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">for</span> col_name <span style="color: maroon; font-weight: bold;">in</span> sim_cols<span style="color: #808030;">:</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> ax<span style="color: #808030;">.</span>plot<span style="color: #808030;">(</span>df<span style="color: #808030;">[</span>col_name<span style="color: #808030;">]</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> fig<span style="color: #808030;">.</span>autofmt_xdate<span style="color: #808030;">(</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> st<span style="color: #808030;">.</span>pyplot<span style="color: #808030;">(</span>fig<span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> df<span style="color: #808030;">[</span><span style="color: #0000e6;">'raw_portfolio_cash_path'</span><span style="color: #808030;">]</span> <span style="color: #808030;">=</span> df<span style="color: #808030;">[</span>sim_cols<span style="color: #808030;">]</span><span style="color: #808030;">.</span><span style="color: #400000;">sum</span><span style="color: #808030;">(</span>axis<span style="color: #808030;">=</span><span style="color: #008c00;">1</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> col11<span style="color: #808030;">,</span> col12 <span style="color: #808030;">=</span> st<span style="color: #808030;">.</span>columns<span style="color: #808030;">(</span><span style="color: #008c00;">2</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> rebalancing_options <span style="color: #808030;">=</span> <span style="color: purple;">{</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: #0000e6;">'3D'</span><span style="color: #808030;">:</span> <span style="color: #0000e6;">"3 Day rebalance"</span><span style="color: #808030;">,</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: #0000e6;">'4D'</span><span style="color: #808030;">:</span> <span style="color: #0000e6;">"4 Day rebalance"</span><span style="color: #808030;">,</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: #0000e6;">'5D'</span><span style="color: #808030;">:</span> <span style="color: #0000e6;">"5 Day rebalance"</span><span style="color: #808030;">,</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: #0000e6;">'W'</span><span style="color: #808030;">:</span> <span style="color: #0000e6;">"Weekly rebalance"</span><span style="color: #808030;">,</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: #0000e6;">'M'</span><span style="color: #808030;">:</span> <span style="color: #0000e6;">"Monthly rebalance"</span><span style="color: #808030;">,</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: purple;">}</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">with</span> col11<span style="color: #808030;">:</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> rebalancing_frequency <span style="color: #808030;">=</span> st<span style="color: #808030;">.</span>selectbox<span style="color: #808030;">(</span><span style="color: #0000e6;">'Rebalancing period:'</span><span style="color: #808030;">,</span> <span style="color: #400000;">list</span><span style="color: #808030;">(</span>rebalancing_options<span style="color: #808030;">.</span>keys<span style="color: #808030;">(</span><span style="color: #808030;">)</span><span style="color: #808030;">)</span><span style="color: #808030;">,</span> </span>
<span class="line_wrapper" style="counter-increment: line 1;"> format_func<span style="color: #808030;">=</span><span style="color: maroon; font-weight: bold;">lambda</span> option<span style="color: #808030;">:</span> rebalancing_options<span style="color: #808030;">[</span>option<span style="color: #808030;">]</span><span style="color: #808030;">,</span> </span>
<span class="line_wrapper" style="counter-increment: line 1;"> index<span style="color: #808030;">=</span><span style="color: #400000;">list</span><span style="color: #808030;">(</span>rebalancing_options<span style="color: #808030;">.</span>keys<span style="color: #808030;">(</span><span style="color: #808030;">)</span><span style="color: #808030;">)</span><span style="color: #808030;">.</span>index<span style="color: #808030;">(</span><span style="color: #0000e6;">"4D"</span><span style="color: #808030;">)</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: dimgrey;"># for simplicity let's do equally weighted allocation</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> rebalanced_portfolio_values <span style="color: #808030;">=</span> <span style="color: #808030;">[</span><span style="color: #808030;">]</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> rebalanced_dates <span style="color: #808030;">=</span> <span style="color: #808030;">[</span><span style="color: #808030;">]</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> current_portfolio_value <span style="color: #808030;">=</span> initial_cash</span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">for</span> date<span style="color: #808030;">,</span> group <span style="color: maroon; font-weight: bold;">in</span> df<span style="color: #808030;">.</span>groupby<span style="color: #808030;">(</span>pd<span style="color: #808030;">.</span>Grouper<span style="color: #808030;">(</span>freq<span style="color: #808030;">=</span>rebalancing_frequency<span style="color: #808030;">)</span><span style="color: #808030;">)</span><span style="color: #808030;">:</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> group_df <span style="color: #808030;">=</span> group<span style="color: #808030;">.</span>copy<span style="color: #808030;">(</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> allocated_cash <span style="color: #808030;">=</span> current_portfolio_value<span style="color: #44aadd;">/</span>num_securities <span style="color: dimgrey;"># equally weighted</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">for</span> idx <span style="color: maroon; font-weight: bold;">in</span> <span style="color: #400000;">range</span><span style="color: #808030;">(</span><span style="color: #008c00;">0</span><span style="color: #808030;">,</span> num_securities<span style="color: #808030;">)</span><span style="color: #808030;">:</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> group_df<span style="color: #808030;">[</span>f<span style="color: #0000e6;">"rebalanced_cash_path_{idx}"</span> <span style="color: #808030;">]</span> <span style="color: #808030;">=</span> <span style="color: #808030;">(</span><span style="color: #008c00;">1</span><span style="color: #44aadd;">+</span>group_df<span style="color: #808030;">[</span>f<span style="color: #0000e6;">"pct_change_{idx}"</span><span style="color: #808030;">]</span><span style="color: #808030;">)</span><span style="color: #808030;">.</span>cumprod<span style="color: #808030;">(</span><span style="color: #808030;">)</span><span style="color: #44aadd;">*</span><span style="color: #808030;">(</span>allocated_cash<span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> </span>
<span class="line_wrapper" style="counter-increment: line 1;"> rebalanced_cash_cols <span style="color: #808030;">=</span> <span style="color: #808030;">[</span>col_name <span style="color: maroon; font-weight: bold;">for</span> col_name <span style="color: maroon; font-weight: bold;">in</span> group_df<span style="color: #808030;">.</span>columns <span style="color: maroon; font-weight: bold;">if</span> col_name<span style="color: #808030;">.</span>startswith<span style="color: #808030;">(</span><span style="color: #0000e6;">"rebalanced_cash_path"</span><span style="color: #808030;">)</span><span style="color: #808030;">]</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> current_portfolio_value <span style="color: #808030;">=</span> group_df<span style="color: #808030;">.</span>iloc<span style="color: #808030;">[</span><span style="color: #44aadd;">-</span><span style="color: #008c00;">1</span><span style="color: #808030;">]</span><span style="color: #808030;">[</span>rebalanced_cash_cols<span style="color: #808030;">]</span><span style="color: #808030;">.</span><span style="color: #400000;">sum</span><span style="color: #808030;">(</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> rebalanced_portfolio_values<span style="color: #808030;">.</span>append<span style="color: #808030;">(</span>current_portfolio_value<span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> rebalanced_dates<span style="color: #808030;">.</span>append<span style="color: #808030;">(</span>date<span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">pass</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> col21<span style="color: #808030;">,</span> col22 <span style="color: #808030;">=</span> st<span style="color: #808030;">.</span>columns<span style="color: #808030;">(</span><span style="color: #008c00;">2</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">with</span> col21<span style="color: #808030;">:</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> st<span style="color: #808030;">.</span>markdown<span style="color: #808030;">(</span>f<span style="color: #0000e6;">"##### Portfolio with rebalancing"</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> rebalanced_df <span style="color: #808030;">=</span> pd<span style="color: #808030;">.</span>DataFrame<span style="color: #808030;">(</span>index<span style="color: #808030;">=</span>rebalanced_dates<span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> df<span style="color: #808030;">.</span>index <span style="color: #808030;">=</span> pd<span style="color: #808030;">.</span>to_datetime<span style="color: #808030;">(</span>df<span style="color: #808030;">.</span>index<span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> rebalanced_df<span style="color: #808030;">[</span><span style="color: #0000e6;">'value'</span><span style="color: #808030;">]</span> <span style="color: #808030;">=</span> rebalanced_portfolio_values</span>
<span class="line_wrapper" style="counter-increment: line 1;"> rebalanced_df<span style="color: #808030;">[</span><span style="color: #0000e6;">'rebalanced_pct_change'</span><span style="color: #808030;">]</span> <span style="color: #808030;">=</span> rebalanced_df<span style="color: #808030;">[</span><span style="color: #0000e6;">'value'</span><span style="color: #808030;">]</span><span style="color: #808030;">.</span>pct_change<span style="color: #808030;">(</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> rebalanced_sr <span style="color: #808030;">=</span> <span style="color: #400000;">round</span><span style="color: #808030;">(</span>qs<span style="color: #808030;">.</span>stats<span style="color: #808030;">.</span>sharpe<span style="color: #808030;">(</span>returns<span style="color: #808030;">=</span>rebalanced_df<span style="color: #808030;">[</span><span style="color: #0000e6;">'rebalanced_pct_change'</span><span style="color: #808030;">]</span><span style="color: #808030;">)</span><span style="color: #808030;">,</span><span style="color: #008c00;">2</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> max_dd <span style="color: #808030;">=</span> <span style="color: #400000;">round</span><span style="color: #808030;">(</span>qs<span style="color: #808030;">.</span>stats<span style="color: #808030;">.</span>max_drawdown<span style="color: #808030;">(</span>rebalanced_df<span style="color: #808030;">[</span><span style="color: #0000e6;">'rebalanced_pct_change'</span><span style="color: #808030;">]</span><span style="color: #808030;">)</span><span style="color: #808030;">,</span> <span style="color: #008c00;">2</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> fig<span style="color: #808030;">,</span> ax <span style="color: #808030;">=</span> plt<span style="color: #808030;">.</span>subplots<span style="color: #808030;">(</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> ax<span style="color: #808030;">.</span>plot<span style="color: #808030;">(</span>rebalanced_df<span style="color: #808030;">[</span><span style="color: #0000e6;">'value'</span><span style="color: #808030;">]</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> fig<span style="color: #808030;">.</span>autofmt_xdate<span style="color: #808030;">(</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> st<span style="color: #808030;">.</span>pyplot<span style="color: #808030;">(</span>fig<span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> st<span style="color: #808030;">.</span>text<span style="color: #808030;">(</span>f<span style="color: #0000e6;">"Sharpe Ratio : {rebalanced_sr}"</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> st<span style="color: #808030;">.</span>text<span style="color: #808030;">(</span>f<span style="color: #0000e6;">"Max DD : {max_dd}"</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">pass</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">with</span> col22<span style="color: #808030;">:</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> st<span style="color: #808030;">.</span>markdown<span style="color: #808030;">(</span><span style="color: #0000e6;">"##### Portfolio with no rebalance"</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> fig<span style="color: #808030;">,</span> ax <span style="color: #808030;">=</span> plt<span style="color: #808030;">.</span>subplots<span style="color: #808030;">(</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> ax<span style="color: #808030;">.</span>plot<span style="color: #808030;">(</span>df<span style="color: #808030;">[</span><span style="color: #0000e6;">'raw_portfolio_cash_path'</span><span style="color: #808030;">]</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> fig<span style="color: #808030;">.</span>autofmt_xdate<span style="color: #808030;">(</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> st<span style="color: #808030;">.</span>pyplot<span style="color: #808030;">(</span>fig<span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> df<span style="color: #808030;">[</span><span style="color: #0000e6;">'raw_portfolio_pct_change'</span><span style="color: #808030;">]</span> <span style="color: #808030;">=</span> df<span style="color: #808030;">[</span><span style="color: #0000e6;">'raw_portfolio_cash_path'</span><span style="color: #808030;">]</span><span style="color: #808030;">.</span>pct_change<span style="color: #808030;">(</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> raw_sr <span style="color: #808030;">=</span> <span style="color: #400000;">round</span><span style="color: #808030;">(</span>qs<span style="color: #808030;">.</span>stats<span style="color: #808030;">.</span>sharpe<span style="color: #808030;">(</span>returns<span style="color: #808030;">=</span>df<span style="color: #808030;">[</span><span style="color: #0000e6;">'raw_portfolio_pct_change'</span><span style="color: #808030;">]</span><span style="color: #808030;">)</span><span style="color: #808030;">,</span><span style="color: #008c00;">2</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> max_dd <span style="color: #808030;">=</span> <span style="color: #400000;">round</span><span style="color: #808030;">(</span>qs<span style="color: #808030;">.</span>stats<span style="color: #808030;">.</span>max_drawdown<span style="color: #808030;">(</span>df<span style="color: #808030;">[</span><span style="color: #0000e6;">'raw_portfolio_pct_change'</span><span style="color: #808030;">]</span><span style="color: #808030;">)</span><span style="color: #808030;">,</span> <span style="color: #008c00;">2</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> st<span style="color: #808030;">.</span>text<span style="color: #808030;">(</span>f<span style="color: #0000e6;">"Sharpe Ratio : {raw_sr}"</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> st<span style="color: #808030;">.</span>text<span style="color: #808030;">(</span>f<span style="color: #0000e6;">"Max DD : {max_dd}"</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">pass</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">pass</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: maroon; font-weight: bold;">def</span> main<span style="color: #808030;">(</span><span style="color: #808030;">)</span><span style="color: #808030;">:</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: dimgrey;">#st.set_page_config(layout="wide")</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> st<span style="color: #808030;">.</span>markdown<span style="color: #808030;">(</span><span style="color: #0000e6;">"### Demonstration of Shannon's demon"</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> col01<span style="color: #808030;">,</span> col02<span style="color: #808030;">,</span> col03<span style="color: #808030;">,</span> col04<span style="color: #808030;">,</span> col05 <span style="color: #808030;">=</span> st<span style="color: #808030;">.</span>columns<span style="color: #808030;">(</span><span style="color: #008c00;">5</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">with</span> col01<span style="color: #808030;">:</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> start_date <span style="color: #808030;">=</span> st<span style="color: #808030;">.</span>date_input<span style="color: #808030;">(</span><span style="color: #0000e6;">'Start Date'</span><span style="color: #808030;">,</span> min_value<span style="color: #808030;">=</span><span style="color: #074726;">None</span><span style="color: #808030;">,</span> max_value<span style="color: #808030;">=</span><span style="color: #074726;">None</span><span style="color: #808030;">,</span> key<span style="color: #808030;">=</span><span style="color: #074726;">None</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">with</span> col02<span style="color: #808030;">:</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> num_securities <span style="color: #808030;">=</span> st<span style="color: #808030;">.</span>number_input<span style="color: #808030;">(</span><span style="color: #0000e6;">'Number of securities'</span><span style="color: #808030;">,</span> min_value<span style="color: #808030;">=</span><span style="color: #008c00;">2</span><span style="color: #808030;">,</span> max_value<span style="color: #808030;">=</span><span style="color: #008c00;">50</span><span style="color: #808030;">,</span> step<span style="color: #808030;">=</span><span style="color: #008c00;">1</span><span style="color: #808030;">,</span> value<span style="color: #808030;">=</span><span style="color: #008c00;">50</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">with</span> col03<span style="color: #808030;">:</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> num_days <span style="color: #808030;">=</span> st<span style="color: #808030;">.</span>number_input<span style="color: #808030;">(</span><span style="color: #0000e6;">"Days"</span><span style="color: #808030;">,</span> min_value<span style="color: #808030;">=</span><span style="color: #008c00;">120</span><span style="color: #808030;">,</span> max_value<span style="color: #808030;">=</span><span style="color: #008c00;">1500</span><span style="color: #808030;">,</span> step<span style="color: #808030;">=</span><span style="color: #008c00;">30</span><span style="color: #808030;">,</span> value<span style="color: #808030;">=</span><span style="color: #008c00;">1000</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">with</span> col04<span style="color: #808030;">:</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> outliers_percentage <span style="color: #808030;">=</span> st<span style="color: #808030;">.</span>number_input<span style="color: #808030;">(</span><span style="color: #0000e6;">"Outliers percentage"</span><span style="color: #808030;">,</span> min_value<span style="color: #808030;">=</span><span style="color: #008c00;">1</span><span style="color: #808030;">,</span> max_value<span style="color: #808030;">=</span><span style="color: #008c00;">100</span><span style="color: #808030;">,</span> step<span style="color: #808030;">=</span><span style="color: #008c00;">1</span><span style="color: #808030;">,</span> value<span style="color: #808030;">=</span><span style="color: #008c00;">10</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">with</span> col05<span style="color: #808030;">:</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> initial_cash <span style="color: #808030;">=</span> st<span style="color: #808030;">.</span>number_input<span style="color: #808030;">(</span><span style="color: #0000e6;">"Initial cash $"</span><span style="color: #808030;">,</span> min_value<span style="color: #808030;">=</span><span style="color: #008c00;">10000</span><span style="color: #808030;">,</span> step<span style="color: #808030;">=</span><span style="color: #008c00;">100</span><span style="color: #808030;">,</span> value<span style="color: #808030;">=</span><span style="color: #008c00;">10000</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> end_date <span style="color: #808030;">=</span> <span style="color: #808030;">(</span>start_date <span style="color: #44aadd;">+</span> timedelta<span style="color: #808030;">(</span>days<span style="color: #808030;">=</span>num_days<span style="color: #808030;">)</span><span style="color: #808030;">)</span> <span style="color: maroon; font-weight: bold;">if</span> start_date <span style="color: maroon; font-weight: bold;">else</span> <span style="color: #074726;">None</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> simulate_portfolios<span style="color: #808030;">(</span>start_date<span style="color: #808030;">=</span>start_date<span style="color: #808030;">,</span> end_date<span style="color: #808030;">=</span>end_date<span style="color: #808030;">,</span> num_securities<span style="color: #808030;">=</span>num_securities<span style="color: #808030;">,</span> num_days<span style="color: #808030;">=</span>num_days<span style="color: #808030;">,</span> initial_cash<span style="color: #808030;">=</span>initial_cash<span style="color: #808030;">,</span> outliers_percentage<span style="color: #808030;">=</span>outliers_percentage<span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: maroon; font-weight: bold;">if</span> <span style="color: #074726;">__name__</span> <span style="color: #44aadd;">==</span> <span style="color: #0000e6;">'__main__'</span><span style="color: #808030;">:</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> main<span style="color: #808030;">(</span><span style="color: #808030;">)</span></span></pre><p><br /></p><p>Лавлагаа </p><p>Rebalancing with Shannon's Demon<br /><a href="https://thepfengineer.com/2016/04/25/rebalancing-with-shannons-demon/">https://thepfengineer.com/2016/04/25/rebalancing-with-shannons-demon/</a></p><p>How returns can be created out of thin air<br /><a href="https://www.richmondquant.com/news/2021/9/21/shannons-demon-amp-how-portfolio-returns-can-be-created-out-of-thin-air">https://www.richmondquant.com/news/2021/9/21/shannons-demon-amp-how-portfolio-returns-can-be-created-out-of-thin-air</a></p><p><br /></p>
Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.com0tag:blogger.com,1999:blog-1457877875009527488.post-11829503530064181122023-10-16T05:46:00.023+08:002023-10-21T17:51:20.670+08:003D график програмчлалыг хэрхэн сурах вэ (2024 оны хувилбар)<p>Намайг 10-н жилийн сурагч байхад "MN", "Компютер Таймс" гэх мэтийн компютертэй холбоотой сонин сэтгүүлүүд гардаг байсан. </p><p>Тухайн үед интернет гэдэг зүйл зөвхөн аймгийн шуудангийн төвд байдаг dial-up модемуудаар холбогддог хэдэн компютерүүд л байв.</p><p>Тийм орчин нөхцөлд компютер гэдэг зүйлийн талаархи сонин сайхныг тэдгээр сонин сэтгүүлүүдээс л олж мэддэг байлаа.</p><p>Нэг нийтлэл дээр хамгийн лаг сайн програмист бол тоглоом хөгжүүлэгчид байдаг гэдэг зүйлийг уншсанаас хойш тоглоом хөгжүүлэгч болох юмсан гэж мөрөөддөг болсон.</p><p>Тэгж сонирхож явсаар эцэстээ тоглоом хөгжүүлэгч болж чадаагүй ч 3-н хэмжээст график програмчлал гээч зүйл дээр бууж хааяа хоббигоо хөдлөхөөр сонирхож хардаг болжээ.</p><p>Үүний хаялагаар 2 хэмжээст матрицүүд дээр хэрхэн ажиллах, векторуудыг нэг огторгуйгаас нөгөөрүү хэрхэн буулгадаг вэ гэх мэтийн жижиг сажиг мэдлэгүүдийг олж авч цаашлаад энэ нь надад одоо үед хит болоод байгаа Deep Learning технологиудыг ойлгоход тун их тус дөхөм болсон юм, учир нь Deep Learning бол тэр чигтээ матриц (дээд эрэмбийн, олон хэмжээст тензорууд) дээр ажиллах үйлдлүүд байдаг билээ.</p><p>Тэхээр эндээс дүгнээд хэлэхэд заавал зорилгодоо хүрэх албагүй ч туулж өнгөрсөн оролдлогуудын хаялага нь өөр шинэ төрлийн мэдлэгийг олж авах хөшүүрэг болдог юм байна гэж дүгнэлээ. </p><p><br /></p><p>2013 оны хавьцаа 3D график програмчлал сурахын тулд OpenGL ч юм уу DirectX ч юм уу ямар нэгэн график API сурах хэрэгтэй гэж боддог байсан.</p><p>Харин одооо 2023 оны сүүл хавьд бол тэд нар заавал шаардлагагүй зөвхөн 2D массив болон хэдэн матриц үйлдлүүд л хэрэгтэй юм байна гэж боддог болсон.</p><p>Учир нь 2D массиваар дэлгэцэн дээр зурах пикселүүдийг төлөөлж SDL2 ч юм уу ямар нэгэн Windowing санг ашиглаад дэлгэцэнд харуулж болдог.</p><p>3D график рэндэр хийнэ гэдэг нь эцэстээ дэлгэцэн дээр байгаа пикселийн өнгийг хэрхэн тооцоолж харуулах вэ л гэдэг зүйл байдаг.</p><p>Олон пикселүүд нийлснээр 3D объект мэт харагдах дүрсүүдийг бидний тархи төсөөлж харах боломжийг олгодог.</p><p>Тухайн пикселийг тооцож харуулахын тулд 3D огторгуйд буй объектүүд, объектүүдийн байршил, объектүүдийн гадаргуу, гэрлүүдийн эх үүсгүүр гэх мэтийг тодорхой өгөгдлийн бүтцүүдийг хэрэглэн дүрслэх хэрэгтэй.</p><p>Гэрэлтүүлэг хийх буюу рэндэрлэх алгоритм ажиллуулахын тулд янз бүрийн гэрэлтүүлгийн моделиуд хэрэглэж болно. Үүнд :</p><p>- Хавтгай гэрэлтүүлэг<br />- Phong ийн гэрэлтүүлэг<br />- Blinn-Phong ийн гэрэлтүүлэг<br />- Физик суурьтай гэрэлтүүлэг (Physically Based Rendering)<br />...<br />гэх мэтчилэн маш олон загварууд бий.</p><p>Хэрэв пиксел дээр ажиллах код илэрхийллүүд болон өгөгдлүүдийн дүрслэлүүдийг ямар нэгэн өөрийнхөө чаддаг програмчлалын хэлүүдийн тусламжтайгаар сайн зохион байгуулж чадвал энэ олон янзын гэрэлтүүлгийн загваруудыг тооцон бодож хэрэгжүүлэх нь тун амархан.<br /><br />3D график програмчлал сурна гэдэг нь явж явж эдгээр гэрэлтүүлгийн алгоримтуудыг хэрхэн хэрэгжүүлж сурах вэ гэдэг зүйл юм.</p><p>Цаашлаад ямар нэгэн график карт болон хурдасгуур төхөөрөмжүүдгүйгээр 3D график рэндэр хийхийг Software Rendering гэж нэрлэдэг.</p><p>Software Rendering хэрэгжүүлээд сурчихвал бусад график API-нууд дээр ажиллахад түүртэх асуудал байхгүй, учир нь бүх суурь ойлголтууд бүгд нэг зүйл дээр л бууж ирдэг.</p><p>Үүнийг дагаад 3D хурдасгуур буюу та бидний нэрлэж заншсанаар график карт гэдэг зүйлийг заавал пиксел тооцох рэндэр хийх албагүй юм байна, <br /><br />3D graphics pipeline ийг энгийн өөр төрлийн параллель тооцооллын хурдасгуур төхөөрөмжүүд дээр ч гэсэн хийсвэрлэн хэрэгжүүлж болох юм байна,</p><p>одоо байгаа график картуудыг ч гэсэн заавал рэндэрлэх зүйл дээр ашиглах албагүй өөр төрлийн scientific computing асуудлууд дээр ч гэсэн ашиглаж болох юм байна, <br /><br />ер нь график рэндэрлэлт байна уу физик тооцоолол байна уу машин сургалт байна уу хамаагүй эд нар явж явж нэг стандарт parallel computing API тай болох юм байна гэдэг битүүхэн бодол төрдөг болсон. <br /><br />Үүний нэг тод жишээ бол WebGPU гээч шинэ төрлийн API.<br /><br />Тиймээс software rendering горимоор рэндэр хийх, 3D гэрэлтүүлэг хийх гээд өөрөө бүх зүйлийг дурын програмчлалын хэл дээр хэрэгжүүлээд сурчихвал дараа нь алгоритмынхаа боломжтой байж болох хэсгүүдийг бага багаар алхам алхамаар WebGPU ашиглаад параллель програмчлалын загвараар хурдасган оптималчилах боломжтой.<br /><br />Хурдасгаж болох хэсгүүд гэвэл :<br />- 3D цэгэн орой боловсруулах, огторгуй хооронд оройн векторуудыг буулгах жишээлбэл<br /><span> Model -> World -> View -> Projection -> NDC -> Screen -> Canvas гэх мэтийн дараалсан буулгалтууд<br /></span>- 3D объектын гадаргууг дүрслэж буй тэр олон гурвилжингууд цуглуулах, параллелиар гурвалжин бүрийн rasterization хэсгийг бодуулах<br />- Rasterization хийж бөглөсөн fragment буюу гадаргуун пиксель бүр дээр ажиллах shader хэсгийг параллелиар бодуулах<br /><br />гэх мэтээр сайжруулж болно.</p><p><br /></p><p>Ингэж явсаар өөрийн гэсэн параллель график боловсруулах шугамтай болж чадвал орчин үеийн тоглоомд хэрэглэгддэг бүхий л төрлийн рэндэр аргуудыг хэрэгжүүлэх туршиж үзэх сайхан боломж өөрт чинь бий болно л гэж мэд.</p><p>Жишээлбэл 2010 аад оны хавьцаа видео тоглоомын салбарт Deferred Rendering гэх арга маш олон гэрлийг GPU төхөөрөмж дээр параллель байдлаар боловсруулж хэрэглэгдэг давуу талаас нь болж бараг л ихэнхи тоглоомууд дээр стандарт шахуу хэрэглэгддэг байжээ гэсэн хэдий ч цонх шил мэтийн гэрэл нэвтрүүлдэг гадаргуу дээр энэ алгоритм нь ажилладаггүй, G-Buffer гэж нэрлэдэг маш том хэмжээтэй Render Target фрэймүүдийг GPU дээр боловсруулах шаардлагатай болдог.<br /><br />Харин өнөө үед Vulkan, DirectX12 мэтийн шинэ төрлийн API-нууд гарч ирснээр параллель draw call дуудалтууд хийх боломжтой болж draw call хийх нь өөрөө маш хямд өртөгтэй зүйл болон хувирчээ. Цаашлаад GPU-ны өөрийнх нь өгөгдөл боловсруулах дотоод bandwidth хязгаарлалт нь орчин үеийн тоглоомуудын хувьд bottleneck буюу тоглоомын рэндэр хийх гүйцэтгэлд тээг болж ирж байна.</p><p>Үүнээс улбаалаад тоглоомын салбар Deferred Rendering ээсээ буцаад эрлийз Forward Rendering аргуудыг хэрэглэх нь илт нэмэгдэж байна. Эрлийз Forward Rendering гэдэг нь дэлгэцээ тодорхой хүснэгтэнд хуваагаад тэр хүснэгтэнд хамааралтай гэрлүүд дээрээ л уламжлалт Forward Rendering хэрэглэх зөвхөн шаардлагатай газраа л гэрэлтүүлгийн алгоритм ажиллуулах бонусд нь өмнө дурьдсан шил мэтийн гэрэл нэвтрүүлэгч гадаргууг рэндэрлэхэд асуудалгүй болж иржээ.<br />
<br />Хэрэв энэ талаар сонирхож байвал Clustered Forward Rendering, Tiled Forward Rendering, Forward+ гэх мэтээр хайгаад үзээрэй.</p>
<p>
</p><center>
<iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="" frameborder="0" height="315" src="https://www.youtube.com/embed/n5OiqJP2f7w?si=DUsZV2aAaQ8BBesV" title="YouTube video player" width="560"></iframe>
</center>
<p></p>
<p>Энэ шинэ аргыг хэрэглэдэг хамгийн том жишээ гэвэл id Tech 7 гэх тоглоомын хөдөлгүүр мөн Doom Eternal гэдэг тоглоом дээрээс харж болно</p>
<center>
<iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="" frameborder="0" height="315" src="https://www.youtube.com/embed/UsmqWSZpgJY?si=TBP6H3UoVX1n8U9L" title="YouTube video player" width="560"></iframe>
</center>
<br />
<br />
Introduction to Vulkan
<br /><br />
<center>
<iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="" frameborder="0" height="315" src="https://www.youtube.com/embed/KN9nHo9kvZs?si=h1NI9phbBbbGr4p5" title="YouTube video player" width="560"></iframe>
</center>
<br />
<br />
Introduction to WebGPU
<br /><br />
<center>
<iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="" frameborder="0" height="315" src="https://www.youtube.com/embed/Hm2_bH_8j3k?si=mgKCvkdHWIFBnnma" title="YouTube video player" width="560"></iframe>
</center>
<br />
<br />What you can do with WebGPU? By Corentin Wallez, François Beaufort<div>
<br />
<center>
<iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="" frameborder="0" height="315" src="https://www.youtube.com/embed/RR4FZ9L4AF4?si=jQotyZGno-of9leB" title="YouTube video player" width="560"></iframe>
</center>
<br />
<br /><p><br /></p><p>Энэ нийтлэлийн хувьд жишээ болгоод хялбар shader програмыг ямар ч OpenGL мэтийн график API хэрэглэхгүйгээр зөвхөн C++20 дээр хэрхэн хэрэгжүүлж болохыг харуулая.<br /><br /><a href="https://github.com/sharavsambuu/leisure-software-renderer">https://github.com/sharavsambuu/leisure-software-renderer</a> <br /><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggNpyRqmTl79-RzelFCBDGLxbmsLp4Ar3KRCN0pIefp7wKkIvezPjA45At-W8tdJlNyGaagS_Y9mQGb87yx62Zg3wEOaLESdmERIdi_ggFbLDPRawBlAiOsZRx-Mo7H334l9HqjLGLPcOW3QdZPAoXblcOboloaG_JKzZtkjKLZpFfJsYG9k4vyftL/s616/Peek%202023-10-17%2016-30.gif" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="616" data-original-width="591" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggNpyRqmTl79-RzelFCBDGLxbmsLp4Ar3KRCN0pIefp7wKkIvezPjA45At-W8tdJlNyGaagS_Y9mQGb87yx62Zg3wEOaLESdmERIdi_ggFbLDPRawBlAiOsZRx-Mo7H334l9HqjLGLPcOW3QdZPAoXblcOboloaG_JKzZtkjKLZpFfJsYG9k4vyftL/s320/Peek%202023-10-17%2016-30.gif" width="307" /></a></div><br /><p><br /></p>
<div><br /></div><div><pre class="code_syntax" style="background: rgb(255, 255, 255); counter-reset: line 0; margin-bottom: 0px; margin-left: 2em; margin-top: 0px;"><span class="line_wrapper" style="counter-increment: line 1;"><span style="color: #004a43;">#</span><span style="color: #004a43;">include </span><span style="color: maroon;"><</span><span style="color: #40015a;">SDL2/SDL.h</span><span style="color: maroon;">></span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: #004a43;">#</span><span style="color: #004a43;">include </span><span style="color: maroon;"><</span><span style="color: #40015a;">glm/glm.hpp</span><span style="color: maroon;">></span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: #004a43;">#</span><span style="color: #004a43;">include </span><span style="color: maroon;"><</span><span style="color: #40015a;">glm/gtc/noise.hpp</span><span style="color: maroon;">></span><span style="color: #004a43;"> </span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: #004a43;">#</span><span style="color: #004a43;">include </span><span style="color: maroon;"><</span><span style="color: #40015a;">algorithm</span><span style="color: maroon;">></span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: #004a43;">#</span><span style="color: #004a43;">include </span><span style="color: maroon;"><</span><span style="color: #40015a;">string</span><span style="color: maroon;">></span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: #004a43;">#</span><span style="color: #004a43;">include </span><span style="color: maroon;"><</span><span style="color: #40015a;">vector</span><span style="color: maroon;">></span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: #004a43;">#</span><span style="color: #004a43;">include </span><span style="color: maroon;"><</span><span style="color: #40015a;">iostream</span><span style="color: maroon;">></span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: #004a43;">#</span><span style="color: #004a43;">include </span><span style="color: maroon;"><</span><span style="color: #40015a;">array</span><span style="color: maroon;">></span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: #004a43;">#</span><span style="color: #004a43;">include </span><span style="color: maroon;"><</span><span style="color: #40015a;">cstdlib</span><span style="color: maroon;">></span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: #004a43;">#</span><span style="color: #004a43;">include </span><span style="color: maroon;"><</span><span style="color: #40015a;">cmath</span><span style="color: maroon;">></span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: #004a43;">#</span><span style="color: #004a43;">include </span><span style="color: maroon;"><</span><span style="color: #40015a;">tuple</span><span style="color: maroon;">></span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: #004a43;">#</span><span style="color: #004a43;">include </span><span style="color: maroon;"><</span><span style="color: #40015a;">thread</span><span style="color: maroon;">></span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: #004a43;">#</span><span style="color: #004a43;">include </span><span style="color: maroon;"><</span><span style="color: #40015a;">mutex</span><span style="color: maroon;">></span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: #004a43;">#</span><span style="color: #004a43;">include </span><span style="color: maroon;">"</span><span style="color: #40015a;">shs_renderer.hpp</span><span style="color: maroon;">"</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: #004a43;">#</span><span style="color: #004a43;">define</span><span style="color: #004a43;"> FRAMES_PER_SECOND 60</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: #004a43;">#</span><span style="color: #004a43;">define</span><span style="color: #004a43;"> WINDOW_WIDTH 600</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: #004a43;">#</span><span style="color: #004a43;">define</span><span style="color: #004a43;"> WINDOW_HEIGHT 600</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: #004a43;">#</span><span style="color: #004a43;">define</span><span style="color: #004a43;"> CANVAS_WIDTH 320</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: #004a43;">#</span><span style="color: #004a43;">define</span><span style="color: #004a43;"> CANVAS_HEIGHT 320</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: #004a43;">#</span><span style="color: #004a43;">define</span><span style="color: #004a43;"> CONCURRENCY_COUNT 8</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: #004a43;">#</span><span style="color: #004a43;">define</span><span style="color: #004a43;"> NUM_OCTAVES 5</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;">glm<span style="color: purple;">::</span>vec4 rescale_vec4_1_255<span style="color: #808030;">(</span><span style="color: maroon; font-weight: bold;">const</span> glm<span style="color: purple;">::</span>vec4 <span style="color: #808030;">&</span>input_vec<span style="color: #808030;">)</span> <span style="color: purple;">{</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> glm<span style="color: purple;">::</span>vec4 clamped_value <span style="color: #808030;">=</span> glm<span style="color: purple;">::</span>clamp<span style="color: #808030;">(</span>input_vec<span style="color: #808030;">,</span> <span style="color: green;">0.0</span><span style="color: #006600;">f</span><span style="color: #808030;">,</span> <span style="color: green;">1.0</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> glm<span style="color: purple;">::</span>vec4 scaled_value <span style="color: #808030;">=</span> clamped_value <span style="color: #808030;">*</span> <span style="color: green;">255.0</span><span style="color: #006600;">f</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">return</span> scaled_value<span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: purple;">}</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: maroon; font-weight: bold;">float</span> random<span style="color: #808030;">(</span><span style="color: maroon; font-weight: bold;">const</span> glm<span style="color: purple;">::</span>vec2<span style="color: #808030;">&</span> _st<span style="color: #808030;">)</span> <span style="color: purple;">{</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">return</span> glm<span style="color: purple;">::</span>fract<span style="color: #808030;">(</span>glm<span style="color: purple;">::</span><span style="color: #603000;">sin</span><span style="color: #808030;">(</span>glm<span style="color: purple;">::</span>dot<span style="color: #808030;">(</span>_st<span style="color: #808030;">,</span> glm<span style="color: purple;">::</span>vec2<span style="color: #808030;">(</span><span style="color: green;">12.9898</span><span style="color: #006600;">f</span><span style="color: #808030;">,</span> <span style="color: green;">78.233</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span><span style="color: #808030;">)</span><span style="color: #808030;">)</span> <span style="color: #808030;">*</span> <span style="color: green;">43758.5453123</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: purple;">}</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: maroon; font-weight: bold;">float</span> noise<span style="color: #808030;">(</span><span style="color: maroon; font-weight: bold;">const</span> glm<span style="color: purple;">::</span>vec2<span style="color: #808030;">&</span> _st<span style="color: #808030;">)</span> <span style="color: purple;">{</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> glm<span style="color: purple;">::</span>vec2 i <span style="color: #808030;">=</span> glm<span style="color: purple;">::</span><span style="color: #603000;">floor</span><span style="color: #808030;">(</span>_st<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> glm<span style="color: purple;">::</span>vec2 f <span style="color: #808030;">=</span> glm<span style="color: purple;">::</span>fract<span style="color: #808030;">(</span>_st<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: dimgrey;">// Four corners in 2D of a tile</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">float</span> a <span style="color: #808030;">=</span> random<span style="color: #808030;">(</span>i<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">float</span> b <span style="color: #808030;">=</span> random<span style="color: #808030;">(</span>i <span style="color: #808030;">+</span> glm<span style="color: purple;">::</span>vec2<span style="color: #808030;">(</span><span style="color: green;">1.0</span><span style="color: #006600;">f</span><span style="color: #808030;">,</span> <span style="color: green;">0.0</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">float</span> c <span style="color: #808030;">=</span> random<span style="color: #808030;">(</span>i <span style="color: #808030;">+</span> glm<span style="color: purple;">::</span>vec2<span style="color: #808030;">(</span><span style="color: green;">0.0</span><span style="color: #006600;">f</span><span style="color: #808030;">,</span> <span style="color: green;">1.0</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">float</span> d <span style="color: #808030;">=</span> random<span style="color: #808030;">(</span>i <span style="color: #808030;">+</span> glm<span style="color: purple;">::</span>vec2<span style="color: #808030;">(</span><span style="color: green;">1.0</span><span style="color: #006600;">f</span><span style="color: #808030;">,</span> <span style="color: green;">1.0</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> glm<span style="color: purple;">::</span>vec2 u <span style="color: #808030;">=</span> f <span style="color: #808030;">*</span> f <span style="color: #808030;">*</span> <span style="color: #808030;">(</span><span style="color: green;">3.0</span><span style="color: #006600;">f</span> <span style="color: #808030;">-</span> <span style="color: green;">2.0</span><span style="color: #006600;">f</span> <span style="color: #808030;">*</span> f<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">return</span> glm<span style="color: purple;">::</span>mix<span style="color: #808030;">(</span>a<span style="color: #808030;">,</span> b<span style="color: #808030;">,</span> u<span style="color: #808030;">.</span>x<span style="color: #808030;">)</span> <span style="color: #808030;">+</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: #808030;">(</span>c <span style="color: #808030;">-</span> a<span style="color: #808030;">)</span> <span style="color: #808030;">*</span> u<span style="color: #808030;">.</span>y <span style="color: #808030;">*</span> <span style="color: #808030;">(</span><span style="color: green;">1.0</span><span style="color: #006600;">f</span> <span style="color: #808030;">-</span> u<span style="color: #808030;">.</span>x<span style="color: #808030;">)</span> <span style="color: #808030;">+</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: #808030;">(</span>d <span style="color: #808030;">-</span> b<span style="color: #808030;">)</span> <span style="color: #808030;">*</span> u<span style="color: #808030;">.</span>x <span style="color: #808030;">*</span> u<span style="color: #808030;">.</span>y<span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: purple;">}</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: maroon; font-weight: bold;">float</span> fbm<span style="color: #808030;">(</span><span style="color: maroon; font-weight: bold;">const</span> glm<span style="color: purple;">::</span>vec2<span style="color: #808030;">&</span> st<span style="color: #808030;">)</span> <span style="color: purple;">{</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> glm<span style="color: purple;">::</span>vec2 _st <span style="color: #808030;">=</span> st<span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">float</span> v <span style="color: #808030;">=</span> <span style="color: green;">0.0</span><span style="color: #006600;">f</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">float</span> a <span style="color: #808030;">=</span> <span style="color: green;">0.5</span><span style="color: #006600;">f</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> glm<span style="color: purple;">::</span>vec2 shift<span style="color: #808030;">(</span><span style="color: green;">100.0</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> </span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: dimgrey;">// Rotate to reduce axial bias</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> glm<span style="color: purple;">::</span>mat2 rot<span style="color: #808030;">(</span><span style="color: #603000;">cos</span><span style="color: #808030;">(</span><span style="color: green;">0.5</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span><span style="color: #808030;">,</span> <span style="color: #603000;">sin</span><span style="color: #808030;">(</span><span style="color: green;">0.5</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span><span style="color: #808030;">,</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: #808030;">-</span><span style="color: #603000;">sin</span><span style="color: #808030;">(</span><span style="color: green;">0.5</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span><span style="color: #808030;">,</span> <span style="color: #603000;">cos</span><span style="color: #808030;">(</span><span style="color: green;">0.5</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">for</span> <span style="color: #808030;">(</span><span style="color: maroon; font-weight: bold;">int</span> i <span style="color: #808030;">=</span> <span style="color: #008c00;">0</span><span style="color: purple;">;</span> i <span style="color: #808030;"><</span> NUM_OCTAVES<span style="color: purple;">;</span> <span style="color: #808030;">+</span><span style="color: #808030;">+</span>i<span style="color: #808030;">)</span> <span style="color: purple;">{</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> v <span style="color: #808030;">+</span><span style="color: #808030;">=</span> a <span style="color: #808030;">*</span> glm<span style="color: purple;">::</span>simplex<span style="color: #808030;">(</span>_st<span style="color: #808030;">)</span><span style="color: purple;">;</span> </span>
<span class="line_wrapper" style="counter-increment: line 1;"> _st <span style="color: #808030;">=</span> rot <span style="color: #808030;">*</span> _st <span style="color: #808030;">*</span> <span style="color: green;">2.0</span><span style="color: #006600;">f</span> <span style="color: #808030;">+</span> shift<span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> a <span style="color: #808030;">*</span><span style="color: #808030;">=</span> <span style="color: green;">0.5</span><span style="color: #006600;">f</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: purple;">}</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">return</span> v<span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: purple;">}</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;">glm<span style="color: purple;">::</span>vec4 fragment_shader<span style="color: #808030;">(</span>glm<span style="color: purple;">::</span>vec2 uniform_uv<span style="color: #808030;">,</span> <span style="color: maroon; font-weight: bold;">float</span> uniform_time<span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: purple;">{</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> glm<span style="color: purple;">::</span>vec2 st <span style="color: #808030;">=</span> <span style="color: #808030;">(</span>uniform_uv<span style="color: #808030;">/</span>glm<span style="color: purple;">::</span>vec2<span style="color: #808030;">(</span>CANVAS_WIDTH<span style="color: #808030;">,</span> CANVAS_HEIGHT<span style="color: #808030;">)</span><span style="color: #808030;">)</span><span style="color: #808030;">*</span><span style="color: green;">3.0</span><span style="color: #006600;">f</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> st <span style="color: #808030;">+</span><span style="color: #808030;">=</span> <span style="color: maroon; font-weight: bold;">float</span><span style="color: #808030;">(</span>glm<span style="color: purple;">::</span><span style="color: #603000;">abs</span><span style="color: #808030;">(</span>glm<span style="color: purple;">::</span><span style="color: #603000;">sin</span><span style="color: #808030;">(</span>uniform_time<span style="color: #808030;">*</span><span style="color: green;">0.1</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span><span style="color: #808030;">*</span><span style="color: green;">3.0</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span><span style="color: #808030;">)</span><span style="color: #808030;">*</span>st<span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> glm<span style="color: purple;">::</span>vec3 color<span style="color: #808030;">(</span><span style="color: green;">0.0</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> glm<span style="color: purple;">::</span>vec2 q<span style="color: #808030;">(</span><span style="color: green;">0.0</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> q<span style="color: #808030;">.</span>x <span style="color: #808030;">=</span> fbm<span style="color: #808030;">(</span>st <span style="color: #808030;">+</span> <span style="color: green;">0.00</span><span style="color: #006600;">f</span> <span style="color: #808030;">*</span> uniform_time<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> q<span style="color: #808030;">.</span>y <span style="color: #808030;">=</span> fbm<span style="color: #808030;">(</span>st <span style="color: #808030;">+</span> glm<span style="color: purple;">::</span>vec2<span style="color: #808030;">(</span><span style="color: green;">1.0</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> glm<span style="color: purple;">::</span>vec2 r<span style="color: #808030;">(</span><span style="color: green;">0.0</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> r<span style="color: #808030;">.</span>x <span style="color: #808030;">=</span> fbm<span style="color: #808030;">(</span>st <span style="color: #808030;">+</span> <span style="color: green;">1.0</span><span style="color: #006600;">f</span> <span style="color: #808030;">*</span> q <span style="color: #808030;">+</span> glm<span style="color: purple;">::</span>vec2<span style="color: #808030;">(</span><span style="color: green;">1.7</span><span style="color: #006600;">f</span><span style="color: #808030;">,</span> <span style="color: green;">9.2</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span> <span style="color: #808030;">+</span> <span style="color: green;">0.15</span><span style="color: #006600;">f</span> <span style="color: #808030;">*</span> uniform_time<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> r<span style="color: #808030;">.</span>y <span style="color: #808030;">=</span> fbm<span style="color: #808030;">(</span>st <span style="color: #808030;">+</span> <span style="color: green;">1.0</span><span style="color: #006600;">f</span> <span style="color: #808030;">*</span> q <span style="color: #808030;">+</span> glm<span style="color: purple;">::</span>vec2<span style="color: #808030;">(</span><span style="color: green;">8.3</span><span style="color: #006600;">f</span><span style="color: #808030;">,</span> <span style="color: green;">2.8</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span> <span style="color: #808030;">+</span> <span style="color: green;">0.126</span><span style="color: #006600;">f</span> <span style="color: #808030;">*</span> uniform_time<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">float</span> f <span style="color: #808030;">=</span> fbm<span style="color: #808030;">(</span>st <span style="color: #808030;">+</span> r<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> color <span style="color: #808030;">=</span> glm<span style="color: purple;">::</span>mix<span style="color: #808030;">(</span>glm<span style="color: purple;">::</span>vec3<span style="color: #808030;">(</span><span style="color: green;">0.101961</span><span style="color: #006600;">f</span><span style="color: #808030;">,</span> <span style="color: green;">0.619608</span><span style="color: #006600;">f</span><span style="color: #808030;">,</span> <span style="color: green;">0.666667</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span><span style="color: #808030;">,</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> glm<span style="color: purple;">::</span>vec3<span style="color: #808030;">(</span><span style="color: green;">0.666667</span><span style="color: #006600;">f</span><span style="color: #808030;">,</span> <span style="color: green;">0.666667</span><span style="color: #006600;">f</span><span style="color: #808030;">,</span> <span style="color: green;">0.498039</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span><span style="color: #808030;">,</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> glm<span style="color: purple;">::</span>clamp<span style="color: #808030;">(</span><span style="color: #808030;">(</span>f <span style="color: #808030;">*</span> f<span style="color: #808030;">)</span> <span style="color: #808030;">*</span> <span style="color: green;">4.0</span><span style="color: #006600;">f</span><span style="color: #808030;">,</span> <span style="color: green;">0.0</span><span style="color: #006600;">f</span><span style="color: #808030;">,</span> <span style="color: green;">1.0</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> color <span style="color: #808030;">=</span> glm<span style="color: purple;">::</span>mix<span style="color: #808030;">(</span>color<span style="color: #808030;">,</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> glm<span style="color: purple;">::</span>vec3<span style="color: #808030;">(</span><span style="color: green;">0.0</span><span style="color: #006600;">f</span><span style="color: #808030;">,</span> <span style="color: green;">0.0</span><span style="color: #006600;">f</span><span style="color: #808030;">,</span> <span style="color: green;">0.164706</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span><span style="color: #808030;">,</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> glm<span style="color: purple;">::</span>clamp<span style="color: #808030;">(</span>glm<span style="color: purple;">::</span>length<span style="color: #808030;">(</span>q<span style="color: #808030;">)</span><span style="color: #808030;">,</span> <span style="color: green;">0.0</span><span style="color: #006600;">f</span><span style="color: #808030;">,</span> <span style="color: green;">1.0</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> color <span style="color: #808030;">=</span> glm<span style="color: purple;">::</span>mix<span style="color: #808030;">(</span>color<span style="color: #808030;">,</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> glm<span style="color: purple;">::</span>vec3<span style="color: #808030;">(</span><span style="color: green;">0.666667</span><span style="color: #006600;">f</span><span style="color: #808030;">,</span> <span style="color: green;">1.0</span><span style="color: #006600;">f</span><span style="color: #808030;">,</span> <span style="color: green;">1.0</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span><span style="color: #808030;">,</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> glm<span style="color: purple;">::</span>clamp<span style="color: #808030;">(</span>glm<span style="color: purple;">::</span>length<span style="color: #808030;">(</span>r<span style="color: #808030;">.</span>x<span style="color: #808030;">)</span><span style="color: #808030;">,</span> <span style="color: green;">0.0</span><span style="color: #006600;">f</span><span style="color: #808030;">,</span> <span style="color: green;">1.0</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> </span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> glm<span style="color: purple;">::</span>vec4 output_arr <span style="color: #808030;">=</span> glm<span style="color: purple;">::</span>vec4<span style="color: #808030;">(</span>color<span style="color: #808030;">*</span><span style="color: maroon; font-weight: bold;">float</span><span style="color: #808030;">(</span>f<span style="color: #808030;">*</span>f<span style="color: #808030;">*</span>f<span style="color: #808030;">+</span><span style="color: green;">0.6</span><span style="color: #006600;">f</span><span style="color: #808030;">*</span>f<span style="color: #808030;">*</span>f<span style="color: #808030;">+</span><span style="color: green;">0.5</span><span style="color: #808030;">*</span>f<span style="color: #808030;">)</span><span style="color: #808030;">,</span><span style="color: green;">1.0</span><span style="color: #006600;">f</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">return</span> rescale_vec4_1_255<span style="color: #808030;">(</span>output_arr<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: purple;">}</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: maroon; font-weight: bold;">int</span> <span style="color: #400000;">main</span><span style="color: #808030;">(</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: purple;">{</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> SDL_Window <span style="color: #808030;">*</span>window <span style="color: #808030;">=</span> <span style="color: maroon; font-weight: bold;">nullptr</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> SDL_Renderer <span style="color: #808030;">*</span>renderer <span style="color: #808030;">=</span> <span style="color: maroon; font-weight: bold;">nullptr</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> SDL_Init<span style="color: #808030;">(</span>SDL_INIT_VIDEO<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> SDL_CreateWindowAndRenderer<span style="color: #808030;">(</span>WINDOW_WIDTH<span style="color: #808030;">,</span> WINDOW_HEIGHT<span style="color: #808030;">,</span> <span style="color: #008c00;">0</span><span style="color: #808030;">,</span> <span style="color: #808030;">&</span>window<span style="color: #808030;">,</span> <span style="color: #808030;">&</span>renderer<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> SDL_RenderSetScale<span style="color: #808030;">(</span>renderer<span style="color: #808030;">,</span> <span style="color: #008c00;">1</span><span style="color: #808030;">,</span> <span style="color: #008c00;">1</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> shs<span style="color: purple;">::</span>Canvas <span style="color: #808030;">*</span>main_canvas <span style="color: #808030;">=</span> <span style="color: maroon; font-weight: bold;">new</span> shs<span style="color: purple;">::</span>Canvas<span style="color: #808030;">(</span>CANVAS_WIDTH<span style="color: #808030;">,</span> CANVAS_HEIGHT<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> SDL_Surface <span style="color: #808030;">*</span>main_sdlsurface <span style="color: #808030;">=</span> main_canvas<span style="color: #808030;">-</span><span style="color: #808030;">></span>create_sdl_surface<span style="color: #808030;">(</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> SDL_Texture <span style="color: #808030;">*</span>screen_texture <span style="color: #808030;">=</span> SDL_CreateTextureFromSurface<span style="color: #808030;">(</span>renderer<span style="color: #808030;">,</span> main_sdlsurface<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">bool</span> <span style="color: #603000;">exit</span> <span style="color: #808030;">=</span> <span style="color: maroon; font-weight: bold;">false</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> SDL_Event event_data<span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">int</span> frame_delay <span style="color: #808030;">=</span> <span style="color: #008c00;">1000</span> <span style="color: #808030;">/</span> FRAMES_PER_SECOND<span style="color: purple;">;</span> <span style="color: dimgrey;">// Delay for 60 FPS</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">float</span> frame_time_accumulator <span style="color: #808030;">=</span> <span style="color: #008c00;">0</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">int</span> frame_counter <span style="color: #808030;">=</span> <span style="color: #008c00;">0</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">int</span> fps <span style="color: #808030;">=</span> <span style="color: #008c00;">0</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">float</span> time_accumulator <span style="color: #808030;">=</span> <span style="color: green;">0.0</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">while</span> <span style="color: #808030;">(</span><span style="color: #808030;">!</span><span style="color: #603000;">exit</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: purple;">{</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> Uint32 frame_start_ticks <span style="color: #808030;">=</span> SDL_GetTicks<span style="color: #808030;">(</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: dimgrey;">// catching up input events happened on hardware</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">while</span> <span style="color: #808030;">(</span>SDL_PollEvent<span style="color: #808030;">(</span><span style="color: #808030;">&</span>event_data<span style="color: #808030;">)</span><span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: purple;">{</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">switch</span> <span style="color: #808030;">(</span>event_data<span style="color: #808030;">.</span>type<span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: purple;">{</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">case </span><span style="color: #7d0045;">SDL_QUIT</span><span style="color: #e34adc;">:</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: #603000;">exit</span> <span style="color: #808030;">=</span> <span style="color: maroon; font-weight: bold;">true</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">break</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">case </span><span style="color: #7d0045;">SDL_KEYDOWN</span><span style="color: #e34adc;">:</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">switch</span><span style="color: #808030;">(</span>event_data<span style="color: #808030;">.</span>key<span style="color: #808030;">.</span>keysym<span style="color: #808030;">.</span>sym<span style="color: #808030;">)</span> <span style="color: purple;">{</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">case </span><span style="color: #7d0045;">SDLK_ESCAPE</span><span style="color: #e34adc;">:</span> </span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: #603000;">exit</span> <span style="color: #808030;">=</span> <span style="color: maroon; font-weight: bold;">true</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">break</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: purple;">}</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">break</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: purple;">}</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: purple;">}</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: dimgrey;">// preparing to render on SDL2</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> SDL_SetRenderDrawColor<span style="color: #808030;">(</span>renderer<span style="color: #808030;">,</span> <span style="color: #008c00;">0</span><span style="color: #808030;">,</span> <span style="color: #008c00;">0</span><span style="color: #808030;">,</span> <span style="color: #008c00;">0</span><span style="color: #808030;">,</span> <span style="color: #008c00;">255</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> SDL_RenderClear<span style="color: #808030;">(</span>renderer<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: dimgrey;">// Run fragment shader with parallel threaded fashion</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: #666616;">std</span><span style="color: purple;">::</span>mutex canvas_mutex<span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: #666616;">std</span><span style="color: purple;">::</span><span style="color: #603000;">vector</span><span style="color: purple;"><</span><span style="color: #666616;">std</span><span style="color: purple;">::</span>thread<span style="color: purple;">></span> thread_pool<span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">int</span> region_width <span style="color: #808030;">=</span> CANVAS_WIDTH <span style="color: #808030;">/</span> CONCURRENCY_COUNT<span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">int</span> region_height <span style="color: #808030;">=</span> CANVAS_HEIGHT <span style="color: #808030;">/</span> CONCURRENCY_COUNT<span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">for</span> <span style="color: #808030;">(</span><span style="color: maroon; font-weight: bold;">int</span> i <span style="color: #808030;">=</span> <span style="color: #008c00;">0</span><span style="color: purple;">;</span> i <span style="color: #808030;"><</span> CONCURRENCY_COUNT<span style="color: purple;">;</span> i<span style="color: #808030;">+</span><span style="color: #808030;">+</span><span style="color: #808030;">)</span> <span style="color: purple;">{</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">int</span> start_x <span style="color: #808030;">=</span> i <span style="color: #808030;">*</span> region_width<span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">int</span> end_x <span style="color: #808030;">=</span> <span style="color: #808030;">(</span>i <span style="color: #808030;">+</span> <span style="color: #008c00;">1</span><span style="color: #808030;">)</span> <span style="color: #808030;">*</span> region_width<span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">for</span> <span style="color: #808030;">(</span><span style="color: maroon; font-weight: bold;">int</span> j <span style="color: #808030;">=</span> <span style="color: #008c00;">0</span><span style="color: purple;">;</span> j <span style="color: #808030;"><</span> CONCURRENCY_COUNT<span style="color: purple;">;</span> j<span style="color: #808030;">+</span><span style="color: #808030;">+</span><span style="color: #808030;">)</span> <span style="color: purple;">{</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">int</span> start_y <span style="color: #808030;">=</span> j <span style="color: #808030;">*</span> region_height<span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">int</span> end_y <span style="color: #808030;">=</span> <span style="color: #808030;">(</span>j <span style="color: #808030;">+</span> <span style="color: #008c00;">1</span><span style="color: #808030;">)</span> <span style="color: #808030;">*</span> region_height<span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: #666616;">std</span><span style="color: purple;">::</span>thread task<span style="color: #808030;">(</span><span style="color: #808030;">[</span>start_x<span style="color: #808030;">,</span> end_x<span style="color: #808030;">,</span> start_y<span style="color: #808030;">,</span> end_y<span style="color: #808030;">,</span> time_accumulator<span style="color: #808030;">,</span> <span style="color: #808030;">&</span>main_canvas<span style="color: #808030;">,</span> <span style="color: #808030;">&</span>canvas_mutex<span style="color: #808030;">]</span><span style="color: #808030;">(</span><span style="color: #808030;">)</span> <span style="color: purple;">{</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">for</span> <span style="color: #808030;">(</span><span style="color: maroon; font-weight: bold;">int</span> x <span style="color: #808030;">=</span> start_x<span style="color: purple;">;</span> x <span style="color: #808030;"><</span> end_x<span style="color: purple;">;</span> x<span style="color: #808030;">+</span><span style="color: #808030;">+</span><span style="color: #808030;">)</span> <span style="color: purple;">{</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">for</span> <span style="color: #808030;">(</span><span style="color: maroon; font-weight: bold;">int</span> y <span style="color: #808030;">=</span> start_y<span style="color: purple;">;</span> y <span style="color: #808030;"><</span> end_y<span style="color: purple;">;</span> y<span style="color: #808030;">+</span><span style="color: #808030;">+</span><span style="color: #808030;">)</span> <span style="color: purple;">{</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> glm<span style="color: purple;">::</span>vec2 uv <span style="color: #808030;">=</span> <span style="color: purple;">{</span><span style="color: maroon; font-weight: bold;">float</span><span style="color: #808030;">(</span>x<span style="color: #808030;">)</span><span style="color: #808030;">,</span> <span style="color: maroon; font-weight: bold;">float</span><span style="color: #808030;">(</span>y<span style="color: #808030;">)</span><span style="color: purple;">}</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> glm<span style="color: purple;">::</span>vec4 shader_output <span style="color: #808030;">=</span> fragment_shader<span style="color: #808030;">(</span>uv<span style="color: #808030;">,</span> time_accumulator<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: purple;">{</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: dimgrey;">//std::lock_guard<std::mutex> lock(canvas_mutex);</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> shs<span style="color: purple;">::</span>Canvas<span style="color: purple;">::</span>draw_pixel<span style="color: #808030;">(</span><span style="color: #808030;">*</span>main_canvas<span style="color: #808030;">,</span> x<span style="color: #808030;">,</span> y<span style="color: #808030;">,</span> shs<span style="color: purple;">::</span>Color<span style="color: purple;">{</span>u_int8_t<span style="color: #808030;">(</span>shader_output<span style="color: #808030;">[</span><span style="color: #008c00;">0</span><span style="color: #808030;">]</span><span style="color: #808030;">)</span><span style="color: #808030;">,</span> u_int8_t<span style="color: #808030;">(</span>shader_output<span style="color: #808030;">[</span><span style="color: #008c00;">1</span><span style="color: #808030;">]</span><span style="color: #808030;">)</span><span style="color: #808030;">,</span> u_int8_t<span style="color: #808030;">(</span>shader_output<span style="color: #808030;">[</span><span style="color: #008c00;">2</span><span style="color: #808030;">]</span><span style="color: #808030;">)</span><span style="color: #808030;">,</span> u_int8_t<span style="color: #808030;">(</span>shader_output<span style="color: #808030;">[</span><span style="color: #008c00;">3</span><span style="color: #808030;">]</span><span style="color: #808030;">)</span><span style="color: purple;">}</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: purple;">}</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: purple;">}</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: purple;">}</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: purple;">}</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> thread_pool<span style="color: #808030;">.</span>emplace_back<span style="color: #808030;">(</span><span style="color: #666616;">std</span><span style="color: purple;">::</span><span style="color: #603000;">move</span><span style="color: #808030;">(</span>task<span style="color: #808030;">)</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: purple;">}</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: purple;">}</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: dimgrey;">// waiting for other threads left in the pool finish its jobs</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">for</span> <span style="color: #808030;">(</span><span style="color: maroon; font-weight: bold;">auto</span> <span style="color: #808030;">&</span>thread <span style="color: purple;">:</span> thread_pool<span style="color: #808030;">)</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: purple;">{</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> thread<span style="color: #808030;">.</span>join<span style="color: #808030;">(</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: purple;">}</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: dimgrey;">// debug draw for if it is rendering something</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> shs<span style="color: purple;">::</span>Canvas<span style="color: purple;">::</span>fill_random_pixel<span style="color: #808030;">(</span><span style="color: #808030;">*</span>main_canvas<span style="color: #808030;">,</span> <span style="color: #008c00;">40</span><span style="color: #808030;">,</span> <span style="color: #008c00;">30</span><span style="color: #808030;">,</span> <span style="color: #008c00;">60</span><span style="color: #808030;">,</span> <span style="color: #008c00;">80</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: dimgrey;">// actually presenting canvas data on the hardware surface</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> shs<span style="color: purple;">::</span>Canvas<span style="color: purple;">::</span>copy_to_SDLSurface<span style="color: #808030;">(</span>main_sdlsurface<span style="color: #808030;">,</span> main_canvas<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> SDL_UpdateTexture<span style="color: #808030;">(</span>screen_texture<span style="color: #808030;">,</span> <span style="color: #7d0045;">NULL</span><span style="color: #808030;">,</span> main_sdlsurface<span style="color: #808030;">-</span><span style="color: #808030;">></span>pixels<span style="color: #808030;">,</span> main_sdlsurface<span style="color: #808030;">-</span><span style="color: #808030;">></span>pitch<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> SDL_Rect destination_rect<span style="color: purple;">{</span><span style="color: #008c00;">0</span><span style="color: #808030;">,</span> <span style="color: #008c00;">0</span><span style="color: #808030;">,</span> WINDOW_WIDTH<span style="color: #808030;">,</span> WINDOW_HEIGHT<span style="color: purple;">}</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> SDL_RenderCopy<span style="color: #808030;">(</span>renderer<span style="color: #808030;">,</span> screen_texture<span style="color: #808030;">,</span> <span style="color: #7d0045;">NULL</span><span style="color: #808030;">,</span> <span style="color: #808030;">&</span>destination_rect<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> SDL_RenderPresent<span style="color: #808030;">(</span>renderer<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> </span>
<span class="line_wrapper" style="counter-increment: line 1;"> frame_counter<span style="color: #808030;">+</span><span style="color: #808030;">+</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> Uint32 delta_frame_time <span style="color: #808030;">=</span> SDL_GetTicks<span style="color: #808030;">(</span><span style="color: #808030;">)</span> <span style="color: #808030;">-</span> frame_start_ticks<span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> frame_time_accumulator <span style="color: #808030;">+</span><span style="color: #808030;">=</span> delta_frame_time<span style="color: #808030;">/</span><span style="color: green;">1000.0</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> time_accumulator <span style="color: #808030;">+</span><span style="color: #808030;">=</span> delta_frame_time<span style="color: #808030;">/</span><span style="color: green;">1000.0</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">if</span> <span style="color: #808030;">(</span>delta_frame_time <span style="color: #808030;"><</span> frame_delay<span style="color: #808030;">)</span> <span style="color: purple;">{</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> SDL_Delay<span style="color: #808030;">(</span>frame_delay <span style="color: #808030;">-</span> delta_frame_time<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: purple;">}</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">if</span> <span style="color: #808030;">(</span>frame_time_accumulator <span style="color: #808030;">></span><span style="color: #808030;">=</span> <span style="color: green;">1.0</span><span style="color: #808030;">)</span> <span style="color: purple;">{</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: #666616;">std</span><span style="color: purple;">::</span><span style="color: #603000;">string</span> window_title <span style="color: #808030;">=</span> <span style="color: maroon;">"</span><span style="color: #0000e6;">FPS : </span><span style="color: maroon;">"</span><span style="color: #808030;">+</span><span style="color: #666616;">std</span><span style="color: purple;">::</span>to_string<span style="color: #808030;">(</span>frame_counter<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> frame_time_accumulator <span style="color: #808030;">=</span> <span style="color: green;">0.0</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> frame_counter <span style="color: #808030;">=</span> <span style="color: #008c00;">0</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> SDL_SetWindowTitle<span style="color: #808030;">(</span>window<span style="color: #808030;">,</span> window_title<span style="color: #808030;">.</span>c_str<span style="color: #808030;">(</span><span style="color: #808030;">)</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: purple;">}</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: purple;">}</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: dimgrey;">// free the memory</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> main_canvas <span style="color: #808030;">=</span> <span style="color: maroon; font-weight: bold;">nullptr</span><span style="color: purple;">;</span> </span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">delete</span> main_canvas<span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> SDL_DestroyTexture<span style="color: #808030;">(</span>screen_texture<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> SDL_FreeSurface<span style="color: #808030;">(</span>main_sdlsurface<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> SDL_DestroyRenderer<span style="color: #808030;">(</span>renderer<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> SDL_DestroyWindow<span style="color: #808030;">(</span>window<span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"> SDL_Quit<span style="color: #808030;">(</span><span style="color: #808030;">)</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"></span>
<span class="line_wrapper" style="counter-increment: line 1;"> <span style="color: maroon; font-weight: bold;">return</span> <span style="color: #008c00;">0</span><span style="color: purple;">;</span></span>
<span class="line_wrapper" style="counter-increment: line 1;"><span style="color: purple;">}</span></span></pre></div><div><br /></div><div><br /></div><div><br />
Лавлагаа холбоосууд</div><div><br /></div><div><span> </span><a href="https://codeconfessions.substack.com/p/gpu-computing">https://codeconfessions.substack.com/p/gpu-computing</a> </div><div><br /></div></div>Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.com0tag:blogger.com,1999:blog-1457877875009527488.post-58910455915916416932023-09-02T13:10:00.008+08:002023-09-03T21:03:39.222+08:00Монгол улсын туг Shadertoy дээр<p>Монгол улсын тугийг GLSL shader дээр бичих юмсан гэж бодоод л явдаг байлаа.</p><span><a name='more'></a></span><p>Ерөнхийдөө хамгийн хэцүү зүйл нь Соёмбо дүрсний координатуудыг яаж олох вэ гэдэг.</p><p>Wikipedia дээр харин Соёмбо svg форматаар байдаг юм байна. </p><p>SVG-г нь <a href="https://zduny.github.io/shadertoy-svg/">https://zduny.github.io/shadertoy-svg/</a> хэрэгслээр хөрвүүлээд координатуудыг гаргаж авч байгаад бага зэргийн shading нэмээд иймэрхүү үр дүнд хүрлээ.</p><p>Энэ гэхдээ ер нь их залхуу арга л даа. Зүй нь бол бүх бөөрөнхий, дөрвөлжин, муруй гээд бүх дүрснүүдийг математикийн томъёогоор гаргаж аваад хооронд нь холиод рэндэрлэх ёстой.</p><p>Тэгж чадвал гар утасны хөтөч дээр ч гэсэн ажиллахуйц хурдан бас томруулахад дүрстэй хамаатай жижиг зүйлүүд нь илүү чанартай харагдана.</p><p>Тиймээс саяны хэлсэн аргаар энийг дахиж кодолбол shader-тэй холбоотой янз бүрийн кодлох аргууд, GPU дээр алгоритмууд яаж ажилладыг ойлгоход туслах хөөрхөн дасгал ажил болно шүү тиймээс оролдоод үзээрэй.</p><p>Кодыг нь дараах холбоосоор ороод дэлгэрэнгүй үзэж болно.</p>
<iframe allowfullscreen="" frameborder="0" height="360" src="https://www.shadertoy.com/embed/MsX3Wn?gui=true&t=10&paused=true&muted=false" width="640"></iframe><div><br /></div><div><br /></div><div>Мөн сүүлийн үед ямар ч OpenGL, DirectX эд нар гэхгүйгээр бүгдийг нь CPU суурьтай буюу software renderer ийг C++ хэлээр бичиж үзэж байгаа юм. </div><div><br /></div><div>Энэ shader ийн C++ хувилбарыг ч гэсэн <a href="https://github.com/sharavsambuu/leisure-software-renderer/blob/master/cpp-folders/src/hello-shaders/hello_mongolian_flag.cpp" target="_blank">энэ</a>дээс орж үзэж болно шүү хэрвээ сонирхож байвал.</div>
<p>
<iframe allowfullscreen="" frameborder="0" height="360" src="https://www.shadertoy.com/embed/wdKfRR?gui=true&t=10&paused=true&muted=false" width="640"></iframe>
</p>
<br />
Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.com0tag:blogger.com,1999:blog-1457877875009527488.post-60980967207702075182023-04-15T23:27:00.006+08:002023-04-16T00:02:18.741+08:00Санхүүгийн онол нээх үйл явц<p>Энэ тэмдэглэлээр машин сургалтын хэрэгслүүдийг хэрхэн санхүүгийн салбарт зөв хэрэглэх талаар тэмдэглэе гэж бодлоо.</p><span><a name='more'></a></span><p>Санхүүгийн машин сургалтын салбар нь тусдаа судлагдахуун болгон авч үзэхүйц том салбар юм бусад машин сургалтын салбарт байдаг шиг шууд датагаа чихээд моделио босгоод явчихдаг шиг plug and play хандлагаар явахад амжилт олохгүй.</p><p>Тиймээс энэ тэмдэглэлээр ML ийн үр ашиг нь хаанаа байж болох талаар тодотгон гаргаж ирэх гэж оролдоё.</p><p>Хамгийн энгийнээр ML-ийг ямар нэгэн арилжааны бүтээгдэхүүний ханшны түүх дээр сургаад цаашдаа өсөх үү буурах уу гэдгийг таамаглан тэндээсээ ашиг олох алхамуудаа хийвэл болчихно гэж бодоход болохоор мэт.</p><p>Гэвч энэ хандлагаар хүмүүс маш олон удаа оролдоод амжилтгүй болсоор байна.</p><p>Тэгвэл ML-ийг санхүүгийн онол нээхэд хэрвээ хэрэглэвэл хамгийн зөв зүйтэй байдаг байна.</p><p>Энэ хандлагаар явбал таны хар хайрцаг бүхий ML модель тухайн арилжааны ханшыг таамаглах биш харин таны нээж олсон онол л тухайн ханшийн хөдөлгөөнийг тайлбарлан цаашдаа таамаглах юм. </p><p>Тэхээр ML нөгөө талаар онол нээж олоход хэрэглэгдэх юм.</p><p>Үүний тулд ML хэрэгслүүдийг ашиглан тухайн ханш яагаад хөдлөв, тэр хөдөлгөөнд ямар ямар хувьсагчид ямар үйлчилэл нөлөөтэйгээр оролцов гэдгийг эхлээд шүүн олж илрүүлэх хэрэгтэй. </p><p>Тэдгээр хувьсагчид бол hypothesis буюу таны өөрийн гэсэн тусдаа онол бүтээх суурь тань болно гэсэн үг.</p><p>Тэр хувьсагчид нийлээд тухайн ханш яагаад хөдлөв гэдэг талаархи cause and effect буюу шалтгаан болон үр дагаварыг тайлбарлахыг оролдох ёстой юм.</p><p>Хэрэв та амжилттайгаар онол нээж олж чадвал таны онол out-of-sample буюу харж байгаагүй өгөгдлүүд дээр ч гэсэн сайн ажиллах ёстой.</p><p>Тэхээр эндээс тухайн онол цаашид амжилттай байж чадах уу үгүй юу гэдгийг сөрөг жишээнүүдийн тусламжтайгаар үгүйсгэх хэрэгтэй болж байна гэсэн үг.</p><p>Хэрэв таны онол тань амжилттайгаар тэдгээр үгүйсгэлүүдийг давж чадвал цаашдаа ML биш харин таны онол зах зээлийг тайлбарлан forecast буюу таамаг хийдэг болно.</p><p>Negative reinforcement заримдаа falsification ч гэдэг өөрөөр хэлбэл backtest-д тааруулж гоё харагддаг арилжааны систем бүтээхийн оронд таны арилжааны систем ер нь тогтвортой ашигтай байж чадахгүй шүү гэдгийг тэдгээр backtest-үүдийн тусламжтайгаар нотлохыг оролдох нь ML-ийг санхүүгийн салбарт хэрэглэж болох хамгийн зөв хандлага байхнээ.</p><p>Хэрвээ таны онол ажилладаггүй буюу false-positive гэдэг нь нотлогдвол онолоо засах гэж оролдохынхоо оронд бүхнийг дахин эхнээс нь эхлэх хэрэгтэй.</p><p><br /></p><p>Лавлагаанууд:</p><p><a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2460551" target="_blank">The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting and Non-Normality</a></p><p><a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3221798">The False Strategy Theorem: A Financial Application of Experimental Mathematics</a></p><p><a href="https://twitter.com/lopezdeprado/status/1148534002883674112?lang=en" target="_blank">Backtesting is not research tool</a></p><p><a href="https://quantdare.com/probabilistic-sharpe-ratio/" target="_blank">Probabilistic Sharpe Ratio</a></p><p><a href="https://quantdare.com/deflated-sharpe-ratio-how-to-avoid-been-fooled-by-randomness/" target="_blank">Deflated Sharpe Ratio (how to avoid been fooled by randomness)</a></p><p><a href="https://www.youtube.com/watch?v=rdnSkDDDIgg" target="_blank">Deflated Sharpe ratio: Adjusting for multiple testing (Excel)</a></p><p><br /></p>Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.com0tag:blogger.com,1999:blog-1457877875009527488.post-27044379242525982452023-04-02T06:32:00.024+08:002023-04-02T14:52:03.488+08:00Financial ML төслүүд унадаг үндсэн шалтгаанууд<p>Энэхүү постыг <a href="https://www.amazon.com/Advances-Financial-Machine-Learning-Marcos/dp/1119482089" target="_blank">Advances In Financial Machine Learning</a> номноос авч орчуулан тавьлаа.</p><p><br /></p><p>Quantitative Finance салбарт бүтэлгүйтлүүд маш олон бий ялангуяа financial ML-д. Цөөн хэд нь л амжилттайгаар үлдэн хөрөнгөө арвижуулж байдаг гэсэн хэдий ч амжилттай байх нь маш ховор үзэгдэл. </p><p>Сүүлийн 20-н жилд маш олон хүмүүс энэ салбарт орж ирээд бүтэлгүйтэн хаалгаа барьсан байна. Үүнд үндсэн нэг том шалтгаан бий.</p><span><a name='more'></a></span><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidlmbXbMDoqeZPb72A3QbKDHztpP19oAvE5DJ7WGvGGae5g18BYBGDe26RJIUo7G9jRyJJcjIZwFdYyIt6XKchdYb784QdQrefZZsvUJP06QVSyKPMcwAueiT9zI1rZTpTQYYNbmxXZwaFMFUoDVnxqkctDwCZlyJe1CeUnsqVXNI954LuX6rZ5A/s600/sisyphus.webp" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="407" data-original-width="600" height="217" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidlmbXbMDoqeZPb72A3QbKDHztpP19oAvE5DJ7WGvGGae5g18BYBGDe26RJIUo7G9jRyJJcjIZwFdYyIt6XKchdYb784QdQrefZZsvUJP06QVSyKPMcwAueiT9zI1rZTpTQYYNbmxXZwaFMFUoDVnxqkctDwCZlyJe1CeUnsqVXNI954LuX6rZ5A/s320/sisyphus.webp" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Sisyphus</td></tr></tbody></table><p><span style="font-size: x-large;">Сисифийн загвар (The Sisyphus Paradigm)</span></p><p>PM буюу хөрөнгийн менежерүүд хөрөнгө оруулалтын шийдвэр гаргахдаа ямар нэгэн онол эсвэл рационал шалтгаан хэрэглэдэггүй. Тэд мэдээ болон янз бүрийн анализүүдэд найдах бөгөөд голдуу өөрсдийн шийдвэр болон зөн совиндоо найддаг. </p><p>Эдгээр шийдвэрүүдээ ямар нэгэн байдлаар өмнө туулж өнгөрүүлсэн түүх туршлагын дагуу тайлбарлан зөвтгөхийг оролддог бөгөөд гаргасан хөрөнгө оруулалтынх нь шийдвэр болгоны ард заавал нэг түүх байж л байдаг байна. </p><p>Тэдгээр мөрийнийх нь ард буй логик учирыг хэн ч сайн ойлгодоггүй, хөрөнгө оруулалтын фирмүүд багцаа солонгоруулах үүднээс менежрүүдээ голдуу тус тусдаа буюу <b>silo</b> хэлбэрээр ажиллуулдаг.</p><p>Хэрвээ хөрөнгө оруулалтын менежрүүдийн уулзалтанд очоод тэдний яриаг чагнавал та магадгүй тэдгээр нөхдүүд ямар аймар таг харанхуй байгааг анзаарах боломжтой. </p><p>Уулзалтанд ирсэн менежер бүр ямар нэгэн шог мэдээлэлд бүрэн автсан, дээрээс нь ямарч суурь нотолгоогүйгээр тухайн дүгнэлтээ томоор гаргаж ирэн ярьж байдаг. </p><p>Энэ нь тэдгээр PM үүд амжилт олохгүй гэсэн үг мэдээж биш даанч тэднээс нь цөөн хэд нь л амжилт олдог нь үнэн. </p><p>Гол хэлэх гэсэн санаа бол тэд нарыг багаар ажиллуулах боломж байхгүй, жишээлбэл 50-н PM-үүдийг цуглуулаад ажиллуулбал тэд хоорондоо нэг нэгнийхээ шийдвэр нөлөөлөх бөгөөд та магадгүй нэг ажлын төлөө 50-н хүний цалин тавих шаардлага гарч магадгүй.</p><p>Тиймээс тэд нарыг тус тусын silo байдлаар ажиллуулах нь учир шалтгаантай байж мэдэх.</p><p>Гэвч энэ аргыг quantitative болон ML төслүүдэд хэрэглэх тоолонд заавал бүтэлгүйтдэг. Жишээ нь 50-н PhD цуглуулаад тэд нар тус бүрээс нь 6-н сарын дотор ямар нэгэн хөрөнгө оруулалтын стратеги гаргаж ир гээд даалгавал энэ хандлагын сул тал гарч ирнэ. </p><p>PhD тус бүр нэг бол backtest дээр сайхан харагддаг overfit хийгдсэн false positive стратеги эсвэл Sharpe Ratio багатай аль хэдийн alpha decay болцон хүн болгон ашигладаг стратеги гаргаад ирдэг. </p><p>Эдгээр үр дүнгүүд хоёулаа хөрөнгө оруулалтын board буюу зөвлөлийнхний урмыг хугалахуйц байх бөгөөд цаашлаад тухайн төслийг зогсоон бүүр болиулдаг.</p><p>Хэдий тэдгээрээс 5-н PhD нь true positive стратеги гаргаж ирсэн ч гэлээ тэдгээр стратегиудаас бий болгох ашиг нь 50-н PhD нөхдүүдэд гаргасан зардлыг даах хэмжээнд хүрэлцээтэй байдаггүй.</p><p>Тэгээд цаашлаад тэр 5-н PhD хүмүүс өөрт таарсан өгөөжөө хүртэхийн тулд хаа нэгтээ өөр салбарлуу ч юм уу зүглэдэг байна.</p><p><br /></p><p><span style="font-size: x-large;">Мета-стратегийн загвар</span></p><p>Хэрэв танаас өөрийн гэсэн ML-д суурьтай хөрөнгө оруулалтын стратеги гаргаад ир гэж даалгавал аз таны талд биш.</p><p>Зөвхөн ганц true positive стратеги гаргаж ирэхийн тулд 100 гаруй магадгүй тэрнээс ч олон оролдлого хийсэнтэй тэнцэхүйц хүчин чармайлт шаарддаг бөгөөд үүнд шаардлагатай аргачлалуудын ээдрээтэй байдал буюу complexity нь маш их. </p><p>Өгөгдлөө олж цуглуулах, цэвэрлэж боловсруулах, инфра бүтэц, програм хангамж хөгжүүлэлт, хувьсагчийн анализ, ажиллагааны симуляц, backtesting гэх мэтээр явж өгнө.</p><p>Таны ажилладаг фирм тэр чиглэлийн дундын үйлчилгээгээр хангаж өгсөнч гэлээ та яг л төсөөлбөл BMW машины үйлдвэрт ажиллаж байгаа ажилчинд бүхэл бүтэн машиныг өөрийнхөө эргэн тойронд байгаа багажуудыг хэрэглэн ганцаараа угсар гэсэнтэй адил нөхцөл байдалд байгаа гэсэн үг юм. </p><p>Долоо хоног өнгөрлөө та магадгүй гагнуур дээр сайн ажиллаж болно, дараа нь дахиад долоо хоног өнгөрлөө та дахиад цахилгаанчингийн үүрэг гүйцэтгэх хэрэгтэй болно, дахиад долоо хоногийн дараа механикийн инженерийн, дахиад будагчингийн үүрэг гүйцэтгэх гэх мэтээр явж өгнө...</p><p>Хэдий та хүчлээд оролдсон ч буцаад гагнуураа хийх шаардлагатай тойрогтоо дахиад л ирнэ.</p><p>Амжилттай ажилладаг quant фирмүүд бүгд <b>мета-стратеги загвар</b>ыг хэрэглэдэг. </p><p>Үүний тулд тэд яг л BMW машин угсрах шугамтай адил судалгааны үйлдвэр барьсан байдаг.</p><p>Quant бүр нь нийт том зургаа харж мэдэрч ойлгодог байх мөртөө өөрсдийн гэсэн тусгайлсан даалгавар, үүрэг чиглэлдээ маш сайн мэргэшсэн байдаг.</p><p>Хэрэв ийм загварыг бүтээж чадвал ямар нэгэн азанд найдахын оронд таамаглаж болохуйцаар true positive стратеги бүтээх бий болгох боломж нь ихээр нэмэгддэг.</p><p>Энэ бол <a href="https://www.lbl.gov/about/" target="_blank">Berkeley Lab</a> болон бусад Америкийн үндэсний лабортариудын шинжлэх ухааны нээлт хийхийн тулд өдөр болгон хийж байдаг үндсэн бүтэцтэй адилхан гэсэн үг. </p><p>Шинжлэх ухааны нээлт бүрийнх нь ард ямар ч хувь хүн байхгүй үүнд бүгд нийтээрээ багаараа хувь нэмэр оруулсан байдаг гэсэн үг.</p><p>Мэдээжийн хэрэг ийм бүтэцтэй financial лаборатори бүтээх нь маш их цаг орсон ажил дээрээс нь тухайн үүрэг чиглэлд шаардлагатай тэр чиглэлдээ туршлагатай хүмүүсийг олж ажиллуулах хэрэгтэй.</p><p>Гэсэн хэдий ч энэ бүтэц зохион байгуулалт нь практик дээр ажилладаг нь нотлогдсон, дээрээс нь quant бүрийн мөрөн дээр өндөр уул өөдөө мацахад тавьцан байдаг маш том хар бул чулууг нь аваад хаяж байгаа учраас нийтээрээ амжилт олох магадлал нь маш ихээр нэмэгддэг байна.</p><p><br /></p><p><span style="font-size: x-large;">Үйлдвэрийн дамжлагын шугам шиг бүтэц</span></p><p>16, 17-р зуунуудад алт, мөнгө олборлох үйл явц маш энгийн байжээ. </p><p>Зуухан жилийн дотор Испани улсын хөлөг онгоцууд Ёвроп даяар арилжигддаг нийт үнэт металиудын тоо хэмжээг бараг дөрөв дахин ихэсгэсэн байна. </p><p>Гэвч тэр сайхан цаг үе ард хоцроод тун уджээ. Өнөө үед хайгуулчид микроскопын хэмжээнд хүртэл ялгах хэмжээний нарийн төвөгтэй үйлдвэрлэлийн аргуудыг хэрэглэх шаардлагатай.</p><p>Гэсэн хэдий ч энэ нь алтны үйлдвэрлэл нь багасаж байгаа гэсэн үг биш. </p><p>Өнөө үеийн олборлогчид <a href="https://www.expensivity.com/worlds-gold/">2500 метрик тонн</a> микроскопын хэмжээнд шигшсэн алтыг жил бүр үйлдвэрлэн гаргаж байхад 16-р зуунд Испаний байлдан дагуулагчдын олсон нийт алтны хэмжээ нь жилд ердөө 1.54 метрик тонн л байжээ.</p><p>Нүдэнд харагдахуйцаар байгальд оршдог алтны хэмжээ нь нийт дэлхий дээрхи алтны хэмжээтэй харьцуулбал маш өчүүхэн бага хувийг эзэлдэг.</p><p>Хөрөнгө оруулалтын стратеги нээх үйл явц ч мөн адил үүнтэй адилхан үйл явцыг туулан өнгөрүүлж байна.</p><p>Жишээлбэл 10-аад жилийн өмнө хувь хүмүүс макро альфа нээж олох нь их байжээ (эконометрик мэтийн энгийн математик хэрэгслүүдийн тусламжтайгаар).</p><p>Харин одоо бол тийм альфа стратеги нээж олох магадлал нь тэгрүү тэмүүлж байна. Хүмүүс хэдий өндөр туршлагатай мэдлэгтэй байсан ч тэрнээс үл хамаараад макро альфа нээх нь бараг боломжгүй хэрэг болжээ.</p><p>Хамгийн үнэн зөв альфанууд зөвхөн микро түвшинд л үлдсэн байна, тэдгээрийг хайж олох үйл явц нь хөрөнгө ихээр хэрэглэдэг үйлдвэрлэлийн аргуудыг шаардаж байдаг.</p><p>Яг л алт шиг, альфа нь микро түвшинд байх тусмаа ашгийн хувьд бага хэмжээтэй гэсэн үг мэдээж биш.</p><p>Микро түвшний альфанууд макро альфануудтай харьцуулбал тоо хэмжээний хувьд маш их элбэг хэмжээнд оршин байдаг.</p><p>Хүмүүс микро альфаг хэрэглэн их хэмжээний мөнгө хийж байгаа даан ч нэг юм нь хүнд түвшний Machine Learning аргуудыг зөв хэрэглэх нь хамгийн том асуудал юм.</p>Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.com0tag:blogger.com,1999:blog-1457877875009527488.post-19931046275672664832021-10-21T15:04:00.006+08:002021-10-21T15:15:41.228+08:00Тогтонги бүсFinancial Machine Learning-д хэрэглэх датасетэд memory буюу санамж хэрэгтэй байдаг. <span><a name='more'></a></span><div><br /></div><div>Өөрөөр хэлбэл одоо болж байгаа үзэгдлүүд буюу арилжааны ханшны датаг өмнөх түүхтэй харьцуулаад үзчихээр тархалтын бүсрүү хувиргах хэрэгтэй байдаг гэсэн үг. <div><br /></div><div>Энэ тогтмол бүсийг stationary distribution гээд байгаа юм. </div><div><br /></div><div>Ихэнхи академик судлаачид ML-д хэрэглэх stationary дата гаргаж авах гээд ханшнаас нь return утгууд тооцож хэрэглэдэг. </div><div><br /></div><div>Хэдийгээр энэ арга stationary хэдий ч датасет дэх санамжийг нь арчаад хаячихдаг байна.
Жишээлбэл return утгууд гаргаж авая.</div></div><div><pre style="background: rgb(0, 0, 0); color: #d1d1d1;">df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'returns'</span><span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'close'</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">.</span>pct_change<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
df <span style="color: #d2cd86;">=</span> df<span style="color: #d2cd86;">.</span>dropna<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'returns'</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">.</span>plot<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span></pre></div><div><br /></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhjibrn7JhytW13znwf-xNjuVRsoMBemj5d2Mz1oUCguSad9W02LZ1fGmUVWssGxi-CRca0-u2UxlIZi-0lhpCLoSgJNqmnzyZcxU8ZWDOt7ZdPDAsGNGkAPOzUOP0U3ihbYGwAkKjRqA/s1166/non-stationary-prices.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="709" data-original-width="1166" height="390" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhjibrn7JhytW13znwf-xNjuVRsoMBemj5d2Mz1oUCguSad9W02LZ1fGmUVWssGxi-CRca0-u2UxlIZi-0lhpCLoSgJNqmnzyZcxU8ZWDOt7ZdPDAsGNGkAPOzUOP0U3ihbYGwAkKjRqA/w640-h390/non-stationary-prices.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Энгийн ханшны цуваанд өмнөх түүхтэй харьцуулаад үзчихээр тархалтын бүс байхгүй байгааг эндээс харж болно</td></tr></tbody></table><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhB7-S3wfGm-mJzcUJ1tDuLJhbqOU-FfTgGapQjGz-_QV5uwhPtTEWxKE5F7XQloSHd4rzNoMtxpmxzKxjs8brEp6y6_yyWEwidIcTu16X-SI3C7CIDZ1_NQ92-zZOd_na3E2CBo4nW0g/s1156/non-stationary-distribution.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="683" data-original-width="1156" height="378" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhB7-S3wfGm-mJzcUJ1tDuLJhbqOU-FfTgGapQjGz-_QV5uwhPtTEWxKE5F7XQloSHd4rzNoMtxpmxzKxjs8brEp6y6_yyWEwidIcTu16X-SI3C7CIDZ1_NQ92-zZOd_na3E2CBo4nW0g/w640-h378/non-stationary-distribution.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Энгийн ханшын цувааны тархалтыг нь харвал иймэрхүү</td></tr></tbody></table><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgUC7y4v1JX5oNK8_KveDzSRtmf4CwVkbCICs0g-wEGiARfId8XhK7fP3-SEkVpqqGOVTH-Sf9u16IKQxnL_sKgEK3FZrjYQAjjz8sFc60tWGQ1kGUiO-8pXiFvui_sk0AM-3iHWuxccg/s1180/stationary-but-memoryless-returns.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="709" data-original-width="1180" height="384" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgUC7y4v1JX5oNK8_KveDzSRtmf4CwVkbCICs0g-wEGiARfId8XhK7fP3-SEkVpqqGOVTH-Sf9u16IKQxnL_sKgEK3FZrjYQAjjz8sFc60tWGQ1kGUiO-8pXiFvui_sk0AM-3iHWuxccg/w640-h384/stationary-but-memoryless-returns.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Хэрэв return утгууд тооцож харвал цуваа илүү stationary болох хэдий ч memory-г арчаад хаячихдаг байна. Энэ бол бөөн баахан noise-ууд л байна гэсэн үг.</td></tr></tbody></table><br /><div><br /></div><div>Marcos de Lopez Prado гуай энэ асуудлыг шийдэх fractional differentiation гэх аргыг санал болгосан байна. </div><div><br /></div><div>Өөрөөр хэлбэл stationary бөгөөд memory бага арчигдсан дата хувиргалтын арга гэсэн үг. </div><div><br /></div><div>Иймэрхүү байдлаар гаргаж авч болно.</div><div><pre style="background: rgb(0, 0, 0); color: #d1d1d1;"><span style="color: #9999a9;">#!pip install fracdiff</span>
<span style="color: #e66170; font-weight: bold;">from</span> fracdiff <span style="color: #e66170; font-weight: bold;">import</span> fdiff
<span style="color: #e66170; font-weight: bold;">from</span> fracdiff <span style="color: #e66170; font-weight: bold;">import</span> FracdiffStat
precision <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">10e</span><span style="color: #00dddd;">-</span><span style="color: #00a800;">8</span>
window <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">60</span>
f <span style="color: #d2cd86;">=</span> FracdiffStat<span style="color: #d2cd86;">(</span>
window <span style="color: #d2cd86;">=</span> window <span style="color: #d2cd86;">,</span>
mode <span style="color: #d2cd86;">=</span> <span style="color: #00c4c4;">'valid'</span> <span style="color: #d2cd86;">,</span>
precision <span style="color: #d2cd86;">=</span> precision<span style="color: #d2cd86;">,</span>
<span style="color: #d2cd86;">)</span>
diff <span style="color: #d2cd86;">=</span> f<span style="color: #d2cd86;">.</span>fit_transform<span style="color: #d2cd86;">(</span>
df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'close'</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">.</span>values<span style="color: #d2cd86;">.</span>reshape<span style="color: #d2cd86;">(</span><span style="color: #00dddd;">-</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">,</span> <span style="color: #00a800;">1</span><span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">)</span>
df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'close_fdiff'</span><span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> fdiff<span style="color: #d2cd86;">(</span>df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'close'</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">.</span>values<span style="color: #d2cd86;">,</span> n<span style="color: #d2cd86;">=</span>f<span style="color: #d2cd86;">.</span>d_<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> window<span style="color: #d2cd86;">=</span>window<span style="color: #d2cd86;">)</span>
df <span style="color: #d2cd86;">=</span> df<span style="color: #d2cd86;">.</span>iloc<span style="color: #d2cd86;">[</span>window<span style="color: #d2cd86;">:</span><span style="color: #d2cd86;">]</span>
df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'close_fdiff'</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">.</span>plot<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span></pre></div><div><br /></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3XK5Q3pbuoIEFqF2PUjzRMcjGfoa6VhHAQfIifDwb5AjTz-LTbJqYbM5ItX6QC469y5fLtnLWBbwKcUkddicaRA72PGBoqZxs2YiXa5NPLJLQyIcGHsaqXeGMImP0tWv9J0N4kleTxw/s1159/fracdiffed-memory-retained-prices.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="709" data-original-width="1159" height="392" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3XK5Q3pbuoIEFqF2PUjzRMcjGfoa6VhHAQfIifDwb5AjTz-LTbJqYbM5ItX6QC469y5fLtnLWBbwKcUkddicaRA72PGBoqZxs2YiXa5NPLJLQyIcGHsaqXeGMImP0tWv9J0N4kleTxw/w640-h392/fracdiffed-memory-retained-prices.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Fractional differentiation хувиргалтын дараа харвал өмнөх түүхтэй харьцуулаад үзчихүйц (memory-г бага арчсан) тархалтын бүс бүхий цуваанууд болж хувирсан байгааг харж болно.</td></tr></tbody></table><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhobePmTNsR96nb8qMqfnGenCKYr91q9ktd-w_NfgtAfOt4AjbizpFAcUyZVK4s-rHyDtLv906ooVTuCRi0hrcXlouaol1feDAc8MyaILHu1QkRTtOfsd3Jxs77wENO135zvnFdd18LIA/s1156/fracdiffed-distribution.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="683" data-original-width="1156" height="378" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhobePmTNsR96nb8qMqfnGenCKYr91q9ktd-w_NfgtAfOt4AjbizpFAcUyZVK4s-rHyDtLv906ooVTuCRi0hrcXlouaol1feDAc8MyaILHu1QkRTtOfsd3Jxs77wENO135zvnFdd18LIA/w640-h378/fracdiffed-distribution.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Fractional differentiated цувааны тархалт нь арай дээрдсэн байгааг бас харж болно.</td></tr></tbody></table><br /><div><br /></div><div><br /></div><div>Өөр нэг stationary дата гаргаж авах арга бол хурдтай moving average-үүдийн cumulative return-уудээс difference тооцох арга. </div><div><br /></div><div>Иймэрхүү байдлаар бичиж болох байх.</div><div><pre style="background: rgb(0, 0, 0); color: #d1d1d1;">fast_ma_window <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">5</span>
df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'fast_ma'</span> <span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'close'</span> <span style="color: #d2cd86;">]</span>
<span style="color: #d2cd86;">.</span>rolling<span style="color: #d2cd86;">(</span>fast_ma_window<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">.</span>mean<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'fast_ret'</span> <span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'fast_ma'</span> <span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">.</span>pct_change<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'fast_cum'</span> <span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'fast_ret'</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">.</span>cumsum<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'stationary_ret'</span><span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'fast_cum'</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">.</span>diff<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span></pre></div><div><br /></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjPdXbGMPOgk5wSVni0wyj9N4THv52Xbq-eR7gL4syZd00GAf4ujXvhFk-CGagBGCjvJoOUhzXJdJXJMFbPdzPNPpQcLnylgwf30ii1lFfQNCXT9pmqu8hjlqVA8a-LJzn-e3BwE2Gw2A/s1186/stationary-series.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="709" data-original-width="1186" height="382" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjPdXbGMPOgk5wSVni0wyj9N4THv52Xbq-eR7gL4syZd00GAf4ujXvhFk-CGagBGCjvJoOUhzXJdJXJMFbPdzPNPpQcLnylgwf30ii1lFfQNCXT9pmqu8hjlqVA8a-LJzn-e3BwE2Gw2A/w640-h382/stationary-series.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Энэ аргаар хувиргасан датаг зурж харвал бас илүү дээрдсэн байгааг харж болно.</td></tr></tbody></table><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVtxvWbDs_jX7kX1sf_sM3KqHK3XGLzPBYsatbWwhuW3yyLd7MQMkURyJ0DcI9ejhUaRl6PfsRQW3lyfrxIOqjUD0zhMidUe9faDMW8RBYDdmK-owxCP_HTzKJj94JuZDO3pjDbhJTPg/s1162/stationary-distribution.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="683" data-original-width="1162" height="376" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVtxvWbDs_jX7kX1sf_sM3KqHK3XGLzPBYsatbWwhuW3yyLd7MQMkURyJ0DcI9ejhUaRl6PfsRQW3lyfrxIOqjUD0zhMidUe9faDMW8RBYDdmK-owxCP_HTzKJj94JuZDO3pjDbhJTPg/w640-h376/stationary-distribution.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">сүүлийн аргын тархалт нь</td></tr></tbody></table><br /><div><br /></div><div><br /></div><div>Цувааг stationary эсэхийг Ad-Fuller test гэдэг аргаар шалгаж болно.</div><div><pre style="background: rgb(0, 0, 0); color: #d1d1d1;"><span style="color: #e66170; font-weight: bold;">from</span> statsmodels<span style="color: #d2cd86;">.</span>tsa<span style="color: #d2cd86;">.</span>stattools <span style="color: #e66170; font-weight: bold;">import</span> adfuller
<span style="color: #e66170; font-weight: bold;">def</span> adf_test<span style="color: #d2cd86;">(</span>array<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
adf<span style="color: #d2cd86;">,</span> pvalue<span style="color: #d2cd86;">,</span> _<span style="color: #d2cd86;">,</span> _<span style="color: #d2cd86;">,</span> _<span style="color: #d2cd86;">,</span> _ <span style="color: #d2cd86;">=</span> adfuller<span style="color: #d2cd86;">(</span>array<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span>f<span style="color: #00c4c4;">"Ad-Fuller : {adf}"</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span>f<span style="color: #00c4c4;">"P-Value : {pvalue}"</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">if</span> pvalue<span style="color: #00dddd;">></span><span style="color: #009f00;">0.05</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span><span style="color: #9999a9;">"""Failed to reject null-hypothesis, </span>
<span style="color: #9999a9;"> not stationary."""</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">if</span> pvalue<span style="color: #00dddd;"><=</span><span style="color: #009f00;">0.05</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span><span style="color: #9999a9;">"""Rejected the null-hypothesis, </span>
<span style="color: #9999a9;"> it is stationary"""</span><span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;"># stationary эсэхийг шалгахдаа </span>
adf_test<span style="color: #d2cd86;">(</span>df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'close'</span> <span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">.</span>values<span style="color: #d2cd86;">)</span>
adf_test<span style="color: #d2cd86;">(</span>df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'returns'</span> <span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">.</span>values<span style="color: #d2cd86;">)</span>
adf_test<span style="color: #d2cd86;">(</span>df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'close_fdiff'</span> <span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">.</span>values<span style="color: #d2cd86;">)</span>
adf_test<span style="color: #d2cd86;">(</span>df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'stationary_ret'</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">.</span>values<span style="color: #d2cd86;">)</span></pre></div><div><br /></div><div>Гаралтууд нь харгалзан</div><div><pre style="background: rgb(0, 0, 0); color: #d1d1d1;">Энгийн ханшын цуваа
Ad<span style="color: #d2cd86;">-</span>Fuller <span style="color: #d2cd86;">:</span> <span style="color: #d2cd86;">-</span><span style="color: #00a800;">1.</span><span style="color: #00a800;">2061369144376757</span>
P<span style="color: #d2cd86;">-</span>Value <span style="color: #d2cd86;">:</span> <span style="color: #00a800;">0.</span><span style="color: #00a800;">6709502103480609</span>
Failed to reject null<span style="color: #d2cd86;">-</span>hypothesis<span style="color: #d2cd86;">,</span> not stationary<span style="color: #d2cd86;">.</span>
Return утгууд
Ad<span style="color: #d2cd86;">-</span>Fuller <span style="color: #d2cd86;">:</span> <span style="color: #d2cd86;">-</span><span style="color: #00a800;">56.</span><span style="color: #00a800;">253940499175776</span>
P<span style="color: #d2cd86;">-</span>Value <span style="color: #d2cd86;">:</span> <span style="color: #00a800;">0.0</span>
Rejects the null<span style="color: #d2cd86;">-</span>hypothesis<span style="color: #d2cd86;">,</span> it is stationary
Fractional differentiation хийсэн утгууд
Ad<span style="color: #d2cd86;">-</span>Fuller <span style="color: #d2cd86;">:</span> <span style="color: #d2cd86;">-</span><span style="color: #00a800;">2.</span><span style="color: #00a800;">8566964987848777</span>
P<span style="color: #d2cd86;">-</span>Value <span style="color: #d2cd86;">:</span> <span style="color: #00a800;">0.</span><span style="color: #00a800;">05062031463227481</span>
Rejects the null<span style="color: #d2cd86;">-</span>hypothesis<span style="color: #d2cd86;">,</span> it is stationary
Сүүлийн аргаар гаргаж авсан утгууд
Ad<span style="color: #d2cd86;">-</span>Fuller <span style="color: #d2cd86;">:</span> <span style="color: #d2cd86;">-</span><span style="color: #00a800;">17.</span><span style="color: #00a800;">417089775860706</span>
P<span style="color: #d2cd86;">-</span>Value <span style="color: #d2cd86;">:</span> <span style="color: #00a800;">4.</span><span style="color: #00a800;">849275348297892e-30</span>
Rejects the null<span style="color: #d2cd86;">-</span>hypothesis<span style="color: #d2cd86;">,</span> it is stationary</pre></div><div><br /></div>Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.com0tag:blogger.com,1999:blog-1457877875009527488.post-2446694920495146082021-09-19T05:05:00.017+08:002021-11-01T13:48:50.142+08:00Арилжааны захиалга хэрхэн биелдэг вэ?<p>Энэ тэмдэглэлээр авах, зарах захиалга болон Market Order, Limit Order мөн Order Book гэж юу вэ гэдэг талаар тэмдэглэнэ.</p><span><a name='more'></a></span><p><br /></p><p>Арилжааны захиалга хэрхэн биелдэгийг ойлгохын тулд эхлээд захиалга гэж юу вэ гэдгээ тодорхойлох хэрэгтэй байх.</p><p><br /></p><p><span style="font-size: x-large;"><b>Арилжааны захиалга</b></span></p><p>Арилжаанд оролцогчид ямар нэгэн хувьцааг худалдаж авах эсвэл зарах хэрэгтэй болвол брокерт энэ талаараа тодорхойлсон <b>захиалга</b> өгдөг.</p><p>Брокер энэ хүсэлтийг нь хүлээн авч таны өмнөөс захиалгыг биелүүлж өгдөг. </p><p>Market Exchange-д захиалга хийхэд дараах бүтэцтэйгээр үүснэ:</p><p> - Авах эсвэл зарах<br /> - Хувьцааны ялгах тэмдэг (Жишээ нь TSLA, GOOG, IBM гэх мэт)<br /> - Авах(эсвэл зарах) тоо ширхэг<br /> - Market Order эсвэл Limit Order<br /> - Ханш</p><p>Эндээс <b>"Market Order"</b> гэдэг нь market дээр одоогоор арилжаалагдаж байгаа ямарч ханшын дагуу хувьцааг тэр үнээр нь шууд авах эсвэл зарахад бэлэн байгаа гэдгийг хэлж өгдөг.</p><p>Харин <b>"Limit Order"</b> гэдэг нь зөвхөн зааж өгсөн ханшын дагуу арилжаанд орно бусад тохиолдолд тэр ханш дээр авах эсвэл зарах боломж гартал нь <b>Order Book</b> дотор хүлээнэ гэдгийг хэлж өгдөг.</p><p>Жишээлбэл :</p><p> <b>"BUY,IBM,100,LIMIT,99.95"</b> гэсэн захиалга байлаа гэж үзье. <br /> Энэ нь юу гэсэн үг вэ гэхээр би <b>100</b> ширхэг <b>IBM</b> хувьцааг <b>99.95</b>-аас<br /> <b>илүү үнээр авахгүй</b> энэ үнээс <b>доош үнэтэй л бол худалдаж авна</b> гэсэн үг</p><p> <b>"SELL,GOOG,150,MARKET"</b> гэсэн захиалгын хувьд би <b>150</b> ширхэг <b>Google</b>-н<br /> хувьцааг яг одоо байгаа <b>ямар ч хамаагүй үнэ</b> дээр зарна гэдгийг хэлж байна.</p><p><br /></p><p><span style="font-size: x-large;"><b>Order Book</b></span></p><p>Ихэнхи арилжааны системүүд <b>Order Book</b> гэх бүтцийг арилжааны захиалга биелүүлэхдээ ашигладаг. </p><p>Энэ бүтэц дотор авах болон зарах бүх захиалгуудыг <b>тэмдэглэн хадгалдаг</b>.</p><p><b>New York Stock Exchange(NYSE)</b> нээгдэх үед <b>"BUY,IBM,100,LIMIT,99.95"</b> гэсэн захиалга өглөө гэж үзье. </p><p>Тухайн мөчид хэн ч ямар нэгэн захиалга нэмж өгөөгүй бол таны захиалга хамгийн эхэнд биелэгдэх захилга болно.</p><p>Таны захиалга Order Book дотор <b>"BID 99.95 100"</b> хэлбэртэй болох бөгөөд нийтэд ил хэн ч харах боломжтой мэдээлэл болж харагдана.</p><p>Хүмүүс үүнийг харахдаа аан за окэй энэ хувьцааг авах сонирхолтой захиалга байгаа юм байна гэж харна. Гэхдээ хэн энэ захиалгыг өгсөн бэ гэдгийг мэдэх боломжгүй.</p><p>Бусад хүмүүсч гэсэн өөрийн захиалгыг өгөх боломжтой. Тухайн ханш дээр нэмж орж ирсэн захиалгын дагуу Order Book дотор <b>нэмж хөтөлдөг</b>.</p><p>Жишээлбэл өөр хэн нэгэн нэмж <b>99.95</b> ханш дээр <b>900</b> хувьцаа авах <b>Limit Order</b> өгвөл Order Book нь <b>"BID 99.95 1000"</b> болж нэмэгдэнэ гэсэн үг.</p><p>Дараагаар өөр нэг оролцогч <b>"SELL,IBM,1000,LIMIT,100"</b> захиалга өглөө гэж үзье.</p><p>Одоогоор хэн ч <b>1000</b>-н ширхэг <b>IBM</b>-н хувьцааг <b>100</b> ханш дээр авах захиалга өгөөгүй учраас энэ sell захиалга <b>биелэгдэхгүй</b> учир Order Book дотор <b>"ASK 100.0 1000"</b> хэлбэртэй болон хадгалдагдана.</p><p>Тэхээр одоогийн Order Book маань дараах хэлбэртэйгээр хадгалагдаж байна гэсэн үг.</p><p> ASK 100.00 1000<br /> BID 99.95 100</p><p>Гэх мэтээр дараа дараачаар орж ирэх захиалгуудыг Order Book маань тэмдэглээд явна гэсэн үг.</p><p><br /></p><p>Нэмэгдэж явсаар Order Book маань дараах байдалтай болсон байна гэж үзье.</p><p> BID/ASK | Price | Size<br /> -----------------------------------<br /> ASK | 100.10 | 100<br /> ASK | 100.05 | 500<br /> ASK | 100.00 | 1000<br /> BID | 99.95 | 100<br /> BID | 99.90 | 50<br /> BID | 99.85 | 50</p><p><br /></p><p>Харин одоо шинээр <b>"BUY,IBM,100,MARKET"</b> гэсэн захиалга орж ирлээ гэж үзье.</p><p>Өөрөөр хэлбэл <b>Market Order</b> буюу <b>IBM</b>-н <b>100 ширхэг</b> хувьцааг одоо байгаа ямарч хамаагүй ханшаар <b>худалдаж авахад бэлэн</b> захиалга нэсэн үг.</p><p>Энэ захиалгыг биелүүлэхийн тулд арилжааны систем эхлээд <b>Order Book</b>-ээсээ <b>шүүлт хийх</b> бөгөөд зарах боломжтой <b>хамгийн бага үнэ</b> дээр дээрхи захиалгыг биелүүлж өгдөг.</p><p>Тэхээр сүүлийн захиалгыг биелүүлэх хамгийн <b>боломжит бага үнэ нь 100.00</b> тул тэнд байгаа <b>1000</b>-н ширхэгээс нь <b>100</b>-г сүүлийн <b>BUY</b> захиалга дээр <b>биелүүлж өгнө</b> тиймээс Order Book маань дараах байдалтай болж хувирна.</p><p> BID/ASK | Price | Size<br /> -----------------------------------<br /> ASK | 100.10 | 100<br /> ASK | 100.05 | 500<br /> ASK | 100.00 | 900<br /> BID | 99.95 | 100<br /> BID | 99.90 | 50<br /> BID | 99.85 | 50</p><p>Өөрөөр хэлбэл одоо Order Book дотор <b>100.00</b> ханш дээр зарах <b>900-н ширхэг</b> хувьцаа <b>үлдсэн</b> байна гэсэн үг юм.<br />Энэ transaction-ий хувьд <b>execution ханш</b> нь 100.00 гэсэн үг.</p><p><br /></p><p>Дараа нь дахин <b>"BUY,100,LIMIT,100.02"</b> гэсэн захиалга орж ирлээ гэж үзвэл өөрөөр хэлбэл <b>100-н ширхэг</b> хувьцааг <b>100.02</b> оос <b>илүүгүй ханшаар худалдаж авна</b> гэсэн захиалга орж ирвэл Order Book дотор энэ нөхцлийг хангах <b>100.00 ханш</b> дээр <b>зарахад бэлэн 900-н ширхэг</b> байгааг харж болно.</p><p>Тиймээс энэ <b>900</b>-аас захиалгыг биелүүлээд Order Book маань <b>100.00</b> дээр <b>800</b> болж хувирна</p><p> BID/ASK | Price | Size<br /> -----------------------------------<br /> ASK | 100.10 | 100<br /> ASK | 100.05 | 500<br /> ASK | 100.00 | 800<br /> BID | 99.95 | 100<br /> BID | 99.90 | 50<br /> BID | 99.85 | 50</p><p>Энэ transaction-ий хувьд <b>execution ханш</b> нь 100.00 гэсэн үг.</p><p><br /></p><p>Одоо нэмээд <b>"SELL,175,MARKET"</b> захилга орж ирлээ гэж үзье. Одоогийн байгаа ямар ч хамаагүй ханшын дагуу <b>175</b> ширхэг хувьцааг зарна гэсэн үг.</p><p>Order Book дотор энэ сүүлийн захиалгын нөхцлыг хангах тоо ширхэгүүдийг шүүвэл</p><p> BID 100 99.95<br /> BID 50 99.90<br /> BID 25 99.85</p><p>болно. Тиймээс захиалгыг биелүүлсний дараа Order Book маань</p><p> BID/ASK | Price | Size<br /> -----------------------------------<br /> ASK | 100.10 | 100<br /> ASK | 100.05 | 500<br /> ASK | 100.00 | 800<br /> BID | 99.85 | 25</p><p>Execution ханш 99.95, 99.90, 99.85 болон буурч биелсэн гэсэн үг.</p><p>Иймэрхүү байдлаар захиалгууд тухайн хувьцааны ханшинд нөлөөлж байдаг байна.</p>Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.com0tag:blogger.com,1999:blog-1457877875009527488.post-19593677276909201092021-09-11T17:45:00.008+08:002021-09-12T04:31:36.294+08:00Python хэлийг MetaTrader5 тай холбож ашиглах<p>Энэ тэмдэглэлээр Python хэлээр MetaTrader5-ийн датаг хэрхэн гаргаж авч ашиглах талаар тэмдэглэе гэж бодлоо.</p><a name='more'></a><p></p><p>Tick өгөгдлүүдийг нь цуглуулаад дуртай bar хэлбэрлүүгээ хөрвүүлж ашиглаж болно.</p>
<pre style="background: rgb(0, 0, 0); color: #d1d1d1;"><span style="color: #e66170; font-weight: bold;">import</span> sys
<span style="color: #e66170; font-weight: bold;">import</span> time
<span style="color: #e66170; font-weight: bold;">import</span> requests
<span style="color: #e66170; font-weight: bold;">import</span> configparser
<span style="color: #e66170; font-weight: bold;">import</span> datetime
<span style="color: #e66170; font-weight: bold;">import</span> joblib
<span style="color: #e66170; font-weight: bold;">import</span> pytz
<span style="color: #e66170; font-weight: bold;">from</span> pytz <span style="color: #e66170; font-weight: bold;">import</span> timezone
<span style="color: #e66170; font-weight: bold;">import</span> matplotlib<span style="color: #d2cd86;">.</span>pyplot <span style="color: #e66170; font-weight: bold;">as</span> plt
<span style="color: #e66170; font-weight: bold;">import</span> matplotlib<span style="color: #d2cd86;">.</span>dates <span style="color: #e66170; font-weight: bold;">as</span> mdates
<span style="color: #e66170; font-weight: bold;">import</span> pandas <span style="color: #e66170; font-weight: bold;">as</span> pd
<span style="color: #e66170; font-weight: bold;">import</span> numpy <span style="color: #e66170; font-weight: bold;">as</span> np
<span style="color: #e66170; font-weight: bold;">import</span> MetaTrader5 <span style="color: #e66170; font-weight: bold;">as</span> mt5
utc_tz <span style="color: #d2cd86;">=</span> pytz<span style="color: #d2cd86;">.</span>timezone<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"Etc/UTC"</span><span style="color: #d2cd86;">)</span>
broker_shift <span style="color: #d2cd86;">=</span> datetime<span style="color: #d2cd86;">.</span>timedelta<span style="color: #d2cd86;">(</span>hours<span style="color: #d2cd86;">=</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">def</span> load_prev_ticks<span style="color: #d2cd86;">(</span>instrument<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"loading previous ticks..."</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">if</span> <span style="color: #e66170; font-weight: bold;">not</span> mt5<span style="color: #d2cd86;">.</span>initialize<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
mt5<span style="color: #d2cd86;">.</span>shutdown<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
datetime_now <span style="color: #d2cd86;">=</span> datetime<span style="color: #d2cd86;">.</span>datetime<span style="color: #d2cd86;">.</span>now<span style="color: #d2cd86;">(</span>utc_tz<span style="color: #d2cd86;">)</span><span style="color: #00dddd;">+</span>broker_shift
days <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">1</span>
yesterday <span style="color: #d2cd86;">=</span> datetime_now <span style="color: #00dddd;">-</span> datetime<span style="color: #d2cd86;">.</span>timedelta<span style="color: #d2cd86;">(</span>days<span style="color: #d2cd86;">=</span>days<span style="color: #d2cd86;">)</span>
start_datetime <span style="color: #d2cd86;">=</span> yesterday<span style="color: #d2cd86;">.</span>replace<span style="color: #d2cd86;">(</span>hour<span style="color: #d2cd86;">=</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">,</span> minute<span style="color: #d2cd86;">=</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">,</span> second<span style="color: #d2cd86;">=</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">,</span> microsecond<span style="color: #d2cd86;">=</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">)</span>
end_datetime <span style="color: #d2cd86;">=</span> datetime_now
previous_ticks <span style="color: #d2cd86;">=</span> mt5<span style="color: #d2cd86;">.</span>copy_ticks_range<span style="color: #d2cd86;">(</span>
instrument <span style="color: #d2cd86;">,</span>
start_datetime <span style="color: #d2cd86;">,</span>
end_datetime <span style="color: #d2cd86;">,</span>
mt5<span style="color: #d2cd86;">.</span>COPY_TICKS_ALL
<span style="color: #d2cd86;">)</span>
df <span style="color: #d2cd86;">=</span> pd<span style="color: #d2cd86;">.</span>DataFrame<span style="color: #d2cd86;">(</span>previous_ticks<span style="color: #d2cd86;">)</span>
df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'datetime'</span><span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> pd<span style="color: #d2cd86;">.</span>to_datetime<span style="color: #d2cd86;">(</span>df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'time_msc'</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> unit<span style="color: #d2cd86;">=</span><span style="color: #00c4c4;">'ms'</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">.</span>dt<span style="color: #d2cd86;">.</span>tz_localize<span style="color: #d2cd86;">(</span>utc_tz<span style="color: #d2cd86;">)</span>
df <span style="color: #d2cd86;">=</span> df<span style="color: #d2cd86;">.</span>set_index<span style="color: #d2cd86;">(</span>pd<span style="color: #d2cd86;">.</span>DatetimeIndex<span style="color: #d2cd86;">(</span>df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'datetime'</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'volume'</span> <span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">1</span>
mt5<span style="color: #d2cd86;">.</span>shutdown<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"previous ticks are loaded now."</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> df<span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'bid'</span><span style="color: #d2cd86;">,</span> <span style="color: #00c4c4;">'ask'</span><span style="color: #d2cd86;">,</span> <span style="color: #00c4c4;">'volume'</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">]</span>
<span style="color: #e66170; font-weight: bold;">def</span> latest_ticks<span style="color: #d2cd86;">(</span>instrument<span style="color: #d2cd86;">,</span> delta_seconds<span style="color: #d2cd86;">=</span><span style="color: #00a800;">40</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"loading latest ticks..."</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">if</span> <span style="color: #e66170; font-weight: bold;">not</span> mt5<span style="color: #d2cd86;">.</span>initialize<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
mt5<span style="color: #d2cd86;">.</span>shutdown<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
utc_now <span style="color: #d2cd86;">=</span> datetime<span style="color: #d2cd86;">.</span>datetime<span style="color: #d2cd86;">.</span>now<span style="color: #d2cd86;">(</span>utc_tz<span style="color: #d2cd86;">)</span><span style="color: #00dddd;">+</span>broker_shift
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span>f<span style="color: #00c4c4;">"datetime now : {utc_now}"</span><span style="color: #d2cd86;">)</span>
utc_from <span style="color: #d2cd86;">=</span> utc_now<span style="color: #00dddd;">-</span>datetime<span style="color: #d2cd86;">.</span>timedelta<span style="color: #d2cd86;">(</span>seconds<span style="color: #d2cd86;">=</span>delta_seconds<span style="color: #d2cd86;">)</span>
new_ticks <span style="color: #d2cd86;">=</span> mt5<span style="color: #d2cd86;">.</span>copy_ticks_range<span style="color: #d2cd86;">(</span>instrument<span style="color: #d2cd86;">,</span> utc_from<span style="color: #d2cd86;">,</span> utc_now<span style="color: #d2cd86;">,</span> mt5<span style="color: #d2cd86;">.</span>COPY_TICKS_ALL<span style="color: #d2cd86;">)</span>
df <span style="color: #d2cd86;">=</span> pd<span style="color: #d2cd86;">.</span>DataFrame<span style="color: #d2cd86;">(</span>new_ticks<span style="color: #d2cd86;">)</span>
df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'datetime'</span><span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> pd<span style="color: #d2cd86;">.</span>to_datetime<span style="color: #d2cd86;">(</span>df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'time_msc'</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> unit<span style="color: #d2cd86;">=</span><span style="color: #00c4c4;">'ms'</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">.</span>dt<span style="color: #d2cd86;">.</span>tz_localize<span style="color: #d2cd86;">(</span>utc_tz<span style="color: #d2cd86;">)</span>
df <span style="color: #d2cd86;">=</span> df<span style="color: #d2cd86;">.</span>set_index<span style="color: #d2cd86;">(</span>pd<span style="color: #d2cd86;">.</span>DatetimeIndex<span style="color: #d2cd86;">(</span>df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'datetime'</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'volume'</span> <span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">1</span>
mt5<span style="color: #d2cd86;">.</span>shutdown<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> df<span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'bid'</span><span style="color: #d2cd86;">,</span> <span style="color: #00c4c4;">'ask'</span><span style="color: #d2cd86;">,</span> <span style="color: #00c4c4;">'volume'</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">]</span>
</pre>
<!--Created using ToHtml.com on 2021-09-11 09:43:40 UTC--><div><br /></div>
Live хэлбэрээр 1 минутын бар үүсгэн дуудаж ашиглахдаа<div><pre style="background: rgb(0, 0, 0); color: #d1d1d1;"><span style="color: #e66170; font-weight: bold;">def</span> ticks_to_1m<span style="color: #d2cd86;">(</span>ticks<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
df_1m <span style="color: #d2cd86;">=</span> pd<span style="color: #d2cd86;">.</span>concat<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">[</span>
ticks<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'ask'</span> <span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">.</span>resample<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">'1min'</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">.</span>ohlc<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span>
ticks<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'volume'</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">.</span>resample<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">'1min'</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">.</span><span style="color: #e66170; font-weight: bold;">sum</span><span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> axis<span style="color: #d2cd86;">=</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">)</span>
df_1m <span style="color: #d2cd86;">=</span> df_1m<span style="color: #d2cd86;">.</span>dropna<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> df_1m
<span style="color: #9999a9;"># Live</span>
instrument <span style="color: #d2cd86;">=</span> <span style="color: #00c4c4;">"USDJPY"</span>
all_ticks <span style="color: #d2cd86;">=</span> load_prev_ticks<span style="color: #d2cd86;">(</span>instrument<span style="color: #d2cd86;">=</span>instrument<span style="color: #d2cd86;">)</span>
all_ticks <span style="color: #d2cd86;">=</span> all_ticks<span style="color: #d2cd86;">[</span><span style="color: #00dddd;">~</span>all_ticks<span style="color: #d2cd86;">.</span>index<span style="color: #d2cd86;">.</span>duplicated<span style="color: #d2cd86;">(</span>keep<span style="color: #d2cd86;">=</span><span style="color: #00c4c4;">'first'</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">]</span>
all_ticks <span style="color: #d2cd86;">=</span> all_ticks<span style="color: #d2cd86;">.</span>dropna<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
df_1m <span style="color: #d2cd86;">=</span> ticks_to_1m<span style="color: #d2cd86;">(</span>all_ticks<span style="color: #d2cd86;">)</span>
last_1m_bar_len <span style="color: #d2cd86;">=</span> <span style="color: #e66170; font-weight: bold;">len</span><span style="color: #d2cd86;">(</span>df_1m<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">try</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">while</span> True<span style="color: #d2cd86;">:</span>
new_ticks <span style="color: #d2cd86;">=</span> latest_ticks<span style="color: #d2cd86;">(</span>instrument<span style="color: #d2cd86;">=</span>instrument<span style="color: #d2cd86;">,</span> delta_seconds<span style="color: #d2cd86;">=</span><span style="color: #00a800;">80</span><span style="color: #d2cd86;">)</span>
all_ticks <span style="color: #d2cd86;">=</span> pd<span style="color: #d2cd86;">.</span>concat<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">[</span>all_ticks<span style="color: #d2cd86;">,</span> new_ticks<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
all_ticks <span style="color: #d2cd86;">=</span> all_ticks<span style="color: #d2cd86;">[</span><span style="color: #00dddd;">~</span>all_ticks<span style="color: #d2cd86;">.</span>index<span style="color: #d2cd86;">.</span>duplicated<span style="color: #d2cd86;">(</span>keep<span style="color: #d2cd86;">=</span><span style="color: #00c4c4;">'first'</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">]</span>
all_ticks <span style="color: #d2cd86;">=</span> all_ticks<span style="color: #d2cd86;">.</span>dropna<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
df_1m <span style="color: #d2cd86;">=</span> ticks_to_1m<span style="color: #d2cd86;">(</span>all_ticks<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span>f<span style="color: #00c4c4;">"current bars length : {len(df_1m)}"</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">if</span> <span style="color: #e66170; font-weight: bold;">len</span><span style="color: #d2cd86;">(</span>df_1m<span style="color: #d2cd86;">)</span><span style="color: #00dddd;">></span>last_1m_bar_len<span style="color: #d2cd86;">:</span>
last_1m_bar_len <span style="color: #d2cd86;">=</span> <span style="color: #e66170; font-weight: bold;">len</span><span style="color: #d2cd86;">(</span>df_1m<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"NEW BAR ENTERED..."</span><span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;"># Энэ хэсэгт шинэ bar орж ирсэн тул feature-үүдээ үүсгэх, </span>
<span style="color: #9999a9;"># ML моделио дуудах, сигналуудаа гаргаж авах гэх мэтээр</span>
<span style="color: #9999a9;"># ашиглаж болно...</span>
time<span style="color: #d2cd86;">.</span>sleep<span style="color: #d2cd86;">(</span><span style="color: #00a800;">30</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">pass</span>
<span style="color: #e66170; font-weight: bold;">except</span> KeyboardInterrupt<span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"Interrupted"</span><span style="color: #d2cd86;">)</span>
</pre>
<!--Created using ToHtml.com on 2021-09-11 09:44:48 UTC-->
</div>Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.com0tag:blogger.com,1999:blog-1457877875009527488.post-29697223710458452802021-08-30T06:29:00.007+08:002021-09-12T04:32:36.341+08:00Lag багатай smoothing функцүүд<p>Moving average аргуудыг financial ML-д их хэрэглэдэг. Гэхдээ нэг дутагдал бий тэр нь юу вэ гэхээр <b>сигналын хоцрогдол</b> юм.<span></span></p><a name='more'></a><p>Moving average-уудыг тооцохдоо ханшын цуваан дээр цонх гүйлгэдэг бөгөөд тэр цонхнуудын дундаж утгуудыг хадгалах байдлаар ажилладаг.</p><p>Дундаж утгад одоогийн ханшийн утгад хамааралтай хүч маш сул юм.</p><p>Хэрэв эдгээр дундаж утгуудыг ашиглан сигнал тооцвол дундаж утгыг олохоор гүйлгэсэн цонхны уртын хэмжээгээр хоцрогдол бий болдог.</p><p>Хэнч мэдээж маркет боломж өгөх үед бусдаас хоцрохыг хүсэхгүй байх.</p><p>Дээрээс нь Machine Learning-ийн хувьд feature noise гээч айхтар том асуудал бий. Энэ асуудлыг ханшын цуваан дээр smoothing функц хэрэглэх байдлаар зөөллөж болдог. </p><p>ML хэрэглэлээ ч feature-үүд нь ямар нэгэн хоцрогдолгүй байх бас хэрэгтэй байдаг.</p><p>Энэ сул талыг арилгахаар John Ehlers гуай <a href="http://technical.traders.com/archive/volume-2014.asp?yr=2021">Technical Analysis of Stocks & Commodities</a> сэтгүүлд шинэ Triangle, Hamming, Hann гэсэн аргуудыг нийтэлсэн байна.</p><p>Энэ аргуудыг <a href="https://financial-hacker.com/better-indicators-with-windowing/">https://financial-hacker.com/better-indicators-with-windowing/</a> сайтад анх С төстэй програмчлалын хэл дээр хэрэгжүүлсэн байна. </p><p>Дараа ашиглах үүднээс python хэл дээр буулгасан хувилбараа энэ постоор тэмдэглэж авая гэж бодлоо.</p>
<pre style="background: rgb(0, 0, 0); color: #d1d1d1;"><span style="color: #9999a9;"># Triangle factor Moving Average</span>
<span style="color: #e66170; font-weight: bold;">def</span> triangle_sma<span style="color: #d2cd86;">(</span>window<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
length <span style="color: #d2cd86;">=</span> <span style="color: #e66170; font-weight: bold;">len</span><span style="color: #d2cd86;">(</span>window<span style="color: #d2cd86;">)</span>
triangle <span style="color: #d2cd86;">=</span> <span style="color: #d2cd86;">[</span>a<span style="color: #00dddd;">+</span><span style="color: #00a800;">1</span> <span style="color: #e66170; font-weight: bold;">if</span> a<span style="color: #00dddd;"><</span>length<span style="color: #00dddd;">/</span><span style="color: #00a800;">2</span> <span style="color: #e66170; font-weight: bold;">else</span> length<span style="color: #00dddd;">-</span>a <span style="color: #e66170; font-weight: bold;">for</span> a <span style="color: #e66170; font-weight: bold;">in</span> <span style="color: #e66170; font-weight: bold;">range</span><span style="color: #d2cd86;">(</span>length<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">]</span>
output <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>multiply<span style="color: #d2cd86;">(</span>triangle<span style="color: #d2cd86;">,</span> window<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> np<span style="color: #d2cd86;">.</span>mean<span style="color: #d2cd86;">(</span>output<span style="color: #d2cd86;">)</span>
df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">"TRIANGLE_50"</span> <span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'close'</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">.</span>rolling<span style="color: #d2cd86;">(</span><span style="color: #00a800;">50</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">.</span>apply<span style="color: #d2cd86;">(</span>
<span style="color: #e66170; font-weight: bold;">lambda</span> w<span style="color: #d2cd86;">:</span> triangle_sma<span style="color: #d2cd86;">(</span>w<span style="color: #d2cd86;">.</span>values<span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;"># Hamming factor Moving Average</span>
<span style="color: #e66170; font-weight: bold;">def</span> hamming_sma<span style="color: #d2cd86;">(</span>window<span style="color: #d2cd86;">,</span> pedestal<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
length <span style="color: #d2cd86;">=</span> <span style="color: #e66170; font-weight: bold;">len</span><span style="color: #d2cd86;">(</span>window<span style="color: #d2cd86;">)</span>
hamming <span style="color: #d2cd86;">=</span> <span style="color: #d2cd86;">[</span>np<span style="color: #d2cd86;">.</span>sin<span style="color: #d2cd86;">(</span>pedestal<span style="color: #00dddd;">+</span><span style="color: #d2cd86;">(</span>np<span style="color: #d2cd86;">.</span>pi<span style="color: #00dddd;">-</span><span style="color: #00a800;">2</span><span style="color: #00dddd;">*</span>pedestal<span style="color: #d2cd86;">)</span><span style="color: #00dddd;">*</span><span style="color: #d2cd86;">(</span>idx<span style="color: #00dddd;">+</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">)</span><span style="color: #00dddd;">/</span><span style="color: #d2cd86;">(</span>length<span style="color: #00dddd;">-</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">for</span> idx<span style="color: #d2cd86;">,</span> a <span style="color: #e66170; font-weight: bold;">in</span> <span style="color: #e66170; font-weight: bold;">enumerate</span><span style="color: #d2cd86;">(</span><span style="color: #e66170; font-weight: bold;">range</span><span style="color: #d2cd86;">(</span>length<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">]</span>
output <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>multiply<span style="color: #d2cd86;">(</span>hamming<span style="color: #d2cd86;">,</span> window<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> np<span style="color: #d2cd86;">.</span>mean<span style="color: #d2cd86;">(</span>output<span style="color: #d2cd86;">)</span>
df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">"HAMMING_50"</span> <span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'close'</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">.</span>rolling<span style="color: #d2cd86;">(</span><span style="color: #00a800;">50</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">.</span>apply<span style="color: #d2cd86;">(</span>
<span style="color: #e66170; font-weight: bold;">lambda</span> w<span style="color: #d2cd86;">:</span> hamming_sma<span style="color: #d2cd86;">(</span>w<span style="color: #d2cd86;">.</span>values<span style="color: #d2cd86;">,</span> <span style="color: #00a800;">10</span><span style="color: #00dddd;">*</span>np<span style="color: #d2cd86;">.</span>pi<span style="color: #00dddd;">/</span><span style="color: #00a800;">360</span><span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;"># Hann factor Moving Average</span>
<span style="color: #e66170; font-weight: bold;">def</span> hann_sma<span style="color: #d2cd86;">(</span>window<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
length <span style="color: #d2cd86;">=</span> <span style="color: #e66170; font-weight: bold;">len</span><span style="color: #d2cd86;">(</span>window<span style="color: #d2cd86;">)</span>
hann <span style="color: #d2cd86;">=</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">(</span><span style="color: #00a800;">1</span><span style="color: #00dddd;">-</span>np<span style="color: #d2cd86;">.</span>cos<span style="color: #d2cd86;">(</span><span style="color: #00a800;">2</span><span style="color: #00dddd;">*</span>np<span style="color: #d2cd86;">.</span>pi<span style="color: #00dddd;">*</span><span style="color: #d2cd86;">(</span>idx<span style="color: #00dddd;">+</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">)</span><span style="color: #00dddd;">/</span><span style="color: #d2cd86;">(</span>length<span style="color: #00dddd;">+</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">for</span> idx<span style="color: #d2cd86;">,</span> a <span style="color: #e66170; font-weight: bold;">in</span> <span style="color: #e66170; font-weight: bold;">enumerate</span><span style="color: #d2cd86;">(</span><span style="color: #e66170; font-weight: bold;">range</span><span style="color: #d2cd86;">(</span>length<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">]</span>
output <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>multiply<span style="color: #d2cd86;">(</span>hann<span style="color: #d2cd86;">,</span> window<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> np<span style="color: #d2cd86;">.</span>mean<span style="color: #d2cd86;">(</span>output<span style="color: #d2cd86;">)</span>
df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">"HANN_50"</span> <span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> df<span style="color: #d2cd86;">[</span><span style="color: #00c4c4;">'close'</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">.</span>rolling<span style="color: #d2cd86;">(</span><span style="color: #00a800;">50</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">.</span>apply<span style="color: #d2cd86;">(</span>
<span style="color: #e66170; font-weight: bold;">lambda</span> w<span style="color: #d2cd86;">:</span> hann_sma<span style="color: #d2cd86;">(</span>w<span style="color: #d2cd86;">.</span>values<span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">)</span>
</pre>
<!--Created using ToHtml.com on 2021-08-29 22:28:16 UTC-->Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.com0tag:blogger.com,1999:blog-1457877875009527488.post-66723938179347361852020-08-03T08:27:00.005+08:002021-09-20T17:59:59.818+08:00Магадлалын тархалтын энтропиг тооцоолохӨмнөх постоор роботын сонголт хийх магадлалын энтропиний хэмжээг ихэсгэснээр робот шинэ орчинд хуучин сурсан юмаа давтан хийх биш харин өөр өөр сонголтуудыг зоригтой хийж дасдаг талаар дурдсан.<span><a name='more'></a></span><div><br /></div><div>Тэгвэл энэ энтропиг илүү сайн ойлгохын тулд энтропи тооцоолдог томъёоных нь дагуу хэрэгжүүлж үр дүнг нь харьцуулж харая.</div><div><br /></div><div>X гэсэн олонлогт роботын хийх үйлдлийн индекст харгалзах магадлалын оноонууд P(x<font size="1">i</font>) байна гэж үзье. Тэгвэл энтропи буюу H(X) ийг дараах томъёогоор боддог.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhpy0h5FR6A4YQWyvcAFNvU80t_owQaqK45-lq-1lLhte44R7PZF8w7dUhH1cImkiAN7IJuHfbFrIOIr9iL_KnPCOwfL7u8LPVKaR2XFcVUNJzzoAK5I2c4GGtCMDE0FdvRTGNUD4n_aw/s388/entropy_with_probability.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="193" data-original-width="388" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhpy0h5FR6A4YQWyvcAFNvU80t_owQaqK45-lq-1lLhte44R7PZF8w7dUhH1cImkiAN7IJuHfbFrIOIr9iL_KnPCOwfL7u8LPVKaR2XFcVUNJzzoAK5I2c4GGtCMDE0FdvRTGNUD4n_aw/d/entropy_with_probability.png" /></a></div>
<div>Энэ томъёоны дагуу numpy дээр энтропи боддог функц бичвэл</div>
<pre style="background: rgb(0, 0, 0); color: #d1d1d1;"><span style="color: #e66170; font-weight: bold;">def</span> entropy<span style="color: #d2cd86;">(</span>prob_distribution<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
log_probabilities <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>log<span style="color: #d2cd86;">(</span>prob_distribution<span style="color: #d2cd86;">)</span>
element_wise_mult <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>multiply<span style="color: #d2cd86;">(</span>prob_distribution<span style="color: #d2cd86;">,</span> log_probabilities<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> <span style="color: #00dddd;">-</span>np<span style="color: #d2cd86;">.</span><span style="color: #e66170; font-weight: bold;">sum</span><span style="color: #d2cd86;">(</span>element_wise_mult<span style="color: #d2cd86;">)</span>
</pre>
<!--Created using ToHtml.com on 2020-08-02 23:56:20 UTC-->
<div><br /></div><div>Энтропи утга нь роботын үйлдэл сонгох магадлалын тархалт нь хэр зэрэг тодорхой бус юм бэ гэдгийг хэлж өгдөг.</div><div><br /></div><div>Хэрэв энтропи их байвал сонголт хийх боломжит үйлдлүүдийн тоо нь их байгааг зааж байдаг.</div><div><br /></div><div>Энтропи бага байвал зарим үйлдэлд харгалзах магадлалын утга бусдаасаа их ер нь тэр үйлдлийг робот сонгох нь тодорхой байгааг заадаг.</div><div><br /></div><div><br /></div><div>Одоо дээр тодорхойлсон функцээр энтропи утга тооцох туршилтууд хийж харая:</div><div><br /></div><div>Роботын сонголт хийх боломжит үйлдлүүд нь 5 ширхэг гэж үзье.</div><div><br /></div><div>0, 1, 2-р индекст харгалзах магадлалууд нь үл ялиг бусдаасаа их байвал өөрөөр хэлбэл энэ гурваас л робот үйлдэл ахиу сонгох юм байна гэдэг нь ер нь тодорхой байвал энтропи нь</div><div><span style="font-family: monospace;"><span style="background-color: white;">>>> print(entropy([0.3, 0.3, 0.2, 0.1, 0.1])) </span><br />1.5047882836811908<br /></span></div><div><br /></div><div>2-р үйлдэл сонгох магадлалын тодорхой байдлыг ихэсгээд үзье, тодорхой учраас энтропи нь буурах ёстой</div><div><span style="font-family: monospace;"><span style="background-color: white;">>>> print(entropy([0.1, 0.1, 0.6, 0.1, 0.1])) </span><br />1.2275294114572126<br /></span></div><div><br /></div><div>Дахиад 2-р үйлдэл сонгох магадлыг бүүр тодорхой болгоод 0.9 дээр аваачаад үзье. Бүүр тодорхой учраас өмнөх жишээтэй харьцуулбал энтропи нь буурах ёстой</div><div><span style="font-family: monospace;"><span style="background-color: white;">>>> print(entropy([0.025, 0.025, 0.9, 0.025, 0.025]))
</span><br />0.4637124095034373<br /></span></div><div><br /></div><div>Тэгвэл одоо энэ боломжит 5-н үйлдлүүдийн магадлалыг хооронд нь тэнцүүлээд энэ 5-аас алийг нь ч сонгох боломжтой буюу үйлдэл сонгох тодорхой бус байдлын хэмжээг нэмээд үзье. Энэ тохиолдолд мэдээж энтропи нь өсөх ёстой.</div><div><span style="font-family: monospace;"><span style="background-color: white;">>>> print(entropy([0.2, 0.2, 0.2, 0.2, 0.2])) </span><br />1.6094379124341005<br /></span></div><div><br /></div><div>Энэ жишээнүүдээс энтропины талаархи тодорхой intuition-тэй болсон байх гэж найдаж байна.</div>
<div><br /></div>
<div>Лавлагаа :</div><div><a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)">https://en.wikipedia.org/wiki/Entropy_(information_theory)</a></div>Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.com0tag:blogger.com,1999:blog-1457877875009527488.post-4725974689305387562020-07-31T13:27:00.048+08:002021-03-07T06:29:41.355+08:00Deep Reinforcement Learning, Soft Actor Critic буюу SACUC Berkeley болон Google нарын хамтран гаргасан SAC алгоритмтэй танилцая.<span></span><div><br /></div><div><b>SAC</b> нь хоёр төрлийн RL алгоритмуудын давуу талуудыг аль алийг нь агуулдаг. </div><div><br /></div><div>Эхний төрөл болох <b>Trust Region Policy Optimization (TRPO)</b> болон <b>Proximal Policy Optimization (PPO)</b> мөн өмнө дурдаж байсан <b>Asynchronous Actor-Critic (A3C)</b> алгоритмууд <b>on-policy</b> байдлаар сурдаг учраас <b>sample efficiency</b> тал дээр муу. </div><div><br /></div><div>Нөгөө төрөл болох <b>Q-Learning</b> -д суурилсан <b>off-policy</b> байдлаар сурдаг <b>Deep Deterministic Policy Gradient (DDPG)</b> болон <b>Twin Delayed Deep Deterministic Policy Gradient (TD3)</b> нар <b>replay buffer</b> хэрэглэдэг учраас өнгөрсөн туршлагуудаа байнга санаж дахин дахин сурж байдаг учраас sample efficiency тал дээр илүү сайн. </div><div><br /></div><div>Гэвч эдгээр аргууд нь <b>hyperparameter-үүд</b>эд хэтэрхий мэдрэг учраас <b>converge</b> хийлгэхийн тулд hyperparameter-үүдийг нь дахин дахин тохируулж олон удаа сургах сул талтай.</div><div><br /></div><div>Тэгвэл SAC нь sample efficient бөгөөд hyperparameter тохируулах шаардлага бага мөн дээрээс нь <b>симуляцаас хальж</b> яг бодит <b>роботын удирдлаганд</b> хэрэглэхэд сайн ажилладаг.</div>
<div><br /></div>
<iframe allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="" frameborder="0" height="315" src="https://www.youtube.com/embed/KOObeIjzXTY" width="560"></iframe>
<div><br /></div>
<div>Энэ жишээн дээр дөрвөн хөлт роботыг SAC-аар алхдаг болгож сургахад <b>хугацаа бага зарцуулаад</b> зогсохгүй өөр үзэж дасаагүй <b>шинэ орчинд</b> ч гэсэн сайн ажиллаж байгааг харуулж байна. </div><div><br /></div><div>SAC-ийн оруулж ирсэн хамгийн чухал санааг тодорхой болгох гээд үзье.</div><div><br /></div><div><span style="font-size: xx-large;">Objective функц</span></div><div><br /></div><div>SAC-н <b>бусад алгоритмууд</b>аас ялгарах хамгийн том онцлог нь <b>сайжруулсан objective функ</b>цыг машин сургалтандаа хэрэглэдэгт бий.</div><div><br /></div><div>Уламжлалт RL алгоритмууд тухайн <b>төлөв</b> дээр авч болох <b>нийлбэр reward</b> утгыг <b>максимумчилах</b>ыг зорьдог бол SAC нь үүний хажуугаар <b>policy</b>-ны <b>энтропи</b>г мөн адил максимумчилахыг зорьдог.</div><div><br /></div><div><b>Энтропи</b> гэдэг нь миний ойлголтоор <b>эмх цэгцтэй байдлын хэмжээ</b>г хэлдэг. <b>Энтропи их</b> байвал тухайн систем хэтэрхий <b>эмх замбараагүй</b> төлөвт байна гэсэн үг. Эсрэгээр <b>энтропи бага</b> байвал цэгцэндээ орсон байна гэж ойлгож болно.</div><div><br /></div><div>Тэгвэл <b>reinforcement learning</b>-ийн хүрээнд <b>policy</b>-ны <b>энтропиг ихэсгэж</b> яах гэж байгаа юм бол гэдэг асуулт гарна. </div><div><br /></div><div>Хэрвээ <b>policy</b>-ны үйлдэл хийх энтропи нь их буюу <b>random үйлдлүүд</b> хийх нь <b>их</b> байдаг тэгсэн мөртлөө <b>reward нийлбэрийг максимумчилаад байдаг</b> бол хар ухаанаар энэ policy нь <b>explore хийхдээ их сайн</b> юм байна гэж ойлгож болно.</div><div><br /></div><div>Мөн бодит орчинд байдаг шиг <b>тааварлахад бэрх нөхцөл</b> байдал буюу симуляцтай харьцуулахад <b>энтропи ихтэй орчинд</b> policy маань нийлбэр <b>reward-г максимумчилж чадаад байна</b> гэдэг нь <b>робот сургахад мөн адил тохиромжтой</b> юм байна гэж бас бодож болохоор.</div><div><br /></div><div>Өөрөөр хэлбэл бодит орчинд байдаг шиг noise орж ирвэл модель маань өөр тийшээгээ хэлбийгээд өөр юм сураад явчилгүй зорилгодоо анхааран илүү тогтвортой байж чадна гэсэн үг.</div><div><br /></div><div><br /></div><div>Тэгэхээр, энтропи хүчин зүйлийг оруулсан ирсэн objective функцын томъёотой танилцая.</div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5wSr_zP4_yElKGzZuYvtHFQA7Wr51Z2-u9jkZaKZqfIUf_DUkb-t9_8iuJ1D7ebFwz9vMeMArAFTbvrMJS_g2-w23Mdv7uLTfYhjv1YJQdj45gSn3CIu8nwPnCZsLmMUn3EKGy8l5jg/s518/entropy_augmented_sac_objective.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="79" data-original-width="518" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5wSr_zP4_yElKGzZuYvtHFQA7Wr51Z2-u9jkZaKZqfIUf_DUkb-t9_8iuJ1D7ebFwz9vMeMArAFTbvrMJS_g2-w23Mdv7uLTfYhjv1YJQdj45gSn3CIu8nwPnCZsLmMUn3EKGy8l5jg/d/entropy_augmented_sac_objective.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Энтропи оруулж ирсэн <b>RL objective функц</b></td></tr></tbody></table><div><br /></div><div>Objective функцын эхний хэсэг <b>reward нийлбэр</b> дараагийн хэсэг нь <b>policy-ны энтропиг</b> тооцож байна. </div><div><br /></div><div>RL-ны зорилго reward нийлбэрийг максимумчилах тухай мэдэх тул энэ хэсгийг орхилоо.</div><div><br /></div><div>Хоёр дахь хэсэг буюу <b>энтропи тооцож байгаа хэсгийг</b> тайлбарлахыг оролдоё.</div><div><br /></div><div><b>α</b> үржвэр нь энтропины темпертурыг тохируулах эерэг тоо байна. Хэрэв α=0 бол objective функц зүгээр л энгийн reward нийлбэр болж хувирна. </div><div><br /></div><div>Энэ температур утгыг <b>objective функц</b>ын эцсийн утгад <b>reward нийлбэр нь илүү оролцоотой байхуу</b> аль эсвэл <b>энтропиний утга нь ахиу оролцоотой байхуу</b> гэдэг <b>тохиргоог</b> хийхэд ашигладаг.</div><div><br /></div><div>Policy-ны энтропи тооцож байгаа хэсэг буюу α<b>log(π(action | state))</b> -ны π функц нь <b>төлөв</b> оролтондоо аваад <b>policy</b>-ны дагуу тааварласан <b>үйлдэл</b> бүрт харгалзах <b>магадлалын оноо</b>г буцаадаг функц байна.</div><div><br /></div><div>Энэ функцын <b>гаралт</b> болох <b>магадлалын тархалт</b> нь энтропи буюу <b>тодорхой бус байдлын оноог</b> тооцож гаргаж ирэхэд яаж тус болж байна вэ гэхээр </div><div><ul style="text-align: left;"><li>Хэрэв <b>бүх үйлдлүүд</b>эд харгалзах <b>магадлалууд</b> хоорондоо <b>ойролцоо</b> байвал эдгээр үйлдлүүдээс <b>алийг нь ч сонгон хэрэглэх боломжтой</b> учраас <b>тодорхой бус байдал нь их</b> буюу <b>энтропи нь их</b> байна</li><li>Боломжит үйлдлүүдээс <b>нэгнийх нь магадлалын оноо их</b> байвал тэр үйлдлийг <b>сонгох нь ойлгомжтой</b> тодорхой тул <b>энтропиний хэмжээ нь бага</b> байна.</li></ul><div><b>Энтропиг их байлгаснаар</b> нэг сурсан үйлдлүүд дээрээ гацах биш ашигтай байж болох <b>өөр олон шинэ үйлдлүүд турш</b>иж үзэх <b>explore хийх</b>ийг дэмжсэн шинж чанарыг агентэд олгодог.</div></div><div><br /></div><div>Энэ энтропи оруулсан objective функц RL-ийн ерөнхий хэлбэрээр бичигдсэн байгаа. </div><div><br /></div><div><br /></div><div>Энэ постоор яг ч Soft Actor Critic-ийг paper-ийнх нь дагуу хэрэгжүүлэхгүй гэхдээ энтропи тооцдог хэсгийг нь энгийн Actor Critic буюу A2C-рүү оруулж ирэх гэж оролдоё.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEihoQkfo741Wy5zmAFjT2-CZ7NHlyc4_P2oZtYwf_LtShjsygHNAKyT6hSeNje3ZFiz0wwfBIpHlOIFtgcVO22I1xnufadwa27e5I18taxCeBhRMhLxKBWvqDxeX0tbX7DhOGYBWdAqHw/s1052/a3c_entropy_bonus.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="91" data-original-width="1052" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEihoQkfo741Wy5zmAFjT2-CZ7NHlyc4_P2oZtYwf_LtShjsygHNAKyT6hSeNje3ZFiz0wwfBIpHlOIFtgcVO22I1xnufadwa27e5I18taxCeBhRMhLxKBWvqDxeX0tbX7DhOGYBWdAqHw/s640/a3c_entropy_bonus.png" width="640" /></a></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgP6yhWf9gTRbKj2ZmlDmzseQz-prXPgeqrKpkoNIgFlBe3w4mvXplGvBK7C7JmuWQ_D1FdRZB4VhcH4dpoiN2_dMDv9cSKkRmSl5t6pOkYI1wRFuOyECHekFg2aFmxgaH7XxaM512NvQ/s1090/a3c_entropy_bonus_2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="93" data-original-width="1090" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgP6yhWf9gTRbKj2ZmlDmzseQz-prXPgeqrKpkoNIgFlBe3w4mvXplGvBK7C7JmuWQ_D1FdRZB4VhcH4dpoiN2_dMDv9cSKkRmSl5t6pOkYI1wRFuOyECHekFg2aFmxgaH7XxaM512NvQ/s640/a3c_entropy_bonus_2.png" width="640" /></a></div><div><br /></div><div>Дур мэдэн A2C дээр энтропи нэмсэн орлуулга хийгээд нэг иймэрхүү болгож авлаа ерөнхийдөө <b>advantage</b> хэсэг дээр нь <b>explore хийх</b>ийг урамшуулсан <b>энтропиг</b> нэмж байна. Доор Jax болон Flax дээр бичсэн кодыг нь тавьлаа.</div><div><br /></div><div>Лавлагаа :</div><div> - <a href="https://ai.googleblog.com/2019/01/soft-actor-critic-deep-reinforcement.html">https://ai.googleblog.com/2019/01/soft-actor-critic-deep-reinforcement.html</a></div><div> - <a href="https://bair.berkeley.edu/blog/2018/12/14/sac/">https://bair.berkeley.edu/blog/2018/12/14/sac/</a></div><div> - <a href="https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/">https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/</a></div><div> - <a href="https://medium.com/@awjuliani/maximum-entropy-policies-in-reinforcement-learning-everyday-life-f5a1cc18d32d">https://medium.com/@awjuliani/maximum-entropy-policies-in-reinforcement-learning-everyday-life-f5a1cc18d32d</a></div><div><br /></div>
<div><br /></div><div><b><span style="font-size: x-large;">Хэрэгжүүлэлт</span></b></div><div><br /></div>
Эхний оролдлого болох энгийн A2C дээр энтропи хүчин зүйл оруулж ирсэн хувилбар<div>
<pre style="background: rgb(0, 0, 0); color: #d1d1d1;"><span style="color: #e66170; font-weight: bold;">import</span> os
<span style="color: #e66170; font-weight: bold;">import</span> random
<span style="color: #e66170; font-weight: bold;">import</span> math
<span style="color: #e66170; font-weight: bold;">import</span> gym
<span style="color: #e66170; font-weight: bold;">import</span> flax
<span style="color: #e66170; font-weight: bold;">import</span> jax
<span style="color: #e66170; font-weight: bold;">from</span> jax <span style="color: #e66170; font-weight: bold;">import</span> numpy <span style="color: #e66170; font-weight: bold;">as</span> jnp
<span style="color: #e66170; font-weight: bold;">import</span> numpy <span style="color: #e66170; font-weight: bold;">as</span> np
<span style="color: #e66170; font-weight: bold;">import</span> numpy
debug_render <span style="color: #d2cd86;">=</span> True
debug <span style="color: #d2cd86;">=</span> False
num_episodes <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">1500</span>
learning_rate <span style="color: #d2cd86;">=</span> <span style="color: #009f00;">0.001</span>
gamma <span style="color: #d2cd86;">=</span> <span style="color: #009f00;">0.99</span>
<span style="color: #e66170; font-weight: bold;">class</span> ActorNetwork<span style="color: #d2cd86;">(</span>flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Module<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">def</span> apply<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> x<span style="color: #d2cd86;">,</span> n_actions<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
dense_layer_1 <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Dense<span style="color: #d2cd86;">(</span>x<span style="color: #d2cd86;">,</span> <span style="color: #00a800;">64</span><span style="color: #d2cd86;">)</span>
activation_layer_1 <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>relu<span style="color: #d2cd86;">(</span>dense_layer_1<span style="color: #d2cd86;">)</span>
dense_layer_2 <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Dense<span style="color: #d2cd86;">(</span>activation_layer_1<span style="color: #d2cd86;">,</span> <span style="color: #00a800;">32</span><span style="color: #d2cd86;">)</span>
activation_layer_2 <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>relu<span style="color: #d2cd86;">(</span>dense_layer_2<span style="color: #d2cd86;">)</span>
output_dense_layer <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Dense<span style="color: #d2cd86;">(</span>activation_layer_2<span style="color: #d2cd86;">,</span> n_actions<span style="color: #d2cd86;">)</span>
output_layer <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>softmax<span style="color: #d2cd86;">(</span>output_dense_layer<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> output_layer
<span style="color: #e66170; font-weight: bold;">class</span> CriticNetwork<span style="color: #d2cd86;">(</span>flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Module<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">def</span> apply<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> x<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
dense_layer_1 <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Dense<span style="color: #d2cd86;">(</span>x<span style="color: #d2cd86;">,</span> <span style="color: #00a800;">64</span><span style="color: #d2cd86;">)</span>
activation_layer_1 <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>relu<span style="color: #d2cd86;">(</span>dense_layer_1<span style="color: #d2cd86;">)</span>
dense_layer_2 <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Dense<span style="color: #d2cd86;">(</span>activation_layer_1<span style="color: #d2cd86;">,</span> <span style="color: #00a800;">32</span><span style="color: #d2cd86;">)</span>
activation_layer_2 <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>relu<span style="color: #d2cd86;">(</span>dense_layer_2<span style="color: #d2cd86;">)</span>
output_dense_layer <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Dense<span style="color: #d2cd86;">(</span>activation_layer_2<span style="color: #d2cd86;">,</span> <span style="color: #00a800;">1</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> output_dense_layer
env <span style="color: #d2cd86;">=</span> gym<span style="color: #d2cd86;">.</span>make<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">'CartPole-v1'</span><span style="color: #d2cd86;">)</span>
state <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>reset<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
n_actions <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>action_space<span style="color: #d2cd86;">.</span>n
actor_module <span style="color: #d2cd86;">=</span> ActorNetwork<span style="color: #d2cd86;">.</span>partial<span style="color: #d2cd86;">(</span>n_actions<span style="color: #d2cd86;">=</span>n_actions<span style="color: #d2cd86;">)</span>
_<span style="color: #d2cd86;">,</span> actor_params <span style="color: #d2cd86;">=</span> actor_module<span style="color: #d2cd86;">.</span>init_by_shape<span style="color: #d2cd86;">(</span>jax<span style="color: #d2cd86;">.</span>random<span style="color: #d2cd86;">.</span>PRNGKey<span style="color: #d2cd86;">(</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">[</span>state<span style="color: #d2cd86;">.</span>shape<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
actor_model <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Model<span style="color: #d2cd86;">(</span>actor_module<span style="color: #d2cd86;">,</span> actor_params<span style="color: #d2cd86;">)</span>
critic_module <span style="color: #d2cd86;">=</span> CriticNetwork<span style="color: #d2cd86;">.</span>partial<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
_<span style="color: #d2cd86;">,</span> critic_params <span style="color: #d2cd86;">=</span> critic_module<span style="color: #d2cd86;">.</span>init_by_shape<span style="color: #d2cd86;">(</span>jax<span style="color: #d2cd86;">.</span>random<span style="color: #d2cd86;">.</span>PRNGKey<span style="color: #d2cd86;">(</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">[</span>state<span style="color: #d2cd86;">.</span>shape<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
critic_model <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Model<span style="color: #d2cd86;">(</span>critic_module<span style="color: #d2cd86;">,</span> critic_params<span style="color: #d2cd86;">)</span>
actor_optimizer <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>optim<span style="color: #d2cd86;">.</span>Adam<span style="color: #d2cd86;">(</span>learning_rate<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">.</span>create<span style="color: #d2cd86;">(</span>actor_model<span style="color: #d2cd86;">)</span>
critic_optimizer <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>optim<span style="color: #d2cd86;">.</span>Adam<span style="color: #d2cd86;">(</span>learning_rate<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">.</span>create<span style="color: #d2cd86;">(</span>critic_model<span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">@</span>jax<span style="color: #d2cd86;">.</span>jit
<span style="color: #e66170; font-weight: bold;">def</span> actor_inference<span style="color: #d2cd86;">(</span>model<span style="color: #d2cd86;">,</span> x<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">return</span> model<span style="color: #d2cd86;">(</span>x<span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">@</span>jax<span style="color: #d2cd86;">.</span>jit
<span style="color: #e66170; font-weight: bold;">def</span> critic_inference<span style="color: #d2cd86;">(</span>model<span style="color: #d2cd86;">,</span> x<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">return</span> model<span style="color: #d2cd86;">(</span>x<span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">@</span>jax<span style="color: #d2cd86;">.</span>jit
<span style="color: #e66170; font-weight: bold;">def</span> backpropagate_critic<span style="color: #d2cd86;">(</span>optimizer<span style="color: #d2cd86;">,</span> props<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #9999a9;"># props[0] - states</span>
<span style="color: #9999a9;"># props[1] - discounted_rewards</span>
<span style="color: #e66170; font-weight: bold;">def</span> loss_fn<span style="color: #d2cd86;">(</span>model<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
values <span style="color: #d2cd86;">=</span> model<span style="color: #d2cd86;">(</span>props<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
values <span style="color: #d2cd86;">=</span> jnp<span style="color: #d2cd86;">.</span>reshape<span style="color: #d2cd86;">(</span>values<span style="color: #d2cd86;">,</span><span style="color: #d2cd86;">(</span>values<span style="color: #d2cd86;">.</span>shape<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
advantages <span style="color: #d2cd86;">=</span> props<span style="color: #d2cd86;">[</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">]</span> <span style="color: #00dddd;">-</span> values
<span style="color: #e66170; font-weight: bold;">return</span> jnp<span style="color: #d2cd86;">.</span>mean<span style="color: #d2cd86;">(</span>jnp<span style="color: #d2cd86;">.</span>square<span style="color: #d2cd86;">(</span>advantages<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
loss<span style="color: #d2cd86;">,</span> gradients <span style="color: #d2cd86;">=</span> jax<span style="color: #d2cd86;">.</span>value_and_grad<span style="color: #d2cd86;">(</span>loss_fn<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">(</span>optimizer<span style="color: #d2cd86;">.</span>target<span style="color: #d2cd86;">)</span>
optimizer <span style="color: #d2cd86;">=</span> optimizer<span style="color: #d2cd86;">.</span>apply_gradient<span style="color: #d2cd86;">(</span>gradients<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> optimizer<span style="color: #d2cd86;">,</span> loss
<span style="color: #d2cd86;">@</span>jax<span style="color: #d2cd86;">.</span>vmap
<span style="color: #e66170; font-weight: bold;">def</span> gather<span style="color: #d2cd86;">(</span>probability_vec<span style="color: #d2cd86;">,</span> action_index<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">return</span> probability_vec<span style="color: #d2cd86;">[</span>action_index<span style="color: #d2cd86;">]</span>
<span style="color: #d2cd86;">@</span>jax<span style="color: #d2cd86;">.</span>jit
<span style="color: #e66170; font-weight: bold;">def</span> backpropagate_actor<span style="color: #d2cd86;">(</span>optimizer<span style="color: #d2cd86;">,</span> critic_model<span style="color: #d2cd86;">,</span> props<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #9999a9;"># props[0] - states</span>
<span style="color: #9999a9;"># props[1] - discounted_rewards</span>
<span style="color: #9999a9;"># props[2] - actions</span>
values <span style="color: #d2cd86;">=</span> jax<span style="color: #d2cd86;">.</span>lax<span style="color: #d2cd86;">.</span>stop_gradient<span style="color: #d2cd86;">(</span>critic_model<span style="color: #d2cd86;">(</span>props<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
values <span style="color: #d2cd86;">=</span> jnp<span style="color: #d2cd86;">.</span>reshape<span style="color: #d2cd86;">(</span>values<span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">(</span>values<span style="color: #d2cd86;">.</span>shape<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
advantages <span style="color: #d2cd86;">=</span> jnp<span style="color: #d2cd86;">.</span>subtract<span style="color: #d2cd86;">(</span>props<span style="color: #d2cd86;">[</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> values<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">def</span> loss_fn<span style="color: #d2cd86;">(</span>model<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
action_probabilities <span style="color: #d2cd86;">=</span> model<span style="color: #d2cd86;">(</span>props<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
probabilities <span style="color: #d2cd86;">=</span> gather<span style="color: #d2cd86;">(</span>action_probabilities<span style="color: #d2cd86;">,</span> props<span style="color: #d2cd86;">[</span><span style="color: #00a800;">2</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
log_probabilities <span style="color: #d2cd86;">=</span> <span style="color: #00dddd;">-</span>jnp<span style="color: #d2cd86;">.</span>log<span style="color: #d2cd86;">(</span>probabilities<span style="color: #d2cd86;">)</span>
alpha <span style="color: #d2cd86;">=</span> <span style="color: #009f00;">0.4</span> <span style="color: #9999a9;"># Entropy temperature</span>
entropies <span style="color: #d2cd86;">=</span> <span style="color: #00dddd;">-</span>jnp<span style="color: #d2cd86;">.</span><span style="color: #e66170; font-weight: bold;">sum</span><span style="color: #d2cd86;">(</span>
jnp<span style="color: #d2cd86;">.</span>multiply<span style="color: #d2cd86;">(</span>
action_probabilities<span style="color: #d2cd86;">,</span>
jnp<span style="color: #d2cd86;">.</span>log<span style="color: #d2cd86;">(</span>action_probabilities<span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span>
axis <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">1</span>
<span style="color: #d2cd86;">)</span><span style="color: #00dddd;">*</span>alpha
advantages_with_entropies <span style="color: #d2cd86;">=</span> jnp<span style="color: #d2cd86;">.</span>add<span style="color: #d2cd86;">(</span>advantages<span style="color: #d2cd86;">,</span> entropies<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> jnp<span style="color: #d2cd86;">.</span>mean<span style="color: #d2cd86;">(</span>jnp<span style="color: #d2cd86;">.</span>multiply<span style="color: #d2cd86;">(</span>log_probabilities<span style="color: #d2cd86;">,</span> advantages_with_entropies<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
loss<span style="color: #d2cd86;">,</span> gradients <span style="color: #d2cd86;">=</span> jax<span style="color: #d2cd86;">.</span>value_and_grad<span style="color: #d2cd86;">(</span>loss_fn<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">(</span>optimizer<span style="color: #d2cd86;">.</span>target<span style="color: #d2cd86;">)</span>
optimizer <span style="color: #d2cd86;">=</span> optimizer<span style="color: #d2cd86;">.</span>apply_gradient<span style="color: #d2cd86;">(</span>gradients<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> optimizer<span style="color: #d2cd86;">,</span> loss
global_step <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">0</span>
<span style="color: #e66170; font-weight: bold;">try</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">for</span> episode <span style="color: #e66170; font-weight: bold;">in</span> <span style="color: #e66170; font-weight: bold;">range</span><span style="color: #d2cd86;">(</span>num_episodes<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
state <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>reset<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
states<span style="color: #d2cd86;">,</span> actions<span style="color: #d2cd86;">,</span> rewards<span style="color: #d2cd86;">,</span> dones <span style="color: #d2cd86;">=</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span>
<span style="color: #e66170; font-weight: bold;">while</span> True<span style="color: #d2cd86;">:</span>
global_step <span style="color: #d2cd86;">=</span> global_step<span style="color: #00dddd;">+</span><span style="color: #00a800;">1</span>
action_probabilities <span style="color: #d2cd86;">=</span> actor_inference<span style="color: #d2cd86;">(</span>actor_optimizer<span style="color: #d2cd86;">.</span>target<span style="color: #d2cd86;">,</span> jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">[</span>state<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
action_probabilities <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>array<span style="color: #d2cd86;">(</span>action_probabilities<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
action <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>random<span style="color: #d2cd86;">.</span>choice<span style="color: #d2cd86;">(</span>n_actions<span style="color: #d2cd86;">,</span> p<span style="color: #d2cd86;">=</span>action_probabilities<span style="color: #d2cd86;">)</span>
next_state<span style="color: #d2cd86;">,</span> reward<span style="color: #d2cd86;">,</span> done<span style="color: #d2cd86;">,</span> _ <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>step<span style="color: #d2cd86;">(</span><span style="color: #e66170; font-weight: bold;">int</span><span style="color: #d2cd86;">(</span>action<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
states<span style="color: #d2cd86;">.</span>append<span style="color: #d2cd86;">(</span>state<span style="color: #d2cd86;">)</span>
actions<span style="color: #d2cd86;">.</span>append<span style="color: #d2cd86;">(</span>action<span style="color: #d2cd86;">)</span>
rewards<span style="color: #d2cd86;">.</span>append<span style="color: #d2cd86;">(</span>reward<span style="color: #d2cd86;">)</span>
dones<span style="color: #d2cd86;">.</span>append<span style="color: #d2cd86;">(</span><span style="color: #e66170; font-weight: bold;">int</span><span style="color: #d2cd86;">(</span>done<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
state <span style="color: #d2cd86;">=</span> next_state
<span style="color: #e66170; font-weight: bold;">if</span> debug_render<span style="color: #d2cd86;">:</span>
env<span style="color: #d2cd86;">.</span>render<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">if</span> done<span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span>episode<span style="color: #d2cd86;">,</span> <span style="color: #00c4c4;">" - reward :"</span><span style="color: #d2cd86;">,</span> <span style="color: #e66170; font-weight: bold;">sum</span><span style="color: #d2cd86;">(</span>rewards<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
episode_length <span style="color: #d2cd86;">=</span> <span style="color: #e66170; font-weight: bold;">len</span><span style="color: #d2cd86;">(</span>rewards<span style="color: #d2cd86;">)</span>
discounted_rewards <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>zeros_like<span style="color: #d2cd86;">(</span>rewards<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">for</span> t <span style="color: #e66170; font-weight: bold;">in</span> <span style="color: #e66170; font-weight: bold;">range</span><span style="color: #d2cd86;">(</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">,</span> episode_length<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
G_t <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">0</span>
<span style="color: #e66170; font-weight: bold;">for</span> idx<span style="color: #d2cd86;">,</span> j <span style="color: #e66170; font-weight: bold;">in</span> <span style="color: #e66170; font-weight: bold;">enumerate</span><span style="color: #d2cd86;">(</span><span style="color: #e66170; font-weight: bold;">range</span><span style="color: #d2cd86;">(</span>t<span style="color: #d2cd86;">,</span> episode_length<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
G_t <span style="color: #d2cd86;">=</span> G_t <span style="color: #00dddd;">+</span> <span style="color: #d2cd86;">(</span>gamma<span style="color: #00dddd;">**</span>idx<span style="color: #d2cd86;">)</span><span style="color: #00dddd;">*</span>rewards<span style="color: #d2cd86;">[</span>j<span style="color: #d2cd86;">]</span><span style="color: #00dddd;">*</span><span style="color: #d2cd86;">(</span><span style="color: #00a800;">1</span><span style="color: #00dddd;">-</span>dones<span style="color: #d2cd86;">[</span>j<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
discounted_rewards<span style="color: #d2cd86;">[</span>t<span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> G_t
discounted_rewards <span style="color: #d2cd86;">=</span> discounted_rewards <span style="color: #00dddd;">-</span> np<span style="color: #d2cd86;">.</span>mean<span style="color: #d2cd86;">(</span>discounted_rewards<span style="color: #d2cd86;">)</span>
discounted_rewards <span style="color: #d2cd86;">=</span> discounted_rewards <span style="color: #00dddd;">/</span> <span style="color: #d2cd86;">(</span>np<span style="color: #d2cd86;">.</span>std<span style="color: #d2cd86;">(</span>discounted_rewards<span style="color: #d2cd86;">)</span><span style="color: #00dddd;">+</span><span style="color: #00a800;">1</span><span style="background: rgb(221, 0, 0); color: white;">e</span><span style="color: #00dddd;">-</span><span style="color: #00a800;">10</span><span style="color: #d2cd86;">)</span>
actor_optimizer<span style="color: #d2cd86;">,</span> _ <span style="color: #d2cd86;">=</span> backpropagate_actor<span style="color: #d2cd86;">(</span>
actor_optimizer<span style="color: #d2cd86;">,</span>
critic_optimizer<span style="color: #d2cd86;">.</span>target<span style="color: #d2cd86;">,</span>
<span style="color: #d2cd86;">(</span>
jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span>states<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span>
jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span>discounted_rewards<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span>
jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span>actions<span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">)</span>
critic_optimizer<span style="color: #d2cd86;">,</span> _ <span style="color: #d2cd86;">=</span> backpropagate_critic<span style="color: #d2cd86;">(</span>
critic_optimizer<span style="color: #d2cd86;">,</span>
<span style="color: #d2cd86;">(</span>
jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span>states<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span>
jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span>discounted_rewards<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span>
<span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">break</span>
<span style="color: #e66170; font-weight: bold;">finally</span><span style="color: #d2cd86;">:</span>
env<span style="color: #d2cd86;">.</span>close<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
</pre>
<!--Created using ToHtml.com on 2020-08-17 00:58:30 UTC--></div>Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.com1tag:blogger.com,1999:blog-1457877875009527488.post-5490700166542591042020-07-12T18:28:00.015+08:002020-08-17T10:02:06.344+08:00Deep Reinforcement Learning, Advantage Actor Critic - A2C, A3C with Jax and Flax<div>Өмнөх цувралууд дээр Deep Q Learning, Policy Gradient алгоритмуудын талаар дурдсан. </div><div><br /></div><div>Энэ постоор <b>Actor Critic</b> гэх дараагийн чухал алгоритмтай танилцая.</div><span><a name='more'></a></span><div><br /></div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEivRi4jw-QtqkkSXulcxHEXgqKylieNnVV4OWWH-Nc9wigwQcnNYKmFVerC7vaDRyFNCzoql1r9KgU1HNqSZFb4gnCCe4XwHUDIGHgluGtMDy0Y-w_dZ3N-oJqgzRhUg_MPLSP-zNMy_g/s432/categories_of_rl.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="250" data-original-width="432" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEivRi4jw-QtqkkSXulcxHEXgqKylieNnVV4OWWH-Nc9wigwQcnNYKmFVerC7vaDRyFNCzoql1r9KgU1HNqSZFb4gnCCe4XwHUDIGHgluGtMDy0Y-w_dZ3N-oJqgzRhUg_MPLSP-zNMy_g/d/categories_of_rl.png" /></a></div><div><br /></div><div>Энэ алгоритмын давуу талыг мэдэхийн тулд өмнөх постны Policy Gradient-г цааш нь сайжруулах, policy gradient дээр гардаг high variance(noisy gradient) асуудлаас эхлэх хэрэгтэй байх.</div><div><br /></div><div>Ингэхэд variance гэж юу вэ? Үүгээр random утгууд хэр зэрэг тархан байрласан байна гэдгийг хэмждэг. Хэрэв ихэнхи утгууд нь дундаж утгатайгаа ойр байвал тархалт нь бага зайд төвлөрсөн учраас low variance-тай байна гэдэг. Эсрэгээр ихэнхи масс утгуудын тархалт дундажаас хол байвал high variance-тэй гэж үзнэ.</div><div><br /></div><div>Policy Gradient-н дутагдал нь өндөр variance байна гэдэг нь юу гэсэн үг вэ? </div><div><br /></div><div>Неорон сүлжээг сургах буюу параметертүүдийг нь шинэчлэхэд objective-рүү тэмүүлсэн градиент утгууд хэрэгтэй. Noisy gradient-үүдтэй буюу high variance-тэй байвал неорон сүлжээ converge хийгдэхгүй уддаг талаар урьд нь дурдсан.</div><div><br /></div><div>Policy Gradient яагаад high variance үүсгэдэг вэ гэхээр trajectory sample-ийн урт янз бүр байж болно. Үүний нөлөөгөөр reward нийлбэрүүд нь янз бүрийн хэмжээтэй, дагаад градиент утгууд нь ч гэсэн адилхан янз бүрийн хэмжээтэй болно. Тиймээс неорон сүлжээний параметрийг шинэчлэх үйлдэл тогтворгүй болж ирдэг.</div><div><br /></div><div>Жишээлбэл гурван sample trajectory-уудын r(τ) утгууд нь [1000, 1001, 1002] мөн ∇<font size="1">θ</font>logπ<font size="1">θ</font>(τ) градиентүүд харгалзан [0.5, 0.2, 0.3] гэж үзвэл эдгээрийн үржвэрүүдийнх нь varience утга <br /><br /></div><div> Var(0.5*1000, 0.2*1001, 0.3*1002) = 23286.8 болно</div><div><br /></div><div>Тэгвэл r(τ) ийн бүх утгыг 1001 ч юм уу нэг тогтмол утгаар багасгаад үзвэл яах вэ? </div><div><br /></div><div> Var(0.5*1, 0.2*0, 0.3*−1) = 0.1633 маш variance-тэй болж ирж байна.</div><div><br /></div><div>Иймэрхүү аргаар <b>baseline</b> гэх утга оруулж ирэн Policy Gradient-ийн variance-ийг багасгаж болно. </div><div><br /></div><div>Бага утгатай градиентүүдээр update хийхэд training хийх буюу парамететр шинэчлэх үйлдэл ч гэсэн илүү тогтвортой болдог.</div><div><br /></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFhQyUGSJZeUMuH9i-0CGQYwboBEjqAtKbd2Xbh6pD4OcBhmk1Zs7IYVJOU7axooenXGvdXFc957Y7exbhguCIrwo9wfgC0JSYGDyRRVga61NsBgchLpfVg8zEoLdzDxes5ws2qCl0iw/s430/policy_gradient_with_baseline.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="98" data-original-width="430" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFhQyUGSJZeUMuH9i-0CGQYwboBEjqAtKbd2Xbh6pD4OcBhmk1Zs7IYVJOU7axooenXGvdXFc957Y7exbhguCIrwo9wfgC0JSYGDyRRVga61NsBgchLpfVg8zEoLdzDxes5ws2qCl0iw/d/policy_gradient_with_baseline.png" /></a></div><div><br /></div>Дээрхийн нэгэн адилаар янз бүрийн baseline функцүүдийг оруулж ирэн ашиглах боломтой. Үүгээрээ ерөнхийдөө PG төрлийн алгоритмууд ялгаатай.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjd8e5fnx-tn_kfyU-IXckez0XbNfGD9RRtf0dI7lX3j7t8Zy833L-f8z2lD8Y0jvtg8BYTxOsBiMUeLXZxsgGQmnE0xRnvcHbpdvRpqsMRgK6MCXAY_DZpFr32pSzHH72g6g1yKDHXAw/s1000/various_baselines.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="215" data-original-width="1000" height="138" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjd8e5fnx-tn_kfyU-IXckez0XbNfGD9RRtf0dI7lX3j7t8Zy833L-f8z2lD8Y0jvtg8BYTxOsBiMUeLXZxsgGQmnE0xRnvcHbpdvRpqsMRgK6MCXAY_DZpFr32pSzHH72g6g1yKDHXAw/w640-h138/various_baselines.png" width="640" /></a></div><div><br /></div><div>Одоо энэ постны үндсэн сэдэв болох Actor Critic хаанаас гарч ирж байгаа талаар дурдая.</div><div><br /></div><div>Эхлээд өмнө үзсэн vanilla policy gradient томъёогоо эргэж харая. Энд хугацааны индексийг 1-ээс биш 0-ээр эхлүүлж байгааг анзаараарай, бусдаараа бол өмнөх томъёотойгоо адилхан.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgBhjz1qrniQjguPZZAuP4L7e8R2wlkz8nM2qXKBQ7a_wVlnQFY2GJ57131pBMHb-3-JfPZZsU923WFhKoJ5phyZaeL-fVbIEA1SCvZbM-exYCEPNf1R1lWdzhH7cVr3oLn5jlo8f7eAg/s453/vanilla_policy_gradient.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="82" data-original-width="453" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgBhjz1qrniQjguPZZAuP4L7e8R2wlkz8nM2qXKBQ7a_wVlnQFY2GJ57131pBMHb-3-JfPZZsU923WFhKoJ5phyZaeL-fVbIEA1SCvZbM-exYCEPNf1R1lWdzhH7cVr3oLn5jlo8f7eAg/d/vanilla_policy_gradient.png" /></a></div><div>Expectation хэсгийг нь задлаж бичвэл</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8Q_Iplrx5RtJZzKmV2tJkuZdKT4PXhW8BCjC1wmpfN2Z0hxkGaQ5VUcOqAvEQusGP9nZim5-mitp3UK_kcaO_zSWySJ52U0YdIbz_AUOrwfHV1sanPjEVIe3aQ_hkywALhIh32s5KWA/s714/vanilla_policy_gradient_expection.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="84" data-original-width="714" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8Q_Iplrx5RtJZzKmV2tJkuZdKT4PXhW8BCjC1wmpfN2Z0hxkGaQ5VUcOqAvEQusGP9nZim5-mitp3UK_kcaO_zSWySJ52U0YdIbz_AUOrwfHV1sanPjEVIe3aQ_hkywALhIh32s5KWA/d/vanilla_policy_gradient_expection.png" /></a></div><div>Эндээс анзаарвал хоёр дахь expectation хэсэг нь Q функц байгааг анзаарч болно.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0067IeDjB4SNOECGm-I9aN_biNzocnQNzDlxZGAzCINWF7EUpWDPSdFr-2_tSUdCbvbf19_mJXsOXsSyMLauiEGejeOZCv2wkUw7KD5VBlo1CH9TsvavuIHOehyphenhyphengmWj84NuAJShFP5Q/s339/expectation_q_value.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="54" data-original-width="339" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0067IeDjB4SNOECGm-I9aN_biNzocnQNzDlxZGAzCINWF7EUpWDPSdFr-2_tSUdCbvbf19_mJXsOXsSyMLauiEGejeOZCv2wkUw7KD5VBlo1CH9TsvavuIHOehyphenhyphengmWj84NuAJShFP5Q/d/expectation_q_value.png" /></a></div><div>w пареметрийг сурдаг Q функцээр vanilla policy gradient-ийн томъёог шинэчилбэл</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQe4trQEFv4p32QCRQ5d1Ahlm0Kg_OIfo_IIwv4rWxM8Fm5bwKLGbeAWI4HW5l9mvSPw9N2T9SBWqclR8NqmKS5qzGrch8hFTxsz2F2gXQ_kZ7Uxl7o5SiIoHoBgS72NZ7A-kGY_uK4Q/s631/vanilla_policy_gradient_with_q_function.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="159" data-original-width="631" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQe4trQEFv4p32QCRQ5d1Ahlm0Kg_OIfo_IIwv4rWxM8Fm5bwKLGbeAWI4HW5l9mvSPw9N2T9SBWqclR8NqmKS5qzGrch8hFTxsz2F2gXQ_kZ7Uxl7o5SiIoHoBgS72NZ7A-kGY_uK4Q/d/vanilla_policy_gradient_with_q_function.png" /></a></div><div><br /></div><div>Энэ хэсгээс Actor Critic аргыг тодорхойлох боломжтой болж байна. </div><div><br /></div><div><b>Actor Critic</b> нь бидний мэдэх <b>value-iteration</b> болон <b>policy-iteration</b> аргуудын аль аль сайн талуудыг нь хослуулж хэрэглэдэг арга юм. </div><div><ul style="text-align: left;"><li>Q эсвэл V функц буюу value хэсгийг нь дөхөж сурч байгаа хэсгийг нь <b>Critic</b> гэнэ.</li><li>Policy сурч байгаа хэсгийг нь <b>Actor</b> гэдэг.</li></ul></div><div>Ямар нэг төлөв дээр үйлдэл хийвэл нийт дундаж онооноос хэр ахиу байж чадах вэ, илүү дээр байж чадахуу гэдэг оноог Advantage функцээр илэрхийлдэг талаар өмнөх постод дурдсан. </div><div>Цааш үргэлжлүүлэхийн тулд Advantage функцээ эргэн саная.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVfDfJUEGli-OduUKTsbW3gbIhepIDcK5FvntYx68MxAWlX4l7ea7VISBH3jda1MGMyPxltE1-_x6AwsdchCFHDRtjm0Srs02ODFcICQduY8FSq0pn1dCSxrE477V41nb7ygc8aQ4vZg/s355/advantage_function.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="57" data-original-width="355" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVfDfJUEGli-OduUKTsbW3gbIhepIDcK5FvntYx68MxAWlX4l7ea7VISBH3jda1MGMyPxltE1-_x6AwsdchCFHDRtjm0Srs02ODFcICQduY8FSq0pn1dCSxrE477V41nb7ygc8aQ4vZg/d/advantage_function.png" /></a></div><div>Q болон V функцүүдийн хамаарлаас Advantage функцыг дан ганц V функцээр илэрхийлж болно.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhyVqfoJGyHu_ofy7fqzhZ7amlP5xFh1e-4JgqLBI69rJTXyBW7eXgEyjf1aQdVSBRP0Og30SLEgidYP1VT8OeSNOrcseJz_7WKMn1WY1axqs1MW0SmafzHn4xqyBW3y1NQcYSXJfAScg/s351/q_expectation.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="53" data-original-width="351" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhyVqfoJGyHu_ofy7fqzhZ7amlP5xFh1e-4JgqLBI69rJTXyBW7eXgEyjf1aQdVSBRP0Og30SLEgidYP1VT8OeSNOrcseJz_7WKMn1WY1axqs1MW0SmafzHn4xqyBW3y1NQcYSXJfAScg/d/q_expectation.png" /></a></div><div>Advantage томъёог зөвхөн v параметер сурдаг V функцээр илэрхийлбэл</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJE1h_EC-5SXMk4roYasrRvUbZGo0WlQkjO8yYYVjb5cgZNnivL_kM_drSvbaDyeWwmCCBSxM_GZfLlfhxkwzbLcV1Slj6yEODYbvZOGLNBRtVyS3r_qv-Z8HY0wyPo8C21dVuhZBReA/s413/advantage_as_value.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="52" data-original-width="413" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJE1h_EC-5SXMk4roYasrRvUbZGo0WlQkjO8yYYVjb5cgZNnivL_kM_drSvbaDyeWwmCCBSxM_GZfLlfhxkwzbLcV1Slj6yEODYbvZOGLNBRtVyS3r_qv-Z8HY0wyPo8C21dVuhZBReA/d/advantage_as_value.png" /></a></div><div><br /></div><div>Эндээс PG хамгийн том асуудал болох high variance-г нь багасгах гэдэг асуудлынхаа хүрээнд policy gradient-ийн томъёонд advantage томъёогоор орлуулга хийвэл <b>Advantage Actor Critic</b> томъёо гарч ирдэг.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi2jxc92EHAREKztXhsTEk-1n4w8VS_XmYg1Q9Xqa_Ogjk5U_PrNlSmdoJ1os3QVl63DdHqFchSNW3OXjBhHUcbnqDxDVlRL7QHJuqLTTMXED4RydQrI6JEOF-zDmbhtjZzzEUCZaSheg/s676/advantage_actor_critic.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="199" data-original-width="676" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi2jxc92EHAREKztXhsTEk-1n4w8VS_XmYg1Q9Xqa_Ogjk5U_PrNlSmdoJ1os3QVl63DdHqFchSNW3OXjBhHUcbnqDxDVlRL7QHJuqLTTMXED4RydQrI6JEOF-zDmbhtjZzzEUCZaSheg/d/advantage_actor_critic.png" /></a></div><div><br /></div><div><b>Advantage Actor Critic</b> алгоритмыг энд тэндхийн сурах бичгүүдэд <b>A2C</b> гэж нэрлэж байхтай нилээн тааралдах байх бас параллель environment тооцоолол оруулж ирэх мэтийн сайжруулалт хийгээд <b>Asynchronous Advantage Actor Critic</b> буюу <b>A3C</b> гэдэг алгоритмууд бий болдог. Энэ постоор суурь нь ижил учраас <b>A3C</b>-ийн талаар бичихгүй орхилоо гэхдээ хамгийн доор хэсэгт thread-тэй хувилбараар хэрэгжүүлсэн кодыг тавьсан байгаа.</div><div><br /></div><div><br /></div><div><font size="6">Лавлагаа</font></div><div><br /></div><div><a href="https://www.quora.com/Why-does-the-policy-gradient-method-have-a-high-variance">https://www.quora.com/Why-does-the-policy-gradient-method-have-a-high-variance</a></div>
<br />
<div><br /></div>
<div><br /></div>
<div><font size="6">Хэрэгжүүлэлт</font></div>
<div><br /></div>
<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj9PjY7wSNycPyXGe5p-jcRCTMkRNXz0MRJ0AByOIBHuEhk4xMI3FhHeyV4NJbj0iboIWpIVv17Pvywp8U5gVJ6TmPbt9jbxi8vRRGCntiukfKZhq8WDBdLQRGQeFA8N1YdvXX9Jz_LWg/s568/episodic_a2c.gif" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="382" data-original-width="568" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj9PjY7wSNycPyXGe5p-jcRCTMkRNXz0MRJ0AByOIBHuEhk4xMI3FhHeyV4NJbj0iboIWpIVv17Pvywp8U5gVJ6TmPbt9jbxi8vRRGCntiukfKZhq8WDBdLQRGQeFA8N1YdvXX9Jz_LWg/d/episodic_a2c.gif" /></a></div><div><br /></div>
<div><br /></div>
<div>
Online <b>Advantage Actor-Critic</b>-н Jax болон Flax дээрхи хэрэгжүүлэлт
</div>
<div><br /></div>
<div><br /></div>
<pre style="background: rgb(0, 0, 0); color: #d1d1d1;"><span style="color: #e66170; font-weight: bold;">import</span> os
<span style="color: #e66170; font-weight: bold;">import</span> random
<span style="color: #e66170; font-weight: bold;">import</span> math
<span style="color: #e66170; font-weight: bold;">import</span> gym
<span style="color: #e66170; font-weight: bold;">import</span> flax
<span style="color: #e66170; font-weight: bold;">import</span> jax
<span style="color: #e66170; font-weight: bold;">from</span> jax <span style="color: #e66170; font-weight: bold;">import</span> numpy <span style="color: #e66170; font-weight: bold;">as</span> jnp
<span style="color: #e66170; font-weight: bold;">import</span> numpy <span style="color: #e66170; font-weight: bold;">as</span> np
<span style="color: #e66170; font-weight: bold;">import</span> numpy
debug_render <span style="color: #d2cd86;">=</span> True
debug <span style="color: #d2cd86;">=</span> False
num_episodes <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">1500</span>
learning_rate <span style="color: #d2cd86;">=</span> <span style="color: #009f00;">0.001</span>
gamma <span style="color: #d2cd86;">=</span> <span style="color: #009f00;">0.99</span>
<span style="color: #e66170; font-weight: bold;">class</span> ActorNetwork<span style="color: #d2cd86;">(</span>flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Module<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">def</span> apply<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> x<span style="color: #d2cd86;">,</span> n_actions<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
dense_layer_1 <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Dense<span style="color: #d2cd86;">(</span>x<span style="color: #d2cd86;">,</span> <span style="color: #00a800;">64</span><span style="color: #d2cd86;">)</span>
activation_layer_1 <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>relu<span style="color: #d2cd86;">(</span>dense_layer_1<span style="color: #d2cd86;">)</span>
dense_layer_2 <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Dense<span style="color: #d2cd86;">(</span>activation_layer_1<span style="color: #d2cd86;">,</span> <span style="color: #00a800;">32</span><span style="color: #d2cd86;">)</span>
activation_layer_2 <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>relu<span style="color: #d2cd86;">(</span>dense_layer_2<span style="color: #d2cd86;">)</span>
output_dense_layer <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Dense<span style="color: #d2cd86;">(</span>activation_layer_2<span style="color: #d2cd86;">,</span> n_actions<span style="color: #d2cd86;">)</span>
output_layer <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>softmax<span style="color: #d2cd86;">(</span>output_dense_layer<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> output_layer
<span style="color: #e66170; font-weight: bold;">class</span> CriticNetwork<span style="color: #d2cd86;">(</span>flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Module<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">def</span> apply<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> x<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
dense_layer_1 <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Dense<span style="color: #d2cd86;">(</span>x<span style="color: #d2cd86;">,</span> <span style="color: #00a800;">64</span><span style="color: #d2cd86;">)</span>
activation_layer_1 <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>relu<span style="color: #d2cd86;">(</span>dense_layer_1<span style="color: #d2cd86;">)</span>
dense_layer_2 <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Dense<span style="color: #d2cd86;">(</span>activation_layer_1<span style="color: #d2cd86;">,</span> <span style="color: #00a800;">32</span><span style="color: #d2cd86;">)</span>
activation_layer_2 <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>relu<span style="color: #d2cd86;">(</span>dense_layer_2<span style="color: #d2cd86;">)</span>
output_dense_layer <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Dense<span style="color: #d2cd86;">(</span>activation_layer_2<span style="color: #d2cd86;">,</span> <span style="color: #00a800;">1</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> output_dense_layer
env <span style="color: #d2cd86;">=</span> gym<span style="color: #d2cd86;">.</span>make<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">'CartPole-v1'</span><span style="color: #d2cd86;">)</span>
state <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>reset<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
n_actions <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>action_space<span style="color: #d2cd86;">.</span>n
actor_module <span style="color: #d2cd86;">=</span> ActorNetwork<span style="color: #d2cd86;">.</span>partial<span style="color: #d2cd86;">(</span>n_actions<span style="color: #d2cd86;">=</span>n_actions<span style="color: #d2cd86;">)</span>
_<span style="color: #d2cd86;">,</span> actor_params <span style="color: #d2cd86;">=</span> actor_module<span style="color: #d2cd86;">.</span>init_by_shape<span style="color: #d2cd86;">(</span>jax<span style="color: #d2cd86;">.</span>random<span style="color: #d2cd86;">.</span>PRNGKey<span style="color: #d2cd86;">(</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">[</span>state<span style="color: #d2cd86;">.</span>shape<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
actor_model <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Model<span style="color: #d2cd86;">(</span>actor_module<span style="color: #d2cd86;">,</span> actor_params<span style="color: #d2cd86;">)</span>
critic_module <span style="color: #d2cd86;">=</span> CriticNetwork<span style="color: #d2cd86;">.</span>partial<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
_<span style="color: #d2cd86;">,</span> critic_params <span style="color: #d2cd86;">=</span> critic_module<span style="color: #d2cd86;">.</span>init_by_shape<span style="color: #d2cd86;">(</span>jax<span style="color: #d2cd86;">.</span>random<span style="color: #d2cd86;">.</span>PRNGKey<span style="color: #d2cd86;">(</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">[</span>state<span style="color: #d2cd86;">.</span>shape<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
critic_model <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Model<span style="color: #d2cd86;">(</span>critic_module<span style="color: #d2cd86;">,</span> critic_params<span style="color: #d2cd86;">)</span>
actor_optimizer <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>optim<span style="color: #d2cd86;">.</span>Adam<span style="color: #d2cd86;">(</span>learning_rate<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">.</span>create<span style="color: #d2cd86;">(</span>actor_model<span style="color: #d2cd86;">)</span>
critic_optimizer <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>optim<span style="color: #d2cd86;">.</span>Adam<span style="color: #d2cd86;">(</span>learning_rate<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">.</span>create<span style="color: #d2cd86;">(</span>critic_model<span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">@</span>jax<span style="color: #d2cd86;">.</span>jit
<span style="color: #e66170; font-weight: bold;">def</span> actor_inference<span style="color: #d2cd86;">(</span>model<span style="color: #d2cd86;">,</span> x<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">return</span> model<span style="color: #d2cd86;">(</span>x<span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">@</span>jax<span style="color: #d2cd86;">.</span>jit
<span style="color: #e66170; font-weight: bold;">def</span> backpropagate_critic<span style="color: #d2cd86;">(</span>optimizer<span style="color: #d2cd86;">,</span> props<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #9999a9;"># props[0] - state</span>
<span style="color: #9999a9;"># props[1] - next_state</span>
<span style="color: #9999a9;"># props[2] - reward</span>
<span style="color: #9999a9;"># props[3] - done</span>
next_value <span style="color: #d2cd86;">=</span> jax<span style="color: #d2cd86;">.</span>lax<span style="color: #d2cd86;">.</span>stop_gradient<span style="color: #d2cd86;">(</span>optimizer<span style="color: #d2cd86;">.</span>target<span style="color: #d2cd86;">(</span>jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">[</span>props<span style="color: #d2cd86;">[</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">def</span> loss_fn<span style="color: #d2cd86;">(</span>model<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
value <span style="color: #d2cd86;">=</span> model<span style="color: #d2cd86;">(</span>jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">[</span>props<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span>
advantage <span style="color: #d2cd86;">=</span> props<span style="color: #d2cd86;">[</span><span style="color: #00a800;">2</span><span style="color: #d2cd86;">]</span><span style="color: #00dddd;">+</span><span style="color: #d2cd86;">(</span>gamma<span style="color: #00dddd;">*</span>next_value<span style="color: #d2cd86;">)</span><span style="color: #00dddd;">*</span><span style="color: #d2cd86;">(</span><span style="color: #00a800;">1</span><span style="color: #00dddd;">-</span>props<span style="color: #d2cd86;">[</span><span style="color: #00a800;">3</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span> <span style="color: #00dddd;">-</span> value
<span style="color: #e66170; font-weight: bold;">return</span> jnp<span style="color: #d2cd86;">.</span>square<span style="color: #d2cd86;">(</span>advantage<span style="color: #d2cd86;">)</span>
loss<span style="color: #d2cd86;">,</span> gradients <span style="color: #d2cd86;">=</span> jax<span style="color: #d2cd86;">.</span>value_and_grad<span style="color: #d2cd86;">(</span>loss_fn<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">(</span>optimizer<span style="color: #d2cd86;">.</span>target<span style="color: #d2cd86;">)</span>
optimizer <span style="color: #d2cd86;">=</span> optimizer<span style="color: #d2cd86;">.</span>apply_gradient<span style="color: #d2cd86;">(</span>gradients<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> optimizer<span style="color: #d2cd86;">,</span> loss
<span style="color: #d2cd86;">@</span>jax<span style="color: #d2cd86;">.</span>jit
<span style="color: #e66170; font-weight: bold;">def</span> backpropagate_actor<span style="color: #d2cd86;">(</span>optimizer<span style="color: #d2cd86;">,</span> critic_model<span style="color: #d2cd86;">,</span> props<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #9999a9;"># props[0] - state</span>
<span style="color: #9999a9;"># props[1] - next_state</span>
<span style="color: #9999a9;"># props[2] - reward</span>
<span style="color: #9999a9;"># props[3] - done</span>
<span style="color: #9999a9;"># props[4] - action</span>
value <span style="color: #d2cd86;">=</span> jax<span style="color: #d2cd86;">.</span>lax<span style="color: #d2cd86;">.</span>stop_gradient<span style="color: #d2cd86;">(</span>critic_model<span style="color: #d2cd86;">(</span>jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">[</span>props<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
next_value <span style="color: #d2cd86;">=</span> jax<span style="color: #d2cd86;">.</span>lax<span style="color: #d2cd86;">.</span>stop_gradient<span style="color: #d2cd86;">(</span>critic_model<span style="color: #d2cd86;">(</span>jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">[</span>props<span style="color: #d2cd86;">[</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
advantage <span style="color: #d2cd86;">=</span> props<span style="color: #d2cd86;">[</span><span style="color: #00a800;">2</span><span style="color: #d2cd86;">]</span><span style="color: #00dddd;">+</span><span style="color: #d2cd86;">(</span>gamma<span style="color: #00dddd;">*</span>next_value<span style="color: #d2cd86;">)</span><span style="color: #00dddd;">*</span><span style="color: #d2cd86;">(</span><span style="color: #00a800;">1</span><span style="color: #00dddd;">-</span>props<span style="color: #d2cd86;">[</span><span style="color: #00a800;">3</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span> <span style="color: #00dddd;">-</span> value
<span style="color: #e66170; font-weight: bold;">def</span> loss_fn<span style="color: #d2cd86;">(</span>model<span style="color: #d2cd86;">,</span> advantage<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
action_probabilities <span style="color: #d2cd86;">=</span> model<span style="color: #d2cd86;">(</span>jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">[</span>props<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span>
probability <span style="color: #d2cd86;">=</span> action_probabilities<span style="color: #d2cd86;">[</span>props<span style="color: #d2cd86;">[</span><span style="color: #00a800;">4</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">]</span>
log_probability <span style="color: #d2cd86;">=</span> <span style="color: #00dddd;">-</span>jnp<span style="color: #d2cd86;">.</span>log<span style="color: #d2cd86;">(</span>probability<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> log_probability<span style="color: #00dddd;">*</span>advantage
loss<span style="color: #d2cd86;">,</span> gradients <span style="color: #d2cd86;">=</span> jax<span style="color: #d2cd86;">.</span>value_and_grad<span style="color: #d2cd86;">(</span>loss_fn<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">(</span>optimizer<span style="color: #d2cd86;">.</span>target<span style="color: #d2cd86;">,</span> advantage<span style="color: #d2cd86;">)</span>
optimizer <span style="color: #d2cd86;">=</span> optimizer<span style="color: #d2cd86;">.</span>apply_gradient<span style="color: #d2cd86;">(</span>gradients<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> optimizer<span style="color: #d2cd86;">,</span> loss
global_step <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">0</span>
<span style="color: #e66170; font-weight: bold;">try</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">for</span> episode <span style="color: #e66170; font-weight: bold;">in</span> <span style="color: #e66170; font-weight: bold;">range</span><span style="color: #d2cd86;">(</span>num_episodes<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
state <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>reset<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
episode_rewards <span style="color: #d2cd86;">=</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span>
<span style="color: #e66170; font-weight: bold;">while</span> True<span style="color: #d2cd86;">:</span>
global_step <span style="color: #d2cd86;">=</span> global_step<span style="color: #00dddd;">+</span><span style="color: #00a800;">1</span>
action_probabilities <span style="color: #d2cd86;">=</span> actor_inference<span style="color: #d2cd86;">(</span>actor_optimizer<span style="color: #d2cd86;">.</span>target<span style="color: #d2cd86;">,</span> jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">[</span>state<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
action_probabilities <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>array<span style="color: #d2cd86;">(</span>action_probabilities<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
action <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>random<span style="color: #d2cd86;">.</span>choice<span style="color: #d2cd86;">(</span>n_actions<span style="color: #d2cd86;">,</span> p<span style="color: #d2cd86;">=</span>action_probabilities<span style="color: #d2cd86;">)</span>
next_state<span style="color: #d2cd86;">,</span> reward<span style="color: #d2cd86;">,</span> done<span style="color: #d2cd86;">,</span> _ <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>step<span style="color: #d2cd86;">(</span><span style="color: #e66170; font-weight: bold;">int</span><span style="color: #d2cd86;">(</span>action<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
episode_rewards<span style="color: #d2cd86;">.</span>append<span style="color: #d2cd86;">(</span>reward<span style="color: #d2cd86;">)</span>
actor_optimizer<span style="color: #d2cd86;">,</span> _ <span style="color: #d2cd86;">=</span> backpropagate_actor<span style="color: #d2cd86;">(</span>
actor_optimizer<span style="color: #d2cd86;">,</span>
critic_optimizer<span style="color: #d2cd86;">.</span>target<span style="color: #d2cd86;">,</span>
<span style="color: #d2cd86;">(</span>state<span style="color: #d2cd86;">,</span> next_state<span style="color: #d2cd86;">,</span> reward<span style="color: #d2cd86;">,</span> <span style="color: #e66170; font-weight: bold;">int</span><span style="color: #d2cd86;">(</span>done<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span> action<span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">)</span>
critic_optimizer<span style="color: #d2cd86;">,</span> _ <span style="color: #d2cd86;">=</span> backpropagate_critic<span style="color: #d2cd86;">(</span>
critic_optimizer<span style="color: #d2cd86;">,</span>
<span style="color: #d2cd86;">(</span>state<span style="color: #d2cd86;">,</span> next_state<span style="color: #d2cd86;">,</span> reward<span style="color: #d2cd86;">,</span> <span style="color: #e66170; font-weight: bold;">int</span><span style="color: #d2cd86;">(</span>done<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">)</span>
state <span style="color: #d2cd86;">=</span> next_state
<span style="color: #e66170; font-weight: bold;">if</span> debug_render<span style="color: #d2cd86;">:</span>
env<span style="color: #d2cd86;">.</span>render<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">if</span> done<span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span>episode<span style="color: #d2cd86;">,</span> <span style="color: #00c4c4;">" - reward :"</span><span style="color: #d2cd86;">,</span> <span style="color: #e66170; font-weight: bold;">sum</span><span style="color: #d2cd86;">(</span>episode_rewards<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">break</span>
<span style="color: #e66170; font-weight: bold;">finally</span><span style="color: #d2cd86;">:</span>
env<span style="color: #d2cd86;">.</span>close<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
</pre>
<!--Created using ToHtml.com on 2020-07-27 18:31:24 UTC-->
<div><br /></div>
<div><br /></div>
Episodic <b>Advantage Actor-Critic</b>-н Jax болон Flax дээрхи хэрэгжүүлэлт
<div><br /></div>
<div><br /></div>
<pre style='color:#d1d1d1;background:#000000;'><span style='color:#e66170; font-weight:bold; '>import</span> os
<span style='color:#e66170; font-weight:bold; '>import</span> random
<span style='color:#e66170; font-weight:bold; '>import</span> math
<span style='color:#e66170; font-weight:bold; '>import</span> gym
<span style='color:#e66170; font-weight:bold; '>import</span> flax
<span style='color:#e66170; font-weight:bold; '>import</span> jax
<span style='color:#e66170; font-weight:bold; '>from</span> jax <span style='color:#e66170; font-weight:bold; '>import</span> numpy <span style='color:#e66170; font-weight:bold; '>as</span> jnp
<span style='color:#e66170; font-weight:bold; '>import</span> numpy <span style='color:#e66170; font-weight:bold; '>as</span> np
<span style='color:#e66170; font-weight:bold; '>import</span> numpy
debug_render <span style='color:#d2cd86; '>=</span> True
debug <span style='color:#d2cd86; '>=</span> False
num_episodes <span style='color:#d2cd86; '>=</span> <span style='color:#00a800; '>1500</span>
learning_rate <span style='color:#d2cd86; '>=</span> <span style='color:#009f00; '>0.001</span>
gamma <span style='color:#d2cd86; '>=</span> <span style='color:#009f00; '>0.99</span>
<span style='color:#e66170; font-weight:bold; '>class</span> ActorNetwork<span style='color:#d2cd86; '>(</span>flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>Module<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
<span style='color:#e66170; font-weight:bold; '>def</span> apply<span style='color:#d2cd86; '>(</span>self<span style='color:#d2cd86; '>,</span> x<span style='color:#d2cd86; '>,</span> n_actions<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
dense_layer_1 <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>Dense<span style='color:#d2cd86; '>(</span>x<span style='color:#d2cd86; '>,</span> <span style='color:#00a800; '>64</span><span style='color:#d2cd86; '>)</span>
activation_layer_1 <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>relu<span style='color:#d2cd86; '>(</span>dense_layer_1<span style='color:#d2cd86; '>)</span>
dense_layer_2 <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>Dense<span style='color:#d2cd86; '>(</span>activation_layer_1<span style='color:#d2cd86; '>,</span> <span style='color:#00a800; '>32</span><span style='color:#d2cd86; '>)</span>
activation_layer_2 <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>relu<span style='color:#d2cd86; '>(</span>dense_layer_2<span style='color:#d2cd86; '>)</span>
output_dense_layer <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>Dense<span style='color:#d2cd86; '>(</span>activation_layer_2<span style='color:#d2cd86; '>,</span> n_actions<span style='color:#d2cd86; '>)</span>
output_layer <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>softmax<span style='color:#d2cd86; '>(</span>output_dense_layer<span style='color:#d2cd86; '>)</span>
<span style='color:#e66170; font-weight:bold; '>return</span> output_layer
<span style='color:#e66170; font-weight:bold; '>class</span> CriticNetwork<span style='color:#d2cd86; '>(</span>flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>Module<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
<span style='color:#e66170; font-weight:bold; '>def</span> apply<span style='color:#d2cd86; '>(</span>self<span style='color:#d2cd86; '>,</span> x<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
dense_layer_1 <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>Dense<span style='color:#d2cd86; '>(</span>x<span style='color:#d2cd86; '>,</span> <span style='color:#00a800; '>64</span><span style='color:#d2cd86; '>)</span>
activation_layer_1 <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>relu<span style='color:#d2cd86; '>(</span>dense_layer_1<span style='color:#d2cd86; '>)</span>
dense_layer_2 <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>Dense<span style='color:#d2cd86; '>(</span>activation_layer_1<span style='color:#d2cd86; '>,</span> <span style='color:#00a800; '>32</span><span style='color:#d2cd86; '>)</span>
activation_layer_2 <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>relu<span style='color:#d2cd86; '>(</span>dense_layer_2<span style='color:#d2cd86; '>)</span>
output_dense_layer <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>Dense<span style='color:#d2cd86; '>(</span>activation_layer_2<span style='color:#d2cd86; '>,</span> <span style='color:#00a800; '>1</span><span style='color:#d2cd86; '>)</span>
<span style='color:#e66170; font-weight:bold; '>return</span> output_dense_layer
env <span style='color:#d2cd86; '>=</span> gym<span style='color:#d2cd86; '>.</span>make<span style='color:#d2cd86; '>(</span><span style='color:#00c4c4; '>'CartPole-v1'</span><span style='color:#d2cd86; '>)</span>
state <span style='color:#d2cd86; '>=</span> env<span style='color:#d2cd86; '>.</span>reset<span style='color:#d2cd86; '>(</span><span style='color:#d2cd86; '>)</span>
n_actions <span style='color:#d2cd86; '>=</span> env<span style='color:#d2cd86; '>.</span>action_space<span style='color:#d2cd86; '>.</span>n
actor_module <span style='color:#d2cd86; '>=</span> ActorNetwork<span style='color:#d2cd86; '>.</span>partial<span style='color:#d2cd86; '>(</span>n_actions<span style='color:#d2cd86; '>=</span>n_actions<span style='color:#d2cd86; '>)</span>
_<span style='color:#d2cd86; '>,</span> actor_params <span style='color:#d2cd86; '>=</span> actor_module<span style='color:#d2cd86; '>.</span>init_by_shape<span style='color:#d2cd86; '>(</span>jax<span style='color:#d2cd86; '>.</span>random<span style='color:#d2cd86; '>.</span>PRNGKey<span style='color:#d2cd86; '>(</span><span style='color:#00a800; '>0</span><span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>,</span> <span style='color:#d2cd86; '>[</span>state<span style='color:#d2cd86; '>.</span>shape<span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>)</span>
actor_model <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>Model<span style='color:#d2cd86; '>(</span>actor_module<span style='color:#d2cd86; '>,</span> actor_params<span style='color:#d2cd86; '>)</span>
critic_module <span style='color:#d2cd86; '>=</span> CriticNetwork<span style='color:#d2cd86; '>.</span>partial<span style='color:#d2cd86; '>(</span><span style='color:#d2cd86; '>)</span>
_<span style='color:#d2cd86; '>,</span> critic_params <span style='color:#d2cd86; '>=</span> critic_module<span style='color:#d2cd86; '>.</span>init_by_shape<span style='color:#d2cd86; '>(</span>jax<span style='color:#d2cd86; '>.</span>random<span style='color:#d2cd86; '>.</span>PRNGKey<span style='color:#d2cd86; '>(</span><span style='color:#00a800; '>0</span><span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>,</span> <span style='color:#d2cd86; '>[</span>state<span style='color:#d2cd86; '>.</span>shape<span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>)</span>
critic_model <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>Model<span style='color:#d2cd86; '>(</span>critic_module<span style='color:#d2cd86; '>,</span> critic_params<span style='color:#d2cd86; '>)</span>
actor_optimizer <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>optim<span style='color:#d2cd86; '>.</span>Adam<span style='color:#d2cd86; '>(</span>learning_rate<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>.</span>create<span style='color:#d2cd86; '>(</span>actor_model<span style='color:#d2cd86; '>)</span>
critic_optimizer <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>optim<span style='color:#d2cd86; '>.</span>Adam<span style='color:#d2cd86; '>(</span>learning_rate<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>.</span>create<span style='color:#d2cd86; '>(</span>critic_model<span style='color:#d2cd86; '>)</span>
<span style='color:#d2cd86; '>@</span>jax<span style='color:#d2cd86; '>.</span>jit
<span style='color:#e66170; font-weight:bold; '>def</span> actor_inference<span style='color:#d2cd86; '>(</span>model<span style='color:#d2cd86; '>,</span> x<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
<span style='color:#e66170; font-weight:bold; '>return</span> model<span style='color:#d2cd86; '>(</span>x<span style='color:#d2cd86; '>)</span>
<span style='color:#d2cd86; '>@</span>jax<span style='color:#d2cd86; '>.</span>jit
<span style='color:#e66170; font-weight:bold; '>def</span> critic_inference<span style='color:#d2cd86; '>(</span>model<span style='color:#d2cd86; '>,</span> x<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
<span style='color:#e66170; font-weight:bold; '>return</span> model<span style='color:#d2cd86; '>(</span>x<span style='color:#d2cd86; '>)</span>
<span style='color:#d2cd86; '>@</span>jax<span style='color:#d2cd86; '>.</span>jit
<span style='color:#e66170; font-weight:bold; '>def</span> backpropagate_critic<span style='color:#d2cd86; '>(</span>optimizer<span style='color:#d2cd86; '>,</span> props<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
<span style='color:#9999a9; '># props[0] - states</span>
<span style='color:#9999a9; '># props[1] - discounted_rewards</span>
<span style='color:#e66170; font-weight:bold; '>def</span> loss_fn<span style='color:#d2cd86; '>(</span>model<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
values <span style='color:#d2cd86; '>=</span> model<span style='color:#d2cd86; '>(</span>props<span style='color:#d2cd86; '>[</span><span style='color:#00a800; '>0</span><span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>)</span>
values <span style='color:#d2cd86; '>=</span> jnp<span style='color:#d2cd86; '>.</span>reshape<span style='color:#d2cd86; '>(</span>values<span style='color:#d2cd86; '>,</span> <span style='color:#d2cd86; '>(</span>values<span style='color:#d2cd86; '>.</span>shape<span style='color:#d2cd86; '>[</span><span style='color:#00a800; '>0</span><span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>,</span><span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>)</span>
advantages <span style='color:#d2cd86; '>=</span> props<span style='color:#d2cd86; '>[</span><span style='color:#00a800; '>1</span><span style='color:#d2cd86; '>]</span> <span style='color:#00dddd; '>-</span> values
<span style='color:#e66170; font-weight:bold; '>return</span> jnp<span style='color:#d2cd86; '>.</span>mean<span style='color:#d2cd86; '>(</span>jnp<span style='color:#d2cd86; '>.</span>square<span style='color:#d2cd86; '>(</span>advantages<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>)</span>
loss<span style='color:#d2cd86; '>,</span> gradients <span style='color:#d2cd86; '>=</span> jax<span style='color:#d2cd86; '>.</span>value_and_grad<span style='color:#d2cd86; '>(</span>loss_fn<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>(</span>optimizer<span style='color:#d2cd86; '>.</span>target<span style='color:#d2cd86; '>)</span>
optimizer <span style='color:#d2cd86; '>=</span> optimizer<span style='color:#d2cd86; '>.</span>apply_gradient<span style='color:#d2cd86; '>(</span>gradients<span style='color:#d2cd86; '>)</span>
<span style='color:#e66170; font-weight:bold; '>return</span> optimizer<span style='color:#d2cd86; '>,</span> loss
<span style='color:#d2cd86; '>@</span>jax<span style='color:#d2cd86; '>.</span>vmap
<span style='color:#e66170; font-weight:bold; '>def</span> gather<span style='color:#d2cd86; '>(</span>probability_vec<span style='color:#d2cd86; '>,</span> action_index<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
<span style='color:#e66170; font-weight:bold; '>return</span> probability_vec<span style='color:#d2cd86; '>[</span>action_index<span style='color:#d2cd86; '>]</span>
<span style='color:#d2cd86; '>@</span>jax<span style='color:#d2cd86; '>.</span>jit
<span style='color:#e66170; font-weight:bold; '>def</span> backpropagate_actor<span style='color:#d2cd86; '>(</span>optimizer<span style='color:#d2cd86; '>,</span> critic_model<span style='color:#d2cd86; '>,</span> props<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
<span style='color:#9999a9; '># props[0] - states</span>
<span style='color:#9999a9; '># props[1] - discounted_rewards</span>
<span style='color:#9999a9; '># props[2] - actions</span>
values <span style='color:#d2cd86; '>=</span> jax<span style='color:#d2cd86; '>.</span>lax<span style='color:#d2cd86; '>.</span>stop_gradient<span style='color:#d2cd86; '>(</span>critic_model<span style='color:#d2cd86; '>(</span>props<span style='color:#d2cd86; '>[</span><span style='color:#00a800; '>0</span><span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>)</span>
values <span style='color:#d2cd86; '>=</span> jnp<span style='color:#d2cd86; '>.</span>reshape<span style='color:#d2cd86; '>(</span>values<span style='color:#d2cd86; '>,</span> <span style='color:#d2cd86; '>(</span>values<span style='color:#d2cd86; '>.</span>shape<span style='color:#d2cd86; '>[</span><span style='color:#00a800; '>0</span><span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>,</span><span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>)</span>
advantages <span style='color:#d2cd86; '>=</span> props<span style='color:#d2cd86; '>[</span><span style='color:#00a800; '>1</span><span style='color:#d2cd86; '>]</span> <span style='color:#00dddd; '>-</span> values
<span style='color:#e66170; font-weight:bold; '>def</span> loss_fn<span style='color:#d2cd86; '>(</span>model<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
action_probabilities <span style='color:#d2cd86; '>=</span> model<span style='color:#d2cd86; '>(</span>props<span style='color:#d2cd86; '>[</span><span style='color:#00a800; '>0</span><span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>)</span>
probabilities <span style='color:#d2cd86; '>=</span> gather<span style='color:#d2cd86; '>(</span>action_probabilities<span style='color:#d2cd86; '>,</span> props<span style='color:#d2cd86; '>[</span><span style='color:#00a800; '>2</span><span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>)</span>
log_probabilities <span style='color:#d2cd86; '>=</span> <span style='color:#00dddd; '>-</span>jnp<span style='color:#d2cd86; '>.</span>log<span style='color:#d2cd86; '>(</span>probabilities<span style='color:#d2cd86; '>)</span>
<span style='color:#e66170; font-weight:bold; '>return</span> jnp<span style='color:#d2cd86; '>.</span>mean<span style='color:#d2cd86; '>(</span>jnp<span style='color:#d2cd86; '>.</span>multiply<span style='color:#d2cd86; '>(</span>log_probabilities<span style='color:#d2cd86; '>,</span> advantages<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>)</span>
loss<span style='color:#d2cd86; '>,</span> gradients <span style='color:#d2cd86; '>=</span> jax<span style='color:#d2cd86; '>.</span>value_and_grad<span style='color:#d2cd86; '>(</span>loss_fn<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>(</span>optimizer<span style='color:#d2cd86; '>.</span>target<span style='color:#d2cd86; '>)</span>
optimizer <span style='color:#d2cd86; '>=</span> optimizer<span style='color:#d2cd86; '>.</span>apply_gradient<span style='color:#d2cd86; '>(</span>gradients<span style='color:#d2cd86; '>)</span>
<span style='color:#e66170; font-weight:bold; '>return</span> optimizer<span style='color:#d2cd86; '>,</span> loss
global_step <span style='color:#d2cd86; '>=</span> <span style='color:#00a800; '>0</span>
<span style='color:#e66170; font-weight:bold; '>try</span><span style='color:#d2cd86; '>:</span>
<span style='color:#e66170; font-weight:bold; '>for</span> episode <span style='color:#e66170; font-weight:bold; '>in</span> <span style='color:#e66170; font-weight:bold; '>range</span><span style='color:#d2cd86; '>(</span>num_episodes<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
state <span style='color:#d2cd86; '>=</span> env<span style='color:#d2cd86; '>.</span>reset<span style='color:#d2cd86; '>(</span><span style='color:#d2cd86; '>)</span>
states<span style='color:#d2cd86; '>,</span> actions<span style='color:#d2cd86; '>,</span> rewards<span style='color:#d2cd86; '>,</span> dones <span style='color:#d2cd86; '>=</span> <span style='color:#d2cd86; '>[</span><span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>,</span> <span style='color:#d2cd86; '>[</span><span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>,</span> <span style='color:#d2cd86; '>[</span><span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>,</span> <span style='color:#d2cd86; '>[</span><span style='color:#d2cd86; '>]</span>
<span style='color:#e66170; font-weight:bold; '>while</span> True<span style='color:#d2cd86; '>:</span>
global_step <span style='color:#d2cd86; '>=</span> global_step<span style='color:#00dddd; '>+</span><span style='color:#00a800; '>1</span>
action_probabilities <span style='color:#d2cd86; '>=</span> actor_inference<span style='color:#d2cd86; '>(</span>actor_optimizer<span style='color:#d2cd86; '>.</span>target<span style='color:#d2cd86; '>,</span> jnp<span style='color:#d2cd86; '>.</span>asarray<span style='color:#d2cd86; '>(</span><span style='color:#d2cd86; '>[</span>state<span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>)</span>
action_probabilities <span style='color:#d2cd86; '>=</span> np<span style='color:#d2cd86; '>.</span>array<span style='color:#d2cd86; '>(</span>action_probabilities<span style='color:#d2cd86; '>[</span><span style='color:#00a800; '>0</span><span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>)</span>
action <span style='color:#d2cd86; '>=</span> np<span style='color:#d2cd86; '>.</span>random<span style='color:#d2cd86; '>.</span>choice<span style='color:#d2cd86; '>(</span>n_actions<span style='color:#d2cd86; '>,</span> p<span style='color:#d2cd86; '>=</span>action_probabilities<span style='color:#d2cd86; '>)</span>
next_state<span style='color:#d2cd86; '>,</span> reward<span style='color:#d2cd86; '>,</span> done<span style='color:#d2cd86; '>,</span> _ <span style='color:#d2cd86; '>=</span> env<span style='color:#d2cd86; '>.</span>step<span style='color:#d2cd86; '>(</span><span style='color:#e66170; font-weight:bold; '>int</span><span style='color:#d2cd86; '>(</span>action<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>)</span>
states<span style='color:#d2cd86; '>.</span>append<span style='color:#d2cd86; '>(</span>state<span style='color:#d2cd86; '>)</span>
actions<span style='color:#d2cd86; '>.</span>append<span style='color:#d2cd86; '>(</span>action<span style='color:#d2cd86; '>)</span>
rewards<span style='color:#d2cd86; '>.</span>append<span style='color:#d2cd86; '>(</span>reward<span style='color:#d2cd86; '>)</span>
dones<span style='color:#d2cd86; '>.</span>append<span style='color:#d2cd86; '>(</span><span style='color:#e66170; font-weight:bold; '>int</span><span style='color:#d2cd86; '>(</span>done<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>)</span>
state <span style='color:#d2cd86; '>=</span> next_state
<span style='color:#e66170; font-weight:bold; '>if</span> debug_render<span style='color:#d2cd86; '>:</span>
env<span style='color:#d2cd86; '>.</span>render<span style='color:#d2cd86; '>(</span><span style='color:#d2cd86; '>)</span>
<span style='color:#e66170; font-weight:bold; '>if</span> done<span style='color:#d2cd86; '>:</span>
<span style='color:#e66170; font-weight:bold; '>print</span><span style='color:#d2cd86; '>(</span>episode<span style='color:#d2cd86; '>,</span> <span style='color:#00c4c4; '>" - reward :"</span><span style='color:#d2cd86; '>,</span> <span style='color:#e66170; font-weight:bold; '>sum</span><span style='color:#d2cd86; '>(</span>rewards<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>)</span>
episode_length <span style='color:#d2cd86; '>=</span> <span style='color:#e66170; font-weight:bold; '>len</span><span style='color:#d2cd86; '>(</span>rewards<span style='color:#d2cd86; '>)</span>
discounted_rewards <span style='color:#d2cd86; '>=</span> np<span style='color:#d2cd86; '>.</span>zeros_like<span style='color:#d2cd86; '>(</span>rewards<span style='color:#d2cd86; '>)</span>
<span style='color:#e66170; font-weight:bold; '>for</span> t <span style='color:#e66170; font-weight:bold; '>in</span> <span style='color:#e66170; font-weight:bold; '>range</span><span style='color:#d2cd86; '>(</span><span style='color:#00a800; '>0</span><span style='color:#d2cd86; '>,</span> episode_length<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
G_t <span style='color:#d2cd86; '>=</span> <span style='color:#00a800; '>0</span>
<span style='color:#e66170; font-weight:bold; '>for</span> idx<span style='color:#d2cd86; '>,</span> j <span style='color:#e66170; font-weight:bold; '>in</span> <span style='color:#e66170; font-weight:bold; '>enumerate</span><span style='color:#d2cd86; '>(</span><span style='color:#e66170; font-weight:bold; '>range</span><span style='color:#d2cd86; '>(</span>t<span style='color:#d2cd86; '>,</span> episode_length<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
G_t <span style='color:#d2cd86; '>=</span> G_t <span style='color:#00dddd; '>+</span> <span style='color:#d2cd86; '>(</span>gamma<span style='color:#00dddd; '>**</span>idx<span style='color:#d2cd86; '>)</span><span style='color:#00dddd; '>*</span>rewards<span style='color:#d2cd86; '>[</span>j<span style='color:#d2cd86; '>]</span><span style='color:#00dddd; '>*</span><span style='color:#d2cd86; '>(</span><span style='color:#00a800; '>1</span><span style='color:#00dddd; '>-</span>dones<span style='color:#d2cd86; '>[</span>j<span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>)</span>
discounted_rewards<span style='color:#d2cd86; '>[</span>t<span style='color:#d2cd86; '>]</span> <span style='color:#d2cd86; '>=</span> G_t
discounted_rewards <span style='color:#d2cd86; '>=</span> discounted_rewards <span style='color:#00dddd; '>-</span> np<span style='color:#d2cd86; '>.</span>mean<span style='color:#d2cd86; '>(</span>discounted_rewards<span style='color:#d2cd86; '>)</span>
discounted_rewards <span style='color:#d2cd86; '>=</span> discounted_rewards <span style='color:#00dddd; '>/</span> <span style='color:#d2cd86; '>(</span>np<span style='color:#d2cd86; '>.</span>std<span style='color:#d2cd86; '>(</span>discounted_rewards<span style='color:#d2cd86; '>)</span><span style='color:#00dddd; '>+</span><span style='color:#00a800; '>1</span><span style='color:#ffffff; background:#dd0000; '>e</span><span style='color:#00dddd; '>-</span><span style='color:#00a800; '>10</span><span style='color:#d2cd86; '>)</span>
actor_optimizer<span style='color:#d2cd86; '>,</span> _ <span style='color:#d2cd86; '>=</span> backpropagate_actor<span style='color:#d2cd86; '>(</span>
actor_optimizer<span style='color:#d2cd86; '>,</span>
critic_optimizer<span style='color:#d2cd86; '>.</span>target<span style='color:#d2cd86; '>,</span>
<span style='color:#d2cd86; '>(</span>
jnp<span style='color:#d2cd86; '>.</span>asarray<span style='color:#d2cd86; '>(</span>states<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>,</span>
jnp<span style='color:#d2cd86; '>.</span>asarray<span style='color:#d2cd86; '>(</span>discounted_rewards<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>,</span>
jnp<span style='color:#d2cd86; '>.</span>asarray<span style='color:#d2cd86; '>(</span>actions<span style='color:#d2cd86; '>)</span>
<span style='color:#d2cd86; '>)</span>
<span style='color:#d2cd86; '>)</span>
critic_optimizer<span style='color:#d2cd86; '>,</span> _ <span style='color:#d2cd86; '>=</span> backpropagate_critic<span style='color:#d2cd86; '>(</span>
critic_optimizer<span style='color:#d2cd86; '>,</span>
<span style='color:#d2cd86; '>(</span>
jnp<span style='color:#d2cd86; '>.</span>asarray<span style='color:#d2cd86; '>(</span>states<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>,</span>
jnp<span style='color:#d2cd86; '>.</span>asarray<span style='color:#d2cd86; '>(</span>discounted_rewards<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>,</span>
<span style='color:#d2cd86; '>)</span>
<span style='color:#d2cd86; '>)</span>
<span style='color:#e66170; font-weight:bold; '>break</span>
<span style='color:#e66170; font-weight:bold; '>finally</span><span style='color:#d2cd86; '>:</span>
env<span style='color:#d2cd86; '>.</span>close<span style='color:#d2cd86; '>(</span><span style='color:#d2cd86; '>)</span>
</pre>
<!--Created using ToHtml.com on 2020-08-17 02:00:23 UTC -->
<div><br /></div>
<div><br /></div>
A3C буюу Asynchronouse Advantage Actor-Critic хэрэгжүүлэлт. A2C-ээс олон environment үүсгэн параллелиар бодуулахаараа л ялгаатай.
<div><br />Байгаа нөөцөө аль болох үр ашигтай ашиглах, GPU эсвэл өөр янз бүрийн хурдасгууруудаа завгүй байлгах цаашлаад моделио түргэн сургах давуу талтай.</div>
<div><br /></div>
<pre style='color:#d1d1d1;background:#000000;'><span style='color:#e66170; font-weight:bold; '>import</span> os
<span style='color:#e66170; font-weight:bold; '>import</span> random
<span style='color:#e66170; font-weight:bold; '>import</span> math
<span style='color:#e66170; font-weight:bold; '>import</span> time
<span style='color:#e66170; font-weight:bold; '>import</span> threading
<span style='color:#e66170; font-weight:bold; '>import</span> gym
<span style='color:#e66170; font-weight:bold; '>import</span> flax
<span style='color:#e66170; font-weight:bold; '>import</span> jax
<span style='color:#e66170; font-weight:bold; '>from</span> jax <span style='color:#e66170; font-weight:bold; '>import</span> numpy <span style='color:#e66170; font-weight:bold; '>as</span> jnp
<span style='color:#e66170; font-weight:bold; '>import</span> numpy <span style='color:#e66170; font-weight:bold; '>as</span> np
debug_render <span style='color:#d2cd86; '>=</span> False
num_episodes <span style='color:#d2cd86; '>=</span> <span style='color:#00a800; '>1500</span>
learning_rate <span style='color:#d2cd86; '>=</span> <span style='color:#009f00; '>0.001</span>
gamma <span style='color:#d2cd86; '>=</span> <span style='color:#009f00; '>0.99</span>
env_name <span style='color:#d2cd86; '>=</span> <span style='color:#00c4c4; '>"CartPole-v1"</span>
n_workers <span style='color:#d2cd86; '>=</span> <span style='color:#00a800; '>8</span>
<span style='color:#e66170; font-weight:bold; '>class</span> ActorNetwork<span style='color:#d2cd86; '>(</span>flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>Module<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
<span style='color:#e66170; font-weight:bold; '>def</span> apply<span style='color:#d2cd86; '>(</span>self<span style='color:#d2cd86; '>,</span> x<span style='color:#d2cd86; '>,</span> n_actions<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
dense_layer_1 <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>Dense<span style='color:#d2cd86; '>(</span>x<span style='color:#d2cd86; '>,</span> <span style='color:#00a800; '>64</span><span style='color:#d2cd86; '>)</span>
activation_layer_1 <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>relu<span style='color:#d2cd86; '>(</span>dense_layer_1<span style='color:#d2cd86; '>)</span>
dense_layer_2 <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>Dense<span style='color:#d2cd86; '>(</span>activation_layer_1<span style='color:#d2cd86; '>,</span> <span style='color:#00a800; '>32</span><span style='color:#d2cd86; '>)</span>
activation_layer_2 <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>relu<span style='color:#d2cd86; '>(</span>dense_layer_2<span style='color:#d2cd86; '>)</span>
output_dense_layer <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>Dense<span style='color:#d2cd86; '>(</span>activation_layer_2<span style='color:#d2cd86; '>,</span> n_actions<span style='color:#d2cd86; '>)</span>
output_layer <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>softmax<span style='color:#d2cd86; '>(</span>output_dense_layer<span style='color:#d2cd86; '>)</span>
<span style='color:#e66170; font-weight:bold; '>return</span> output_layer
<span style='color:#e66170; font-weight:bold; '>class</span> CriticNetwork<span style='color:#d2cd86; '>(</span>flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>Module<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
<span style='color:#e66170; font-weight:bold; '>def</span> apply<span style='color:#d2cd86; '>(</span>self<span style='color:#d2cd86; '>,</span> x<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
dense_layer_1 <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>Dense<span style='color:#d2cd86; '>(</span>x<span style='color:#d2cd86; '>,</span> <span style='color:#00a800; '>64</span><span style='color:#d2cd86; '>)</span>
activation_layer_1 <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>relu<span style='color:#d2cd86; '>(</span>dense_layer_1<span style='color:#d2cd86; '>)</span>
dense_layer_2 <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>Dense<span style='color:#d2cd86; '>(</span>activation_layer_1<span style='color:#d2cd86; '>,</span> <span style='color:#00a800; '>32</span><span style='color:#d2cd86; '>)</span>
activation_layer_2 <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>relu<span style='color:#d2cd86; '>(</span>dense_layer_2<span style='color:#d2cd86; '>)</span>
output_dense_layer <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>Dense<span style='color:#d2cd86; '>(</span>activation_layer_2<span style='color:#d2cd86; '>,</span> <span style='color:#00a800; '>1</span><span style='color:#d2cd86; '>)</span>
<span style='color:#e66170; font-weight:bold; '>return</span> output_dense_layer
env <span style='color:#d2cd86; '>=</span> gym<span style='color:#d2cd86; '>.</span>make<span style='color:#d2cd86; '>(</span>env_name<span style='color:#d2cd86; '>)</span>
state <span style='color:#d2cd86; '>=</span> env<span style='color:#d2cd86; '>.</span>reset<span style='color:#d2cd86; '>(</span><span style='color:#d2cd86; '>)</span>
n_actions <span style='color:#d2cd86; '>=</span> env<span style='color:#d2cd86; '>.</span>action_space<span style='color:#d2cd86; '>.</span>n
env<span style='color:#d2cd86; '>.</span>close<span style='color:#d2cd86; '>(</span><span style='color:#d2cd86; '>)</span>
actor_module <span style='color:#d2cd86; '>=</span> ActorNetwork<span style='color:#d2cd86; '>.</span>partial<span style='color:#d2cd86; '>(</span>n_actions<span style='color:#d2cd86; '>=</span>n_actions<span style='color:#d2cd86; '>)</span>
_<span style='color:#d2cd86; '>,</span> actor_params <span style='color:#d2cd86; '>=</span> actor_module<span style='color:#d2cd86; '>.</span>init_by_shape<span style='color:#d2cd86; '>(</span>jax<span style='color:#d2cd86; '>.</span>random<span style='color:#d2cd86; '>.</span>PRNGKey<span style='color:#d2cd86; '>(</span><span style='color:#00a800; '>0</span><span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>,</span> <span style='color:#d2cd86; '>[</span>state<span style='color:#d2cd86; '>.</span>shape<span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>)</span>
actor_model <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>Model<span style='color:#d2cd86; '>(</span>actor_module<span style='color:#d2cd86; '>,</span> actor_params<span style='color:#d2cd86; '>)</span>
critic_module <span style='color:#d2cd86; '>=</span> CriticNetwork<span style='color:#d2cd86; '>.</span>partial<span style='color:#d2cd86; '>(</span><span style='color:#d2cd86; '>)</span>
_<span style='color:#d2cd86; '>,</span> critic_params <span style='color:#d2cd86; '>=</span> critic_module<span style='color:#d2cd86; '>.</span>init_by_shape<span style='color:#d2cd86; '>(</span>jax<span style='color:#d2cd86; '>.</span>random<span style='color:#d2cd86; '>.</span>PRNGKey<span style='color:#d2cd86; '>(</span><span style='color:#00a800; '>0</span><span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>,</span> <span style='color:#d2cd86; '>[</span>state<span style='color:#d2cd86; '>.</span>shape<span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>)</span>
critic_model <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>nn<span style='color:#d2cd86; '>.</span>Model<span style='color:#d2cd86; '>(</span>critic_module<span style='color:#d2cd86; '>,</span> critic_params<span style='color:#d2cd86; '>)</span>
actor_optimizer <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>optim<span style='color:#d2cd86; '>.</span>Adam<span style='color:#d2cd86; '>(</span>learning_rate<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>.</span>create<span style='color:#d2cd86; '>(</span>actor_model<span style='color:#d2cd86; '>)</span>
critic_optimizer <span style='color:#d2cd86; '>=</span> flax<span style='color:#d2cd86; '>.</span>optim<span style='color:#d2cd86; '>.</span>Adam<span style='color:#d2cd86; '>(</span>learning_rate<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>.</span>create<span style='color:#d2cd86; '>(</span>critic_model<span style='color:#d2cd86; '>)</span>
<span style='color:#d2cd86; '>@</span>jax<span style='color:#d2cd86; '>.</span>jit
<span style='color:#e66170; font-weight:bold; '>def</span> actor_inference<span style='color:#d2cd86; '>(</span>model<span style='color:#d2cd86; '>,</span> x<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
<span style='color:#e66170; font-weight:bold; '>return</span> model<span style='color:#d2cd86; '>(</span>x<span style='color:#d2cd86; '>)</span>
<span style='color:#d2cd86; '>@</span>jax<span style='color:#d2cd86; '>.</span>jit
<span style='color:#e66170; font-weight:bold; '>def</span> critic_inference<span style='color:#d2cd86; '>(</span>model<span style='color:#d2cd86; '>,</span> x<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
<span style='color:#e66170; font-weight:bold; '>return</span> model<span style='color:#d2cd86; '>(</span>x<span style='color:#d2cd86; '>)</span>
<span style='color:#d2cd86; '>@</span>jax<span style='color:#d2cd86; '>.</span>jit
<span style='color:#e66170; font-weight:bold; '>def</span> backpropagate_critic<span style='color:#d2cd86; '>(</span>optimizer<span style='color:#d2cd86; '>,</span> props<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
<span style='color:#9999a9; '># props[0] - states</span>
<span style='color:#9999a9; '># props[1] - discounted_rewards</span>
<span style='color:#e66170; font-weight:bold; '>def</span> loss_fn<span style='color:#d2cd86; '>(</span>model<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
values <span style='color:#d2cd86; '>=</span> model<span style='color:#d2cd86; '>(</span>props<span style='color:#d2cd86; '>[</span><span style='color:#00a800; '>0</span><span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>)</span>
values <span style='color:#d2cd86; '>=</span> jnp<span style='color:#d2cd86; '>.</span>reshape<span style='color:#d2cd86; '>(</span>values<span style='color:#d2cd86; '>,</span> <span style='color:#d2cd86; '>(</span>values<span style='color:#d2cd86; '>.</span>shape<span style='color:#d2cd86; '>[</span><span style='color:#00a800; '>0</span><span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>,</span><span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>)</span>
advantages <span style='color:#d2cd86; '>=</span> props<span style='color:#d2cd86; '>[</span><span style='color:#00a800; '>1</span><span style='color:#d2cd86; '>]</span> <span style='color:#00dddd; '>-</span> values
<span style='color:#e66170; font-weight:bold; '>return</span> jnp<span style='color:#d2cd86; '>.</span>mean<span style='color:#d2cd86; '>(</span>jnp<span style='color:#d2cd86; '>.</span>square<span style='color:#d2cd86; '>(</span>advantages<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>)</span>
loss<span style='color:#d2cd86; '>,</span> gradients <span style='color:#d2cd86; '>=</span> jax<span style='color:#d2cd86; '>.</span>value_and_grad<span style='color:#d2cd86; '>(</span>loss_fn<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>(</span>optimizer<span style='color:#d2cd86; '>.</span>target<span style='color:#d2cd86; '>)</span>
optimizer <span style='color:#d2cd86; '>=</span> optimizer<span style='color:#d2cd86; '>.</span>apply_gradient<span style='color:#d2cd86; '>(</span>gradients<span style='color:#d2cd86; '>)</span>
<span style='color:#e66170; font-weight:bold; '>return</span> optimizer<span style='color:#d2cd86; '>,</span> loss
<span style='color:#d2cd86; '>@</span>jax<span style='color:#d2cd86; '>.</span>vmap
<span style='color:#e66170; font-weight:bold; '>def</span> gather<span style='color:#d2cd86; '>(</span>probability_vec<span style='color:#d2cd86; '>,</span> action_index<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
<span style='color:#e66170; font-weight:bold; '>return</span> probability_vec<span style='color:#d2cd86; '>[</span>action_index<span style='color:#d2cd86; '>]</span>
<span style='color:#d2cd86; '>@</span>jax<span style='color:#d2cd86; '>.</span>jit
<span style='color:#e66170; font-weight:bold; '>def</span> backpropagate_actor<span style='color:#d2cd86; '>(</span>optimizer<span style='color:#d2cd86; '>,</span> critic_model<span style='color:#d2cd86; '>,</span> props<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
<span style='color:#9999a9; '># props[0] - states</span>
<span style='color:#9999a9; '># props[1] - discounted_rewards</span>
<span style='color:#9999a9; '># props[2] - actions</span>
values <span style='color:#d2cd86; '>=</span> jax<span style='color:#d2cd86; '>.</span>lax<span style='color:#d2cd86; '>.</span>stop_gradient<span style='color:#d2cd86; '>(</span>critic_model<span style='color:#d2cd86; '>(</span>props<span style='color:#d2cd86; '>[</span><span style='color:#00a800; '>0</span><span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>)</span>
values <span style='color:#d2cd86; '>=</span> jnp<span style='color:#d2cd86; '>.</span>reshape<span style='color:#d2cd86; '>(</span>values<span style='color:#d2cd86; '>,</span> <span style='color:#d2cd86; '>(</span>values<span style='color:#d2cd86; '>.</span>shape<span style='color:#d2cd86; '>[</span><span style='color:#00a800; '>0</span><span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>,</span><span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>)</span>
advantages <span style='color:#d2cd86; '>=</span> props<span style='color:#d2cd86; '>[</span><span style='color:#00a800; '>1</span><span style='color:#d2cd86; '>]</span> <span style='color:#00dddd; '>-</span> values
<span style='color:#e66170; font-weight:bold; '>def</span> loss_fn<span style='color:#d2cd86; '>(</span>model<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
action_probabilities <span style='color:#d2cd86; '>=</span> model<span style='color:#d2cd86; '>(</span>props<span style='color:#d2cd86; '>[</span><span style='color:#00a800; '>0</span><span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>)</span>
probabilities <span style='color:#d2cd86; '>=</span> gather<span style='color:#d2cd86; '>(</span>action_probabilities<span style='color:#d2cd86; '>,</span> props<span style='color:#d2cd86; '>[</span><span style='color:#00a800; '>2</span><span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>)</span>
log_probabilities <span style='color:#d2cd86; '>=</span> <span style='color:#00dddd; '>-</span>jnp<span style='color:#d2cd86; '>.</span>log<span style='color:#d2cd86; '>(</span>probabilities<span style='color:#d2cd86; '>)</span>
<span style='color:#e66170; font-weight:bold; '>return</span> jnp<span style='color:#d2cd86; '>.</span>mean<span style='color:#d2cd86; '>(</span>jnp<span style='color:#d2cd86; '>.</span>multiply<span style='color:#d2cd86; '>(</span>log_probabilities<span style='color:#d2cd86; '>,</span> advantages<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>)</span>
loss<span style='color:#d2cd86; '>,</span> gradients <span style='color:#d2cd86; '>=</span> jax<span style='color:#d2cd86; '>.</span>value_and_grad<span style='color:#d2cd86; '>(</span>loss_fn<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>(</span>optimizer<span style='color:#d2cd86; '>.</span>target<span style='color:#d2cd86; '>)</span>
optimizer <span style='color:#d2cd86; '>=</span> optimizer<span style='color:#d2cd86; '>.</span>apply_gradient<span style='color:#d2cd86; '>(</span>gradients<span style='color:#d2cd86; '>)</span>
<span style='color:#e66170; font-weight:bold; '>return</span> optimizer<span style='color:#d2cd86; '>,</span> loss
episode_count <span style='color:#d2cd86; '>=</span> <span style='color:#00a800; '>0</span>
global_step <span style='color:#d2cd86; '>=</span> <span style='color:#00a800; '>0</span>
lock <span style='color:#d2cd86; '>=</span> threading<span style='color:#d2cd86; '>.</span>Lock<span style='color:#d2cd86; '>(</span><span style='color:#d2cd86; '>)</span>
<span style='color:#e66170; font-weight:bold; '>def</span> training_worker<span style='color:#d2cd86; '>(</span>env<span style='color:#d2cd86; '>,</span> thread_index<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
<span style='color:#e66170; font-weight:bold; '>global</span> actor_optimizer
<span style='color:#e66170; font-weight:bold; '>global</span> critic_optimizer
<span style='color:#e66170; font-weight:bold; '>global</span> episode_count
<span style='color:#e66170; font-weight:bold; '>global</span> global_step
<span style='color:#e66170; font-weight:bold; '>for</span> episode <span style='color:#e66170; font-weight:bold; '>in</span> <span style='color:#e66170; font-weight:bold; '>range</span><span style='color:#d2cd86; '>(</span>num_episodes<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
state <span style='color:#d2cd86; '>=</span> env<span style='color:#d2cd86; '>.</span>reset<span style='color:#d2cd86; '>(</span><span style='color:#d2cd86; '>)</span>
states<span style='color:#d2cd86; '>,</span> actions<span style='color:#d2cd86; '>,</span> rewards<span style='color:#d2cd86; '>,</span> dones <span style='color:#d2cd86; '>=</span> <span style='color:#d2cd86; '>[</span><span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>,</span> <span style='color:#d2cd86; '>[</span><span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>,</span> <span style='color:#d2cd86; '>[</span><span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>,</span> <span style='color:#d2cd86; '>[</span><span style='color:#d2cd86; '>]</span>
<span style='color:#e66170; font-weight:bold; '>while</span> True<span style='color:#d2cd86; '>:</span>
global_step <span style='color:#d2cd86; '>=</span> global_step <span style='color:#00dddd; '>+</span> <span style='color:#00a800; '>1</span>
action_probabilities <span style='color:#d2cd86; '>=</span> actor_inference<span style='color:#d2cd86; '>(</span>actor_optimizer<span style='color:#d2cd86; '>.</span>target<span style='color:#d2cd86; '>,</span> jnp<span style='color:#d2cd86; '>.</span>asarray<span style='color:#d2cd86; '>(</span><span style='color:#d2cd86; '>[</span>state<span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>)</span>
action_probabilities <span style='color:#d2cd86; '>=</span> np<span style='color:#d2cd86; '>.</span>array<span style='color:#d2cd86; '>(</span>action_probabilities<span style='color:#d2cd86; '>[</span><span style='color:#00a800; '>0</span><span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>)</span>
action <span style='color:#d2cd86; '>=</span> np<span style='color:#d2cd86; '>.</span>random<span style='color:#d2cd86; '>.</span>choice<span style='color:#d2cd86; '>(</span>n_actions<span style='color:#d2cd86; '>,</span> p<span style='color:#d2cd86; '>=</span>action_probabilities<span style='color:#d2cd86; '>)</span>
next_state<span style='color:#d2cd86; '>,</span> reward<span style='color:#d2cd86; '>,</span> done<span style='color:#d2cd86; '>,</span> _ <span style='color:#d2cd86; '>=</span> env<span style='color:#d2cd86; '>.</span>step<span style='color:#d2cd86; '>(</span><span style='color:#e66170; font-weight:bold; '>int</span><span style='color:#d2cd86; '>(</span>action<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>)</span>
states<span style='color:#d2cd86; '>.</span>append<span style='color:#d2cd86; '>(</span>state<span style='color:#d2cd86; '>)</span>
actions<span style='color:#d2cd86; '>.</span>append<span style='color:#d2cd86; '>(</span>action<span style='color:#d2cd86; '>)</span>
rewards<span style='color:#d2cd86; '>.</span>append<span style='color:#d2cd86; '>(</span>reward<span style='color:#d2cd86; '>)</span>
dones<span style='color:#d2cd86; '>.</span>append<span style='color:#d2cd86; '>(</span><span style='color:#e66170; font-weight:bold; '>int</span><span style='color:#d2cd86; '>(</span>done<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>)</span>
state <span style='color:#d2cd86; '>=</span> next_state
<span style='color:#e66170; font-weight:bold; '>if</span> debug_render <span style='color:#e66170; font-weight:bold; '>and</span> thread_index<span style='color:#00dddd; '>==</span><span style='color:#00a800; '>0</span><span style='color:#d2cd86; '>:</span>
env<span style='color:#d2cd86; '>.</span>render<span style='color:#d2cd86; '>(</span><span style='color:#d2cd86; '>)</span>
<span style='color:#e66170; font-weight:bold; '>if</span> done<span style='color:#d2cd86; '>:</span>
<span style='color:#e66170; font-weight:bold; '>print</span><span style='color:#d2cd86; '>(</span><span style='color:#00c4c4; '>"{} step, {} worker, {} episode, reward : {}"</span><span style='color:#d2cd86; '>.</span>format<span style='color:#d2cd86; '>(</span>
global_step<span style='color:#d2cd86; '>,</span> thread_index<span style='color:#d2cd86; '>,</span> episode<span style='color:#d2cd86; '>,</span> <span style='color:#e66170; font-weight:bold; '>sum</span><span style='color:#d2cd86; '>(</span>rewards<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>)</span>
episode_length <span style='color:#d2cd86; '>=</span> <span style='color:#e66170; font-weight:bold; '>len</span><span style='color:#d2cd86; '>(</span>rewards<span style='color:#d2cd86; '>)</span>
discounted_rewards <span style='color:#d2cd86; '>=</span> np<span style='color:#d2cd86; '>.</span>zeros_like<span style='color:#d2cd86; '>(</span>rewards<span style='color:#d2cd86; '>)</span>
<span style='color:#e66170; font-weight:bold; '>for</span> t <span style='color:#e66170; font-weight:bold; '>in</span> <span style='color:#e66170; font-weight:bold; '>range</span><span style='color:#d2cd86; '>(</span><span style='color:#00a800; '>0</span><span style='color:#d2cd86; '>,</span> episode_length<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
G_t <span style='color:#d2cd86; '>=</span> <span style='color:#00a800; '>0</span>
<span style='color:#e66170; font-weight:bold; '>for</span> idx<span style='color:#d2cd86; '>,</span> j <span style='color:#e66170; font-weight:bold; '>in</span> <span style='color:#e66170; font-weight:bold; '>enumerate</span><span style='color:#d2cd86; '>(</span><span style='color:#e66170; font-weight:bold; '>range</span><span style='color:#d2cd86; '>(</span>t<span style='color:#d2cd86; '>,</span> episode_length<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>:</span>
G_t <span style='color:#d2cd86; '>=</span> G_t <span style='color:#00dddd; '>+</span> <span style='color:#d2cd86; '>(</span>gamma<span style='color:#00dddd; '>**</span>idx<span style='color:#d2cd86; '>)</span><span style='color:#00dddd; '>*</span>rewards<span style='color:#d2cd86; '>[</span>j<span style='color:#d2cd86; '>]</span><span style='color:#00dddd; '>*</span><span style='color:#d2cd86; '>(</span><span style='color:#00a800; '>1</span><span style='color:#00dddd; '>-</span>dones<span style='color:#d2cd86; '>[</span>j<span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>)</span>
discounted_rewards<span style='color:#d2cd86; '>[</span>t<span style='color:#d2cd86; '>]</span> <span style='color:#d2cd86; '>=</span> G_t
discounted_rewards <span style='color:#d2cd86; '>=</span> discounted_rewards <span style='color:#00dddd; '>-</span> np<span style='color:#d2cd86; '>.</span>mean<span style='color:#d2cd86; '>(</span>discounted_rewards<span style='color:#d2cd86; '>)</span>
discounted_rewards <span style='color:#d2cd86; '>=</span> discounted_rewards <span style='color:#00dddd; '>/</span> <span style='color:#d2cd86; '>(</span>np<span style='color:#d2cd86; '>.</span>std<span style='color:#d2cd86; '>(</span>discounted_rewards<span style='color:#d2cd86; '>)</span><span style='color:#00dddd; '>+</span><span style='color:#00a800; '>1</span><span style='color:#ffffff; background:#dd0000; '>e</span><span style='color:#00dddd; '>-</span><span style='color:#00a800; '>10</span><span style='color:#d2cd86; '>)</span>
<span style='color:#e66170; font-weight:bold; '>with</span> lock<span style='color:#d2cd86; '>:</span>
actor_optimizer<span style='color:#d2cd86; '>,</span> _ <span style='color:#d2cd86; '>=</span> backpropagate_actor<span style='color:#d2cd86; '>(</span>
actor_optimizer<span style='color:#d2cd86; '>,</span>
critic_optimizer<span style='color:#d2cd86; '>.</span>target<span style='color:#d2cd86; '>,</span>
<span style='color:#d2cd86; '>(</span>
jnp<span style='color:#d2cd86; '>.</span>asarray<span style='color:#d2cd86; '>(</span>states<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>,</span>
jnp<span style='color:#d2cd86; '>.</span>asarray<span style='color:#d2cd86; '>(</span>discounted_rewards<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>,</span>
jnp<span style='color:#d2cd86; '>.</span>asarray<span style='color:#d2cd86; '>(</span>actions<span style='color:#d2cd86; '>)</span>
<span style='color:#d2cd86; '>)</span>
<span style='color:#d2cd86; '>)</span>
critic_optimizer<span style='color:#d2cd86; '>,</span> _ <span style='color:#d2cd86; '>=</span> backpropagate_critic<span style='color:#d2cd86; '>(</span>
critic_optimizer<span style='color:#d2cd86; '>,</span>
<span style='color:#d2cd86; '>(</span>
jnp<span style='color:#d2cd86; '>.</span>asarray<span style='color:#d2cd86; '>(</span>states<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>,</span>
jnp<span style='color:#d2cd86; '>.</span>asarray<span style='color:#d2cd86; '>(</span>discounted_rewards<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>,</span>
<span style='color:#d2cd86; '>)</span>
<span style='color:#d2cd86; '>)</span>
episode_count <span style='color:#d2cd86; '>=</span> episode_count <span style='color:#00dddd; '>+</span> <span style='color:#00a800; '>1</span>
<span style='color:#e66170; font-weight:bold; '>break</span>
<span style='color:#e66170; font-weight:bold; '>print</span><span style='color:#d2cd86; '>(</span><span style='color:#00c4c4; '>"{} id-тэй thread ажиллаж дууслаа."</span><span style='color:#d2cd86; '>.</span>format<span style='color:#d2cd86; '>(</span>thread_index<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>)</span>
<span style='color:#e66170; font-weight:bold; '>pass</span>
envs <span style='color:#d2cd86; '>=</span> <span style='color:#d2cd86; '>[</span>gym<span style='color:#d2cd86; '>.</span>make<span style='color:#d2cd86; '>(</span>env_name<span style='color:#d2cd86; '>)</span> <span style='color:#e66170; font-weight:bold; '>for</span> i <span style='color:#e66170; font-weight:bold; '>in</span> <span style='color:#e66170; font-weight:bold; '>range</span><span style='color:#d2cd86; '>(</span>n_workers<span style='color:#d2cd86; '>)</span><span style='color:#d2cd86; '>]</span>
<span style='color:#e66170; font-weight:bold; '>try</span><span style='color:#d2cd86; '>:</span>
workers <span style='color:#d2cd86; '>=</span> <span style='color:#d2cd86; '>[</span>
threading<span style='color:#d2cd86; '>.</span>Thread<span style='color:#d2cd86; '>(</span>
target <span style='color:#d2cd86; '>=</span> training_worker<span style='color:#d2cd86; '>,</span>
daemon <span style='color:#d2cd86; '>=</span> True<span style='color:#d2cd86; '>,</span>
args <span style='color:#d2cd86; '>=</span> <span style='color:#d2cd86; '>(</span>envs<span style='color:#d2cd86; '>[</span>i<span style='color:#d2cd86; '>]</span><span style='color:#d2cd86; '>,</span> i<span style='color:#d2cd86; '>)</span>
<span style='color:#d2cd86; '>)</span> <span style='color:#e66170; font-weight:bold; '>for</span> i <span style='color:#e66170; font-weight:bold; '>in</span> <span style='color:#e66170; font-weight:bold; '>range</span><span style='color:#d2cd86; '>(</span>n_workers<span style='color:#d2cd86; '>)</span>
<span style='color:#d2cd86; '>]</span>
<span style='color:#e66170; font-weight:bold; '>for</span> worker <span style='color:#e66170; font-weight:bold; '>in</span> workers<span style='color:#d2cd86; '>:</span>
time<span style='color:#d2cd86; '>.</span>sleep<span style='color:#d2cd86; '>(</span><span style='color:#00a800; '>1</span><span style='color:#d2cd86; '>)</span>
worker<span style='color:#d2cd86; '>.</span>start<span style='color:#d2cd86; '>(</span><span style='color:#d2cd86; '>)</span>
<span style='color:#e66170; font-weight:bold; '>for</span> worker <span style='color:#e66170; font-weight:bold; '>in</span> workers<span style='color:#d2cd86; '>:</span>
worker<span style='color:#d2cd86; '>.</span>join<span style='color:#d2cd86; '>(</span><span style='color:#d2cd86; '>)</span>
<span style='color:#e66170; font-weight:bold; '>finally</span><span style='color:#d2cd86; '>:</span>
<span style='color:#e66170; font-weight:bold; '>for</span> env <span style='color:#e66170; font-weight:bold; '>in</span> envs<span style='color:#d2cd86; '>:</span> env<span style='color:#d2cd86; '>.</span>close<span style='color:#d2cd86; '>(</span><span style='color:#d2cd86; '>)</span>
</pre>
<!--Created using ToHtml.com on 2020-08-17 02:01:23 UTC -->
<div><br /></div>
<div><br /></div>
Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.com0tag:blogger.com,1999:blog-1457877875009527488.post-29946203307637872682020-07-06T22:07:00.109+08:002020-07-28T00:43:15.357+08:00Deep Reinforcement Learning, Policy GradientӨмнөх Q Learning алгоритмуудыг <b>value iteration</b> арга гэдэг бөгөөд ирээдүйд авах reward оноог ойролцоолон дөхүүлж олдог тэндээсээ тоглолт хийх <b>дүрмээ(policy)</b> бий болгодог арга.<div><br /></div><div>Энэ удаагийн тэмдэглэлээр value iteration хийхийн оронд <b>дүрмийн функц</b>ээ шууд өөрөө сурдаг <b>policy iteration</b> аргын талаар дурдана.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiXvV9aLIau7Q7X5krRK1asbFp0RwOzSVdrj5wDM8xsn2614iihYRSk5g7uU_8rMV87xmkgOND8ttClzN82GZ6pPZdcUsuSP9LYaao4eSfm4x4DNaktIJfsMQdLOBu56-5QQATb6Tn_RQ/s1362/pg.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="523" data-original-width="1362" height="246" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiXvV9aLIau7Q7X5krRK1asbFp0RwOzSVdrj5wDM8xsn2614iihYRSk5g7uU_8rMV87xmkgOND8ttClzN82GZ6pPZdcUsuSP9LYaao4eSfm4x4DNaktIJfsMQdLOBu56-5QQATb6Tn_RQ/w640-h246/pg.png" width="640" /></a></div><span><a name='more'></a></span><div><br /></div><div><font size="6">Intuition буюу цаад санаа</font></div><div><br /></div><div>Спортын тамирчдын хувьд тогтмол бэлтгэл сургуулилалт амжилт гаргах үндсэн нөхцөл болдог, учир нь тогтмол бэлтгэл хийснээр <b>muscle memory</b> буюу ямарч нөхцөл байдалд хамгийн сайн үйлдлийг түргэн үзүүлэх <b>чадвар</b> суудаг.</div><div><br /></div><div>Үүний нэгэн адилаар policy gradient алгоритм нь тухайн нөхцөл байдалд ажиглалт хийн таарсан үйлдлийг шууд хийх чадвартай <b>функцыг хайж олох</b>од оршдог.</div><div><br /></div><div>Функц хайж олох гэдэг нь функцын <b>параметерыг тохируулах</b> гэж ойлгож болно.</div><div><br /></div><div>Хэрэв тухайн төлөвт гаргасан шийдвэр нь өндөр оноо(нэмэх reward) авбал дараа төстэй төлөв дээр ирэх үед өмнө нь өндөр оноо авч байсан <b>үйлдлийн магадлал</b> нэмэгдэнэ. </div><div><br /></div><div>Хамгийн <b>оптимал функц</b>ыг хайж олохдоо <b>нийт авсан reward</b> онооны дагуу <b>градиент утгууд олж</b> тэрүүгээрээ функцын параметрүүдыг <b>тохируулан шинэчилдэг</b>.</div><div><br /></div><div><br /></div><div><br /></div><div>Өмнөх постуудын адил үндсэн ойлголтуудтай нь танилцах, тодорхой болгох замаар энэ алгоритмтэй танилцая.</div><div><br /></div><div><br /></div><div><font size="6">Policy функц</font></div><div><br /></div><div>Policy гэдгийг дүрэм гэж орчуулаад байгааг анзаарсан байх, энэ орчуулга буруу ч байж болно. Ер нь policy бол тухайн төлвийг үйлдэлрүү буулгах үүрэгтэй функц юм.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjlQyNvE1ceBk4WW051vZab5URS_TODN-YE8z8xNet0_flFci31cnZMvML4_ICE2m2TvJMOLE3UfXTIJoI7NbWQ84Nx6eAkHX__3yger7RtVf_xSOfer8aBEtLOfs_k6wWzY2zU7u8Z9Q/s113/policy%252C+mapping+state+to+action.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="40" data-original-width="113" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjlQyNvE1ceBk4WW051vZab5URS_TODN-YE8z8xNet0_flFci31cnZMvML4_ICE2m2TvJMOLE3UfXTIJoI7NbWQ84Nx6eAkHX__3yger7RtVf_xSOfer8aBEtLOfs_k6wWzY2zU7u8Z9Q/d/policy%252C+mapping+state+to+action.png" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div><b>Stochastic</b> болон <b>deterministic</b> гэсэн хоёр төрлийн policy байдаг.</div><div><br /></div><div>Deterministic policy нь тухайн төлөв дээр хийх үйлдлийг тодорхойлсон утгыг шууд бодож олдог. Өмнөх Q Learning алгоритм бол deterministic policy ангилалдаа орно. </div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_VA5Wt8IYJ7nVMFoExAcseFGHC5Gk0TwgXdguINoEabRnStJS9xUSRPnmrkjhC2KNqqWZpvImRIeZJ_I6OGJZDGWoECLECuDvYrnQKBjqeVpM05mGtP22TiQzSkHxDkqtre_TuU6_DA/s114/deterministic+policy.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="44" data-original-width="114" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_VA5Wt8IYJ7nVMFoExAcseFGHC5Gk0TwgXdguINoEabRnStJS9xUSRPnmrkjhC2KNqqWZpvImRIeZJ_I6OGJZDGWoECLECuDvYrnQKBjqeVpM05mGtP22TiQzSkHxDkqtre_TuU6_DA/s0/deterministic+policy.png" /></a></div><div>Stochastic policy-ны гаралт бүх үйлдлүүдтэй хамааралтай магадлалтын тархалт байдаг. Энэ постоор бий болгох policy энэ ангилалд орно.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiabgj8r5dlzZbXCzN-3Vb8r74jb96nDA85HWp1zjWNvQeKXF2Gnzj3_xWz1i_8qYvC1UAEkxtC6wXy0Mu0TgBd3uEadMkTF82olur1pgu-zwY9G2dUdUp8qhOsV4h1JVTFf6IWOtS7RQ/s300/stochastic+policy.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="55" data-original-width="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiabgj8r5dlzZbXCzN-3Vb8r74jb96nDA85HWp1zjWNvQeKXF2Gnzj3_xWz1i_8qYvC1UAEkxtC6wXy0Mu0TgBd3uEadMkTF82olur1pgu-zwY9G2dUdUp8qhOsV4h1JVTFf6IWOtS7RQ/s0/stochastic+policy.png" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEizxq5pBsnYlAQm9QOHR2pSyYocdeoWP0MDr9G84uHP4M-eHd2SxuO56o_QmIWM5QEYX0uhUOk_GduVlJ-yhvsJfq8c9dAfAuZINuogUgVcN5n6XpSWsGbA5OI9mhDJyPqZeTEfiqMQog/s1065/policy+in+RL+environment.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="291" data-original-width="1065" height="175" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEizxq5pBsnYlAQm9QOHR2pSyYocdeoWP0MDr9G84uHP4M-eHd2SxuO56o_QmIWM5QEYX0uhUOk_GduVlJ-yhvsJfq8c9dAfAuZINuogUgVcN5n6XpSWsGbA5OI9mhDJyPqZeTEfiqMQog/w640-h175/policy+in+RL+environment.png" width="640" /></a></div><div>π<font size="1">θ</font>(a|s) гэдэг бол <b>s</b> төлөвт байгаа агент <b>θ</b> параметрийн дагуу <b>a</b> үйлдлийг сонгох магадлал. θ параметер нь ерөнхийдөө policy-г төлөөлдөг.</div><div><br /></div><div><br /></div><div><br /></div><div><font size="6">Trajectory буюу замнал</font></div><div><br /></div><div><b>τ</b> буюу <b>tau</b> гэсэн грек үсгээр тэмдэглэдэг. Үүгээр тухайн policy-ийн дагуу environment дотор хийсэн үйлдлүүд, явж ирсэн төлвүүдийн өөрчлөлтүүд, цуглуулсан reward оноонуудын <b>түүх</b>ийг илэрхийлдэг. Энд H үсгээр horizon буюу нийт түүхийн хязгаарыг тэмдэглэв. Мэдээж хязгааргүй <b>олон янзын τ</b> байх боломжтой.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFQz-4wJ33AwlLT3dHeJbLNQZGvkDgnOu3CSI3su4YEGMNE8Wf4Ol7P2WMJhXg19Iq_anlvthMd-L3dSelxeQG9HR_eCSM90kuv1oP_R4cB6QUl_4WZCF7tSK5agvhihG1stC0z3Xuuw/s568/tau_trajectory.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="48" data-original-width="568" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFQz-4wJ33AwlLT3dHeJbLNQZGvkDgnOu3CSI3su4YEGMNE8Wf4Ol7P2WMJhXg19Iq_anlvthMd-L3dSelxeQG9HR_eCSM90kuv1oP_R4cB6QUl_4WZCF7tSK5agvhihG1stC0z3Xuuw/d/tau_trajectory.png" /></a></div><div><br /></div><div><span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 16px;"><b>θ</b></span> параметр бүхий <b>policy</b>-ээр үүсгэсэн <b>τ </b>өөрөөр хэлбэл <b>τ төлөвт байх магадлал</b> бол эхний <b>s<font size="1">1</font></b> төлөвт байх магадлалыг үрждэг нь policy-ийн дагуу <b>s<font size="1">t</font></b> төлөвөөс <b>a<font size="1">t</font></b> үйлдэл үүсэх магадлал мөн дахин үрждэг нь environment-д <b>s<font size="1">t</font></b> төлөвт байхад нь <b>a<font size="1">t</font></b> үйлдэл хийхэд <b>s<font size="1">t+1</font></b> төлөв үүсэх магадлалуудын үржвэрүүд болно.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiw_a0mf-tdZlmfuEoANIQXRAIy6Y64hyphenhyphenfbZC87Nn3awYA_LiViAfw3Aku3IO28MVL9HAt23KDMYUj9TrR58YqsxchZ-5FvzNMoEPfY1Gz73jInQKJg-5lHF-PHkYJfr0i-KInFaPcCEg/s653/tau_of_policy.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="126" data-original-width="653" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiw_a0mf-tdZlmfuEoANIQXRAIy6Y64hyphenhyphenfbZC87Nn3awYA_LiViAfw3Aku3IO28MVL9HAt23KDMYUj9TrR58YqsxchZ-5FvzNMoEPfY1Gz73jInQKJg-5lHF-PHkYJfr0i-KInFaPcCEg/d/tau_of_policy.png" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div><br /></div><div><font size="6">Бодлогын зорилго</font></div><div><br /></div><div>Тухайн <span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 16px;"><b>τ</b></span> замнал <b>P<span style="background-color: white; color: #222222; font-family: arial, sans-serif;"><font size="1">θ</font></span><span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 16px;">(</span><font color="#222222" face="">τ</font><span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 16px;">)</span></b> магадлалд байгаа, reward оноо авах функ нь <b>R(s, a)</b> гэвэл <b>бодлогын зорилго</b> reward-уудын нийлбэр нь <b>хамгийн их</b> байдаг <span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 16px;"><b>θ</b></span> параметрийг <b>хайж олох</b> юм гэж тодорхойлогдоно. Үүнийг математик илэрхийллээр бичвэл.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEheXFFUgfyRWGuHZ56U4E-EeWp4UJ3o4YUpxj_yEk2rnGx6euASLwLB1E63DqiLEdjkeN0kAG4clySpqM0lpO2c9HGe4ZQ-naNFdR6wIgKH1SFTXVX1wJcXWy_HyefB87ncmf4HoTMT-w/s410/optimal_theta.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="70" data-original-width="410" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEheXFFUgfyRWGuHZ56U4E-EeWp4UJ3o4YUpxj_yEk2rnGx6euASLwLB1E63DqiLEdjkeN0kAG4clySpqM0lpO2c9HGe4ZQ-naNFdR6wIgKH1SFTXVX1wJcXWy_HyefB87ncmf4HoTMT-w/d/optimal_theta.png" /></a></div><div><br /></div><div><br /></div><div><font size="6">Objective функц</font></div><div><br /></div><div><b style="color: #222222; font-family: arial, sans-serif; font-size: 16px;">θ</b> параметрыг оптималчлахын тулд <b>objective функц</b> хэрэгтэй. Үүнийг дараах <b>J(θ)</b> функц байдлаар тодорхойлж болно. Давхар summa-ны эхнийх нь бүх N sample-үүдийг нэмээд дунджыг нь авч байгааг хэлж байгаа шүү.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgL5jVkg01bkwePXcnKcyXEqPX4EWzJIt_D7hYKoCJMqoHwj7z2XIJAkSpsv1SbOe8tsJ3BG19J2juvkMAxhfX_VLtr4i2URwMUe04ljkSPfOxWw47i-PSFAU7zmu2FtP5LYz3XqAKZrw/s607/objective_function_J_theta.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="228" data-original-width="607" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgL5jVkg01bkwePXcnKcyXEqPX4EWzJIt_D7hYKoCJMqoHwj7z2XIJAkSpsv1SbOe8tsJ3BG19J2juvkMAxhfX_VLtr4i2URwMUe04ljkSPfOxWw47i-PSFAU7zmu2FtP5LYz3XqAKZrw/d/objective_function_J_theta.png" /></a></div><div><br /></div><div>Гурван ялгаатай trajectory-г хооронд нь <b>J(</b><b style="color: #222222; font-family: arial, sans-serif; font-size: 16px;">θ</b><b>)</b> функцээр үнэлэн харьцуулж харвал. </div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjgi7pR7KV4w5DSDqur6NiRjk7P3EUnx8xVe8suLXdNxntTyOvFyfae1Z4oQ7LwFsXp_oafj_pF3uGNmCbqnjfe7A-X5xP1p36zxB3XTgaqQkm6zpiYHfMp02I_0lXtTSkEWex0VHqN2A/s724/trajectory_evaluation.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="380" data-original-width="724" height="168" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjgi7pR7KV4w5DSDqur6NiRjk7P3EUnx8xVe8suLXdNxntTyOvFyfae1Z4oQ7LwFsXp_oafj_pF3uGNmCbqnjfe7A-X5xP1p36zxB3XTgaqQkm6zpiYHfMp02I_0lXtTSkEWex0VHqN2A/w320-h168/trajectory_evaluation.png" width="320" /></a></div><div><br /></div><div><br /></div><div><font size="6">Policy Gradient</font></div><div><br /></div><div>Дахин сануулая, бидний зорилго бол reward нийлбэр оноо нь хамгийн их байхаар θ параметрыг оптималчилах. Энэ өгүүлбэрийг сайн ойлгоорой!!!</div><div><br /></div><div>Үүний тулд objective функцыг θ параметрээс хамаатуулан дифференциалчлах шаардлагатай.</div><div><br /></div><div>Дифференциалчилсан функцээр θ параметрээс хамаарсан градиент утгууд олж авах боломжтой. </div><div><br /></div><div>Градиент утгуудтай байнаа гэдэг нь бодлогын зорилгод хүргэхээр манай тохиолдолд reward нийлбэрийг максимумчилахаар θ параметрыг алхам алхамаар шинэчлэх чиглэлтэй болж байна гэсэн үг юм.</div><div><br /></div><div>Өмнө дурдсан <b>P<span style="background-color: white; color: #222222; font-family: arial, sans-serif;"><font size="1">θ</font></span><span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 16px;">(</span><font color="#222222" face="">τ</font><span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 16px;">)</span></b> магадлалд байгаа trajectory-ийн objective функцны оронд <b>π(θ)</b> policy функцээр тодорхойлогдсон objective функцыг дараах интегралаар тодорхойлж болно.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidtY_olI1mqC-OQA3AnDd3RyyZeRa7WokN4aFZq0hiLojDVpdmDfdZPyVOo5zdsAc4U0D8dV5io9vdtVIhvO1XE1t35PINyU_yHwn26FjlSbFSVFwH7c9kpdZIhDXK2vjuzLgFbduz-g/s480/objective_function_defined_by_policy.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="153" data-original-width="480" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidtY_olI1mqC-OQA3AnDd3RyyZeRa7WokN4aFZq0hiLojDVpdmDfdZPyVOo5zdsAc4U0D8dV5io9vdtVIhvO1XE1t35PINyU_yHwn26FjlSbFSVFwH7c9kpdZIhDXK2vjuzLgFbduz-g/d/objective_function_defined_by_policy.png" /></a></div><div><br /></div><div>Одоо дифференциалчлах ажиллагааг эхлэе. Эхлээд ∇ оператороор хоёр талаас нь дифференциал авая </div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhQuiwb_u7CixWvVk_i6hOkEAJkW_EWhBzLajJLF8fiLevkajcH-ML8SgR3urcTxJxA9iGO2lbD6hSOWClDdgWkLbwbV4l-nKnILiEFefRzn5BLc7rpe4llR9whe21soo0PrHqLILsLCw/s377/differentiation_J_theta_1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="50" data-original-width="377" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhQuiwb_u7CixWvVk_i6hOkEAJkW_EWhBzLajJLF8fiLevkajcH-ML8SgR3urcTxJxA9iGO2lbD6hSOWClDdgWkLbwbV4l-nKnILiEFefRzn5BLc7rpe4llR9whe21soo0PrHqLILsLCw/d/differentiation_J_theta_1.png" /></a></div><div>Энэ ∇(nabla) операторыг гайхаж байвал зүгээр л уламжлал авах үйлдэл юм. Лейбницын уламжлал авах тэмдэглэгээр хөрвүүлж бичвэл иймэрхүү болно</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjq0gCU3e8QuuYgrCE2GFMHs7Qy8lIHJUZMgYgQS1yIjeMRt3H9M1QhOd9x_rOiXBO695dhrXuFwaAKtZCcL4ccmDMIOQgRLJm4dPZB_32HlNTvxHQOlmXO6lHDqzQJGiROnD_wZb08QQ/s222/nabla_as_leibniz_notation.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="79" data-original-width="222" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjq0gCU3e8QuuYgrCE2GFMHs7Qy8lIHJUZMgYgQS1yIjeMRt3H9M1QhOd9x_rOiXBO695dhrXuFwaAKtZCcL4ccmDMIOQgRLJm4dPZB_32HlNTvxHQOlmXO6lHDqzQJGiROnD_wZb08QQ/d/nabla_as_leibniz_notation.png" /></a></div><div>Цааш нь дифференциалчлахын тулд <a href="http://blog.shakirm.com/2015/11/machine-learning-trick-of-the-day-5-log-derivative-trick/">"log derivative trick"</a> гэх аргыг ашиглая</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfJb0Wp96rGzJPT7WtIV4rrToozA2ujpXRl1DwBwPsF1OF9r1Yj5iP8kIlIJpUejxidOvKSpMUOC_YIheARm4BHx5qZrFBFGpMSTBKAoZoP0ycoHJXThPUyXCVftBT57KfmjpxD6Wyww/s350/log_derivative_trick.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="89" data-original-width="350" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfJb0Wp96rGzJPT7WtIV4rrToozA2ujpXRl1DwBwPsF1OF9r1Yj5iP8kIlIJpUejxidOvKSpMUOC_YIheARm4BHx5qZrFBFGpMSTBKAoZoP0ycoHJXThPUyXCVftBT57KfmjpxD6Wyww/d/log_derivative_trick.png" /></a></div><div>Энэ аргыг ашиглахын тулд жижиг хувиргалт хийе</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEimyrlhP1TBlgvEXcxzMhyphenhyphenGClbK9inhcOdtlWsVVOUxAeLf_C_UtIiqKGyA3nG1aWRrWJGObeSOqdc9oIhM12KPAVPcK_ucApSEmuNAlKo0yLjKNHspIOIEqcIUcRcE5xhOVmU9HbI5fg/s355/log_derivative_trick_J_theta_1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="163" data-original-width="355" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEimyrlhP1TBlgvEXcxzMhyphenhyphenGClbK9inhcOdtlWsVVOUxAeLf_C_UtIiqKGyA3nG1aWRrWJGObeSOqdc9oIhM12KPAVPcK_ucApSEmuNAlKo0yLjKNHspIOIEqcIUcRcE5xhOVmU9HbI5fg/d/log_derivative_trick_J_theta_1.png" /></a></div><div>Log derivative trick ашиглаад хувиргавал</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEguUeogaJpqFHnvMk85Z-VxSuFm0m1BlsI2nXTHrsBuAOmW3kYfb1DnLt8oCQIB1eLZ-MZFtM21yaK2HYwUSu8_zLae_K2PTmp0lj9DLbiabyOC3R_iNRulZbxyg4PtcnBtxRDhGsIHAA/s369/log_derivative_trick_J_theta_2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="126" data-original-width="369" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEguUeogaJpqFHnvMk85Z-VxSuFm0m1BlsI2nXTHrsBuAOmW3kYfb1DnLt8oCQIB1eLZ-MZFtM21yaK2HYwUSu8_zLae_K2PTmp0lj9DLbiabyOC3R_iNRulZbxyg4PtcnBtxRDhGsIHAA/d/log_derivative_trick_J_theta_2.png" /></a></div><div>Хувиргалтын дараа objective функцын дифференциал ийм болно</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgNgRWy74VRgEGcFLoHUC0I3rKzk19nLiTRFblJ1oc0tJgrxEBSWye_YSN76Y-b00pxTuOZ7mgtaxWJyRJunpPxgi8d9vbhtlgI1lj4T5NH1q-3HdT-OgPpZvHZWs4Gwqnyh5ActyzWww/s490/differentiation_J_theta_2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="144" data-original-width="490" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgNgRWy74VRgEGcFLoHUC0I3rKzk19nLiTRFblJ1oc0tJgrxEBSWye_YSN76Y-b00pxTuOZ7mgtaxWJyRJunpPxgi8d9vbhtlgI1lj4T5NH1q-3HdT-OgPpZvHZWs4Gwqnyh5ActyzWww/d/differentiation_J_theta_2.png" /></a></div><div>Дээр тодорхойлсон π<font size="1">θ</font>(<span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 16px;">τ</span>) функцын үржвэрүүдийг логарифмын хуулийн дагуу нэмэгдэхүүнүүд болгон задлая</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgioLOR1RDjYP9REBELp1xoWA314lBgw5mqCkWKDKB3byZh6D9Zig37bj99MqOz5LwEWlheQCPX_g7nAt-vOMQv9JSCwhnXYDRJBGSKr7zmbjyXuPt6nx74b_SMGrpqB8i1tQuza4Ddmg/s693/logarithm_extraction_into_sums.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="252" data-original-width="693" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgioLOR1RDjYP9REBELp1xoWA314lBgw5mqCkWKDKB3byZh6D9Zig37bj99MqOz5LwEWlheQCPX_g7nAt-vOMQv9JSCwhnXYDRJBGSKr7zmbjyXuPt6nx74b_SMGrpqB8i1tQuza4Ddmg/d/logarithm_extraction_into_sums.png" /></a></div><div><br /></div><div>∇<span style="font-size: x-small;">θ</span> оператороо цааш үргэлжлүүлэн дифференциалчилбал θ параметрээс хамааралгүй хэсгүүдийн градиент 0 тул орхиж болно</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEga90a2vVquctmGuSSuVH3jO2GYuU8JDex99IJxTQ8jU_oZrCZRrQPvxzD5_ubmIfFVB7uZS0294iy397YunUrJRE4RzSal0VZYYJb-wBIDfB0kShSE5rguH6xnkkhKwvDp-KZQa1ToWg/s790/derivate_cancellation_over_parameter_relations.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="165" data-original-width="790" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEga90a2vVquctmGuSSuVH3jO2GYuU8JDex99IJxTQ8jU_oZrCZRrQPvxzD5_ubmIfFVB7uZS0294iy397YunUrJRE4RzSal0VZYYJb-wBIDfB0kShSE5rguH6xnkkhKwvDp-KZQa1ToWg/d/derivate_cancellation_over_parameter_relations.png" /></a></div><div>Эцэстээ дифференциалчилсан objective функц ийм болно</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEioPeGRjMEdvw4JAZCyPmd-PoUb3-edQ455WvUo5-4t8scI4mVEjcc8yG9wF2ZReqtw01XvAfN6MV__SMDBXgyq1JeBIKhuaY-sDQNsXnpJYKo316w-PcdcCv-mW99RKHo2xBAF42jNQA/s651/differentiation_J_theta_3.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="98" data-original-width="651" height="96" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEioPeGRjMEdvw4JAZCyPmd-PoUb3-edQ455WvUo5-4t8scI4mVEjcc8yG9wF2ZReqtw01XvAfN6MV__SMDBXgyq1JeBIKhuaY-sDQNsXnpJYKo316w-PcdcCv-mW99RKHo2xBAF42jNQA/w640-h96/differentiation_J_theta_3.png" width="640" /></a></div><div>Expectation бол sample trajectory-уудын дундаж гэдгийг санавал</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgBPvQSlI7Bkac3LtDuhlrD3uTExX_vZeD590aRjd8s7Hm3hZeusw1XnzHw3Z0tdwsySNo5vDHAnMg16FUxS5sdvrooyWV-Ae3oqWX43y0jkuzjFUhccM0__kWpTapTOqwvAnEGYtbnSA/s678/differentiation_J_theta_4.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="206" data-original-width="678" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgBPvQSlI7Bkac3LtDuhlrD3uTExX_vZeD590aRjd8s7Hm3hZeusw1XnzHw3Z0tdwsySNo5vDHAnMg16FUxS5sdvrooyWV-Ae3oqWX43y0jkuzjFUhccM0__kWpTapTOqwvAnEGYtbnSA/d/differentiation_J_theta_4.png" /></a></div><div><br /></div><div>Policy функцын түвшинд иртэл өөрөөр хэлбэл trajectory-оос state->action түвшинд иртэл нь дифференциалчихлаа гэдэг нь бид environment дээрээ trajectory sample цуглуулж байхдаа л policy параметрүүдээ оптималчилаад байх боломжтой болж байна гэсэн үг.</div><div><br /></div><div>Элдэв математикгүйгээр хар үгээр хэлбэл, <b>бид туршлага цуглуулах тоолондоо policy-гээ сайжруулаад байх боломжтой боллоо гэсэн үг юм.</b></div><div><br /></div><div><br /></div><div><font size="6">REINFORCE алгоритм</font></div><div><br /></div><div>Энэ алгоритмаар неорон сүлжээг манай тохиолдолд policy функцын параметрүүдийг алхам алхмаар сургадаг.<br /><br />REINFORCE алгоритмын өөр нэг нэрийг Monte-Carlo policy gradient гэдэг. </div><div><br /></div><div>Алгоритмын ажиллах дарааллыг дүрсэлбэл</div><div><ol style="text-align: left;"><li>{τ<font size="1">i</font>} sample-ийг π<font size="1">θ</font>(a<font size="1">t</font> | s<font size="1">t</font>) policy ашиглан үүсгэнэ</li><li><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgs0dEAn_wEFTpp44gIqKYSltZhBUy7vhxIUczCNQwwhze62BCw7mUrRTmZrbUHpdcf2QAfOuMvIdwNpfolKC3h1MuSqOK2oTPKP_3SVfTF4dOLcYjXhyphenhyphenxWBuN4qCXagiTQFiSpjJ53uw/s659/policy_gradient.png" style="clear: left; display: inline; margin-bottom: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="82" data-original-width="659" height="25" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgs0dEAn_wEFTpp44gIqKYSltZhBUy7vhxIUczCNQwwhze62BCw7mUrRTmZrbUHpdcf2QAfOuMvIdwNpfolKC3h1MuSqOK2oTPKP_3SVfTF4dOLcYjXhyphenhyphenxWBuN4qCXagiTQFiSpjJ53uw/w200-h25/policy_gradient.png" width="200" /></a> policy gradient олно</li><li>θ = θ + α∇<font size="1">θ</font>J(θ) олсон градиент утгаараа моделийн параметрүүдийг шинэчилнэ. Бидний зорилго утга максимумчилах тул градиент утгыг <b>нэмж</b> байна үүнийг <b>gradient ascent</b> алгоритм гэдэг. Энгийн неорон сүлжээнд бид label-үүдтэй илүү ойртуулах буюу loss-г багасгахын тулд <b>gradient descent</b> алгоритм хэрэглэдэг, ялгааг нь анзаараарай.</li><li>1-р алхамруу очих</li></ol><div>Энд хялбарчилах зорилгоор policy gradient-ийн хамгийн арын reward нийлбэрт discount factor ерөөсөө оруулж ирээгүй байгааг анхаараарай. Discount factor-оо эргэж санавал</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6DeQQuCVH1nhnf4QRueik-_8trQOaimDgWx_CTTltXyBGhrdqFRCMFypW7Up5uZHpKH_kcuze-fioobYeNvOFB7infz1MRvqSMzzOFMHEXUl9deyodjA7VdxIAD5cgcvddD6B8Ss9kA/s403/Total+discounted+reward.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="61" data-original-width="403" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6DeQQuCVH1nhnf4QRueik-_8trQOaimDgWx_CTTltXyBGhrdqFRCMFypW7Up5uZHpKH_kcuze-fioobYeNvOFB7infz1MRvqSMzzOFMHEXUl9deyodjA7VdxIAD5cgcvddD6B8Ss9kA/d/Total+discounted+reward.png" /></a></div><div>Discount factor оруулж ирэе</div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEixNzjro3zUb1p9wNpwDoVCLnfh0DzO60ZZpgGo-ma7EOo1CI7vSbxLHbU-a7Ge2UJKqOtvnnavfD1siU4GFCItsjbGjb0Ra9Hj2LW42sghPMCsWP-SIUTk2qLLf0ls_LDvJzCL3BYePg/s564/policy_gradient_sutton.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="82" data-original-width="564" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEixNzjro3zUb1p9wNpwDoVCLnfh0DzO60ZZpgGo-ma7EOo1CI7vSbxLHbU-a7Ge2UJKqOtvnnavfD1siU4GFCItsjbGjb0Ra9Hj2LW42sghPMCsWP-SIUTk2qLLf0ls_LDvJzCL3BYePg/d/policy_gradient_sutton.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Sutton-ий тэмдэглэгээгээр</td></tr></tbody></table><div><br /></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgBUwi5d5hL3g7a0l0zZBPrXliCbHLlkIayJh6YUqpiSICmI7p9_2onIqg1F_chWT230aQ4pm1gYdZoYvj4fuvMPdTE8_kry7N-5EwB0SGqvleLLYVyyT-FvUJYcdtaJg4bt1XxhpJuVw/s516/policy_gradient_silver.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="94" data-original-width="516" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgBUwi5d5hL3g7a0l0zZBPrXliCbHLlkIayJh6YUqpiSICmI7p9_2onIqg1F_chWT230aQ4pm1gYdZoYvj4fuvMPdTE8_kry7N-5EwB0SGqvleLLYVyyT-FvUJYcdtaJg4bt1XxhpJuVw/d/policy_gradient_silver.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Silver-ийн тэмдэглэгээгээр</td></tr></tbody></table><div><br /></div><div><a href="http://incompleteideas.net/sutton/book/the-book-2nd.html">Sutton</a> болон <a href="https://www.davidsilver.uk/teaching/">Silver</a>-ийн тэмдэглэгээнүүд хоёулаа ижилхэн тул будилах хэрэггүй.</div><div><br /></div><div><br /></div><div><br /></div><div><font size="6">Policy Gradient-ийг цааш нь сайжруулах</font></div><div><br /></div><div>Monte-Carlo ийн аргаар дөхөлт хийхэд гардаг сөрөг тал нь цуглуулж байгаа sample-үүд high-variance-тэй байгаад байдаг.</div><div><br /></div><div>High variance-тэй байна гэдэг нь неорон сүлжээг сургах градиентүүдийн чиглэл converge хийх чиглэлрүү биш харин будилуулах чиглэлүүд бий болгоод байна гэсэн үг.</div><div><br /></div><div>Нэг дээжилсэн(sample-дсэн) reward тухайн action-д харгалзах магадлалыг өсгөмөөр байхад өөр нэг sample reward эсрэгээрээ тэр action-ий магадлалыг бууруулах гээд байж болно. </div><div><br /></div><div>Ингэсээр байгаад байвал неорон сүлжээ маань converge хийгдэхгүй удна. Тиймээс неорон сүлжээг сургахын тулд дээжилсэн reward утгын variance-ийг багасгах арга хэмжээ авах хэрэгтэй.</div><div><br /></div><div>Batch-ийн хэмжээг нэмснээр variance багасах боловч эсрэгээрээ sample efficiency байдал нь эрс буурч эхэлдэг тиймээс batch-ийн хэмжээг хамаагүй өсгөөд байж болохгүй.</div><div><br /></div><div>Энэ variance-ийг бууруулж өгдөг нэг арга нь Advantage утгууд оруулж ирэх арга юм. </div><div><br /></div><div><b>Advantage</b> утга бол <b>Q</b> утга болон тухайн төлөв дэх онооны утга <b>V</b> утга хоорондын ялгаа юм. Тухайн action-д харгалзах <b>advantage</b> утга гэдэг бол ер нь энэ үйлдлийг энэ төлөв дээр хийвэл нийт дундаж онооноос <b>хэр ахиу байж чадах вэ</b>, илүү <b>дээр байж чадахуу</b> гэдэг <b>оноог</b> илэрхийлдэг.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibP8LC-pq3edRdgG3S0RxddlMGsLPxy4L_BVCf6STolgtWRrxayj6gW54s4jBl0-ynk1ycyRnQRIQV856LO_NGyjKh5xEnHJZiWP3kzhHLHFLIJPy2MzqfOeDoaAv4wsXLX2np0B3FcQ/s298/advantage_function.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="54" data-original-width="298" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibP8LC-pq3edRdgG3S0RxddlMGsLPxy4L_BVCf6STolgtWRrxayj6gW54s4jBl0-ynk1ycyRnQRIQV856LO_NGyjKh5xEnHJZiWP3kzhHLHFLIJPy2MzqfOeDoaAv4wsXLX2np0B3FcQ/s0/advantage_function.png" /></a></div><div>Policy gradient томъёоны reward нийлбэр бол Q(s,a) буюу <a href="https://spinningup.openai.com/en/latest/spinningup/extra_pg_proof2.html">Q функц юм</a> гэдэг нэг баталгаа бий. Мөн үүний оронд Advantage функцыг ч бас оруулж ирж болно. </div><div><br /></div><div>Тэхээр advantage фукцыг оруулж ирэн policy gradient томъёогоо шинэчилбэл</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgrC5FqzVKd7-qHJo82z8lkCRogO3mc4LkuWygZwSL-dJU6zkkjax6bXr0B8H3Vovh1dAvs1nM0o_enlB-tvLUrHfmu-fOUhScSkn07VTJWsD635zRG_tJRURP7E7KhP0VDDDRtGRazQQ/s633/policy_gradient_with_advantage_function.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="96" data-original-width="633" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgrC5FqzVKd7-qHJo82z8lkCRogO3mc4LkuWygZwSL-dJU6zkkjax6bXr0B8H3Vovh1dAvs1nM0o_enlB-tvLUrHfmu-fOUhScSkn07VTJWsD635zRG_tJRURP7E7KhP0VDDDRtGRazQQ/d/policy_gradient_with_advantage_function.png" /></a></div><div><b>Advantage</b> утга оруулж ирснээр градиентээр параметр шинэчлэх үед <b>өндөр reward-тэй, ач холбогдол сайтай</b> байж болохуйц <b>action</b>-д илүү <b>их жин</b> оноож өгч байна гэж бодож болно. </div><div><br /></div><div>Ач холбогдол өндөртэй action advantage утгын нөлөөгөөр жин ихтэй байх тул хамаарал бүхий градиентүүдийг хаа хамаагүй чиглэлүүдрүү үсчүүлээд байлгүй нийт reward нийлбэрийг максимумчилах чиглэлрүүгээ илүү сайн тэмүүлж өгнө.</div><div><br /></div><div>Тиймээс өндөр variance-ийн асуудлыг шийдвэрлэхэд тодорхой хэмжээний хувь нэмэр оруулж байгаа хэрэг.</div><div><br /></div><div><br /></div></div><div><br /></div><div><font size="6">Лавлагаа</font></div><div><br /></div><div>Deep RL Bootcamp Lecture 4A: Policy Gradients</div><div><a href="https://www.youtube.com/watch?v=S_gwYj1Q-44">https://www.youtube.com/watch?v=S_gwYj1Q-44</a></div><div><br /></div><div>Reinforcement Learning | MIT 6.S191</div><div><a href="https://www.youtube.com/watch?v=nZfaHIxDD5w">https://www.youtube.com/watch?v=nZfaHIxDD5w</a></div><div><br /></div><div>Deep Reinforcement Learning without PhD</div><div><a href="https://www.youtube.com/watch?v=t1A3NTttvBA">https://www.youtube.com/watch?v=t1A3NTttvBA</a></div><div><br /></div><div>Log Derivate Trick</div><div><a href="http://blog.shakirm.com/2015/11/machine-learning-trick-of-the-day-5-log-derivative-trick/">http://blog.shakirm.com/2015/11/machine-learning-trick-of-the-day-5-log-derivative-trick/</a> </div><div><br /></div><div>Intro to policy optimization</div><div><a href="https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html">https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html</a></div><div><br /></div><div>Policy Gradient Methods for Reinforcement Learning with Function Approximation</div><div><a href="https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf">https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf</a></div><div><br /></div><div>An intuitive explanation of policy gradient</div><div><a href="https://towardsdatascience.com/an-intuitive-explanation-of-policy-gradient-part-1-reinforce-aa4392cbfd3c">https://towardsdatascience.com/an-intuitive-explanation-of-policy-gradient-part-1-reinforce-aa4392cbfd3c</a></div><div><br /></div><div><br /></div><div><br /></div><div><font size="6">Хэрэгжүүлэлт</font></div>
<div>
<br />
Энд advantage дөхөлт хийсэн факторыг оруулаагүй бөгөөд дээр дурдсан Sutton болон Silver нарын discount-тэй reward бүхий томъёоны дагуух хэрэгжүүлэлт болно.</div><div><br /></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqOy4XXx97H6XQxX2O2VEL7Oo7Z4Ux3_XYQKaCRjU1SKb7yddJZ7Cc31pfqITfkMoVKtbD1v-T1lrZfznRS49lUmSqNpIiUyJvkK2vliavq7-2g0oG7jEp5w1-VvI3Yt6UCYALjVQhPA/s907/action_probability_distribution.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="277" data-original-width="907" height="196" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqOy4XXx97H6XQxX2O2VEL7Oo7Z4Ux3_XYQKaCRjU1SKb7yddJZ7Cc31pfqITfkMoVKtbD1v-T1lrZfznRS49lUmSqNpIiUyJvkK2vliavq7-2g0oG7jEp5w1-VvI3Yt6UCYALjVQhPA/w640-h196/action_probability_distribution.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Action бүрт магадлалын оноо харгалзаж байгаа талаархи зураг.<br /></td></tr></tbody></table><div><br /></div><div><br /></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh3YGZCxmXPR64DumqxvaZMD7zbXHmROam4MWUPHS1k32tdpCQXuIniI6oW-cFU4RdY1vm0fqkgYoHICXlHmfjWW6YUsVV8bp_ovam4e6Ycxa4XJTJzxVSPVdcGvy9-kwassltFlYHMMw/s578/trained_pg.gif" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="355" data-original-width="578" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh3YGZCxmXPR64DumqxvaZMD7zbXHmROam4MWUPHS1k32tdpCQXuIniI6oW-cFU4RdY1vm0fqkgYoHICXlHmfjWW6YUsVV8bp_ovam4e6Ycxa4XJTJzxVSPVdcGvy9-kwassltFlYHMMw/d/trained_pg.gif" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Бүрэн сургагдсан Policy Gradient.<br /></td></tr></tbody></table><div><br /></div><div><br /></div><div><br /></div>
<div>Jax болон Flax дээрхи хэрэгжүүлэлт</div>
<div><br /></div>
<pre style="background: rgb(0, 0, 0); color: #d1d1d1;"><span style="color: #e66170; font-weight: bold;">import</span> os
<span style="color: #e66170; font-weight: bold;">import</span> random
<span style="color: #e66170; font-weight: bold;">import</span> math
<span style="color: #e66170; font-weight: bold;">import</span> gym
<span style="color: #e66170; font-weight: bold;">import</span> flax
<span style="color: #e66170; font-weight: bold;">import</span> jax
<span style="color: #e66170; font-weight: bold;">from</span> jax <span style="color: #e66170; font-weight: bold;">import</span> numpy <span style="color: #e66170; font-weight: bold;">as</span> jnp
<span style="color: #e66170; font-weight: bold;">import</span> numpy <span style="color: #e66170; font-weight: bold;">as</span> np
<span style="color: #e66170; font-weight: bold;">import</span> numpy
debug_render <span style="color: #d2cd86;">=</span> True
debug <span style="color: #d2cd86;">=</span> False
num_episodes <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">600</span>
learning_rate <span style="color: #d2cd86;">=</span> <span style="color: #009f00;">0.001</span>
gamma <span style="color: #d2cd86;">=</span> <span style="color: #009f00;">0.99</span> <span style="color: #9999a9;"># discount factor</span>
<span style="color: #e66170; font-weight: bold;">class</span> PolicyNetwork<span style="color: #d2cd86;">(</span>flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Module<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">def</span> apply<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> x<span style="color: #d2cd86;">,</span> n_actions<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
dense_layer_1 <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Dense<span style="color: #d2cd86;">(</span>x<span style="color: #d2cd86;">,</span> <span style="color: #00a800;">64</span><span style="color: #d2cd86;">)</span>
activation_layer_1 <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>relu<span style="color: #d2cd86;">(</span>dense_layer_1<span style="color: #d2cd86;">)</span>
dense_layer_2 <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Dense<span style="color: #d2cd86;">(</span>activation_layer_1<span style="color: #d2cd86;">,</span> <span style="color: #00a800;">32</span><span style="color: #d2cd86;">)</span>
activation_layer_2 <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>relu<span style="color: #d2cd86;">(</span>dense_layer_2<span style="color: #d2cd86;">)</span>
output_dense_layer <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Dense<span style="color: #d2cd86;">(</span>activation_layer_2<span style="color: #d2cd86;">,</span> n_actions<span style="color: #d2cd86;">)</span>
output_layer <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>softmax<span style="color: #d2cd86;">(</span>output_dense_layer<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> output_layer
env <span style="color: #d2cd86;">=</span> gym<span style="color: #d2cd86;">.</span>make<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">'CartPole-v0'</span><span style="color: #d2cd86;">)</span>
state <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>reset<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
n_actions <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>action_space<span style="color: #d2cd86;">.</span>n
pg_module <span style="color: #d2cd86;">=</span> PolicyNetwork<span style="color: #d2cd86;">.</span>partial<span style="color: #d2cd86;">(</span>n_actions<span style="color: #d2cd86;">=</span>n_actions<span style="color: #d2cd86;">)</span>
_<span style="color: #d2cd86;">,</span> params <span style="color: #d2cd86;">=</span> pg_module<span style="color: #d2cd86;">.</span>init_by_shape<span style="color: #d2cd86;">(</span>jax<span style="color: #d2cd86;">.</span>random<span style="color: #d2cd86;">.</span>PRNGKey<span style="color: #d2cd86;">(</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">[</span>state<span style="color: #d2cd86;">.</span>shape<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
policy_network <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Model<span style="color: #d2cd86;">(</span>pg_module<span style="color: #d2cd86;">,</span> params<span style="color: #d2cd86;">)</span>
optimizer <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>optim<span style="color: #d2cd86;">.</span>Adam<span style="color: #d2cd86;">(</span>learning_rate<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">.</span>create<span style="color: #d2cd86;">(</span>policy_network<span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">@</span>jax<span style="color: #d2cd86;">.</span>jit
<span style="color: #e66170; font-weight: bold;">def</span> policy_inference<span style="color: #d2cd86;">(</span>model<span style="color: #d2cd86;">,</span> x<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
action_probabilities <span style="color: #d2cd86;">=</span> model<span style="color: #d2cd86;">(</span>x<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> action_probabilities
<span style="color: #d2cd86;">@</span>jax<span style="color: #d2cd86;">.</span>vmap
<span style="color: #e66170; font-weight: bold;">def</span> gather<span style="color: #d2cd86;">(</span>action_probabilities<span style="color: #d2cd86;">,</span> action_index<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">return</span> action_probabilities<span style="color: #d2cd86;">[</span>action_index<span style="color: #d2cd86;">]</span>
<span style="color: #d2cd86;">@</span>jax<span style="color: #d2cd86;">.</span>jit
<span style="color: #e66170; font-weight: bold;">def</span> train_step<span style="color: #d2cd86;">(</span>optimizer<span style="color: #d2cd86;">,</span> batch<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #9999a9;"># batch[0] - states</span>
<span style="color: #9999a9;"># batch[1] - actions</span>
<span style="color: #9999a9;"># batch[2] - discounted rewards</span>
<span style="color: #e66170; font-weight: bold;">def</span> loss_fn<span style="color: #d2cd86;">(</span>model<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
action_probabilities_list <span style="color: #d2cd86;">=</span> model<span style="color: #d2cd86;">(</span>batch<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
picked_action_probabilities <span style="color: #d2cd86;">=</span> gather<span style="color: #d2cd86;">(</span>action_probabilities_list<span style="color: #d2cd86;">,</span> batch<span style="color: #d2cd86;">[</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
log_probabilities <span style="color: #d2cd86;">=</span> jnp<span style="color: #d2cd86;">.</span>log<span style="color: #d2cd86;">(</span>picked_action_probabilities<span style="color: #d2cd86;">)</span>
losses <span style="color: #d2cd86;">=</span> jnp<span style="color: #d2cd86;">.</span>multiply<span style="color: #d2cd86;">(</span>log_probabilities<span style="color: #d2cd86;">,</span> batch<span style="color: #d2cd86;">[</span><span style="color: #00a800;">2</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> <span style="color: #00dddd;">-</span>jnp<span style="color: #d2cd86;">.</span><span style="color: #e66170; font-weight: bold;">sum</span><span style="color: #d2cd86;">(</span>losses<span style="color: #d2cd86;">)</span>
loss<span style="color: #d2cd86;">,</span> gradients <span style="color: #d2cd86;">=</span> jax<span style="color: #d2cd86;">.</span>value_and_grad<span style="color: #d2cd86;">(</span>loss_fn<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">(</span>optimizer<span style="color: #d2cd86;">.</span>target<span style="color: #d2cd86;">)</span>
optimizer <span style="color: #d2cd86;">=</span> optimizer<span style="color: #d2cd86;">.</span>apply_gradient<span style="color: #d2cd86;">(</span>gradients<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> optimizer<span style="color: #d2cd86;">,</span> loss
global_steps <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">0</span>
<span style="color: #e66170; font-weight: bold;">try</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">for</span> episode <span style="color: #e66170; font-weight: bold;">in</span> <span style="color: #e66170; font-weight: bold;">range</span><span style="color: #d2cd86;">(</span>num_episodes<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
states<span style="color: #d2cd86;">,</span> actions<span style="color: #d2cd86;">,</span> rewards <span style="color: #d2cd86;">=</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span>
state <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>reset<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">while</span> True<span style="color: #d2cd86;">:</span>
global_steps <span style="color: #d2cd86;">=</span> global_steps<span style="color: #00dddd;">+</span><span style="color: #00a800;">1</span>
action_probabilities <span style="color: #d2cd86;">=</span> policy_inference<span style="color: #d2cd86;">(</span>optimizer<span style="color: #d2cd86;">.</span>target<span style="color: #d2cd86;">,</span> jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">[</span>state<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span>
action_probabilities <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>array<span style="color: #d2cd86;">(</span>action_probabilities<span style="color: #d2cd86;">)</span>
action_probabilities <span style="color: #00dddd;">/</span><span style="color: #d2cd86;">=</span> action_probabilities<span style="color: #d2cd86;">.</span><span style="color: #e66170; font-weight: bold;">sum</span><span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
action <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>random<span style="color: #d2cd86;">.</span>choice<span style="color: #d2cd86;">(</span>n_actions<span style="color: #d2cd86;">,</span> p<span style="color: #d2cd86;">=</span>action_probabilities<span style="color: #d2cd86;">)</span>
new_state<span style="color: #d2cd86;">,</span> reward<span style="color: #d2cd86;">,</span> done<span style="color: #d2cd86;">,</span> _ <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>step<span style="color: #d2cd86;">(</span><span style="color: #e66170; font-weight: bold;">int</span><span style="color: #d2cd86;">(</span>action<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
states<span style="color: #d2cd86;">.</span>append<span style="color: #d2cd86;">(</span>state<span style="color: #d2cd86;">)</span>
actions<span style="color: #d2cd86;">.</span>append<span style="color: #d2cd86;">(</span>action<span style="color: #d2cd86;">)</span>
rewards<span style="color: #d2cd86;">.</span>append<span style="color: #d2cd86;">(</span>reward<span style="color: #d2cd86;">)</span>
state <span style="color: #d2cd86;">=</span> new_state
<span style="color: #e66170; font-weight: bold;">if</span> debug_render<span style="color: #d2cd86;">:</span>
env<span style="color: #d2cd86;">.</span>render<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">if</span> done<span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"{} - нийт reward : {}"</span><span style="color: #d2cd86;">.</span>format<span style="color: #d2cd86;">(</span>episode<span style="color: #d2cd86;">,</span> <span style="color: #e66170; font-weight: bold;">sum</span><span style="color: #d2cd86;">(</span>rewards<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
episode_length <span style="color: #d2cd86;">=</span> <span style="color: #e66170; font-weight: bold;">len</span><span style="color: #d2cd86;">(</span>rewards<span style="color: #d2cd86;">)</span>
discounted_rewards <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>zeros_like<span style="color: #d2cd86;">(</span>rewards<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">for</span> t <span style="color: #e66170; font-weight: bold;">in</span> <span style="color: #e66170; font-weight: bold;">range</span><span style="color: #d2cd86;">(</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">,</span> episode_length<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
G_t <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">0</span>
<span style="color: #e66170; font-weight: bold;">for</span> idx<span style="color: #d2cd86;">,</span> j <span style="color: #e66170; font-weight: bold;">in</span> <span style="color: #e66170; font-weight: bold;">enumerate</span><span style="color: #d2cd86;">(</span><span style="color: #e66170; font-weight: bold;">range</span><span style="color: #d2cd86;">(</span>t<span style="color: #d2cd86;">,</span> episode_length<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
G_t <span style="color: #d2cd86;">=</span> G_t <span style="color: #00dddd;">+</span> <span style="color: #d2cd86;">(</span>gamma<span style="color: #00dddd;">**</span>idx<span style="color: #d2cd86;">)</span><span style="color: #00dddd;">*</span>rewards<span style="color: #d2cd86;">[</span>j<span style="color: #d2cd86;">]</span>
discounted_rewards<span style="color: #d2cd86;">[</span>t<span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> G_t
discounted_rewards <span style="color: #d2cd86;">=</span> discounted_rewards <span style="color: #00dddd;">-</span> np<span style="color: #d2cd86;">.</span>mean<span style="color: #d2cd86;">(</span>discounted_rewards<span style="color: #d2cd86;">)</span>
discounted_rewards <span style="color: #d2cd86;">=</span> discounted_rewards <span style="color: #00dddd;">/</span> <span style="color: #d2cd86;">(</span>np<span style="color: #d2cd86;">.</span>std<span style="color: #d2cd86;">(</span>discounted_rewards<span style="color: #d2cd86;">)</span><span style="color: #00dddd;">+</span><span style="color: #00a800;">1</span><span style="background: rgb(221, 0, 0); color: white;">e</span><span style="color: #00dddd;">-</span><span style="color: #00a800;">10</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"Training..."</span><span style="color: #d2cd86;">)</span>
optimizer<span style="color: #d2cd86;">,</span> loss <span style="color: #d2cd86;">=</span> train_step<span style="color: #d2cd86;">(</span>optimizer<span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">(</span>
jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span>states<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span>
jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span>actions<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span>
jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span>discounted_rewards<span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">break</span>
<span style="color: #e66170; font-weight: bold;">finally</span><span style="color: #d2cd86;">:</span>
env<span style="color: #d2cd86;">.</span>close<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
</pre>
<!--Created using ToHtml.com on 2020-07-26 02:00:15 UTC-->
<div><br /></div>
<div><br /></div>
Tensorflow 2 дээрхи хэрэгжүүлэлт<br />
<div>
<pre style="background: rgb(0, 0, 0); color: #d1d1d1;"><span style="color: #e66170; font-weight: bold;">import</span> os
<span style="color: #e66170; font-weight: bold;">import</span> random
<span style="color: #e66170; font-weight: bold;">from</span> time <span style="color: #e66170; font-weight: bold;">import</span> sleep
<span style="color: #e66170; font-weight: bold;">from</span> collections <span style="color: #e66170; font-weight: bold;">import</span> deque
<span style="color: #e66170; font-weight: bold;">import</span> numpy <span style="color: #e66170; font-weight: bold;">as</span> np
<span style="color: #e66170; font-weight: bold;">import</span> gym
<span style="color: #e66170; font-weight: bold;">import</span> cv2
<span style="color: #e66170; font-weight: bold;">import</span> tkinter
<span style="color: #e66170; font-weight: bold;">import</span> matplotlib
matplotlib<span style="color: #d2cd86;">.</span>use<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">'TkAgg'</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">import</span> matplotlib<span style="color: #d2cd86;">.</span>pyplot <span style="color: #e66170; font-weight: bold;">as</span> plt
<span style="color: #e66170; font-weight: bold;">import</span> tensorflow <span style="color: #e66170; font-weight: bold;">as</span> tf
tf<span style="color: #d2cd86;">.</span>get_logger<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">.</span>setLevel<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">'ERROR'</span><span style="color: #d2cd86;">)</span>
debug_render <span style="color: #d2cd86;">=</span> False
num_episodes <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">2000</span>
save_per_step <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">1000</span> <span style="color: #9999a9;"># хэдэн алхам тутамд сургасан моделийг хадгалах вэ</span>
gamma <span style="color: #d2cd86;">=</span> <span style="color: #009f00;">0.99</span> <span style="color: #9999a9;"># discount factor</span>
<span style="color: #e66170; font-weight: bold;">class</span> PolicyNetwork<span style="color: #d2cd86;">(</span>tf<span style="color: #d2cd86;">.</span>keras<span style="color: #d2cd86;">.</span>Model<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">def</span> __init__<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> n_actions<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">super</span><span style="color: #d2cd86;">(</span>PolicyNetwork<span style="color: #d2cd86;">,</span> self<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">.</span>__init__<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
self<span style="color: #d2cd86;">.</span>dense_layer <span style="color: #d2cd86;">=</span> tf<span style="color: #d2cd86;">.</span>keras<span style="color: #d2cd86;">.</span>layers<span style="color: #d2cd86;">.</span>Dense<span style="color: #d2cd86;">(</span><span style="color: #00a800;">128</span><span style="color: #d2cd86;">,</span> activation<span style="color: #d2cd86;">=</span><span style="color: #00c4c4;">'relu'</span><span style="color: #d2cd86;">,</span> kernel_initializer<span style="color: #d2cd86;">=</span><span style="color: #00c4c4;">'glorot_uniform'</span><span style="color: #d2cd86;">)</span>
self<span style="color: #d2cd86;">.</span>output_layer <span style="color: #d2cd86;">=</span> tf<span style="color: #d2cd86;">.</span>keras<span style="color: #d2cd86;">.</span>layers<span style="color: #d2cd86;">.</span>Dense<span style="color: #d2cd86;">(</span>n_actions<span style="color: #d2cd86;">,</span> activation<span style="color: #d2cd86;">=</span><span style="color: #00c4c4;">'softmax'</span><span style="color: #d2cd86;">,</span> kernel_initializer<span style="color: #d2cd86;">=</span><span style="color: #00c4c4;">'glorot_uniform'</span><span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">@</span>tf<span style="color: #d2cd86;">.</span>function<span style="color: #d2cd86;">(</span>experimental_relax_shapes<span style="color: #d2cd86;">=</span>True<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">def</span> call<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> inputs<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
dense_out <span style="color: #d2cd86;">=</span> self<span style="color: #d2cd86;">.</span>dense_layer<span style="color: #d2cd86;">(</span>inputs<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> self<span style="color: #d2cd86;">.</span>output_layer<span style="color: #d2cd86;">(</span>dense_out<span style="color: #d2cd86;">)</span>
env <span style="color: #d2cd86;">=</span> gym<span style="color: #d2cd86;">.</span>make<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">'CartPole-v0'</span><span style="color: #d2cd86;">)</span>
env<span style="color: #d2cd86;">.</span>reset<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
n_actions <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>action_space<span style="color: #d2cd86;">.</span>n
policy_optimizer <span style="color: #d2cd86;">=</span> tf<span style="color: #d2cd86;">.</span>keras<span style="color: #d2cd86;">.</span>optimizers<span style="color: #d2cd86;">.</span>RMSprop<span style="color: #d2cd86;">(</span>lr<span style="color: #d2cd86;">=</span><span style="color: #009f00;">0.0007</span><span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;">#policy_optimizer = tf.keras.optimizers.Adam()</span>
policy <span style="color: #d2cd86;">=</span> PolicyNetwork<span style="color: #d2cd86;">(</span>n_actions<span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">@</span>tf<span style="color: #d2cd86;">.</span>function<span style="color: #d2cd86;">(</span>experimental_relax_shapes<span style="color: #d2cd86;">=</span>True<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">def</span> policy_loss_fn<span style="color: #d2cd86;">(</span>action_logits<span style="color: #d2cd86;">,</span> actions<span style="color: #d2cd86;">,</span> targets<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
actions <span style="color: #d2cd86;">=</span> tf<span style="color: #d2cd86;">.</span>convert_to_tensor<span style="color: #d2cd86;">(</span>
<span style="color: #e66170; font-weight: bold;">list</span><span style="color: #d2cd86;">(</span><span style="color: #e66170; font-weight: bold;">zip</span><span style="color: #d2cd86;">(</span>np<span style="color: #d2cd86;">.</span>arange<span style="color: #d2cd86;">(</span><span style="color: #e66170; font-weight: bold;">len</span><span style="color: #d2cd86;">(</span>actions<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span> actions<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;"># πθ(a|s)</span>
action_probabilities <span style="color: #d2cd86;">=</span> tf<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>softmax<span style="color: #d2cd86;">(</span>action_logits<span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;"># Үйлдлийн индексд харгалзах магадлалын оноог авах</span>
picked_action_probabilities <span style="color: #d2cd86;">=</span> tf<span style="color: #d2cd86;">.</span>gather_nd<span style="color: #d2cd86;">(</span>action_probabilities<span style="color: #d2cd86;">,</span> actions<span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;"># logπθ(a|s)</span>
log_probabilites <span style="color: #d2cd86;">=</span> tf<span style="color: #d2cd86;">.</span>cast<span style="color: #d2cd86;">(</span>tf<span style="color: #d2cd86;">.</span>math<span style="color: #d2cd86;">.</span>log<span style="color: #d2cd86;">(</span>picked_action_probabilities<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span> dtype<span style="color: #d2cd86;">=</span>tf<span style="color: #d2cd86;">.</span>float64<span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;"># logπθ(a|s)*G_t, үйлдлийн магадлалыг discount авсан reward-аар үржих</span>
loss <span style="color: #d2cd86;">=</span> tf<span style="color: #d2cd86;">.</span>multiply<span style="color: #d2cd86;">(</span>log_probabilites<span style="color: #d2cd86;">,</span> tf<span style="color: #d2cd86;">.</span>convert_to_tensor<span style="color: #d2cd86;">(</span>targets<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;"># максимумчилахын тулд оптимайзерын минимумчилагчийн эсрэг хасах loss</span>
<span style="color: #e66170; font-weight: bold;">return</span> <span style="color: #00dddd;">-</span>tf<span style="color: #d2cd86;">.</span>reduce_sum<span style="color: #d2cd86;">(</span>loss<span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">@</span>tf<span style="color: #d2cd86;">.</span>function<span style="color: #d2cd86;">(</span>experimental_relax_shapes<span style="color: #d2cd86;">=</span>True<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">def</span> train_policy_network<span style="color: #d2cd86;">(</span>inputs<span style="color: #d2cd86;">,</span> actions<span style="color: #d2cd86;">,</span> advantages<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">with</span> tf<span style="color: #d2cd86;">.</span>GradientTape<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span> <span style="color: #e66170; font-weight: bold;">as</span> tape<span style="color: #d2cd86;">:</span>
<span style="color: #9999a9;"># πθ(a|s)</span>
predictions <span style="color: #d2cd86;">=</span> policy<span style="color: #d2cd86;">(</span>inputs<span style="color: #d2cd86;">,</span> training<span style="color: #d2cd86;">=</span>True<span style="color: #d2cd86;">)</span>
loss <span style="color: #d2cd86;">=</span> policy_loss_fn<span style="color: #d2cd86;">(</span>predictions<span style="color: #d2cd86;">,</span> actions<span style="color: #d2cd86;">,</span> advantages<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">if</span> debug_render<span style="color: #d2cd86;">:</span>
tf<span style="color: #d2cd86;">.</span><span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"loss : "</span><span style="color: #d2cd86;">,</span> loss<span style="color: #d2cd86;">)</span>
gradients <span style="color: #d2cd86;">=</span> tape<span style="color: #d2cd86;">.</span>gradient<span style="color: #d2cd86;">(</span>loss<span style="color: #d2cd86;">,</span> policy<span style="color: #d2cd86;">.</span>trainable_variables<span style="color: #d2cd86;">)</span>
policy_optimizer<span style="color: #d2cd86;">.</span>apply_gradients<span style="color: #d2cd86;">(</span><span style="color: #e66170; font-weight: bold;">zip</span><span style="color: #d2cd86;">(</span>gradients<span style="color: #d2cd86;">,</span> policy<span style="color: #d2cd86;">.</span>trainable_variables<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">if</span> <span style="color: #e66170; font-weight: bold;">not</span> os<span style="color: #d2cd86;">.</span>path<span style="color: #d2cd86;">.</span>exists<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"model_weights"</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
os<span style="color: #d2cd86;">.</span>makedirs<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"model_weights"</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">if</span> os<span style="color: #d2cd86;">.</span>path<span style="color: #d2cd86;">.</span>exists<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">'model_weights/ReinforcePolicy'</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
policy <span style="color: #d2cd86;">=</span> tf<span style="color: #d2cd86;">.</span>keras<span style="color: #d2cd86;">.</span>models<span style="color: #d2cd86;">.</span>load_model<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"model_weights/ReinforcePolicy"</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"өмнөх сургасан моделийг ачааллаа"</span><span style="color: #d2cd86;">)</span>
global_steps <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">0</span>
rewards_history <span style="color: #d2cd86;">=</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span>
<span style="color: #e66170; font-weight: bold;">for</span> episode <span style="color: #e66170; font-weight: bold;">in</span> <span style="color: #e66170; font-weight: bold;">range</span><span style="color: #d2cd86;">(</span>num_episodes<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
done <span style="color: #d2cd86;">=</span> False
score <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">0</span>
states<span style="color: #d2cd86;">,</span> actions<span style="color: #d2cd86;">,</span> rewards <span style="color: #d2cd86;">=</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span>
state <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>reset<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">while</span> <span style="color: #e66170; font-weight: bold;">not</span> done<span style="color: #d2cd86;">:</span>
global_steps <span style="color: #d2cd86;">=</span> global_steps<span style="color: #00dddd;">+</span><span style="color: #00a800;">1</span>
logits <span style="color: #d2cd86;">=</span> policy<span style="color: #d2cd86;">(</span>np<span style="color: #d2cd86;">.</span>array<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">[</span>state<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> dtype<span style="color: #d2cd86;">=</span>np<span style="color: #d2cd86;">.</span>float32<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span> training<span style="color: #d2cd86;">=</span>False<span style="color: #d2cd86;">)</span> <span style="color: #9999a9;"># πθ(a|s)</span>
probabilities <span style="color: #d2cd86;">=</span> tf<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>softmax<span style="color: #d2cd86;">(</span>logits<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">.</span>numpy<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span>
action <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>random<span style="color: #d2cd86;">.</span>choice<span style="color: #d2cd86;">(</span>n_actions<span style="color: #d2cd86;">,</span> p<span style="color: #d2cd86;">=</span>probabilities<span style="color: #d2cd86;">)</span>
new_state<span style="color: #d2cd86;">,</span> reward<span style="color: #d2cd86;">,</span> done<span style="color: #d2cd86;">,</span> _ <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>step<span style="color: #d2cd86;">(</span>action<span style="color: #d2cd86;">)</span>
score <span style="color: #d2cd86;">=</span> score<span style="color: #00dddd;">+</span>reward
states <span style="color: #d2cd86;">.</span>append<span style="color: #d2cd86;">(</span>state<span style="color: #d2cd86;">)</span>
actions <span style="color: #d2cd86;">.</span>append<span style="color: #d2cd86;">(</span>action<span style="color: #d2cd86;">)</span>
rewards <span style="color: #d2cd86;">.</span>append<span style="color: #d2cd86;">(</span>reward<span style="color: #d2cd86;">)</span>
state <span style="color: #d2cd86;">=</span> new_state
<span style="color: #e66170; font-weight: bold;">if</span> debug_render<span style="color: #d2cd86;">:</span>
env<span style="color: #d2cd86;">.</span>render<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">if</span> global_steps<span style="color: #00dddd;">%</span>save_per_step<span style="color: #00dddd;">==</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">:</span>
policy<span style="color: #d2cd86;">.</span>save<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"model_weights/ReinforcePolicy"</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"Моделийг хадгаллаа"</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">if</span> done<span style="color: #00dddd;">==</span>True<span style="color: #d2cd86;">:</span>
episode_length <span style="color: #d2cd86;">=</span> <span style="color: #e66170; font-weight: bold;">len</span><span style="color: #d2cd86;">(</span>states<span style="color: #d2cd86;">)</span>
input_states <span style="color: #d2cd86;">=</span> tf<span style="color: #d2cd86;">.</span>convert_to_tensor<span style="color: #d2cd86;">(</span>states<span style="color: #d2cd86;">,</span> dtype<span style="color: #d2cd86;">=</span>tf<span style="color: #d2cd86;">.</span>float32<span style="color: #d2cd86;">)</span>
discounted_rewards <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>zeros_like<span style="color: #d2cd86;">(</span>rewards<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">for</span> t <span style="color: #e66170; font-weight: bold;">in</span> <span style="color: #e66170; font-weight: bold;">range</span><span style="color: #d2cd86;">(</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">,</span> episode_length<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
G_t <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">0</span>
<span style="color: #e66170; font-weight: bold;">for</span> idx<span style="color: #d2cd86;">,</span> j <span style="color: #e66170; font-weight: bold;">in</span> <span style="color: #e66170; font-weight: bold;">enumerate</span><span style="color: #d2cd86;">(</span><span style="color: #e66170; font-weight: bold;">range</span><span style="color: #d2cd86;">(</span>t<span style="color: #d2cd86;">,</span> episode_length<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
G_t <span style="color: #d2cd86;">=</span> G_t <span style="color: #00dddd;">+</span> <span style="color: #d2cd86;">(</span>gamma<span style="color: #00dddd;">**</span>idx<span style="color: #d2cd86;">)</span><span style="color: #00dddd;">*</span>rewards<span style="color: #d2cd86;">[</span>j<span style="color: #d2cd86;">]</span>
discounted_rewards<span style="color: #d2cd86;">[</span>t<span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> G_t
<span style="color: #9999a9;"># normalize rewards</span>
discounted_rewards <span style="color: #d2cd86;">=</span> <span style="color: #d2cd86;">(</span>discounted_rewards <span style="color: #00dddd;">-</span> np<span style="color: #d2cd86;">.</span>mean<span style="color: #d2cd86;">(</span>discounted_rewards<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span> <span style="color: #00dddd;">/</span> <span style="color: #d2cd86;">(</span>np<span style="color: #d2cd86;">.</span>std<span style="color: #d2cd86;">(</span>discounted_rewards<span style="color: #d2cd86;">)</span> <span style="color: #00dddd;">+</span> <span style="color: #00a800;">1</span><span style="background: rgb(221, 0, 0); color: white;">e</span><span style="color: #00dddd;">-</span><span style="color: #00a800;">10</span><span style="color: #d2cd86;">)</span>
train_policy_network<span style="color: #d2cd86;">(</span>input_states<span style="color: #d2cd86;">,</span> actions<span style="color: #d2cd86;">,</span> discounted_rewards<span style="color: #d2cd86;">)</span>
training_happened <span style="color: #d2cd86;">=</span> True
states<span style="color: #d2cd86;">,</span> rewards<span style="color: #d2cd86;">,</span> actions <span style="color: #d2cd86;">=</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"%s : %s урттай ажиллагаа %s оноотой дууслаа"</span><span style="color: #00dddd;">%</span><span style="color: #d2cd86;">(</span>episode<span style="color: #d2cd86;">,</span> episode_length<span style="color: #d2cd86;">,</span> score<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
rewards_history<span style="color: #d2cd86;">.</span>append<span style="color: #d2cd86;">(</span>score<span style="color: #d2cd86;">)</span>
plt<span style="color: #d2cd86;">.</span>style<span style="color: #d2cd86;">.</span>use<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">'seaborn'</span><span style="color: #d2cd86;">)</span>
plt<span style="color: #d2cd86;">.</span>plot<span style="color: #d2cd86;">(</span>rewards_history<span style="color: #d2cd86;">)</span>
plt<span style="color: #d2cd86;">.</span>xlabel<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">'Episode'</span><span style="color: #d2cd86;">)</span>
plt<span style="color: #d2cd86;">.</span>ylabel<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">'Total Reward'</span><span style="color: #d2cd86;">)</span>
plt<span style="color: #d2cd86;">.</span>savefig<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">'reinforce_baseline.png'</span><span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;">#plt.show()</span>
<span style="color: #e66170; font-weight: bold;">if</span> debug_render<span style="color: #d2cd86;">:</span>
plt<span style="color: #d2cd86;">.</span>close<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"all"</span><span style="color: #d2cd86;">)</span>
env<span style="color: #d2cd86;">.</span>close<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
</pre>
<!--Created using ToHtml.com on 2020-07-13 01:26:50 UTC-->
</div>Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.com0Ulaanbaatar, Mongolia47.886398799999988 106.9057439-5.9760675870462947 36.593243900000004 90 177.2182439tag:blogger.com,1999:blog-1457877875009527488.post-55950951154182185872020-06-22T21:23:00.115+08:002020-07-26T07:09:37.425+08:00Deep Reinforcement Learning, Deep Q Network буюу DQNӨмнөх постоор <a href="https://sharavaa.blogspot.com/2020/06/deep-reinforcement-learning-q-learning.html">Q Learning</a> хэрхэн ажилладаг талаар дурдсан. Дахин сануулахад <b>Q функц</b> тухайн <b>төлөв</b> дээр <b>үйлдэл</b> хийгээд цаашаагаа хамгийн оптимал дүрмээр тоглолт хийвэл ирээдүйд авах <b>нийт нийлбэр reward оноо</b> тэд байна гэдгийг ойролцоогоор олсон утгыг буцаах үүрэгтэй. <b>RL</b>-ийн зорилго бол <b>нийлбэр reward-г максимумчилах</b> бөгөөд энэ Q функц нь тэр максимум утгад хүрэх хөтөч нь буюу хамгийн <b>оптимал policy</b>-г бий болгоход хэрэглэгддэг.<div><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhC-33rKG-LO2R2AZgyilm6QRmKh-cY-bb2RlQUNq775G8dlxOvMkGi8iijhiCc0jaC5ufXr6OF7fOTuNTZJEiW1RfJ4tkye4O1LFLbW-25LoZH6XtLBHffc887BhA_74_xaBpZ-1Rs_A/s735/Peek+2020-06-21+22-47.gif" style="margin-left: auto; margin-right: auto;"><img alt="Сургаагүй пуужин газардуулга." border="0" data-original-height="735" data-original-width="564" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhC-33rKG-LO2R2AZgyilm6QRmKh-cY-bb2RlQUNq775G8dlxOvMkGi8iijhiCc0jaC5ufXr6OF7fOTuNTZJEiW1RfJ4tkye4O1LFLbW-25LoZH6XtLBHffc887BhA_74_xaBpZ-1Rs_A/w246-h320/Peek+2020-06-21+22-47.gif" title="Сургаагүй пуужин газардуулга." width="246" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Амжилтгүй пуужин газардуулга. <br />Дараа дараагийн сайжирсан постууд дээр сургаж харуулнаа.</td></tr></tbody></table><div><br /></div><div><span><a name='more'></a></span><div><br /></div><div>Энэ постоор Deep Learning алгоритмыг Reinforcement Learning алгоритмтай хоршуулах талаар тэмдэглэнэ. </div></div></div><div><br /></div><div>Тэгэхээр эхлээд асуудлаа эхлэж гаргаж тавья. Дээрхи хөдөлгөөнт зургийг сайн ажаарай. Нэг пуужин газардах гээд үйлээ үзэн сүйрч байна.</div><div><br /></div><div>Энэ пуужинг яаж <a href="https://www.youtube.com/watch?v=IXYMbzV2DC0">SpaceX-ийн Falcon 9</a> пуужин шиг газарддаг болгох вэ?</div><div><br /></div><div>Өмнөх Q Learning алгоритм хэрэглээд оролдож үзэж болно л доо гэхдээ асуудал бий. Тэр нь юу вэ гэхээр энэ олон пикселүүдээс бүрдсэн state-г Q хүснэгтэд бүртгэхэд хэтэрхий их санах ой шаардана. </div><div><br /></div><div>Зүгээр л практикийн хувьд дэндүү тохиромжгүй шийдэл тэрнээс биш болохгүй юм байхгүй.</div><div><br /></div><div>Неорон сүлжээ бол <a href="https://t8m8r.wordpress.com/2018/08/16/uat/">universal function approximator</a> буюу бүх функцийн шинж чанарыг ойролцоолон үр ашигтай дөхүүлэх чадвартай.</div><div><br /></div><div>Неорон сүлжээний далд давхарга дотор олон янзын төлвүүдийг шахан дүрслэж тэдгээрийн комбинациудыг дараа дараагийн давхарга дээрээ шийдвэр гаргахад хэрэглэдэг тул нилээн үр ашигтай алгоритм юм.</div><div><br /></div><div>Тийм болохоор дээр дурдсан асуудалд хамгийн сайн тохирох шийдэл нь неорон сүлжээ болж таарч байна.</div><div><br /></div><div><a href="https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf">Deep Q Networks</a> алгоритм ажиллахдаа RL environment-ийн state-г неорон сүлжээгээр оруулж бүх шинж чанаруудыг нь pattern extraction байдлаар задлан дотоод неорон утгууд дээрээ тогтоон гаралт дээрээ Q value утгыг бий болгох байдлаар ажилладаг.</div><div><br /></div><div>Нэг бодлын Q функцыг ойролцоолон олох регресс ч юм шиг. </div><div><br /></div><div>Bellman-ий тэгшитгэлийг эргэн саная.<br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQBecz3RsWyMTk_MlhLgokcMHDsZ4cfb7FJ7M0qXjdEryOnbYs_Dlu1q_ExuEJomukEhPhepg_nC9jXrczKLZeIVTZiAiQkC1w9K1Kz2d29GEeW-XkYC8VvcxV2NnuMR1F2-Uvnl5hoA/s337/Bellman+equation.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="63" data-original-width="337" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQBecz3RsWyMTk_MlhLgokcMHDsZ4cfb7FJ7M0qXjdEryOnbYs_Dlu1q_ExuEJomukEhPhepg_nC9jXrczKLZeIVTZiAiQkC1w9K1Kz2d29GEeW-XkYC8VvcxV2NnuMR1F2-Uvnl5hoA/d/Bellman+equation.png" /></a></div><div>DQN-ий зорилго нь энэ Q(s, a) функцийг ойролцоогоор боддог болж сурах явдал юм. </div><div><br /></div><div>Неорон сүлжээг <b>theta</b> параметерийг сурдаг функц буюу математик илэрхийллээр бичье.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgXqtOmU04yu5C5RXf-mTa417tyt31KEGbDAof7T4z9FP-cUNtMC3g60mB6xUW0YTAyLRDMiM3_xx1Ka12r_C8J99y_75x-1aK5REgz3zGsqipLcDecqQF78kAnaB4VCRFEaF4Q7_8yog/s135/NN+as+theta+and+function.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="51" data-original-width="135" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgXqtOmU04yu5C5RXf-mTa417tyt31KEGbDAof7T4z9FP-cUNtMC3g60mB6xUW0YTAyLRDMiM3_xx1Ka12r_C8J99y_75x-1aK5REgz3zGsqipLcDecqQF78kAnaB4VCRFEaF4Q7_8yog/d/NN+as+theta+and+function.png" /></a></div><div>Неорон сүлжээг сургахын тулд Loss буюу Cost функцийг эхлээд тодорхойлох шаардлагатай. DQN-ийн энэ Loss-ийг <b>Temporal Difference Error</b> гэж нэрлэдэг.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhgHPeSD8QOKlorJDPV7SsVMsbxciV1V_wqBT4OdN7s17iY3KHeuZIvDHLCSAuIBTuGDP18HbJEbbR_hgrLzZmEEsgiUKBwvsM8N0lIvKsdS4gg9LijG6lzI1lUk2YlOtnhqkLvhtVJzQ/s378/General+NN+Loss.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="80" data-original-width="378" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhgHPeSD8QOKlorJDPV7SsVMsbxciV1V_wqBT4OdN7s17iY3KHeuZIvDHLCSAuIBTuGDP18HbJEbbR_hgrLzZmEEsgiUKBwvsM8N0lIvKsdS4gg9LijG6lzI1lUk2YlOtnhqkLvhtVJzQ/d/General+NN+Loss.png" /></a></div><div>Неорон сүлжээ маань Q функцийг сурах шаардлагатай тул <b>target</b> нь Bellman-ий томъёо болж таарна. Тэгэхээр theta параметрийн хувьд Loss нь</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiIQdhB9jPyXrBluME56zModH077wfWmmmtbtWmOH545uJH74Mn4BvJdoqXFT1K_QEx8GOug2aX4-DYvEaCuG4xLvxqIZIy4SWpYldhPdDMbLsht_x75W4pC7ENgVvpnvMFZvqatIx2Bw/s554/DQN+loss+with+bellman.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="301" data-original-width="554" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiIQdhB9jPyXrBluME56zModH077wfWmmmtbtWmOH545uJH74Mn4BvJdoqXFT1K_QEx8GOug2aX4-DYvEaCuG4xLvxqIZIy4SWpYldhPdDMbLsht_x75W4pC7ENgVvpnvMFZvqatIx2Bw/d/DQN+loss+with+bellman.png" /></a></div><div>Loss функц тодорхой болсон тул эндээс theta параметерийг градиент утгуудын тусламжтайгаар шинэчлэх буюу неорон сүлжээг сургах томъёог гаргаж авч болно.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgS4GmmsgIEviOarIdu0cOjrCRPaIT3qmySAQuIvKePL6g_2dx0KOIaulId5dPK4q6lua9QAqcKOmakBVZBaou_ZclbfzPvzBInj7pf3uIP86vKqzyEN4G6k-J1MYk1ihagkLtEAy6WuA/s212/DQN+gradient+update.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="150" data-original-width="212" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgS4GmmsgIEviOarIdu0cOjrCRPaIT3qmySAQuIvKePL6g_2dx0KOIaulId5dPK4q6lua9QAqcKOmakBVZBaou_ZclbfzPvzBInj7pf3uIP86vKqzyEN4G6k-J1MYk1ihagkLtEAy6WuA/s0/DQN+gradient+update.png" /></a></div><div><br /></div><div>Энд alpha нь learning rate болно.</div><div><br /></div><div><br /></div><div>Nature сэтгүүл дээр гарсан <a href="https://www.nature.com/articles/nature14236">DQN paper</a>-аас өмнө нь зарим нэг trick-үүд дутуу байсны улмаас хүмүүс DQN-г сургах гээд амжилт олоогүй гэж дуулсан. </div><div><br /></div><div>Энэ trick-үүдийн нэг нь <b>Experience Replay</b> болно. Үүнийг ойлгохын тулд эхлээд тодорхойлолтуудтай нь танилцая.</div><div><br /></div><div>Агент ажиллахдаа хугацааны алхам бүрт experience гэж нэрлэгдэх төлвийг <b>replay memory</b> жагсаалтруу аваачиж нэмдэг. </div><div><br /></div><div>Агентын <b>experience</b> гэдэг нь тухайн хугацааны алхам дэх <b>төлөв</b>, <b>үйлдэл</b>, <b>шагнал</b>, <b>шинэ төлөв</b> утгуудыг багцалсан нэг tuple юм (SARS).</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8w0TIzPCVS9eEEvU3bey3lYlTKNT_3dVpv3wrFvDInv-j_9MOLzic-0Im02l6cKxJfT7DmWDCVQRyo0SKIEK3cFweozIIrG2ZT7xzqj_rLURUiITxD4kcw8WizadYLPLifGF8zSu-zg/s187/Experience+tuple.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="54" data-original-width="187" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8w0TIzPCVS9eEEvU3bey3lYlTKNT_3dVpv3wrFvDInv-j_9MOLzic-0Im02l6cKxJfT7DmWDCVQRyo0SKIEK3cFweozIIrG2ZT7xzqj_rLURUiITxD4kcw8WizadYLPLifGF8zSu-zg/d/Experience+tuple.png" /></a></div><div>Үүгээр ямар нэгэн үйлдэл хийхэд төлөв ингэж өөрчлөгдөж байсан шүү бас ийм reward оноо авч байсан шүү гэдэг өөрчлөлтийн түүхийг тэмдэглэж байна гэж ойлгож болно.</div><div><br /></div><div>Мэдээж санах ой хязгаартай тул replay memory-г тодорхой N урттай байхаар урьдчилан тохируулдаг. Хэрэв энэ memory дүүрвэл хамгийн хуучин experience-г хасаж шинийг нэмдэг.</div><div><br /></div><div>Replay memory-оос random batch үүсгэж агентыг сургадахад хэрэглэх зориулалттай. </div><div><br /></div><div>Experience replay хэрэглэснээр temporal correlation-г арилгаж тооцоолол хэмнэх, дата хэрэглэхдээ үр ашигтай болдог.</div><div><br /></div><div>Sample-үүдийг дараалуулж аваад агентыг сургаад байвал неорон сүлжээ маань зөвхөн тэр дараалсан зүй тогтлыг л сурах гээд байх магадлалтай. Тийм болохоор энэ дараалсан шинжийг болиулахад experience replay-г хэрэглэдэг.</div><div><br /></div><div><br /></div><div><font size="6">Хэрэгжүүлэлт</font></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh3IoBzb2VudQAIU7e8Q9woRl7Ttih6VbJzqkR4MEImXf38Hhp3nSWtDPuk_so9nPe_tlUpziycZ-BZTn0OHvKh3ICOuZEHW_K9rCJT0s4hibbNqEKGhBVW7KJFr4g59c8oTBUDvzPWWQ/s960/DQN+training+chart.jpg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="720" data-original-width="960" height="480" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh3IoBzb2VudQAIU7e8Q9woRl7Ttih6VbJzqkR4MEImXf38Hhp3nSWtDPuk_so9nPe_tlUpziycZ-BZTn0OHvKh3ICOuZEHW_K9rCJT0s4hibbNqEKGhBVW7KJFr4g59c8oTBUDvzPWWQ/w640-h480/DQN+training+chart.jpg" width="640" /></a></div><div>Санаа нь q функцэд зориулсан хоёр ширхэг неорон сүлжээг нэг ижилхэн архитектур ашиглан үүсгэнэ. </div><div><br /></div><div>Нэгийг нь target болгоод нөгөөг нь experience цуглуулж сургахад хэрэглэнэ. </div><div><br /></div><div>Сайжирч сургасан q неорон сүлжээний жингүүдийг үе үе target неорон сүлжээрүү хуулж онооно. </div><div><br /></div><div>Target неорон сүлжээгээ ашиглан experience дээр сурдаг q неорон сүлжээний таамаг утгууд буюу q утгуудыг засан шинэчилж training batch-аа үүсгэнэ. </div><div><br /></div>Архитектурын оролт нь state, гаралт нь <b>нийт action-ы тоотой тэнцүү вектор</b> байх бөгөөд вектор дэх гишүүн утга нь байрлалаараа action-д харгалзан ойролцоолсон q value утга байна.<div><br /></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhXInyGuuq-KIs7BvW9pct0sCalgCKk470-vd7r16pqbNWLSksSRDVog7VABgvLpTHeeTuDyHnP1wRPOUe3DfI1zdW-nX1K8819tRnia0Dns_pZr5OwnGcA0ID8tCNbRluVKTd0TFkD6w/s1016/Q+value+predictions+through+DQN.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="303" data-original-width="1016" height="191" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhXInyGuuq-KIs7BvW9pct0sCalgCKk470-vd7r16pqbNWLSksSRDVog7VABgvLpTHeeTuDyHnP1wRPOUe3DfI1zdW-nX1K8819tRnia0Dns_pZr5OwnGcA0ID8tCNbRluVKTd0TFkD6w/w640-h191/Q+value+predictions+through+DQN.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">MIT-ийн Deep RL-н хичээл дээрээс авсан зураг<br />DQN нь Q оноонуудын векторыг predict хийж байгаа нь.<br /></td></tr></tbody></table><div class="separator" style="clear: both; text-align: center;"><br /></div><div><br /></div><div>RL-ийн HelloWorld гэж хэлвэл болохоор хамгийн энгийн environment CartPole-д зориулсан хэрэгжүүлэлтүүдийг Prioritized Experience Replay-тэй хамт тавьлаа</div><div><br /></div><div><br /></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhbj5twmpXpi2_mXKdzvckHISlts-nmiZr8Y2PDr7Yu-vKs-u8QbLoZKGrbRHTq6ztEog3kZp8j6PSmOP6cACgH7BujLvpy70qMU1MaggYUiDMbQoN2L_7o7jtzswqO6E9xrf8MKgQXOA/s496/Peek+2020-07-26+07-07.gif" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="358" data-original-width="496" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhbj5twmpXpi2_mXKdzvckHISlts-nmiZr8Y2PDr7Yu-vKs-u8QbLoZKGrbRHTq6ztEog3kZp8j6PSmOP6cACgH7BujLvpy70qMU1MaggYUiDMbQoN2L_7o7jtzswqO6E9xrf8MKgQXOA/d/Peek+2020-07-26+07-07.gif" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Миний лаптоп дээр сургасан байдал.<br /></td></tr></tbody></table><div><br /></div><div><br /></div><div><br /></div>
<div>Сүүлийн үед хит болоод байгаа Jax болон Flax дээрхи хэрэгжүүлэлт</div>
<div><br /></div>
<div><br /></div>
<pre style="background: rgb(0, 0, 0); color: #d1d1d1;"><span style="color: #e66170; font-weight: bold;">import</span> os
<span style="color: #e66170; font-weight: bold;">import</span> random
<span style="color: #e66170; font-weight: bold;">import</span> math
<span style="color: #e66170; font-weight: bold;">import</span> gym
<span style="color: #e66170; font-weight: bold;">from</span> collections <span style="color: #e66170; font-weight: bold;">import</span> deque
<span style="color: #e66170; font-weight: bold;">import</span> flax
<span style="color: #e66170; font-weight: bold;">import</span> jax
<span style="color: #e66170; font-weight: bold;">from</span> jax <span style="color: #e66170; font-weight: bold;">import</span> numpy <span style="color: #e66170; font-weight: bold;">as</span> jnp
<span style="color: #e66170; font-weight: bold;">import</span> numpy <span style="color: #e66170; font-weight: bold;">as</span> np
debug_render <span style="color: #d2cd86;">=</span> True
debug <span style="color: #d2cd86;">=</span> False
num_episodes <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">500</span>
batch_size <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">64</span>
learning_rate <span style="color: #d2cd86;">=</span> <span style="color: #009f00;">0.001</span>
sync_steps <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">100</span>
memory_length <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">4000</span>
epsilon <span style="color: #d2cd86;">=</span> <span style="color: #009f00;">1.0</span>
epsilon_decay <span style="color: #d2cd86;">=</span> <span style="color: #009f00;">0.001</span>
epsilon_max <span style="color: #d2cd86;">=</span> <span style="color: #009f00;">1.0</span>
epsilon_min <span style="color: #d2cd86;">=</span> <span style="color: #009f00;">0.01</span>
gamma <span style="color: #d2cd86;">=</span> <span style="color: #009f00;">0.99</span> <span style="color: #9999a9;"># discount factor</span>
<span style="color: #e66170; font-weight: bold;">class</span> SumTree<span style="color: #d2cd86;">:</span>
write <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">0</span>
<span style="color: #e66170; font-weight: bold;">def</span> __init__<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> capacity<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
self<span style="color: #d2cd86;">.</span>capacity <span style="color: #d2cd86;">=</span> capacity
self<span style="color: #d2cd86;">.</span>tree <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>zeros<span style="color: #d2cd86;">(</span><span style="color: #00a800;">2</span><span style="color: #00dddd;">*</span>capacity <span style="color: #00dddd;">-</span> <span style="color: #00a800;">1</span><span style="color: #d2cd86;">)</span>
self<span style="color: #d2cd86;">.</span>data <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>zeros<span style="color: #d2cd86;">(</span>capacity<span style="color: #d2cd86;">,</span> dtype<span style="color: #d2cd86;">=</span><span style="color: #e66170; font-weight: bold;">object</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">def</span> _propagate<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> idx<span style="color: #d2cd86;">,</span> change<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
parent <span style="color: #d2cd86;">=</span> <span style="color: #d2cd86;">(</span>idx <span style="color: #00dddd;">-</span> <span style="color: #00a800;">1</span><span style="color: #d2cd86;">)</span> <span style="color: #00dddd;">//</span> <span style="color: #00a800;">2</span>
self<span style="color: #d2cd86;">.</span>tree<span style="color: #d2cd86;">[</span>parent<span style="color: #d2cd86;">]</span> <span style="color: #00dddd;">+</span><span style="color: #d2cd86;">=</span> change
<span style="color: #e66170; font-weight: bold;">if</span> parent <span style="color: #00dddd;">!=</span> <span style="color: #00a800;">0</span><span style="color: #d2cd86;">:</span>
self<span style="color: #d2cd86;">.</span>_propagate<span style="color: #d2cd86;">(</span>parent<span style="color: #d2cd86;">,</span> change<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">def</span> _retrieve<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> idx<span style="color: #d2cd86;">,</span> s<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
left <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">2</span> <span style="color: #00dddd;">*</span> idx <span style="color: #00dddd;">+</span> <span style="color: #00a800;">1</span>
right <span style="color: #d2cd86;">=</span> left <span style="color: #00dddd;">+</span> <span style="color: #00a800;">1</span>
<span style="color: #e66170; font-weight: bold;">if</span> left <span style="color: #00dddd;">>=</span> <span style="color: #e66170; font-weight: bold;">len</span><span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">.</span>tree<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">return</span> idx
<span style="color: #e66170; font-weight: bold;">if</span> s <span style="color: #00dddd;"><=</span> self<span style="color: #d2cd86;">.</span>tree<span style="color: #d2cd86;">[</span>left<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">return</span> self<span style="color: #d2cd86;">.</span>_retrieve<span style="color: #d2cd86;">(</span>left<span style="color: #d2cd86;">,</span> s<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">else</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">return</span> self<span style="color: #d2cd86;">.</span>_retrieve<span style="color: #d2cd86;">(</span>right<span style="color: #d2cd86;">,</span> s<span style="color: #00dddd;">-</span>self<span style="color: #d2cd86;">.</span>tree<span style="color: #d2cd86;">[</span>left<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">def</span> total<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">return</span> self<span style="color: #d2cd86;">.</span>tree<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span>
<span style="color: #e66170; font-weight: bold;">def</span> add<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> p<span style="color: #d2cd86;">,</span> data<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
idx <span style="color: #d2cd86;">=</span> self<span style="color: #d2cd86;">.</span>write <span style="color: #00dddd;">+</span> self<span style="color: #d2cd86;">.</span>capacity <span style="color: #00dddd;">-</span> <span style="color: #00a800;">1</span>
self<span style="color: #d2cd86;">.</span>data<span style="color: #d2cd86;">[</span>self<span style="color: #d2cd86;">.</span>write<span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> data
self<span style="color: #d2cd86;">.</span>update<span style="color: #d2cd86;">(</span>idx<span style="color: #d2cd86;">,</span> p<span style="color: #d2cd86;">)</span>
self<span style="color: #d2cd86;">.</span>write <span style="color: #00dddd;">+</span><span style="color: #d2cd86;">=</span> <span style="color: #00a800;">1</span>
<span style="color: #e66170; font-weight: bold;">if</span> self<span style="color: #d2cd86;">.</span>write <span style="color: #00dddd;">>=</span> self<span style="color: #d2cd86;">.</span>capacity<span style="color: #d2cd86;">:</span>
self<span style="color: #d2cd86;">.</span>write <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">0</span>
<span style="color: #e66170; font-weight: bold;">def</span> update<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> idx<span style="color: #d2cd86;">,</span> p<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
change <span style="color: #d2cd86;">=</span> p <span style="color: #00dddd;">-</span> self<span style="color: #d2cd86;">.</span>tree<span style="color: #d2cd86;">[</span>idx<span style="color: #d2cd86;">]</span>
self<span style="color: #d2cd86;">.</span>tree<span style="color: #d2cd86;">[</span>idx<span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> p
self<span style="color: #d2cd86;">.</span>_propagate<span style="color: #d2cd86;">(</span>idx<span style="color: #d2cd86;">,</span> change<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">def</span> get<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> s<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
idx <span style="color: #d2cd86;">=</span> self<span style="color: #d2cd86;">.</span>_retrieve<span style="color: #d2cd86;">(</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">,</span> s<span style="color: #d2cd86;">)</span>
dataIdx <span style="color: #d2cd86;">=</span> idx <span style="color: #00dddd;">-</span> self<span style="color: #d2cd86;">.</span>capacity <span style="color: #00dddd;">+</span> <span style="color: #00a800;">1</span>
<span style="color: #e66170; font-weight: bold;">return</span> <span style="color: #d2cd86;">(</span>idx<span style="color: #d2cd86;">,</span> self<span style="color: #d2cd86;">.</span>tree<span style="color: #d2cd86;">[</span>idx<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> self<span style="color: #d2cd86;">.</span>data<span style="color: #d2cd86;">[</span>dataIdx<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">class</span> PERMemory<span style="color: #d2cd86;">:</span>
e <span style="color: #d2cd86;">=</span> <span style="color: #009f00;">0.01</span>
a <span style="color: #d2cd86;">=</span> <span style="color: #009f00;">0.6</span>
<span style="color: #e66170; font-weight: bold;">def</span> __init__<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> capacity<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
self<span style="color: #d2cd86;">.</span>tree <span style="color: #d2cd86;">=</span> SumTree<span style="color: #d2cd86;">(</span>capacity<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">def</span> _get_priority<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> error<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">return</span> <span style="color: #d2cd86;">(</span>error<span style="color: #00dddd;">+</span>self<span style="color: #d2cd86;">.</span>e<span style="color: #d2cd86;">)</span><span style="color: #00dddd;">**</span>self<span style="color: #d2cd86;">.</span>a
<span style="color: #e66170; font-weight: bold;">def</span> add<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> error<span style="color: #d2cd86;">,</span> sample<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
p <span style="color: #d2cd86;">=</span> self<span style="color: #d2cd86;">.</span>_get_priority<span style="color: #d2cd86;">(</span>error<span style="color: #d2cd86;">)</span>
self<span style="color: #d2cd86;">.</span>tree<span style="color: #d2cd86;">.</span>add<span style="color: #d2cd86;">(</span>p<span style="color: #d2cd86;">,</span> sample<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">def</span> sample<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> n<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
batch <span style="color: #d2cd86;">=</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span>
segment <span style="color: #d2cd86;">=</span> self<span style="color: #d2cd86;">.</span>tree<span style="color: #d2cd86;">.</span>total<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span><span style="color: #00dddd;">/</span>n
<span style="color: #e66170; font-weight: bold;">for</span> i <span style="color: #e66170; font-weight: bold;">in</span> <span style="color: #e66170; font-weight: bold;">range</span><span style="color: #d2cd86;">(</span>n<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
a <span style="color: #d2cd86;">=</span> segment<span style="color: #00dddd;">*</span>i
b <span style="color: #d2cd86;">=</span> segment<span style="color: #00dddd;">*</span><span style="color: #d2cd86;">(</span>i<span style="color: #00dddd;">+</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">)</span>
s <span style="color: #d2cd86;">=</span> random<span style="color: #d2cd86;">.</span>uniform<span style="color: #d2cd86;">(</span>a<span style="color: #d2cd86;">,</span> b<span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">(</span>idx<span style="color: #d2cd86;">,</span> p<span style="color: #d2cd86;">,</span> data<span style="color: #d2cd86;">)</span> <span style="color: #d2cd86;">=</span> self<span style="color: #d2cd86;">.</span>tree<span style="color: #d2cd86;">.</span>get<span style="color: #d2cd86;">(</span>s<span style="color: #d2cd86;">)</span>
batch<span style="color: #d2cd86;">.</span>append<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">(</span>idx<span style="color: #d2cd86;">,</span> data<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> batch
<span style="color: #e66170; font-weight: bold;">def</span> update<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> idx<span style="color: #d2cd86;">,</span> error<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
p <span style="color: #d2cd86;">=</span> self<span style="color: #d2cd86;">.</span>_get_priority<span style="color: #d2cd86;">(</span>error<span style="color: #d2cd86;">)</span>
self<span style="color: #d2cd86;">.</span>tree<span style="color: #d2cd86;">.</span>update<span style="color: #d2cd86;">(</span>idx<span style="color: #d2cd86;">,</span> p<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">class</span> DeepQNetwork<span style="color: #d2cd86;">(</span>flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Module<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">def</span> apply<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> x<span style="color: #d2cd86;">,</span> n_actions<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
dense_layer_1 <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Dense<span style="color: #d2cd86;">(</span>x<span style="color: #d2cd86;">,</span> <span style="color: #00a800;">64</span><span style="color: #d2cd86;">)</span>
activation_layer_1 <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>relu<span style="color: #d2cd86;">(</span>dense_layer_1<span style="color: #d2cd86;">)</span>
dense_layer_2 <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Dense<span style="color: #d2cd86;">(</span>activation_layer_1<span style="color: #d2cd86;">,</span> <span style="color: #00a800;">32</span><span style="color: #d2cd86;">)</span>
activation_layer_2 <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>relu<span style="color: #d2cd86;">(</span>dense_layer_2<span style="color: #d2cd86;">)</span>
output_layer <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Dense<span style="color: #d2cd86;">(</span>activation_layer_2<span style="color: #d2cd86;">,</span> n_actions<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> output_layer
env <span style="color: #d2cd86;">=</span> gym<span style="color: #d2cd86;">.</span>make<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">'CartPole-v0'</span><span style="color: #d2cd86;">)</span>
state <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>reset<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
n_actions <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>action_space<span style="color: #d2cd86;">.</span>n
dqn_module <span style="color: #d2cd86;">=</span> DeepQNetwork<span style="color: #d2cd86;">.</span>partial<span style="color: #d2cd86;">(</span>n_actions<span style="color: #d2cd86;">=</span>n_actions<span style="color: #d2cd86;">)</span>
_<span style="color: #d2cd86;">,</span> params <span style="color: #d2cd86;">=</span> dqn_module<span style="color: #d2cd86;">.</span>init_by_shape<span style="color: #d2cd86;">(</span>jax<span style="color: #d2cd86;">.</span>random<span style="color: #d2cd86;">.</span>PRNGKey<span style="color: #d2cd86;">(</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">[</span>state<span style="color: #d2cd86;">.</span>shape<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
q_network <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Model<span style="color: #d2cd86;">(</span>dqn_module<span style="color: #d2cd86;">,</span> params<span style="color: #d2cd86;">)</span>
target_q_network <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>nn<span style="color: #d2cd86;">.</span>Model<span style="color: #d2cd86;">(</span>dqn_module<span style="color: #d2cd86;">,</span> params<span style="color: #d2cd86;">)</span>
optimizer <span style="color: #d2cd86;">=</span> flax<span style="color: #d2cd86;">.</span>optim<span style="color: #d2cd86;">.</span>Adam<span style="color: #d2cd86;">(</span>learning_rate<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">.</span>create<span style="color: #d2cd86;">(</span>q_network<span style="color: #d2cd86;">)</span>
per_memory <span style="color: #d2cd86;">=</span> PERMemory<span style="color: #d2cd86;">(</span>memory_length<span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">@</span>jax<span style="color: #d2cd86;">.</span>jit
<span style="color: #e66170; font-weight: bold;">def</span> policy<span style="color: #d2cd86;">(</span>model<span style="color: #d2cd86;">,</span> x<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
predicted_q_values <span style="color: #d2cd86;">=</span> model<span style="color: #d2cd86;">(</span>x<span style="color: #d2cd86;">)</span>
max_q_action <span style="color: #d2cd86;">=</span> jnp<span style="color: #d2cd86;">.</span>argmax<span style="color: #d2cd86;">(</span>predicted_q_values<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> max_q_action<span style="color: #d2cd86;">,</span> predicted_q_values
<span style="color: #d2cd86;">@</span>jax<span style="color: #d2cd86;">.</span>vmap
<span style="color: #e66170; font-weight: bold;">def</span> calculate_td_error<span style="color: #d2cd86;">(</span>q_value_vec<span style="color: #d2cd86;">,</span> target_q_value_vec<span style="color: #d2cd86;">,</span> action<span style="color: #d2cd86;">,</span> reward<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
td_target <span style="color: #d2cd86;">=</span> reward <span style="color: #00dddd;">+</span> gamma<span style="color: #00dddd;">*</span>jnp<span style="color: #d2cd86;">.</span>amax<span style="color: #d2cd86;">(</span>target_q_value_vec<span style="color: #d2cd86;">)</span>
td_error <span style="color: #d2cd86;">=</span> td_target <span style="color: #00dddd;">-</span> q_value_vec<span style="color: #d2cd86;">[</span>action<span style="color: #d2cd86;">]</span>
<span style="color: #e66170; font-weight: bold;">return</span> jnp<span style="color: #d2cd86;">.</span><span style="color: #e66170; font-weight: bold;">abs</span><span style="color: #d2cd86;">(</span>td_error<span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">@</span>jax<span style="color: #d2cd86;">.</span>jit
<span style="color: #e66170; font-weight: bold;">def</span> td_error<span style="color: #d2cd86;">(</span>model<span style="color: #d2cd86;">,</span> target_model<span style="color: #d2cd86;">,</span> batch<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #9999a9;"># batch[0] - states</span>
<span style="color: #9999a9;"># batch[1] - actions</span>
<span style="color: #9999a9;"># batch[2] - rewards</span>
<span style="color: #9999a9;"># batch[3] - next_states</span>
predicted_q_values <span style="color: #d2cd86;">=</span> model<span style="color: #d2cd86;">(</span>batch<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
target_q_values <span style="color: #d2cd86;">=</span> target_model<span style="color: #d2cd86;">(</span>batch<span style="color: #d2cd86;">[</span><span style="color: #00a800;">3</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> calculate_td_error<span style="color: #d2cd86;">(</span>predicted_q_values<span style="color: #d2cd86;">,</span> target_q_values<span style="color: #d2cd86;">,</span> batch<span style="color: #d2cd86;">[</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> batch<span style="color: #d2cd86;">[</span><span style="color: #00a800;">2</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">@</span>jax<span style="color: #d2cd86;">.</span>vmap
<span style="color: #e66170; font-weight: bold;">def</span> q_learning_loss<span style="color: #d2cd86;">(</span>q_value_vec<span style="color: #d2cd86;">,</span> target_q_value_vec<span style="color: #d2cd86;">,</span> action<span style="color: #d2cd86;">,</span> reward<span style="color: #d2cd86;">,</span> done<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
td_target <span style="color: #d2cd86;">=</span> reward <span style="color: #00dddd;">+</span> gamma<span style="color: #00dddd;">*</span>jnp<span style="color: #d2cd86;">.</span>amax<span style="color: #d2cd86;">(</span>target_q_value_vec<span style="color: #d2cd86;">)</span><span style="color: #00dddd;">*</span><span style="color: #d2cd86;">(</span><span style="color: #009f00;">1.</span><span style="color: #00dddd;">-</span>done<span style="color: #d2cd86;">)</span>
td_error <span style="color: #d2cd86;">=</span> jax<span style="color: #d2cd86;">.</span>lax<span style="color: #d2cd86;">.</span>stop_gradient<span style="color: #d2cd86;">(</span>td_target<span style="color: #d2cd86;">)</span> <span style="color: #00dddd;">-</span> q_value_vec<span style="color: #d2cd86;">[</span>action<span style="color: #d2cd86;">]</span>
<span style="color: #e66170; font-weight: bold;">return</span> jnp<span style="color: #d2cd86;">.</span>square<span style="color: #d2cd86;">(</span>td_error<span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">@</span>jax<span style="color: #d2cd86;">.</span>jit
<span style="color: #e66170; font-weight: bold;">def</span> train_step<span style="color: #d2cd86;">(</span>optimizer<span style="color: #d2cd86;">,</span> target_model<span style="color: #d2cd86;">,</span> batch<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #9999a9;"># batch[0] - states</span>
<span style="color: #9999a9;"># batch[1] - actions</span>
<span style="color: #9999a9;"># batch[2] - rewards</span>
<span style="color: #9999a9;"># batch[3] - next_states</span>
<span style="color: #9999a9;"># batch[4] - dones</span>
<span style="color: #e66170; font-weight: bold;">def</span> loss_fn<span style="color: #d2cd86;">(</span>model<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
predicted_q_values <span style="color: #d2cd86;">=</span> model<span style="color: #d2cd86;">(</span>batch<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
target_q_values <span style="color: #d2cd86;">=</span> target_model<span style="color: #d2cd86;">(</span>batch<span style="color: #d2cd86;">[</span><span style="color: #00a800;">3</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> jnp<span style="color: #d2cd86;">.</span>mean<span style="color: #d2cd86;">(</span>
q_learning_loss<span style="color: #d2cd86;">(</span>
predicted_q_values<span style="color: #d2cd86;">,</span>
target_q_values<span style="color: #d2cd86;">,</span>
batch<span style="color: #d2cd86;">[</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span>
batch<span style="color: #d2cd86;">[</span><span style="color: #00a800;">2</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span>
batch<span style="color: #d2cd86;">[</span><span style="color: #00a800;">4</span><span style="color: #d2cd86;">]</span>
<span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">)</span>
loss<span style="color: #d2cd86;">,</span> gradients <span style="color: #d2cd86;">=</span> jax<span style="color: #d2cd86;">.</span>value_and_grad<span style="color: #d2cd86;">(</span>loss_fn<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">(</span>optimizer<span style="color: #d2cd86;">.</span>target<span style="color: #d2cd86;">)</span>
optimizer <span style="color: #d2cd86;">=</span> optimizer<span style="color: #d2cd86;">.</span>apply_gradient<span style="color: #d2cd86;">(</span>gradients<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> optimizer<span style="color: #d2cd86;">,</span> loss<span style="color: #d2cd86;">,</span> td_error<span style="color: #d2cd86;">(</span>optimizer<span style="color: #d2cd86;">.</span>target<span style="color: #d2cd86;">,</span> target_model<span style="color: #d2cd86;">,</span> batch<span style="color: #d2cd86;">)</span>
global_steps <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">0</span>
<span style="color: #e66170; font-weight: bold;">try</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">for</span> episode <span style="color: #e66170; font-weight: bold;">in</span> <span style="color: #e66170; font-weight: bold;">range</span><span style="color: #d2cd86;">(</span>num_episodes<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
episode_rewards <span style="color: #d2cd86;">=</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span>
state <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>reset<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">while</span> True<span style="color: #d2cd86;">:</span>
global_steps <span style="color: #d2cd86;">=</span> global_steps<span style="color: #00dddd;">+</span><span style="color: #00a800;">1</span>
<span style="color: #e66170; font-weight: bold;">if</span> np<span style="color: #d2cd86;">.</span>random<span style="color: #d2cd86;">.</span>rand<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span> <span style="color: #00dddd;"><=</span> epsilon<span style="color: #d2cd86;">:</span>
action <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>action_space<span style="color: #d2cd86;">.</span>sample<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">else</span><span style="color: #d2cd86;">:</span>
action<span style="color: #d2cd86;">,</span> q_values <span style="color: #d2cd86;">=</span> policy<span style="color: #d2cd86;">(</span>optimizer<span style="color: #d2cd86;">.</span>target<span style="color: #d2cd86;">,</span> state<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">if</span> debug<span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"q утгууд :"</span> <span style="color: #d2cd86;">,</span> q_values<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"сонгосон action :"</span><span style="color: #d2cd86;">,</span> action <span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">if</span> epsilon<span style="color: #00dddd;">></span>epsilon_min<span style="color: #d2cd86;">:</span>
epsilon <span style="color: #d2cd86;">=</span> epsilon_min<span style="color: #00dddd;">+</span><span style="color: #d2cd86;">(</span>epsilon_max<span style="color: #00dddd;">-</span>epsilon_min<span style="color: #d2cd86;">)</span><span style="color: #00dddd;">*</span>math<span style="color: #d2cd86;">.</span>exp<span style="color: #d2cd86;">(</span><span style="color: #00dddd;">-</span>epsilon_decay<span style="color: #00dddd;">*</span>global_steps<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">if</span> debug<span style="color: #d2cd86;">:</span>
<span style="color: #9999a9;">#print("epsilon :", epsilon)</span>
<span style="color: #e66170; font-weight: bold;">pass</span>
new_state<span style="color: #d2cd86;">,</span> reward<span style="color: #d2cd86;">,</span> done<span style="color: #d2cd86;">,</span> _ <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>step<span style="color: #d2cd86;">(</span><span style="color: #e66170; font-weight: bold;">int</span><span style="color: #d2cd86;">(</span>action<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;"># sample нэмэхдээ temporal difference error-ийг тооцож нэмэх</span>
temporal_difference <span style="color: #d2cd86;">=</span> <span style="color: #e66170; font-weight: bold;">float</span><span style="color: #d2cd86;">(</span>td_error<span style="color: #d2cd86;">(</span>optimizer<span style="color: #d2cd86;">.</span>target<span style="color: #d2cd86;">,</span> target_q_network<span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">(</span>
jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">[</span>state<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span>
jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">[</span>action<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span>
jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">[</span>reward<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span>
jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">[</span>new_state<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
per_memory<span style="color: #d2cd86;">.</span>add<span style="color: #d2cd86;">(</span>temporal_difference<span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">(</span>state<span style="color: #d2cd86;">,</span> action<span style="color: #d2cd86;">,</span> reward<span style="color: #d2cd86;">,</span> new_state<span style="color: #d2cd86;">,</span> <span style="color: #e66170; font-weight: bold;">int</span><span style="color: #d2cd86;">(</span>done<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;"># Prioritized Experience Replay санах ойгоос batch үүсгээд DQN сүлжээг сургах</span>
batch <span style="color: #d2cd86;">=</span> per_memory<span style="color: #d2cd86;">.</span>sample<span style="color: #d2cd86;">(</span>batch_size<span style="color: #d2cd86;">)</span>
states<span style="color: #d2cd86;">,</span> actions<span style="color: #d2cd86;">,</span> rewards<span style="color: #d2cd86;">,</span> next_states<span style="color: #d2cd86;">,</span> dones <span style="color: #d2cd86;">=</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span>
<span style="color: #e66170; font-weight: bold;">for</span> i <span style="color: #e66170; font-weight: bold;">in</span> <span style="color: #e66170; font-weight: bold;">range</span><span style="color: #d2cd86;">(</span>batch_size<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
states<span style="color: #d2cd86;">.</span>append <span style="color: #d2cd86;">(</span>batch<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
actions<span style="color: #d2cd86;">.</span>append <span style="color: #d2cd86;">(</span>batch<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
rewards<span style="color: #d2cd86;">.</span>append <span style="color: #d2cd86;">(</span>batch<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">2</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
next_states<span style="color: #d2cd86;">.</span>append<span style="color: #d2cd86;">(</span>batch<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">3</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
dones<span style="color: #d2cd86;">.</span>append <span style="color: #d2cd86;">(</span>batch<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">4</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
optimizer<span style="color: #d2cd86;">,</span> loss<span style="color: #d2cd86;">,</span> new_td_errors <span style="color: #d2cd86;">=</span> train_step<span style="color: #d2cd86;">(</span>
optimizer<span style="color: #d2cd86;">,</span>
target_q_network<span style="color: #d2cd86;">,</span>
<span style="color: #d2cd86;">(</span> <span style="color: #9999a9;"># sample-дсэн batch өгөгдлүүдийг хурдасгуур </span>
<span style="color: #9999a9;"># төхөөрөмийн санах ойруу хуулах</span>
jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span>states<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span>
jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span>actions<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span>
jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span>rewards<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span>
jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span>next_states<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span>
jnp<span style="color: #d2cd86;">.</span>asarray<span style="color: #d2cd86;">(</span>dones<span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;"># batch-аас бий болсон temporal difference error-ийн дагуу санах ойг шинэчлэх</span>
new_td_errors <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>array<span style="color: #d2cd86;">(</span>new_td_errors<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">for</span> i <span style="color: #e66170; font-weight: bold;">in</span> <span style="color: #e66170; font-weight: bold;">range</span><span style="color: #d2cd86;">(</span>batch_size<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
idx <span style="color: #d2cd86;">=</span> batch<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span>
per_memory<span style="color: #d2cd86;">.</span>update<span style="color: #d2cd86;">(</span>idx<span style="color: #d2cd86;">,</span> new_td_errors<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
episode_rewards<span style="color: #d2cd86;">.</span>append<span style="color: #d2cd86;">(</span>reward<span style="color: #d2cd86;">)</span>
state <span style="color: #d2cd86;">=</span> new_state
<span style="color: #9999a9;"># Тодорхой алхам тутамд target неорон сүлжээний жингүүдийг сайжирсан хувилбараар солих</span>
<span style="color: #e66170; font-weight: bold;">if</span> global_steps<span style="color: #00dddd;">%</span>sync_steps<span style="color: #00dddd;">==</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">:</span>
target_q_network <span style="color: #d2cd86;">=</span> target_q_network<span style="color: #d2cd86;">.</span>replace<span style="color: #d2cd86;">(</span>params<span style="color: #d2cd86;">=</span>optimizer<span style="color: #d2cd86;">.</span>target<span style="color: #d2cd86;">.</span>params<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">if</span> debug<span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"сайжруулсан жингүүдийг target неорон сүлжээрүү хууллаа"</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">if</span> debug_render<span style="color: #d2cd86;">:</span>
env<span style="color: #d2cd86;">.</span>render<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">if</span> done<span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"{} - нийт reward : {}"</span><span style="color: #d2cd86;">.</span>format<span style="color: #d2cd86;">(</span>episode<span style="color: #d2cd86;">,</span> <span style="color: #e66170; font-weight: bold;">sum</span><span style="color: #d2cd86;">(</span>episode_rewards<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">break</span>
<span style="color: #e66170; font-weight: bold;">finally</span><span style="color: #d2cd86;">:</span>
env<span style="color: #d2cd86;">.</span>close<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
</pre>
<!--Created using ToHtml.com on 2020-07-25 17:47:11 UTC-->
<div><br /></div>
<div><br /></div>
<div>Tensorflow 2 дээрхи хэрэгжүүлэлт</div>
<div><br /></div>
<div><br /></div>
<pre style="background: rgb(0, 0, 0); color: #d1d1d1;"><span style="color: #e66170; font-weight: bold;">import</span> os
<span style="color: #e66170; font-weight: bold;">import</span> random
<span style="color: #e66170; font-weight: bold;">import</span> math
<span style="color: #e66170; font-weight: bold;">from</span> time <span style="color: #e66170; font-weight: bold;">import</span> sleep
<span style="color: #e66170; font-weight: bold;">from</span> collections <span style="color: #e66170; font-weight: bold;">import</span> deque
<span style="color: #e66170; font-weight: bold;">import</span> numpy <span style="color: #e66170; font-weight: bold;">as</span> np
<span style="color: #e66170; font-weight: bold;">import</span> gym
<span style="color: #e66170; font-weight: bold;">import</span> cv2
<span style="color: #e66170; font-weight: bold;">import</span> rocket_lander_gym
<span style="color: #e66170; font-weight: bold;">import</span> tkinter
<span style="color: #e66170; font-weight: bold;">import</span> matplotlib
matplotlib<span style="color: #d2cd86;">.</span>use<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">'TkAgg'</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">import</span> matplotlib<span style="color: #d2cd86;">.</span>pyplot <span style="color: #e66170; font-weight: bold;">as</span> plt
<span style="color: #e66170; font-weight: bold;">import</span> tensorflow <span style="color: #e66170; font-weight: bold;">as</span> tf
tf<span style="color: #d2cd86;">.</span>get_logger<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">.</span>setLevel<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">'ERROR'</span><span style="color: #d2cd86;">)</span>
debug_render <span style="color: #d2cd86;">=</span> False
num_episodes <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">100000</span>
train_start_count <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">1000</span> <span style="color: #9999a9;"># хичнээн sample цуглуулсны дараа сургаж болох вэ</span>
train_per_step <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">1</span> <span style="color: #9999a9;"># хэдэн алхам тутамд сургах вэ</span>
save_per_step <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">2500</span> <span style="color: #9999a9;"># хэдэн алхам тутамд сургасан моделийг хадгалах вэ</span>
training_happened <span style="color: #d2cd86;">=</span> False
sync_per_step <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">100</span> <span style="color: #9999a9;"># хэдэн алхам тутам target_q неорон сүлжээг шинэчлэх вэ</span>
train_count <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">1</span> <span style="color: #9999a9;"># хэдэн удаа сургах вэ</span>
batch_size <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">64</span>
gamma <span style="color: #d2cd86;">=</span> <span style="color: #009f00;">0.99</span> <span style="color: #9999a9;"># discount factor</span>
<span style="color: #9999a9;"># exploration vs exploitation</span>
epsilon <span style="color: #d2cd86;">=</span> <span style="color: #009f00;">1.0</span>
epsilon_decay <span style="color: #d2cd86;">=</span> <span style="color: #009f00;">0.001</span>
epsilon_max <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">1</span>
epsilon_min <span style="color: #d2cd86;">=</span> <span style="color: #009f00;">0.01</span>
<span style="color: #9999a9;"># replay memory</span>
memory_length <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">100000</span>
<span style="color: #e66170; font-weight: bold;">class</span> SumTree<span style="color: #d2cd86;">:</span>
write <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">0</span>
<span style="color: #e66170; font-weight: bold;">def</span> __init__<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> capacity<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
self<span style="color: #d2cd86;">.</span>capacity <span style="color: #d2cd86;">=</span> capacity
self<span style="color: #d2cd86;">.</span>tree <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>zeros<span style="color: #d2cd86;">(</span><span style="color: #00a800;">2</span><span style="color: #00dddd;">*</span>capacity <span style="color: #00dddd;">-</span> <span style="color: #00a800;">1</span><span style="color: #d2cd86;">)</span>
self<span style="color: #d2cd86;">.</span>data <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>zeros<span style="color: #d2cd86;">(</span>capacity<span style="color: #d2cd86;">,</span> dtype<span style="color: #d2cd86;">=</span><span style="color: #e66170; font-weight: bold;">object</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">def</span> _propagate<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> idx<span style="color: #d2cd86;">,</span> change<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
parent <span style="color: #d2cd86;">=</span> <span style="color: #d2cd86;">(</span>idx <span style="color: #00dddd;">-</span> <span style="color: #00a800;">1</span><span style="color: #d2cd86;">)</span> <span style="color: #00dddd;">//</span> <span style="color: #00a800;">2</span>
self<span style="color: #d2cd86;">.</span>tree<span style="color: #d2cd86;">[</span>parent<span style="color: #d2cd86;">]</span> <span style="color: #00dddd;">+</span><span style="color: #d2cd86;">=</span> change
<span style="color: #e66170; font-weight: bold;">if</span> parent <span style="color: #00dddd;">!=</span> <span style="color: #00a800;">0</span><span style="color: #d2cd86;">:</span>
self<span style="color: #d2cd86;">.</span>_propagate<span style="color: #d2cd86;">(</span>parent<span style="color: #d2cd86;">,</span> change<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">def</span> _retrieve<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> idx<span style="color: #d2cd86;">,</span> s<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
left <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">2</span> <span style="color: #00dddd;">*</span> idx <span style="color: #00dddd;">+</span> <span style="color: #00a800;">1</span>
right <span style="color: #d2cd86;">=</span> left <span style="color: #00dddd;">+</span> <span style="color: #00a800;">1</span>
<span style="color: #e66170; font-weight: bold;">if</span> left <span style="color: #00dddd;">>=</span> <span style="color: #e66170; font-weight: bold;">len</span><span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">.</span>tree<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">return</span> idx
<span style="color: #e66170; font-weight: bold;">if</span> s <span style="color: #00dddd;"><=</span> self<span style="color: #d2cd86;">.</span>tree<span style="color: #d2cd86;">[</span>left<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">return</span> self<span style="color: #d2cd86;">.</span>_retrieve<span style="color: #d2cd86;">(</span>left<span style="color: #d2cd86;">,</span> s<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">else</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">return</span> self<span style="color: #d2cd86;">.</span>_retrieve<span style="color: #d2cd86;">(</span>right<span style="color: #d2cd86;">,</span> s<span style="color: #00dddd;">-</span>self<span style="color: #d2cd86;">.</span>tree<span style="color: #d2cd86;">[</span>left<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">def</span> total<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">return</span> self<span style="color: #d2cd86;">.</span>tree<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span>
<span style="color: #e66170; font-weight: bold;">def</span> add<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> p<span style="color: #d2cd86;">,</span> data<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
idx <span style="color: #d2cd86;">=</span> self<span style="color: #d2cd86;">.</span>write <span style="color: #00dddd;">+</span> self<span style="color: #d2cd86;">.</span>capacity <span style="color: #00dddd;">-</span> <span style="color: #00a800;">1</span>
self<span style="color: #d2cd86;">.</span>data<span style="color: #d2cd86;">[</span>self<span style="color: #d2cd86;">.</span>write<span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> data
self<span style="color: #d2cd86;">.</span>update<span style="color: #d2cd86;">(</span>idx<span style="color: #d2cd86;">,</span> p<span style="color: #d2cd86;">)</span>
self<span style="color: #d2cd86;">.</span>write <span style="color: #00dddd;">+</span><span style="color: #d2cd86;">=</span> <span style="color: #00a800;">1</span>
<span style="color: #e66170; font-weight: bold;">if</span> self<span style="color: #d2cd86;">.</span>write <span style="color: #00dddd;">>=</span> self<span style="color: #d2cd86;">.</span>capacity<span style="color: #d2cd86;">:</span>
self<span style="color: #d2cd86;">.</span>write <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">0</span>
<span style="color: #e66170; font-weight: bold;">def</span> update<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> idx<span style="color: #d2cd86;">,</span> p<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
change <span style="color: #d2cd86;">=</span> p <span style="color: #00dddd;">-</span> self<span style="color: #d2cd86;">.</span>tree<span style="color: #d2cd86;">[</span>idx<span style="color: #d2cd86;">]</span>
self<span style="color: #d2cd86;">.</span>tree<span style="color: #d2cd86;">[</span>idx<span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> p
self<span style="color: #d2cd86;">.</span>_propagate<span style="color: #d2cd86;">(</span>idx<span style="color: #d2cd86;">,</span> change<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">def</span> get<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> s<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
idx <span style="color: #d2cd86;">=</span> self<span style="color: #d2cd86;">.</span>_retrieve<span style="color: #d2cd86;">(</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">,</span> s<span style="color: #d2cd86;">)</span>
dataIdx <span style="color: #d2cd86;">=</span> idx <span style="color: #00dddd;">-</span> self<span style="color: #d2cd86;">.</span>capacity <span style="color: #00dddd;">+</span> <span style="color: #00a800;">1</span>
<span style="color: #e66170; font-weight: bold;">return</span> <span style="color: #d2cd86;">(</span>idx<span style="color: #d2cd86;">,</span> self<span style="color: #d2cd86;">.</span>tree<span style="color: #d2cd86;">[</span>idx<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> self<span style="color: #d2cd86;">.</span>data<span style="color: #d2cd86;">[</span>dataIdx<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">class</span> PERMemory<span style="color: #d2cd86;">:</span>
e <span style="color: #d2cd86;">=</span> <span style="color: #009f00;">0.01</span>
a <span style="color: #d2cd86;">=</span> <span style="color: #009f00;">0.6</span>
<span style="color: #e66170; font-weight: bold;">def</span> __init__<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> capacity<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
self<span style="color: #d2cd86;">.</span>tree <span style="color: #d2cd86;">=</span> SumTree<span style="color: #d2cd86;">(</span>capacity<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">def</span> _get_priority<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> error<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">return</span> <span style="color: #d2cd86;">(</span>error<span style="color: #00dddd;">+</span>self<span style="color: #d2cd86;">.</span>e<span style="color: #d2cd86;">)</span><span style="color: #00dddd;">**</span>self<span style="color: #d2cd86;">.</span>a
<span style="color: #e66170; font-weight: bold;">def</span> add<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> error<span style="color: #d2cd86;">,</span> sample<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
p <span style="color: #d2cd86;">=</span> self<span style="color: #d2cd86;">.</span>_get_priority<span style="color: #d2cd86;">(</span>error<span style="color: #d2cd86;">)</span>
self<span style="color: #d2cd86;">.</span>tree<span style="color: #d2cd86;">.</span>add<span style="color: #d2cd86;">(</span>p<span style="color: #d2cd86;">,</span> sample<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">def</span> sample<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> n<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
batch <span style="color: #d2cd86;">=</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span>
segment <span style="color: #d2cd86;">=</span> self<span style="color: #d2cd86;">.</span>tree<span style="color: #d2cd86;">.</span>total<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span><span style="color: #00dddd;">/</span>n
<span style="color: #e66170; font-weight: bold;">for</span> i <span style="color: #e66170; font-weight: bold;">in</span> <span style="color: #e66170; font-weight: bold;">range</span><span style="color: #d2cd86;">(</span>n<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
a <span style="color: #d2cd86;">=</span> segment<span style="color: #00dddd;">*</span>i
b <span style="color: #d2cd86;">=</span> segment<span style="color: #00dddd;">*</span><span style="color: #d2cd86;">(</span>i<span style="color: #00dddd;">+</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">)</span>
s <span style="color: #d2cd86;">=</span> random<span style="color: #d2cd86;">.</span>uniform<span style="color: #d2cd86;">(</span>a<span style="color: #d2cd86;">,</span> b<span style="color: #d2cd86;">)</span>
<span style="color: #d2cd86;">(</span>idx<span style="color: #d2cd86;">,</span> p<span style="color: #d2cd86;">,</span> data<span style="color: #d2cd86;">)</span> <span style="color: #d2cd86;">=</span> self<span style="color: #d2cd86;">.</span>tree<span style="color: #d2cd86;">.</span>get<span style="color: #d2cd86;">(</span>s<span style="color: #d2cd86;">)</span>
batch<span style="color: #d2cd86;">.</span>append<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">(</span>idx<span style="color: #d2cd86;">,</span> data<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> batch
<span style="color: #e66170; font-weight: bold;">def</span> update<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> idx<span style="color: #d2cd86;">,</span> error<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
p <span style="color: #d2cd86;">=</span> self<span style="color: #d2cd86;">.</span>_get_priority<span style="color: #d2cd86;">(</span>error<span style="color: #d2cd86;">)</span>
self<span style="color: #d2cd86;">.</span>tree<span style="color: #d2cd86;">.</span>update<span style="color: #d2cd86;">(</span>idx<span style="color: #d2cd86;">,</span> p<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">class</span> DeepQNetwork<span style="color: #d2cd86;">(</span>tf<span style="color: #d2cd86;">.</span>keras<span style="color: #d2cd86;">.</span>Model<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">def</span> __init__<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> n_actions<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">super</span><span style="color: #d2cd86;">(</span>DeepQNetwork<span style="color: #d2cd86;">,</span> self<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">.</span>__init__<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
self<span style="color: #d2cd86;">.</span>dense_layer <span style="color: #d2cd86;">=</span> tf<span style="color: #d2cd86;">.</span>keras<span style="color: #d2cd86;">.</span>layers<span style="color: #d2cd86;">.</span>Dense<span style="color: #d2cd86;">(</span><span style="color: #00a800;">128</span><span style="color: #d2cd86;">,</span> activation<span style="color: #d2cd86;">=</span><span style="color: #00c4c4;">'relu'</span><span style="color: #d2cd86;">)</span>
self<span style="color: #d2cd86;">.</span>mid_layer <span style="color: #d2cd86;">=</span> tf<span style="color: #d2cd86;">.</span>keras<span style="color: #d2cd86;">.</span>layers<span style="color: #d2cd86;">.</span>Dense<span style="color: #d2cd86;">(</span><span style="color: #00a800;">128</span><span style="color: #d2cd86;">,</span> activation<span style="color: #d2cd86;">=</span><span style="color: #00c4c4;">'relu'</span><span style="color: #d2cd86;">)</span>
self<span style="color: #d2cd86;">.</span>output_layer <span style="color: #d2cd86;">=</span> tf<span style="color: #d2cd86;">.</span>keras<span style="color: #d2cd86;">.</span>layers<span style="color: #d2cd86;">.</span>Dense<span style="color: #d2cd86;">(</span>n_actions<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">def</span> call<span style="color: #d2cd86;">(</span>self<span style="color: #d2cd86;">,</span> inputs<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
dense_out <span style="color: #d2cd86;">=</span> self<span style="color: #d2cd86;">.</span>dense_layer<span style="color: #d2cd86;">(</span>inputs<span style="color: #d2cd86;">)</span>
mid_out <span style="color: #d2cd86;">=</span> self<span style="color: #d2cd86;">.</span>mid_layer<span style="color: #d2cd86;">(</span>dense_out<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">return</span> self<span style="color: #d2cd86;">.</span>output_layer<span style="color: #d2cd86;">(</span>mid_out<span style="color: #d2cd86;">)</span>
env <span style="color: #d2cd86;">=</span> gym<span style="color: #d2cd86;">.</span>make<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">'CartPole-v0'</span><span style="color: #d2cd86;">)</span>
env<span style="color: #d2cd86;">.</span>reset<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
n_actions <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>action_space<span style="color: #d2cd86;">.</span>n
optimizer <span style="color: #d2cd86;">=</span> tf<span style="color: #d2cd86;">.</span>keras<span style="color: #d2cd86;">.</span>optimizers<span style="color: #d2cd86;">.</span>Adam<span style="color: #d2cd86;">(</span>learning_rate<span style="color: #d2cd86;">=</span><span style="color: #009f00;">0.001</span><span style="color: #d2cd86;">)</span>
q_network <span style="color: #d2cd86;">=</span> DeepQNetwork<span style="color: #d2cd86;">(</span>n_actions<span style="color: #d2cd86;">)</span>
target_q_network <span style="color: #d2cd86;">=</span> DeepQNetwork<span style="color: #d2cd86;">(</span>n_actions<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">if</span> <span style="color: #e66170; font-weight: bold;">not</span> os<span style="color: #d2cd86;">.</span>path<span style="color: #d2cd86;">.</span>exists<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"model_weights"</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
os<span style="color: #d2cd86;">.</span>makedirs<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"model_weights"</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">if</span> os<span style="color: #d2cd86;">.</span>path<span style="color: #d2cd86;">.</span>exists<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">'model_weights/dqn_per_q'</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
q_network <span style="color: #d2cd86;">=</span> tf<span style="color: #d2cd86;">.</span>keras<span style="color: #d2cd86;">.</span>models<span style="color: #d2cd86;">.</span>load_model<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"model_weights/dqn_per_q"</span><span style="color: #d2cd86;">)</span>
target_q_network <span style="color: #d2cd86;">=</span> tf<span style="color: #d2cd86;">.</span>keras<span style="color: #d2cd86;">.</span>models<span style="color: #d2cd86;">.</span>load_model<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"model_weights/dqn_per_q_target"</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"өмнөх сургасан dqn моделийг ачааллаа"</span><span style="color: #d2cd86;">)</span>
per_memory <span style="color: #d2cd86;">=</span> PERMemory<span style="color: #d2cd86;">(</span>memory_length<span style="color: #d2cd86;">)</span>
global_steps <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">0</span>
<span style="color: #e66170; font-weight: bold;">for</span> episode <span style="color: #e66170; font-weight: bold;">in</span> <span style="color: #e66170; font-weight: bold;">range</span><span style="color: #d2cd86;">(</span>num_episodes<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #9999a9;">#print(episode, "р ажиллагаа эхэллээ")</span>
done <span style="color: #d2cd86;">=</span> False
state <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>reset<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
state_shape <span style="color: #d2cd86;">=</span> state<span style="color: #d2cd86;">.</span>shape
<span style="color: #e66170; font-weight: bold;">if</span> debug_render<span style="color: #d2cd86;">:</span>
env<span style="color: #d2cd86;">.</span>render<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
episode_rewards <span style="color: #d2cd86;">=</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span>
<span style="color: #e66170; font-weight: bold;">while</span> <span style="color: #e66170; font-weight: bold;">not</span> done<span style="color: #d2cd86;">:</span>
global_steps <span style="color: #d2cd86;">=</span> global_steps<span style="color: #00dddd;">+</span><span style="color: #00a800;">1</span>
<span style="color: #9999a9;"># exploration vs exploitation</span>
<span style="color: #e66170; font-weight: bold;">if</span> np<span style="color: #d2cd86;">.</span>random<span style="color: #d2cd86;">.</span>rand<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span> <span style="color: #00dddd;"><=</span> epsilon<span style="color: #d2cd86;">:</span>
action <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>action_space<span style="color: #d2cd86;">.</span>sample<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">else</span><span style="color: #d2cd86;">:</span>
q_value <span style="color: #d2cd86;">=</span> q_network<span style="color: #d2cd86;">(</span>np<span style="color: #d2cd86;">.</span>array<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">[</span>state<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> dtype<span style="color: #d2cd86;">=</span>np<span style="color: #d2cd86;">.</span>float32<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
action <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>argmax<span style="color: #d2cd86;">(</span>q_value<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
new_state<span style="color: #d2cd86;">,</span> reward<span style="color: #d2cd86;">,</span> done<span style="color: #d2cd86;">,</span> _ <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>step<span style="color: #d2cd86;">(</span>action<span style="color: #d2cd86;">)</span>
episode_rewards<span style="color: #d2cd86;">.</span>append<span style="color: #d2cd86;">(</span>reward<span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;"># TD error-г тооцоолох, энэ алдааны утгаар sample-д priority утга өгнө</span>
<span style="color: #9999a9;"># алдааны утга нь их байх тусмаа сургах batch дээр гарч ирэх магадлал нь ихэснэ</span>
<span style="color: #9999a9;">#if epsilon == 1:</span>
<span style="color: #9999a9;"># done = True</span>
q_out <span style="color: #d2cd86;">=</span> q_network<span style="color: #d2cd86;">(</span>np<span style="color: #d2cd86;">.</span>array<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">[</span>state<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> dtype<span style="color: #d2cd86;">=</span>np<span style="color: #d2cd86;">.</span>float32<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">.</span>numpy<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
old_value <span style="color: #d2cd86;">=</span> q_out<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span>action<span style="color: #d2cd86;">]</span>
target_q_out <span style="color: #d2cd86;">=</span> target_q_network<span style="color: #d2cd86;">(</span>np<span style="color: #d2cd86;">.</span>array<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">[</span>new_state<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">,</span> dtype<span style="color: #d2cd86;">=</span>np<span style="color: #d2cd86;">.</span>float32<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">.</span>numpy<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">if</span> done<span style="color: #d2cd86;">:</span>
q_out<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span>action<span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> reward
<span style="color: #e66170; font-weight: bold;">else</span><span style="color: #d2cd86;">:</span>
q_out<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span>action<span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> reward <span style="color: #00dddd;">+</span> gamma<span style="color: #00dddd;">*</span>np<span style="color: #d2cd86;">.</span>amax<span style="color: #d2cd86;">(</span>target_q_out<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
td_error <span style="color: #d2cd86;">=</span> <span style="color: #e66170; font-weight: bold;">abs</span><span style="color: #d2cd86;">(</span>old_value<span style="color: #00dddd;">-</span>q_out<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span>action<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
per_memory<span style="color: #d2cd86;">.</span>add<span style="color: #d2cd86;">(</span>td_error<span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">(</span>state<span style="color: #d2cd86;">,</span> action<span style="color: #d2cd86;">,</span> reward<span style="color: #d2cd86;">,</span> new_state<span style="color: #d2cd86;">,</span> done<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;"># explore хийх epsilon утга шинэчлэх</span>
<span style="color: #e66170; font-weight: bold;">if</span> epsilon<span style="color: #00dddd;">></span>epsilon_min<span style="color: #d2cd86;">:</span>
epsilon <span style="color: #d2cd86;">=</span> epsilon_min <span style="color: #00dddd;">+</span> <span style="color: #d2cd86;">(</span>epsilon_max<span style="color: #00dddd;">-</span>epsilon_min<span style="color: #d2cd86;">)</span><span style="color: #00dddd;">*</span>math<span style="color: #d2cd86;">.</span>exp<span style="color: #d2cd86;">(</span><span style="color: #00dddd;">-</span>epsilon_decay<span style="color: #00dddd;">*</span>global_steps<span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;">#print(epsilon)</span>
<span style="color: #9999a9;"># хангалттай sample цугларсан тул Q неорон сүлжээг сургах</span>
<span style="color: #e66170; font-weight: bold;">if</span> <span style="color: #d2cd86;">(</span>global_steps<span style="color: #00dddd;">%</span>train_per_step<span style="color: #00dddd;">==</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #9999a9;">#print("Q сүлжээг сургаж байна түр хүлээгээрэй")</span>
<span style="color: #e66170; font-weight: bold;">for</span> train_step <span style="color: #e66170; font-weight: bold;">in</span> <span style="color: #e66170; font-weight: bold;">range</span><span style="color: #d2cd86;">(</span>train_count<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #9999a9;"># цугларсан жишээнүүдээсээ эхлээд batch sample-дэж үүсгэх</span>
sampled_batch <span style="color: #d2cd86;">=</span> per_memory<span style="color: #d2cd86;">.</span>sample<span style="color: #d2cd86;">(</span>batch_size<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span>sampled_batch<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
state_shape <span style="color: #d2cd86;">=</span> sampled_batch<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">.</span>shape
q_input <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>zeros<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">(</span>batch_size<span style="color: #d2cd86;">,</span> state_shape<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span> dtype<span style="color: #d2cd86;">=</span>np<span style="color: #d2cd86;">.</span>float32<span style="color: #d2cd86;">)</span>
target_q_input <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>zeros<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">(</span>batch_size<span style="color: #d2cd86;">,</span> state_shape<span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">,</span> dtype<span style="color: #d2cd86;">=</span>np<span style="color: #d2cd86;">.</span>float32<span style="color: #d2cd86;">)</span>
actions <span style="color: #d2cd86;">=</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span>
rewards <span style="color: #d2cd86;">=</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span>
dones <span style="color: #d2cd86;">=</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span>
td_errors <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>zeros<span style="color: #d2cd86;">(</span>batch_size<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">for</span> i <span style="color: #e66170; font-weight: bold;">in</span> <span style="color: #e66170; font-weight: bold;">range</span><span style="color: #d2cd86;">(</span>batch_size<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
q_input <span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> sampled_batch<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span> <span style="color: #9999a9;"># curr_state</span>
target_q_input<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> sampled_batch<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">3</span><span style="color: #d2cd86;">]</span> <span style="color: #9999a9;"># next_state</span>
actions<span style="color: #d2cd86;">.</span>append<span style="color: #d2cd86;">(</span>sampled_batch<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span> <span style="color: #9999a9;"># action</span>
rewards<span style="color: #d2cd86;">.</span>append<span style="color: #d2cd86;">(</span>sampled_batch<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">2</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span> <span style="color: #9999a9;"># reward</span>
dones <span style="color: #d2cd86;">.</span>append<span style="color: #d2cd86;">(</span>sampled_batch<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">4</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span> <span style="color: #9999a9;"># is done</span>
q_out <span style="color: #d2cd86;">=</span> q_network<span style="color: #d2cd86;">(</span>q_input<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">.</span>numpy<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
target_q_out <span style="color: #d2cd86;">=</span> target_q_network<span style="color: #d2cd86;">(</span>target_q_input<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">.</span>numpy<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;"># bellman q утгыг дөхүүлэхийн тулд сургах batch шинэчлэн тохируулах</span>
<span style="color: #e66170; font-weight: bold;">for</span> i <span style="color: #e66170; font-weight: bold;">in</span> <span style="color: #e66170; font-weight: bold;">range</span><span style="color: #d2cd86;">(</span>batch_size<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
old_value <span style="color: #d2cd86;">=</span> q_out<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span>actions<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">]</span>
<span style="color: #e66170; font-weight: bold;">if</span> dones<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">:</span>
q_out<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span>actions<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> rewards<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span>
<span style="color: #e66170; font-weight: bold;">else</span><span style="color: #d2cd86;">:</span>
q_out<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span>actions<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> rewards<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span> <span style="color: #00dddd;">+</span> gamma<span style="color: #00dddd;">*</span>np<span style="color: #d2cd86;">.</span>amax<span style="color: #d2cd86;">(</span>target_q_out<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;"># шинэ batch-аас TD error-г тооцох</span>
td_errors<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> <span style="color: #e66170; font-weight: bold;">abs</span><span style="color: #d2cd86;">(</span>old_value <span style="color: #00dddd;">-</span> q_out<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span>actions<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;"># PER санах ой дээрхи td_error-уудыг шинэчлэх</span>
<span style="color: #9999a9;"># дараа дахин sample-дэхэд хэрэгтэй</span>
<span style="color: #e66170; font-weight: bold;">for</span> i <span style="color: #e66170; font-weight: bold;">in</span> <span style="color: #e66170; font-weight: bold;">range</span><span style="color: #d2cd86;">(</span>batch_size<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
idx <span style="color: #d2cd86;">=</span> sampled_batch<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">[</span><span style="color: #00a800;">0</span><span style="color: #d2cd86;">]</span>
per_memory<span style="color: #d2cd86;">.</span>update<span style="color: #d2cd86;">(</span>idx<span style="color: #d2cd86;">,</span> td_errors<span style="color: #d2cd86;">[</span>i<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;"># Q неорон сүлжээг сургах</span>
<span style="color: #e66170; font-weight: bold;">with</span> tf<span style="color: #d2cd86;">.</span>GradientTape<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span> <span style="color: #e66170; font-weight: bold;">as</span> tape<span style="color: #d2cd86;">:</span>
prediction_q_out <span style="color: #d2cd86;">=</span> q_network<span style="color: #d2cd86;">(</span>q_input<span style="color: #d2cd86;">)</span>
loss <span style="color: #d2cd86;">=</span> tf<span style="color: #d2cd86;">.</span>keras<span style="color: #d2cd86;">.</span>losses<span style="color: #d2cd86;">.</span>Huber<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">(</span>q_out<span style="color: #d2cd86;">,</span> prediction_q_out<span style="color: #d2cd86;">)</span>
gradients <span style="color: #d2cd86;">=</span> tape<span style="color: #d2cd86;">.</span>gradient<span style="color: #d2cd86;">(</span>loss<span style="color: #d2cd86;">,</span> q_network<span style="color: #d2cd86;">.</span>trainable_variables<span style="color: #d2cd86;">)</span>
optimizer<span style="color: #d2cd86;">.</span>apply_gradients<span style="color: #d2cd86;">(</span><span style="color: #e66170; font-weight: bold;">zip</span><span style="color: #d2cd86;">(</span>gradients<span style="color: #d2cd86;">,</span> q_network<span style="color: #d2cd86;">.</span>trainable_variables<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
training_happened <span style="color: #d2cd86;">=</span> True
<span style="color: #9999a9;">#print("Q сүлжээг сургаж дууслаа")</span>
<span style="color: #9999a9;"># target q неорон сүлжээг шинэчлэх цаг боллоо</span>
<span style="color: #e66170; font-weight: bold;">if</span> global_steps<span style="color: #00dddd;">%</span>sync_per_step<span style="color: #00dddd;">==</span><span style="color: #00a800;">0</span> <span style="color: #e66170; font-weight: bold;">and</span> training_happened<span style="color: #00dddd;">==</span>True<span style="color: #d2cd86;">:</span>
target_q_network<span style="color: #d2cd86;">.</span>set_weights<span style="color: #d2cd86;">(</span>q_network<span style="color: #d2cd86;">.</span>get_weights<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;">#print("target неорон сүлжээнийг жингүүдийг шинэчиллээ")</span>
<span style="color: #e66170; font-weight: bold;">if</span> global_steps<span style="color: #00dddd;">%</span>save_per_step<span style="color: #00dddd;">==</span><span style="color: #00a800;">0</span> <span style="color: #e66170; font-weight: bold;">and</span> training_happened<span style="color: #00dddd;">==</span>True<span style="color: #d2cd86;">:</span>
q_network<span style="color: #d2cd86;">.</span>save<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"model_weights/dqn_per_q"</span><span style="color: #d2cd86;">)</span>
target_q_network<span style="color: #d2cd86;">.</span>save<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"model_weights/dqn_per_q_target"</span><span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;">#print("моделийг model_weights/ фолдерт хадгаллаа")</span>
<span style="color: #e66170; font-weight: bold;">if</span> done<span style="color: #00dddd;">==</span>True<span style="color: #d2cd86;">:</span>
<span style="color: #9999a9;">#print(episode, "р ажиллагаа дууслаа")</span>
<span style="color: #9999a9;">#print("{} - нийт reward : {}".format(episode, sum(episode_rewards)))</span>
<span style="color: #9999a9;">#print("дундаж reward :", sum(episode_rewards)/len(episode_rewards))</span>
<span style="color: #e66170; font-weight: bold;">pass</span>
env<span style="color: #d2cd86;">.</span>close<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
</pre>
<!--Created using ToHtml.com on 2020-07-25 17:48:04 UTC-->
<div><br /></div>
<div><br /></div>Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.com0Ulaanbaatar, Mongolia47.886398799999988 106.905743919.576164963821142 71.7494939 76.196632636178833 142.0619939tag:blogger.com,1999:blog-1457877875009527488.post-18733160887772464702020-06-20T00:42:00.051+08:002020-10-10T03:37:47.313+08:00Deep Reinforcement Learning, Q Learning<b>Q Learning</b> алгоритмыг ойлгохын тулд эхлээд доторхи хэсгүүдийг нь тодорхой болгоё.<div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjIzi6J4Y5waVDWkwwCv48LCpYcuprQTP2a-7Fb7h1StzfLyWeFeRXKbmYheFSjrIap90BtnsB7cDGIhA6XSYGUevIIKbU5TEUAxTdH81BFPElU7hcnMj1PCAQ7zVdDo8PCF54xM_nrgw/s510/Peek+2020-06-22+21-19.gif" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="426" data-original-width="510" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjIzi6J4Y5waVDWkwwCv48LCpYcuprQTP2a-7Fb7h1StzfLyWeFeRXKbmYheFSjrIap90BtnsB7cDGIhA6XSYGUevIIKbU5TEUAxTdH81BFPElU7hcnMj1PCAQ7zVdDo8PCF54xM_nrgw/s320/Peek+2020-06-22+21-19.gif" width="320" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><span><a name='more'></a></span><div><b><span style="font-size: x-large;"><br /></span></b></div><div><b><span style="font-size: x-large;"><br /></span></b></div><div><b><span><font size="6">Бууруулалттай reward нийлбэр</font></span></b><br />
<br />
<b>Discounted future reward</b>-г ингэж дур мэдэн орчуулах нь зөв үү буруу юу гэхдээ утга нь ер нь гарсан байх.<br />
<br />
Тухайн агшинаас эхлэн агентийн ирээдүйд авч болох нийт нийлбэр reward оноог дараах томъёогоор илэрхийлж болно.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjpjBs7hloNxKuoHsshQOkeDpUzs7E4tIE3N8elYCKXesWaA9-IpiMfVD2UpHYjAUJq71_3XGnKuGX4Mk3anCV9_c6Qir5x5ecUmeBfyltjBi4CGQjhsfcAK4tkDW8Jf1D7S2mnZp91Wg/s236/Total+reward.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="53" data-original-width="236" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjpjBs7hloNxKuoHsshQOkeDpUzs7E4tIE3N8elYCKXesWaA9-IpiMfVD2UpHYjAUJq71_3XGnKuGX4Mk3anCV9_c6Qir5x5ecUmeBfyltjBi4CGQjhsfcAK4tkDW8Jf1D7S2mnZp91Wg/d/Total+reward.png" /></a></div><div>Энд нэг асуудал бий тэр нь юу вэ гэхээр хорвоо ертөнц үргэлж хувьсан өөрчлөгдөж байдаг. Холын ирээдүйд бий болох үр дүн ер нь тодорхой бус байдаг.</div><div>
<br />
Эндээс тухайн үйлдлийг хийснээр ойрын ирээдүйд авч байгаа оноо нь холын ирээдүйд авах онооноос илүү жинтэй буюу нийт авах оноонд холын ирээдүйд авах онооны хүч нь тодорхой бус байдлаас болж сулрах зүй тогтол ажиглагдана.<br />
<br />
Энэ зүй тогтлыг шингээсэн ирээдүйд авч болох оноог <b>discounted future reward</b> буюу <b>бууруулалттай reward нийлбэр</b> гээд байгаа юм.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgwzdJ1yDeOlbcJmDIXvanutNbPi_SBcR3YMcbJKAo5P8HFVVl0Q879HGhdYat6RPKuC3HQZN2uSBkwgp1GQyWvM42coPDk6Avu-jBFhdQ4zkFEf8LwZHVF0ZuSDnwE_7H2oOnSZ9yAxg/s403/Total+discounted+reward.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="61" data-original-width="403" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgwzdJ1yDeOlbcJmDIXvanutNbPi_SBcR3YMcbJKAo5P8HFVVl0Q879HGhdYat6RPKuC3HQZN2uSBkwgp1GQyWvM42coPDk6Avu-jBFhdQ4zkFEf8LwZHVF0ZuSDnwE_7H2oOnSZ9yAxg/d/Total+discounted+reward.png" /></a></div><div>Хүч сулруулах <b>gamma</b> хүчин зүйл нь 0-оос 1-н хооронд бутархай тоо байна. Хугацаа холдох тусам ирээдүйд авах reward оноо нь зэрэгтийн нөлөөгөөр бага тоо болон буурна.</div><div><br /></div><div>Мөн <b>gamma</b>-г хэрэглэснээр <b>n</b> утга хязгааргүйрүү тэмүүлэх үед нийлбэр reward оноог хязгааргүй биш харин тодорхой тооруу тэмүүлэх баталгааг өгдөгөөрөө давуу талтай.</div><div>
<br />
<br />
<b><span><font size="6">Онооны функц</font></span></b><br />
<br />
<b>Value function</b>-г онооны функц гэж дур мэдэн орчууллаа.<br />
<br />
Хэрэв төлөв <i><b>s</b></i>-д <b><i>π</i></b> policy-н дагуу үйлдэл хийвэл хэр сайн байх вэ, ирээдүйд хичнээн оноо авах вэ гэдгийг мэдэж байвал агент үүнийхээ тусламжтайгаар оновчтой шийдвэр гаргах боломжтой.<br />
<br />
Тухайн төлөв дээрхи ирээдүйд авч болох нийт оноо нь төчнөөн байх болно гэдгийг дараах функцээр илэрхийлдэг. Өмнө дурдсан бууруулалттай reward нийлбэр байгааг анзаарсан байх, π тавьдаг нь π дүрмийн дагуу үйлдэж гаргасан reward нийлбэр байх ёстой гэдгийг илтгэнэ.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEieJjydvzrOPUcoWHpEiptC3i5fS6ImJh-9mOu_mJuWBm-Nk8h1kPq5L-m-l_f3ZCV3LSisxllN9Q0QRpt_bVn4-A4Z0rQQooYu4T9y5PpqSeYhC1OzMRA74F10HJ1mMFYesLm58catdg/s1600/Value+function.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="67" data-original-width="246" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEieJjydvzrOPUcoWHpEiptC3i5fS6ImJh-9mOu_mJuWBm-Nk8h1kPq5L-m-l_f3ZCV3LSisxllN9Q0QRpt_bVn4-A4Z0rQQooYu4T9y5PpqSeYhC1OzMRA74F10HJ1mMFYesLm58catdg/s1600/Value+function.png" /></a></div>
Тэгвэл тухайн төлөвт хамгийн их онооны функцын утгатай policy <b><i>π</i></b> оршин байх боломжтой. Үүнийг <b>optimal value function</b> гэж нэрлэдэг. * тэмдэг нь оптимал гэсэн утгатай бөгөөд бүх боломжит π-нуудаас хамгийн их V оноотой π policy нь V*-г төлөөлнө гэж уншиж болно.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdulcbNTCGIpw_1wLH5n7hh0mEvELbrsWh6RLy9rmk8z4LZNm0qTjxMSQO5Y1C-jevdWvFBknQ_33MIcSGk-5VmsKLrZPx7JKsTfwxSfRsBQ-inXHUCb7QIa0UBaTZ0p78Elq1-W8Ajw/s1600/Optimal+value+function.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="69" data-original-width="233" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdulcbNTCGIpw_1wLH5n7hh0mEvELbrsWh6RLy9rmk8z4LZNm0qTjxMSQO5Y1C-jevdWvFBknQ_33MIcSGk-5VmsKLrZPx7JKsTfwxSfRsBQ-inXHUCb7QIa0UBaTZ0p78Elq1-W8Ajw/s1600/Optimal+value+function.png" /></a></div><div><br /></div>
<center>
<iframe width="560" height="315" src="https://www.youtube.com/embed/1p7Zgy79cSo" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</center>
</div>
<div><br /></div>
<div>
<b><font size="6">Temporal Difference</font></b></div><div><br /></div><div><b>V(s)</b> функцыг ойролцоолон олдог хэд хэдэн арга бий. Үүний нэг нь <b>TD(1)</b> гэдэг алгоритм юм. Ажиллахдаа эхлэл утгаас нь шинэчилж давтсаар эцэстээ <b>Gt</b> буюу өмнөх үзсэн <b>discounted future reward</b>-руу дөхүүлдэг.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhxNilNg7bulHIQWVFWAF3Dkf52zb3UcNOGEXtcdM__91I_Lm8FFR1EFU9VEB5_kBnLgFcDnhAZaEix3TudtSRYMglGDD1CWyNjyEq0jJ_gzgd-vJKqJcAt8TDLhY8DBqzj4tfp6bA4pw/s353/TD1+update.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="62" data-original-width="353" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhxNilNg7bulHIQWVFWAF3Dkf52zb3UcNOGEXtcdM__91I_Lm8FFR1EFU9VEB5_kBnLgFcDnhAZaEix3TudtSRYMglGDD1CWyNjyEq0jJ_gzgd-vJKqJcAt8TDLhY8DBqzj4tfp6bA4pw/d/TD1+update.png" /></a></div><div>Энэ томъёонд байгаа <b>G<font size="1">t</font>-V(s<font size="1">t</font>)</b> ялгааг <b>TD Error</b> буюу <b>Temporal Difference Error</b> гэж нэрлэдэг. <b>V(s<font size="1">t</font>)</b> утгыг шинэчилж байгаа энэ томъёог сая дурдсанчлан <b>TD(1) алгоритм</b> гэдэг.</div><div><br /></div><div>Өөр нэг арга нь <b>G<font size="2">t</font></b> буюу нийт нийлбэр reward-рүү ойролцоолон дөхүүлэхийн оронд дараагийн алхамд авах reward онооруу буюу <b>r<font size="1">t+1</font>+γV(s<font size="1">t+1</font>)</b> утгаруу дөхүүлэх арга юм</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHQnuJzIjHGqLijaw8BXphhVQW86fmRpfYNEh1WmxK1XOp5HCbp3p5RcEMVBpKjdgBextMXrnZnkVL1B3uGozMvGbFmq_f9S1USIVrxgrqZ7kr2qEtQFOgpDSCW-QsBNk3ONBmhZAUFw/s490/TD0+update.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="52" data-original-width="490" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHQnuJzIjHGqLijaw8BXphhVQW86fmRpfYNEh1WmxK1XOp5HCbp3p5RcEMVBpKjdgBextMXrnZnkVL1B3uGozMvGbFmq_f9S1USIVrxgrqZ7kr2qEtQFOgpDSCW-QsBNk3ONBmhZAUFw/d/TD0+update.png" /></a></div><div>Үүнийг <b>TD(0) алгоритм</b> гэж нэрлэдэг.</div><div><br /></div><div><br />
<b><span><font size="6">Q функц</font></span></b><br />
<br />
Энэ функц нь <b>төлөв</b> болон <b>үйлдлийг</b> илтгэх хоёр утга хүлээн авч эдгээрт харгалзах <b>авч болох нийт reward оноог</b> буцаадаг. <br />
<br />
<b>Q</b> үсэг нь англи хэлний <b>Quality</b> буюу чанар гэсэн үгнээс гаралтай. Тэгэхээр өгөгдсөн <b><i>s</i></b> төлөвт <b><i>a</i></b> гэсэн үйлдэл хийвэл ямархуу чанартай байх вэ гэдгийг илтгэнэ гэж үзэж болно.</div><div><br /></div><div>Тухайн төлөв дээр авч болох бүх үйлдэл тус бүрт оноо харгалзуулахад хэрэглэдэг.<br />
<br />
Хамгийн оптимал онооны функц болон Q функц хоёр хоорондоо дараах хамааралтай. <b>Оптимал V функц</b>-ын утга бол <b>s</b> төлөв дээр хийж болох бүх үйлдлүүдээс хамийн их <b>Q</b> оноотой үйлдэл болох <b>a</b> дээрээс авсан <b>Q</b> утгатай тэнцүү.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjtc0tB5ooT356TaVj_IfVoPRbIn67E5Ikxb9dQub6-3QTd7x11lBJAa3kpncvt_8fm62Xq4SzkhbRb3xU4pD65tCnhA8e0JUSyBELNMy8UAmCdAG08y01wqxbv_4hmKfi_cWXPpeuXvQ/s1600/Relation+between+value+and+q+functions.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="75" data-original-width="263" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjtc0tB5ooT356TaVj_IfVoPRbIn67E5Ikxb9dQub6-3QTd7x11lBJAa3kpncvt_8fm62Xq4SzkhbRb3xU4pD65tCnhA8e0JUSyBELNMy8UAmCdAG08y01wqxbv_4hmKfi_cWXPpeuXvQ/s1600/Relation+between+value+and+q+functions.png" /></a></div>Reinforcement Learning-н зорилго нь нийт авч болох reward оноог хамгийн их дээр байлгах гэж үзвэл дээрхи хамаарлаас бид хамгийн оптимал policy функцыг гаргаж авах боломжтой.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6mLDpCLdRZnBZLf868GZIN_7bWyyOdK_TqdXB9XBU1R_TqxnayequgI6u-_rQWNXO54XrnHiCk9Np4hs3BoiDBQavLTcDkIFasSaPyGdHITcnV-RX2onMeVoTHwprUrNCLaztoQjGZQ/s1600/Optimal+policy+function.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="63" data-original-width="302" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6mLDpCLdRZnBZLf868GZIN_7bWyyOdK_TqdXB9XBU1R_TqxnayequgI6u-_rQWNXO54XrnHiCk9Np4hs3BoiDBQavLTcDkIFasSaPyGdHITcnV-RX2onMeVoTHwprUrNCLaztoQjGZQ/s1600/Optimal+policy+function.png" /></a></div>
Энийг юу гэж унших вэ гэхээр ирээдүйд хамгийн их reward өгөх <b>дүрэм π</b> бол хамгийн их Q утгатай үйлдэл болох <b><i>a</i></b> үйлдлийг сонгоод байх явдал юм гэж уншиж болно.<br />
<br />
<br />
<b><span><font size="6">Q Learning буюу Q функцыг сургах</font></span></b><br />
<br />
Тухайн <b><i>s</i></b> төлөв дээрхи <b>a</b> гэсэн үйлдлийн <b>Q утга</b> буюу Q(s, a)-г дараах рекурсив илэрхийллээр бичдэг, үүнийг <b>Bellman-ий тэгшитгэл</b> гэдэг.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggpQWV1J7-3nKzDO-JeuAloGFO3rVB4ux6hjuYaYEewlm0tsbvMiBWQKik6gNqBn287ZXhQR9n-yfsHPw5NP3dhCO6HFLyVEl4sKJODSf8W6O3weTVu4bcSJiBVlLs_LobOn6TzvQHPA/s1600/Bellman+equation.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="64" data-original-width="337" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggpQWV1J7-3nKzDO-JeuAloGFO3rVB4ux6hjuYaYEewlm0tsbvMiBWQKik6gNqBn287ZXhQR9n-yfsHPw5NP3dhCO6HFLyVEl4sKJODSf8W6O3weTVu4bcSJiBVlLs_LobOn6TzvQHPA/s1600/Bellman+equation.png" /></a></div>
Энэ тэгшитгэлийг юу гэж унших вэ <b>a</b> үйлдлийн үр дүнд бий болох reward оноог бодохдоо <b>шууд reward</b> оноон дээр дараагийн шинэ төлөв <b>s'</b> дээр хамгийн оптимал policy-ээр авсан reward-ыг нэмж олно. </div><div><br /></div><div>Ирээдүйд бий болох үр дүнгүүд үл ялиг тодорхой бус байдалтай тул discount factor болох <b>gamma</b>-г үржиж байгааг дахин анхаараарай.</div><div><br /></div><div>Bellman-ий тэгшитгэлийг яаж олж байна вэ гэхээр </div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgjUC-LsUl69G1hFINbFG8aFZefYSCxmRubp38l1ckwmeV_IoqBBth3-NKpP-JnCWdGQPVt-gR95eWdzlc6Ha0ta7b5503ewrr5kMhxPd7mDC_ktok2PJbL6Q5ZkWJbIZQik8Bvs19UDQ/s477/Bellman+equation+composition.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="477" data-original-width="461" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgjUC-LsUl69G1hFINbFG8aFZefYSCxmRubp38l1ckwmeV_IoqBBth3-NKpP-JnCWdGQPVt-gR95eWdzlc6Ha0ta7b5503ewrr5kMhxPd7mDC_ktok2PJbL6Q5ZkWJbIZQik8Bvs19UDQ/d/Bellman+equation+composition.png" /></a></div><div><br /></div><div><br /><b>TD</b> алгоритмын дагуу Bellman-ны утгаруу дөхүүлж Q функцыг сургавал<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhDAGZ24RVn-KThUWX8FhQ0SY5ipEgJ5ryqBsvMCCklJLNaBZDJzkrGwLSF6s-TIkGc7sXPhMqimEKkkwjpuX4jp9MPyJMiQs87ICrIXb3PSzzqoUj6FcvMBjOH75zuJuW1_IYZUtRnUw/s681/Bellman+equation+practical.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="65" data-original-width="681" height="62" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhDAGZ24RVn-KThUWX8FhQ0SY5ipEgJ5ryqBsvMCCklJLNaBZDJzkrGwLSF6s-TIkGc7sXPhMqimEKkkwjpuX4jp9MPyJMiQs87ICrIXb3PSzzqoUj6FcvMBjOH75zuJuW1_IYZUtRnUw/w640-h62/Bellman+equation+practical.png" width="640" /></a></div><div>Энд alpha утга нь learning rate буюу Q функцийн утгыг шинэчлэхэд хэрэглэх алхамын утга. Үүгээр өмнөх Q утгыг хэр их хэмжээтэйгээр өөрчлөн шинэчлэх вэ гэдгийг хянан тохируулдаг.</div><div><br /></div><div>Environment дотор маш олон удаа explore хийсний дараа Q функц нь ерөнхий ойролцоолчилсон оптимал утгуудруугаа сурч эхэлдэг. </div><div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhbQJ1-xte11wSZghS6AMiw_oebXgxOTH3nWULYszdBiKHpxCYM4ZuuOqiqTOiPiiTogV5L-3cu3H5Go_OYZ_6OJd_J99KztlskWMSCdbawgaRN-MReSHXRO9GMupbUnaIO4YOsbrB4CA/s1456/QTable.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="390" data-original-width="1456" height="172" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhbQJ1-xte11wSZghS6AMiw_oebXgxOTH3nWULYszdBiKHpxCYM4ZuuOqiqTOiPiiTogV5L-3cu3H5Go_OYZ_6OJd_J99KztlskWMSCdbawgaRN-MReSHXRO9GMupbUnaIO4YOsbrB4CA/w640-h172/QTable.png" width="640" /></a></div><div><br /></div><div><br /></div><div><b><font size="6">Хэрэгжүүлэлт</font></b></div><div><br /></div><div>
Дээрхи томъёоны дагуу <a href="https://gym.openai.com/envs/FrozenLake-v0/">OpenAI-н FrozenLake environment</a> дээр Q Learning алгоритмыг хэрэгжүүлж харая.<br /><br />Энэ environment 4x4 хүснэгт буюу нийт 16 утга бүхий төлвүүдээс бүрдэнэ, энэ хүснэгт дотор 4-н зүг хөдлөж болох тулд нийт хийгдэх үйлдлийн тоо нь 4. </div><div><br /></div><div>Зорилго нь S байрлалаас хөдлөөд G байрлалруу үйлдэл хийн хүрч очих. Дундаа янз бүрийн нүхтэй тул түүгээр явж болохгүй, харин хөлдүү нуураар дамжин явах боломжтой.<br /><ul style="text-align: left;"><li>S - Эхлэлийн байршил</li><li>F - Хөлдүү нуур, үүгээр алхаж болно</li><li>H - Hole буюу явах боломжгүй нүх</li><li>G - Goal буюу хүрч очих ёстой байршил</li></ul><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0JbTzKJYKYCklDkPrLRIWjy3Jbcst4uhyphenhyphenbwebmlyidtb23apcL5zAQIkPt_jwbZiNCqMRASx8FeYwtGBhhq7M9gwUMQLDwyA6MHAfD-X9Vw_sWWOYe9kVHku3FjVGutEgzhHSJa_F7A/s374/FrozenLake+environment.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="232" data-original-width="374" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0JbTzKJYKYCklDkPrLRIWjy3Jbcst4uhyphenhyphenbwebmlyidtb23apcL5zAQIkPt_jwbZiNCqMRASx8FeYwtGBhhq7M9gwUMQLDwyA6MHAfD-X9Vw_sWWOYe9kVHku3FjVGutEgzhHSJa_F7A/s320/FrozenLake+environment.png" width="320" /></a></div></div><div><br /></div>
<pre style="background: rgb(0, 0, 0); color: #d1d1d1;"><span style="color: #e66170; font-weight: bold;">from</span> os <span style="color: #e66170; font-weight: bold;">import</span> system
<span style="color: #e66170; font-weight: bold;">from</span> time <span style="color: #e66170; font-weight: bold;">import</span> sleep
<span style="color: #e66170; font-weight: bold;">import</span> gym
<span style="color: #e66170; font-weight: bold;">import</span> numpy <span style="color: #e66170; font-weight: bold;">as</span> np
env <span style="color: #d2cd86;">=</span> gym<span style="color: #d2cd86;">.</span>make<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">'FrozenLake-v0'</span><span style="color: #d2cd86;">)</span>
system<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"clear"</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"Нийт төлөв :"</span><span style="color: #d2cd86;">,</span> env<span style="color: #d2cd86;">.</span>observation_space<span style="color: #d2cd86;">.</span>n<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"Үйлдлүүд :"</span><span style="color: #d2cd86;">,</span> env<span style="color: #d2cd86;">.</span>action_space<span style="color: #d2cd86;">.</span>n <span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"Удахгүй сургаж эхлэнэ түр хүлээнэ үү!"</span><span style="color: #d2cd86;">)</span>
sleep<span style="color: #d2cd86;">(</span><span style="color: #00a800;">3</span><span style="color: #d2cd86;">)</span>
system<span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"clear"</span><span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;"># Q хүснэгтийг бүгдийг тэгээр дүүргэн үүсгэх</span>
Q <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>zeros<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">[</span>env<span style="color: #d2cd86;">.</span>observation_space<span style="color: #d2cd86;">.</span>n<span style="color: #d2cd86;">,</span> env<span style="color: #d2cd86;">.</span>action_space<span style="color: #d2cd86;">.</span>n<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;"># alpha буюу learning rate</span>
learning_rate <span style="color: #d2cd86;">=</span> <span style="color: #009f00;">0.8</span>
<span style="color: #9999a9;"># gamma буюу discount factor</span>
gamma <span style="color: #d2cd86;">=</span> <span style="color: #009f00;">0.95</span>
<span style="color: #9999a9;"># episode-ийн тоо, хичнээн удаа environment-ийг дахин эхлүүлж явуулах вэ</span>
episodes <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">4000</span>
<span style="color: #9999a9;"># episode бүрт цуглуулсан reward оноонуудыг хадгалах жагсаалт</span>
reward_list <span style="color: #d2cd86;">=</span> <span style="color: #d2cd86;">[</span><span style="color: #d2cd86;">]</span>
<span style="color: #e66170; font-weight: bold;">for</span> i <span style="color: #e66170; font-weight: bold;">in</span> <span style="color: #e66170; font-weight: bold;">range</span><span style="color: #d2cd86;">(</span>episodes<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">:</span>
<span style="color: #9999a9;"># episode эхлэж байгаа тул environment тэр чигт нь шинэчлэх</span>
state <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>reset<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
total_reward <span style="color: #d2cd86;">=</span> <span style="color: #00a800;">0</span>
done <span style="color: #d2cd86;">=</span> False
<span style="color: #e66170; font-weight: bold;">while</span> True<span style="color: #d2cd86;">:</span>
<span style="color: #9999a9;"># action буюу үйлдлийг сонгох, гэхдээ тодорхой хэмжээний noise-тойгоор</span>
action <span style="color: #d2cd86;">=</span> np<span style="color: #d2cd86;">.</span>argmax<span style="color: #d2cd86;">(</span>Q<span style="color: #d2cd86;">[</span>state<span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">:</span><span style="color: #d2cd86;">]</span><span style="color: #00dddd;">+</span>np<span style="color: #d2cd86;">.</span>random<span style="color: #d2cd86;">.</span>randn<span style="color: #d2cd86;">(</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">,</span> env<span style="color: #d2cd86;">.</span>action_space<span style="color: #d2cd86;">.</span>n<span style="color: #d2cd86;">)</span><span style="color: #00dddd;">*</span><span style="color: #d2cd86;">(</span><span style="color: #009f00;">1.0</span><span style="color: #00dddd;">/</span><span style="color: #d2cd86;">(</span>i<span style="color: #00dddd;">+</span><span style="color: #00a800;">1</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;"># үйлдэл хийж шинэ төлөв болон reward оноог авах</span>
new_state<span style="color: #d2cd86;">,</span> new_reward<span style="color: #d2cd86;">,</span> done<span style="color: #d2cd86;">,</span> _ <span style="color: #d2cd86;">=</span> env<span style="color: #d2cd86;">.</span>step<span style="color: #d2cd86;">(</span>action<span style="color: #d2cd86;">)</span>
<span style="color: #9999a9;"># TD(0) алгоритмын дагуу Q хүснэгтийг шинэчлэх</span>
Q<span style="color: #d2cd86;">[</span>state<span style="color: #d2cd86;">,</span> action<span style="color: #d2cd86;">]</span> <span style="color: #d2cd86;">=</span> Q<span style="color: #d2cd86;">[</span>state<span style="color: #d2cd86;">,</span> action<span style="color: #d2cd86;">]</span> <span style="color: #00dddd;">+</span> \
learning_rate<span style="color: #00dddd;">*</span><span style="color: #d2cd86;">(</span>new_reward <span style="color: #00dddd;">+</span> gamma<span style="color: #00dddd;">*</span>np<span style="color: #d2cd86;">.</span><span style="color: #e66170; font-weight: bold;">max</span><span style="color: #d2cd86;">(</span>Q<span style="color: #d2cd86;">[</span>new_state<span style="color: #d2cd86;">,</span> <span style="color: #d2cd86;">:</span><span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span> <span style="color: #00dddd;">-</span> Q<span style="color: #d2cd86;">[</span>state<span style="color: #d2cd86;">,</span> action<span style="color: #d2cd86;">]</span><span style="color: #d2cd86;">)</span>
total_reward <span style="color: #d2cd86;">=</span> total_reward <span style="color: #00dddd;">+</span> new_reward
state <span style="color: #d2cd86;">=</span> new_state
env<span style="color: #d2cd86;">.</span>render<span style="color: #d2cd86;">(</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">if</span> done <span style="color: #00dddd;">==</span> True<span style="color: #d2cd86;">:</span>
<span style="color: #e66170; font-weight: bold;">break</span>
reward_list<span style="color: #d2cd86;">.</span>append<span style="color: #d2cd86;">(</span>total_reward<span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"Сургаж дууслаа."</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"Нийт оноо :"</span><span style="color: #d2cd86;">,</span> <span style="color: #e66170; font-weight: bold;">str</span><span style="color: #d2cd86;">(</span><span style="color: #e66170; font-weight: bold;">sum</span><span style="color: #d2cd86;">(</span>reward_list<span style="color: #d2cd86;">)</span><span style="color: #00dddd;">/</span>episodes<span style="color: #d2cd86;">)</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span><span style="color: #00c4c4;">"Q хүснэгт :"</span><span style="color: #d2cd86;">)</span>
<span style="color: #e66170; font-weight: bold;">print</span><span style="color: #d2cd86;">(</span>Q<span style="color: #d2cd86;">)</span>
</pre>
<!--Created using ToHtml.com on 2020-06-19 19:08:24 UTC-->
<div><br /></div>
</div>
</div></div>Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.com0Ulaanbaatar, Mongolia47.886398799999988 106.9057439-30.818544843446723 -33.719256099999996 90 -112.4692561tag:blogger.com,1999:blog-1457877875009527488.post-694867364293291892020-06-17T23:56:00.008+08:002020-07-02T22:24:53.014+08:00Deep Reinforcement Learning, үндсэн ойлголтуудМашин сургах алгоритмууд үндсэн гурван бүлд хуваагддаг эдгээр нь supervised learning, unsupervised learning, <b>reinforcement learning</b> нар болно.<br />
<br />
<a name='more'></a><br />
<br />
RL алгоритм нь бараг л компютерийн шинжлэх ухаан үүсч эхэлснээс хойш зэрэгцэн хөгжиж ирсэн олон жилийн түүхтэй алгоритм юм.<br />
<br />
Энэ алгоритмын үндсэн санаа нь ямар нэгэн <b>зорилго</b>д хүрэхэд шаардлагатай <b>шийдвэр гаргах</b> үйл явцыг <b>автоматжуулах</b> явдал юм.<br />
<br />
Бусад төрлийн машин сургах алгоритмуудаас ямар нэгэн сургагчийн хэлж өгсөн жишээнд найдахын оронд тухайн орчинтой шууд харилцан үйлчлэлд орж тэндээсээ суралцах явдлаараа ялгаатай.<br />
<br />
RL-д <b>агент</b> болон түүний байгаа <b>орчин</b> хоорондын харилцан үйлчлэлийг <b>төлөв</b>, <b>үйлдэл</b>, <b>шагнал</b> гэсэн ойлголтуудаар илэрхийлэн тодорхойлдог.<br />
<br />
Эдгээрийн тусламжтайгаар шалтгаан болон үр дагавар, тодорхой бус байдал, ямар нэгэн зорилгыг биелүүлэх ёстой гэдэг мэдрэхүйг загварчлах боломжтой болдог.<br />
<br />
Тухайн орчинд агент үйлдэл хийвэл үүний үр дүнд шинэ төлөв байдал бий болох бөгөөд бодлогын зорилгоос хамаарч шагналын оноо нь ч мөн адил бий болно.<br />
<br />
Өөрөөр RL нь тухайн нөхцөлд байдалд ирээдүйд авч болох шагналыг хэрхэн хамгийн их утгатайгаар барьсан үйлдлийг хийх вэ, нөхцөл байдал болон үйлдэл хоорондын хамаарлыг хэрхэн бий болгох вэ гэдгийг сурдаг алгоритм гэж тодорхойлж болно.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg4SzAaREruxlJVWUpWyswr_qaBqxX3XLifDPqX9HZi75SLYoMFPyGI9m0bQ43zqqfqRzWU5-5X75fIHHmvQw3UugF-oVF-fHiGFjthYhAMcGsoaOLP2lJZvFHJ5fwq8QPmF493X88z_Q/s1600/Reinforcement+Learning%252C+simple+formulation.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="900" data-original-width="1600" height="225" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg4SzAaREruxlJVWUpWyswr_qaBqxX3XLifDPqX9HZi75SLYoMFPyGI9m0bQ43zqqfqRzWU5-5X75fIHHmvQw3UugF-oVF-fHiGFjthYhAMcGsoaOLP2lJZvFHJ5fwq8QPmF493X88z_Q/s400/Reinforcement+Learning%252C+simple+formulation.png" width="400" /></a></div>
<br />
<br />
Мөн бусад төрлийн машин сурах алгоритмуудаас ялгаатай нь <b>exploration</b> болон <b>exploitation</b> хоорондын балансыг барих явдал байдаг.<br />
<br />
Урьд сурсан мэдлэг дээр суурилсан үйлдэл нь reward оноог ихэсгэхэд нилээн тустай боловч өөр оролдож үзэж байгаагүй үйлдэл сонгосноор илүү өндөр оноотой reward авах магадлалтай тул тэр үйлдлийг нээж олох шаардлагатай байдаг.<br />
<br />
Өмнөх сурсан мэдлэгээ хэрэглэхийг нь <b>exploitation</b> гээд байгаа харин шинээр мэдлэг олж авах, өөр хийж байгаагүй үйлдэл сонгох оролдож үзэх үйл явцыг нь <b>exploration</b> гэж нэрлэдэг.<br />
<br />
Тухайн төлөв болон нөхцөл байдалд ямар нэгэн үйлдэл хэдий хэмжээний оролцоо үр нөлөөг эцсийн reward оноог өсгөхөд үзүүлэх боломжтой вэ?<br />
Техникийн хэллэгээр энэ асуудлыг <b>credit assignment</b> гэж нэрлэдэг.<br />
<br />
RL алгоритм "credit assignment" асуудлыг тухайн орчин дэх төлөв бүрт кредит оноо(credit value) харгалзуулан оноож learning болон planning явцуудын тусламжтайгаар эдгээр оноог шинэчлэн тогтоох замаар шийдвэрлэдэг.<br />
<br />
Энэ оноог ойролцоолон өгч сурсан функцыг <b>Value function</b> гэдэг.<br />
<br />
Агент тухайн орчинд харилцан үйлчлэлд ороход төчнөөн хэмжээний шагнал авч байсаан гэдэг мэдлэгийг суралцах явцыг нь <b>learning</b> гэдэг.<br />
<br />
Харин тухайн төлөв дээр ирэхэд олон боломжит үйлдлүүдээс аль нь нөгөөхөөсөө илүү дээр байсан бэ гэдэг кредит оноог тухайн төлөв дээр оноох үйл явцыг <b>planning</b> гэдэг.<br />
<br />
Тэхээр агентийн <b>эцсийн зорилго</b> бол авч болох шагналын оноог <b>хамгийн их дээр байлгах</b> явдал юм.<br />
<br />
Агент тухайн нөхцөл байдалд шийдвэр гаргахдаа үргэлж хамгийн өндөр credit value-тэй үйлдлийг сонгох байдлаар ажилладаг. Учир нь урт хугацаандаа хамгийн их шагналын оноог цуглуулах болохоор тэр. Энэ шийдвэр гаргадаг функцыг <b>Policy function</b> гэж нэрлэдэг.<br />
<br />
<br />
<br />
<br />
Лавлагаа:<br />
<a href="http://incompleteideas.net/book/first/ebook/the-book.html">http://incompleteideas.net/book/first/ebook/the-book.html</a><br />
<br />
<br />Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.com0Ulaanbaatar, Mongolia47.886398799999988 106.905743919.576164963821142 71.7494939 76.196632636178833 142.0619939tag:blogger.com,1999:blog-1457877875009527488.post-12211021844785777042020-02-09T03:25:00.003+08:002023-11-29T01:46:00.527+08:00Transformer, Self-attention механизмыг Tensorflow дээр хэрэгжүүлэх<a href="https://sharavaa.blogspot.com/2019/08/transformer.html">Өмнөх пост</a>од сүүлийн үед хүчээ аваад байгаа <a href="https://arxiv.org/abs/1706.03762">Transformer</a> хэмээх неорон сүлжээний архитектурын талаархи ойлголтоо тэмдэглэж авсан билээ. Тэгвэл одоо яг цаана нь юу болж байгааг илүү сайн ойлгохын тулд Tensorflow дээр алхам алхамаар нь хэрэгжүүлж харая.<br />
<br />
<a name='more'></a><br />
<br />
Эхлээд tensorflow-оо eager горимд ажиллахаар идэвхижүүлье.<br />
<pre class="prettyprint">import tensorflow as tf
tf.compat.v1.enable_eager_execution()
</pre>
<br />
Оролтын векторуудаа зарлая, NLP-ийн хувьд эдгээр векторууд нь үг тус бүрийг илэрхийлэх word embedding байж болно, starcraft тоглоомын хувьд нэг unity-ийнх нь skill-үүдийг илэрхийлсэн векторууд байж болно гэх мэтээр тухайн шийдэх гэж байгаа асуудлаасаа хамаараад юугаар ч төсөөлөх боломжтой.
<br />
<pre class="prettyprint">>>> inputs = tf.Variable([[1,0,1,0], [0,2,0,2], [1,1,1,1]], dtype=tf.float32)
>>> inputs
< tf.Variable 'Variable:0' shape=(3, 4) dtype=float32, numpy=
array([[1., 0., 1., 0.],
[0., 2., 0., 2.],
[1., 1., 1., 1.]], dtype=float32) >
</pre>
<br />
Өмнөх постыг хэрвээ уншсан бол энэ оролтын векторуудын цувааг query, key, value гэсэн гурван матрицуудын тусламжтайгаар өөр дундын dimension бүхий векторуудруу хувиргах шаардлагатай. Тэхээр эдгээр гурван хувиргалтын матрицуудаа тодорхойлоё.<br />
<pre class="prettyprint">>>> query_weights = tf.Variable([[1,0,1], [1,0,0], [0,0,1], [0,1,1]], dtype=tf.float32)
>>> query_weights
< tf.Variable 'Variable:0' shape=(4, 3) dtype=float32, numpy=
array([[1., 0., 1.],
[1., 0., 0.],
[0., 0., 1.],
[0., 1., 1.]], dtype=float32) >
</pre>
<pre class="prettyprint">>>> key_weights = tf.Variable([[0,0,1], [1,1,0], [0,1,0], [1,1,0]], dtype=tf.float32)
>>> key_weights
< tf.Variable 'Variable:0' shape=(4, 3) dtype=float32, numpy=
array([[0., 0., 1.],
[1., 1., 0.],
[0., 1., 0.],
[1., 1., 0.]], dtype=float32) >
</pre>
<pre class="prettyprint">>>> value_weights = tf.Variable([[0,2,0], [0,3,0], [1,0,3], [1,1,0]], dtype=tf.float32)
>>> value_weights
< tf.Variable 'Variable:0' shape=(4, 3) dtype=float32, numpy=
array([[0., 2., 0.],
[0., 3., 0.],
[1., 0., 3.],
[1., 1., 0.]], dtype=float32) >
</pre>
<br />
Оролтын векторуудаа энэ гурван матрицуудаар үржүүлж self-attention хийхэд бэлтгэе. <br />
<pre class="prettyprint">>>> queries = tf.matmul(inputs, query_weights)
>>> queries
< tf.Tensor: id=104, shape=(3, 3), dtype=float32, numpy=
array([[1., 0., 2.],
[2., 2., 2.],
[2., 1., 3.]], dtype=float32) >
</pre>
<pre class="prettyprint">>>> keys = tf.matmul(inputs, key_weights)
>>> keys
< tf.Tensor: id=100, shape=(3, 3), dtype=float32, numpy=
array([[0., 1., 1.],
[4., 4., 0.],
[2., 3., 1.]], dtype=float32) >
</pre>
<pre class="prettyprint">>>> values = tf.matmul(inputs, value_weights)
>>> values
< tf.Tensor: id=108, shape=(3, 3), dtype=float32, numpy=
array([[1., 2., 3.],
[2., 8., 0.],
[2., 6., 3.]], dtype=float32) >
</pre>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBi0aHeHcGMFdn_H_L_d7_Hc9PJP8HKU-oZqujE_WuICsC90XPXKwhfRX0Vpqd9WbOgYCPGJDiZG9mlyynR3Fz5gR2bd_DmcprOQuj_XrDNxoh0piPsMjWNTf1vqMJT6hTcHNxjSzukA/s1600/image-6.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="73" data-original-width="407" height="57" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBi0aHeHcGMFdn_H_L_d7_Hc9PJP8HKU-oZqujE_WuICsC90XPXKwhfRX0Vpqd9WbOgYCPGJDiZG9mlyynR3Fz5gR2bd_DmcprOQuj_XrDNxoh0piPsMjWNTf1vqMJT6hTcHNxjSzukA/s320/image-6.png" width="320" /></a></div>
Одоо self-attention хийх суурь нь бүрдэж байна. Self-attention бол дарааллын элэмент бүр тэр дараалалд байгаа бусад элементүүдтэйгээ хэр хэмжээний хамааралтай байна вэ гэдэг оноонуудыг тооцсон матриц юм. Тиймээс эхлээд тэр хамаарлын оноонуудыг бий болгосон матриц үүсгэе. <br />
<pre class="prettyprint">>>> attention_scores = tf.matmul(queries, tf.transpose(keys))
>>> attention_scores
< tf.Tensor: id=112, shape=(3, 3), dtype=float32, numpy=
array([[ 2., 4., 4.],
[ 4., 16., 12.],
[ 4., 12., 10.]], dtype=float32) >
</pre>
Attention score матрицын мөр бүрээс нь softmax авж нийлбэр нь 1-тэй тэнцүү байдаг магадлалын тархалтууд бий болгоё.
<br />
<pre class="prettyprint">>>> softmaxed_attention_scores = tf.nn.softmax(attention_scores, axis=1)
>>> softmaxed_attention_scores
< tf.Tensor: id=114, shape=(3, 3), dtype=float32, numpy=
array([[6.3378938e-02, 4.6831051e-01, 4.6831051e-01],
[6.0336647e-06, 9.8200780e-01, 1.7986100e-02],
[2.9538720e-04, 8.8053685e-01, 1.1916770e-01]], dtype=float32) >
</pre>
Зиа эдгээр оноо юуг илэрхийлж байна вэ?<br />
Энэ softmaxed_attention_scores матриц нь оролтын дарааллын урттай тэнцүү квадрад матриц үүснэ.<br />
(i, j)-р элемент дээрхи оноо нь дарааллын i-р элемент нь дарааллын j-р элементтэй хэр их хэмжээний хамааралтай, хэр хэмжээний анхаарал хандуулж байгааг илэрхийлсэн тоо байна.<br />
Одоо тэгэхээр энэ оноонуудын дагуу дарааллын value векторуудыг хувирган үржүүлж attend хийсэн шинэ векторуудын цувааг гаргаж авая.
<br />
<pre class="prettyprint">>>> attended_values = tf.matmul(softmaxed_attention_scores, values)
>>> attended_values
< tf.Tensor: id=116, shape=(3, 3), dtype=float32, numpy=
array([[1.936621 , 6.683105 , 1.5950683 ],
[1.9999939 , 7.963991 , 0.0539764 ],
[1.9997045 , 7.759892 , 0.35838926]], dtype=float32) >
</pre>
Энэ шинэ векторуудын цувааг янз бүрийн layer normalization хийх, residual layer нэмэх, дахин attend хийх, цаашлаад concat хийж dense layer-аар dimension буулгаад ангилалт хийх гэх мэтээр үргэлжлүүлээд ашиглах боломжтой.
<br />
<br />
Лавлагаа :<br />
<a href="https://www.youtube.com/watch?v=JIvx2k5dALc">Матрицын талаархи МУИС-ийн хичээл</a><br />
<a href="http://matrixmultiplication.xyz/">Матриц үржүүлэх</a><div><br /></div>Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.com0tag:blogger.com,1999:blog-1457877875009527488.post-47042706034918657412019-08-19T09:52:00.004+08:002023-11-29T00:51:07.673+08:00Transformer архитектур гэж юу вэDeep Learning-н хувьд хийгдсэн хамгийн үндсэн чухал инновациуд гэвэл LSTM, CNN, Attention энэ гурав болно.<br />
<a name='more'></a><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiLpABaUmoFjbfONOd4h_V-FYsK0SRhv74pfGz-ptdqrK-uzjnYfeDP9kSyYGIMqrRCXlUTH2SjBrKatJ7I5ubiGv8R200T9tUklwECMWUslgK1gBfA3cTczdtN0ItRPMMlTyJNnfG70w/s1600/a7bebda0a454359a18c239e212df778f.jpg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiLpABaUmoFjbfONOd4h_V-FYsK0SRhv74pfGz-ptdqrK-uzjnYfeDP9kSyYGIMqrRCXlUTH2SjBrKatJ7I5ubiGv8R200T9tUklwECMWUslgK1gBfA3cTczdtN0ItRPMMlTyJNnfG70w/s320/a7bebda0a454359a18c239e212df778f.jpg" width="320" /></a></div>
<br />
<br />
Сүүлийн үед параллельчилж бодуулж болдог hardware нөөцүүд дээр үр ашиг сайтай ажилладаг давуу талаас нь болж <a href="https://arxiv.org/abs/1706.03762">transformer архитектурын</a> хэрэглээ илт нэмэгдэж байна дээрээс нь NLP-ийн олон таскуудын оноог ахиулж мөн NLP-н хувьд pre-training хийж ашиглах өргөн боломжийг transformer олгож байна.<br />
<br />
Transformer нь тодорхой урттай векторуудын дарааллыг encode хийгээд түүнийг мөн доторхой урттай векторуудын дараалалруу decode хийж хувиргадаг. Нэг дарааллыг нөгөө дараалалруу хувиргадаг архитектур юм.<br />
<br />
Дараалал хувиргадаг архитектурын хэрэглээ гэвэл хоёр хэлний хооронд машин орчуулга хийх, текст үүсгүүр бүтээх, NER олох, чат бот хийх гэх мэт төсөөлж болох юу л байна олон хэрэглээтэй.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg0ej9ivmyqanH0r-fHpaj2n1Qxt97ubllf0JouJnovhUIIF-KSfduOiUkK1mjT82lgP0TAeJntQsFUCS5SSQBenWOQ8gYlHcpE-epU0UMs_5G06YWLvjBIuMabTgAHu2VZsy9OIyS7ng/s1600/transform20fps.gif" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="708" data-original-width="800" height="283" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg0ej9ivmyqanH0r-fHpaj2n1Qxt97ubllf0JouJnovhUIIF-KSfduOiUkK1mjT82lgP0TAeJntQsFUCS5SSQBenWOQ8gYlHcpE-epU0UMs_5G06YWLvjBIuMabTgAHu2VZsy9OIyS7ng/s320/transform20fps.gif" width="320" /></a></div>
<br />
Transformer архитектурыг ойлгоё гэвэл <b>self-attention</b> хэмээх механизмийг ойлгох шаардлагатай. Ер нь үндсэн цөм нь гэхэд болно.<br />
<br />
<br />
Self-attention ийг тодорхой хувиргалтын дараалал байдлаар тайлбарлавал хэрэглээ нь тодорхой болно.<br />
<br />
<b>Хамгийн эхэнд</b> self-attention ажиллахдаа оролтын дарааллын гишүүн вектор бүрийг <b>query</b> вектор, <b>key</b> вектор, <b>value</b> вектор гурван векторуудруу тус бүрийн харгалзах <b>Wq</b>, <b>Wk</b>, <b>Wv</b> матрицуудын тусламжтайгаар хувиргаж буулгадаг.<br />
<br />
Машин орчуулга хийж байна гэж үзвэл оролтын дарааллын гишүүд нь тухайн үгийг илэрхийлэх <b>embedding вектор</b> байна.<br />
<br />
<b>Хоёр дахь алхам</b>, оролтын дарааллын гишүүн бүрээр гүйж байна гэж үзээд итераци бүрт тухайн гишүүний query векторийг дарааллын бусад гишүүдийнх нь key векторүүдтэй dot product буюу <b>скаляр үржвэр</b>ээр үржүүлж тус бүрийн оноог гаргана. Үүнийг өөрөөр query векторыг дарааллын бусад гишүүдийн key вектортой cosine similarity тооцож үржүүлж байна гэж бодож болох бөгөөд дарааллын гишүүд хоорондоо хэр хамааралтай вэ хэр зэрэг анхаарал хандуулах вэ гэдгийг мэдэх боломж олгодог. Нэг чиглэлрүү заагаад эхэлбэл илүү өндөр оноотой болж тухайн үг дарааллын тэр хэсэгт илүү их фокуслаж байна гэсэн үг юм.<br />
<br />
<b>Гурав дахь алхам</b> нь эдгээр оноонуудаас <b>softmax</b> авна. Өөрөөр хэлбэл нийлбэр нь 1-тэй тэнцүү магадлалын тархалтууд бий болно. Тухайн итерацийн гишүүн вектор нь бусад гишүүдтэйгээ хэр хамаарал холбогдолтой вэ гэдгийг заасан стандарт утгууд нь эндээс нь үүснэ гэсэн үг.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhqVxu1Eu4i3FtfWHjIX9rWXCPmnHWCbuGgzveWioNuKuh2UA4ii8zOZPn8wcCIpwoNSXwJH6IBEM-Ii01SQXkyIxtbeEwsZAL39nbLHutHJbOdLcwgYtKfqNQmdqhPoPexJwpn2y7VQ/s1600/image-6.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="73" data-original-width="407" height="57" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhqVxu1Eu4i3FtfWHjIX9rWXCPmnHWCbuGgzveWioNuKuh2UA4ii8zOZPn8wcCIpwoNSXwJH6IBEM-Ii01SQXkyIxtbeEwsZAL39nbLHutHJbOdLcwgYtKfqNQmdqhPoPexJwpn2y7VQ/s320/image-6.png" width="320" /></a></div>
<br />
<b>Дөрөв дэх алхам</b> нь энэ softmax-аас гарсан утгуудыг дарааллын бусад гишүүн бүрийн харгалзах <b>value вектор</b>тэй үржиж өгнө. Эндээс өндөр утгаар үржих тусам тухайн давталт хийж байгаа гишүүнтэй илүү их хамааралтай болж өндөр сигналтай болж байна гэж бодож болохоор. Хамааралгүй үгнүүд нь бага утгаар үржигдэж сигнал нь буурна өөрөөр хэлбэл бага анхаарал хандуулна гэсэн үг. Хэрвээ миний ойлгосон intuition буруу алдаатай бол засаж өгөөрэй please!<br />
<br />
<b>Тав дахь алхам</b> нь итерацийн үг болгоны хувьд өмнөх алхамаас үржигдэж value векторуудыг хооронд нь нэмж дунджийг нь авна. Энэ гарсан вектор нь self-attention модулийн гаралт бөгөөд цааш нь тус бүрт нь feed-forward давхарга тавьж хэрэгцээтэй dimension бүхий гаралтаа үүсгээд явдаг.<br />
<br />
<br />
Компютерийн санах ойд програм хувьсагч хадгалахдаа тухайн хувьсагчийн хаяг мөн хувьсагчийн хадгалах утга гэсэн хаяг-утга хамааралтай хос дата хэрэглэдэг.<br />
<br />
Тэгвэл self-attention-ий хувьд дараалалд буй үгнүүдийг сая дурдсан шиг key-value гэсэн хослолоор хадгалсан байна гэж төсөөлж бас болно.<br />
<br />
Тухайн үг бусад үгнүүдтэйгээ хэр хамааралтай байна гэдэг оноог тооцохдоо үгэнд харгалзах query векторыг бусад үгнүүдийн key векторуудтэй cosine similarity тооцож оноожуулаад харгалзах value вектортой нь тухайн оноог үржүүлж дараа нь бүгдийг нь нэмэн weighted sum хийн attend хийсэн гаралтын вектороо гаргаж авдаг.<br />
<br />
Энэ процесс бүх үг бүрийн хувьд хийгдэх учраас дараалалд байгаа үгнүүд бүгдээрээ хоорондоо хэр зэрэг хамааралтай талаархи зүй тогтлыг машин сургалтын явцад олж чадна.<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_DNulfs0i3JJS3tz2ivaDQEx20lhbnsAzunKYG3vKNzjTJLBVthG-AoH0gZeXC994BsiEWjnSqRbDmaWrsEqCNlV3ko9-FMKw0yfzroxMNtZAuT192-msA4fbYJw5sae8dR_p9J0uQA/s1600/1_5h3HHJh7kgezyOdTcRZc0A.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="378" data-original-width="437" height="276" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_DNulfs0i3JJS3tz2ivaDQEx20lhbnsAzunKYG3vKNzjTJLBVthG-AoH0gZeXC994BsiEWjnSqRbDmaWrsEqCNlV3ko9-FMKw0yfzroxMNtZAuT192-msA4fbYJw5sae8dR_p9J0uQA/s320/1_5h3HHJh7kgezyOdTcRZc0A.png" width="320" /></a></div>
<br />
Wq, Wk, Wv энэ гурван матрицаар хийгдэх үйлдлүүд дээрээс саяны attend хийх процессийг нийлүүлээд нэг head гэж нэрлэдэг. Үүн шиг 8 ширхэг эсвэл түүнээс олон head байж болно. Олон head хэрэглэвэл <b>multi-headed attention</b> гэж нэрлэгдэх бөгөөд гаралтуудыг нийлүүлээд томоохон хэмжээтэй матрицаар үр дүнг нь нэгтгэж болдог.<br />
<br />
Multi-headed attention хэрэглэхийн давуу тал нь тухайн үг дарааллын бусад гишүүдтэй олон ялгаатай байдлаар анхаарал хандуулж чаддаг болдог. Weight-үүдийг мэдээж эхний үед random-оор цэнэглэх бөгөөд сургалтын явцад өөр өөр behaviour үзүүлдэг болно.<br />
<br />
<br />
Self-attention ийн нэг онцлог гэвэл оролтын дараалал дахь векторууд ямар байрлалд байх нь эцсийн гаралтад нөлөөгүй.<br />
<br />
Хэлний модель мэтийн өмнөх үгнүүд нь дараагийн үгнүүддээ нөлөөлдөг нөхцөлт дарааллын хувьд байрлалын мэдээлэл маш чухал.<br />
<br />
Тиймээс байрлалын мэдээллийг энкодлосон векторыг дарааллын embedding вектор бүрт нь харгалзуулан бодож нэмдэг. Эндээс тухайн дараалал дахь үг байрлалын сигналтай болж авдаг. Өөрөөр хэлбэл дарааллаас хамаарсан нөхцөлдүүлэгүүд биелэх боломжтой болдог.<br />
<br />
Өмнөх layer-ийн weight-үүдийг одоогийн layer дээрээ нэмж хэрэглэдэг <b>residual connection</b>, мөн дээрээс нь <b>layer normalization</b> гэх мэтийн нарийн бусад техникүүд transformer дотор бий. Эдгээр нь мэдээж гүйцэтгэл сайжруулах чухал техникүүд, transformer болон self-attention лүү илүү төвлөрөх үүднээс түрдээ орхиж бичлээ.<br />
<br />
Саяны энэ бүх болсон процесс буюу encoding давхаргыг мэдээж хооронд нь олноор давхарлаж ашиглаж болно. Хэдийг давхарлах эсэх нь тухайн шийдэх гэж байгаа проблем болон архитектурын дизайныг гаргаж байгаа хөгжүүлэгчээс хамаарна. Хэдч байж болно.<br />
<br />
<b>Encoder-decoder</b> проблемд encoder-оос гарсан гаралтын векторуудыг decoder-т хэрэглэхдээ тусдаа Wk болон Wv матрицуудаар үржиж харгалзан K болон V гэсэн attention векторуудын олонлог болгон хувиргаж хэрэглэдэг. decoder-т дараалал үүсгэхдээ өмнөх encoder layer-ийн элементүүдэд хэр их анхаарал хандуулах вэ гэдгийг эдгээр векторуудыг хэрэглэн гаралтын элемент тус бүрийг нь нэг нэгээр нь predict хийлгэн үүсгэдэг. Бүгдээрээ пир хийтэл нэг дор гараад ирдэггүй гэсэн үг.<br />
<br />
<b>Decoder</b> ажиллахдаа оролтондоо хангалттай хэмжээний дараалал аваад predict хийх байрлалуудаа бүгдийг нь mask-лаж тусгай түлхүүр үгнүүдээр дүүргэдэг.<br />
<br />
Decoder-ийн гаралт нь predict хийх үгэнд ашиглах хялбар softmax давхарга байна. Энэ predict хийсэн үгээ decoder-ийн оролтын дараалалдаа залгаж бөглөөд дараагийн үгээ predict хийгээд дахин залгаж бөглөөд гэх мэтээр үг үгээр циклэдэж зогсох нөхцөл заасан token үүсэх хүртэл нь decode хийнэ гэсэн үг.<br />
<br />
Товчхондоо ийм.<br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhrVoGTjs6gYjd7tfj4Qo7YtBe0aKgcUfzeebbNW9_TjQIjLfvori6F5L0xpPf_fTVBk4-nzR7-yAiS6lTLitVhTpD3LXNfYn9I3XbrNzrNnAK0z_1t5bjMAMaI1nWgZG3GMVMMyLbJQA/s1024/transformer.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="602" data-original-width="1024" height="376" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhrVoGTjs6gYjd7tfj4Qo7YtBe0aKgcUfzeebbNW9_TjQIjLfvori6F5L0xpPf_fTVBk4-nzR7-yAiS6lTLitVhTpD3LXNfYn9I3XbrNzrNnAK0z_1t5bjMAMaI1nWgZG3GMVMMyLbJQA/w640-h376/transformer.png" width="640" /></a><br /><div><br /></div><div>Transformer архитектур хэрэглэж латинаар бичсэн монгол үгнүүдийг крилл монгол үгрүү буулгадаг жишээ болгож бичсэн байгаа <a href="https://colab.research.google.com/drive/10Eq_VvR84oEOBUK5EflvAB35ZcrlQwGm">энэ холбоос</a>оос үзээрэй.<br />
<br />
Лавлагаа :<br />
<a href="http://jalammar.github.io/illustrated-transformer/">http://jalammar.github.io/illustrated-transformer/</a><br />
<a href="https://nostalgebraist.tumblr.com/post/185326092369/the-transformer-explained">https://nostalgebraist.tumblr.com/post/185326092369/the-transformer-explained</a><br />
<br />
<center>
<iframe allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="" frameborder="0" height="315" src="https://www.youtube.com/embed/z1xs9jdZnuY" width="560"></iframe>
</center>
<br />
<br />
<br />
<br />
<br /></div><div class="separator" style="clear: both; text-align: center;"><br /></div><br />Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.com0tag:blogger.com,1999:blog-1457877875009527488.post-61938741204210243852019-08-08T04:16:00.003+08:002022-02-10T06:56:57.984+08:00Negative Sampling гэж юу вэ?Word2vec модель сургаж байна гэж бодоё. Ингэхийн тулд их хэмжээний текстээр тодорхой урттай цонх гүйлгээд үгнүүд сонгож авна.<br />
<a name='more'></a><br />
<br />
Цонхонд буй үгнүүдээс голын үгийг оролт бусад үгнүүдийг label болгон хэрэглээд autoencoder маягийн гуравхан давхаргатай неорон сүлжээгээр softmax regression хийлгүүлдэг. <br />
<br />
Оролт, далд давхар, гаралт гэсэн гурван давхарга. Далд давхаргын хэмжээ нь word2vec-ийн dimension болно.<br />
<br />
Оролт болон гаралтын неороны тензорын дүрс нь нэгийг харьцах нь датасетэд байгаа нийт давтагдашгүй үгнүүдийн тоотой тэнцүү байна.<br />
<br />
Сургалтанд үгнүүдийг onehot encoding илэрхийллээр дүрслэнэ өөрөөр хэлбэл үгийн байрлал дээрхи векторын гишүүний утга нь 1 бусад нь бүгд 0.<br />
<br />
Сургасны дараа гаралтын давхаргыг орхиод далд давхаргын гаралтыг word2vec байдлаар хэрэглэдэг.<br />
<br />
Одоо асуудлаа гаргаж тавья. Тэхээр хэрвээ давтагдашгүй үгийн тоо 10000, далд давхаргын хэмжээ 300 бол далд давхаргын weight 10000x300, гаралтын давхаргын weight 300x10000 болно.<br />
<br />
Өөрөөр хэлбэл давхарга бүр дээр 3 сая параметер сургах шаардлагатай backpropagation хийхэд асар их тооцоолол шаардлагатай гэсэн үг.<br />
<br />
Энэ их тооцооллыг цөөлж өгдөг нэг аргыг <b>negative sampling</b> гэдэг.<br />
<br />
Давхарга бүр дээр сургах weight-ийг ажвал ердөө үг болгон дээр хамааралтай баахан lookup хүснэгт байна гэсэн үг.<br />
<br />
Negative sampling ажиллахдаа зөвхөн тухайн үгэнд хамааралтай weight-үүдийг л сургах явцад update хийнэ.<br />
<br />
Label-ийн onehot дүрслэл дээрхи 1 гээд тэмдэглээд өгсөн неоронд хамааралтай weight-үүдийг л update хийнэ гэсэн үг. Энэ label-ийг "positive sample" гэдэг.<br />
<br />
Мөн дээрээс нь санамсаргүй таван ширхэг үг сонгоод өөрөөр хэлбэл харгалзах байрлалыг нь сонгоод тэрэнд хамааралтай weight-үүдийг бас update хийдэг. Тэр сонгогдсон таван үгний байрлал дээрхи 0 утгуудаас эхлэн backpropagation хийгдэнэ гэсэн үг. Эдгээрийг "negative sample" гэдэг.<br />
<br />
Тэхээр санамсаргүй таван ширхэг 0 мөн нэг ширхэг 1 утга буюу нийтдээ 6 ширхэг байрлалд хамаатай weight-үүдийг л update хийнэ гэсэн үг.<br />
<br />
Далд давхаргын хэмжээ 300 гэвэл нийтдээ гаралтын давхарга дээр 300x6=1800 ширхэг л weight-үүдийг update хийхнээ.<br />
<br />
3 сая параметер сургаж тооцоолол үрж байснаас 1800 буюу ердөө 0.06-хан хувийн параметерийг сургана гэсэн үг.<br />
<br />
Далд давхаргын хувьд зөвхөн оролтын үгны байрлалтай хамаатай weight-үүдийг л update хийнэ.<br />
<br />
Тэхээр negative sampling гэдэг нь их хэмжээний класстай өгөгдлүүд дээр неорон сүлжээг сургахад тооцоолол хэмнэж өгдөг механизм байхнээ. Зөвхөн word2vec-ээр хязгаарлаглахгүй.<br />
<br />
<br />
<br />
<br />Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.com0tag:blogger.com,1999:blog-1457877875009527488.post-56183594671079788752019-01-31T05:03:00.002+08:002021-11-23T16:40:26.418+08:00Арилжаанд хэрэглэгддэг үйлдлүүдСанхүүгийн янз бүрийн хэллэгүүд жишээлбэл going long, going short гэх мэтийн хэллэгүүдийн талаар гайхдаг байлаа. Энэ талаархи ойлголтоо тэмдэглэж авая гэж бодлоо.<br />
<br />
<a name='more'></a><br />
<br />
Тэхээр хүмүүс хоёр янзын аргаар хувьцаанаас ашиг хийдэг эдгээр нь longing болон shorting нар болно.<br />
<br />
<span style="font-size: large;">Going long гэж юу вэ?</span><br />
<br />
Энэ нь тухайн хувьцааны ханш доогуур үнэтэй байх үед нь <b>худалдаж аваад</b> үнэ нь өсөх үед нь <b>буцаан зарж</b> ашиг олохын хэлдэг. Энэ үнийн хөдөлгөөнийг bullish гэж нэрлэдэг, өөрөөр хэлбэл хувьцааны үнэ цаашаагаа өснө гэдэг дээр өндөр итгэлтэй байхыг bullish гэнэ.<br />
<br />
Buy low at $29 (Buy to Open Position)<br />
Sell high at $36 (Sell to Close Position)<br />
<br />
- $29 (хувьцааг худалдаж авч байгаа учраас хасахтай, debit)<br />
+ $36 (буцаан зарж байгаа учраас нэмэхтэй, credit)<br />
-------<br />
= $7 (ширхэг бүрээс 7 долларын ашиг хийсэн гэсэн үг)<br />
<br />
<br />
<span style="font-size: large;">Going short гэж юу вэ?</span><br />
<br />
Энэ бол хувьцааны үнэ өндөр байхад нь <b>зараад </b>үнэ нь буурах үед нь <b>худалдаж авах</b> үйл явц юм.<br />
Тэгвэл эндээс өөрт байхгүй хувьцаагаа яаж зарах юм бэ? гэдэг асуулт гарч ирнэ.<br />
Тэхээр зарах захиалга өгөх үед брокероос хувьцаа зээлж авч болдог бөгөөд өөрт эзэмшдэггүй хувьцааг зарах үйл явцыг <b>short selling</b> гэж нэрлэдэг.<br />
Энэ үйл явц мэдээж автоматаар хийгддэг.<br />
Мэдээж зээлж авсан тул брокертоо авсан хувьцаануудаа буцаан эргүүлэн төлөх ёстой.<br />
Үүний тулд хувьцааны үнэ унасан үед нь буцаан худалдаж авдаг.<br />
Хувьцааны үнэ цаашдаа буурна гэдэг дээр өндөр итгэлтэй байхыг bearish гэж нэрлэдэг.<br />
<br />
Sell high at $26 (Sell to Open Position)<br />
Buy low at $20 (Buy to Close Position)<br />
<br />
+ $26<br />
- $20<br />
-------<br />
= $6 (ширхэг бүрээс 6 долларын ашиг хийж байна гэсэн үг)<br />
<br />
Short selling хийдэг боломжтой байхын тулд брокерт margin account нээх хэрэгтэй.<br />
Брокероос хувьцаа зээлбэл хариуд нь хүү төлөх шаардлагатай байдаг.<br />
Хугацааны хувьд тодорхой заасан хязгаар байхгүй ч удах тусмаа өндөр хүү төлөх хэрэгтэй.<br />
Өртөг багатай брокеруудад энэ хүү нь 2-3% байдаг.<br />
<br />
Ер нь ихэнхидээ short selling хийх үйл явц 1-ээс 2 долоо хоног болтол хэтрээд байдаггүй тул брокерийн хүү дансанд айхтар нөлөөлөхүйц хэмжээтэй их биш байдаг байна.<br />
<br />
<br />
Иймэрхүү замаар хувьцааны үнэ өсөх, эсрэгээр буурах хоёр чиглэлд арилжаачид ашиг хийх боломжтой байдаг.<br />
<br />
<br />
[Сүүлд нэмэв]<br />
<br />
<span style="font-size: large;">ROI(Return On Investment) буюу хөрөнгө оруулсан ашгаа тооцох</span><br />
<br />
ROI нь тухайн хувьцааг өөрийнхөө мөнгөөр худалдаж авч зарах аль эсвэл margin хэрэглэсэн эсэхээс шалтгаалан өөр өөр байх боломжтой.<br />
Margin хэрэглэнэ гэдэг нь бусдын мөнгийг хэрэглэн ашиг олохыг хэлдэг.<br />
<br />
Жишээлбэл 100-н ширхэг хувьцааг $29 байх үед нь худалдаж авбал нийт оруулсан <b>хөрөнгө оруулалт</b> нь $29x100=$2900 болно.<br />
Longing хийх буюу ханш өсөөд $36 болох үед нь зарвал $36x100=$3600 болно.<br />
Эндээс <b>ашиг</b> нь $3600-$2900=$700 болно.<br />
Анхны хөрөнгө оруулалтаас олсон ашгаа тооцвол<br />
<b>ROI</b> нь $700/$2900=<b>24%</b> болно.<br />
Энэ ашиг бол бараг л хувьцааны үнийн өсөлттэй адилхан буюу ($36-$29)/$29=24% байна.<br />
<br />
Гэтэл зарим сайн арилжаачид хувьцааны үнэ дөнгөж 5% өссөн байхад ROI-гоо 20%, 50%, 100% гэх мэт өндөр хувьтай байлгаж чаддаг.<br />
Энэ өндөр ROI-д хүрэхийн тулд тэд хөшүүрэг буюу <b>leverage </b>хэрэглэдэг байна.<br />
Өөрөөр хэлбэл margin хэрэглэж хөрөнгө оруулалт хийх буюу хүний мөнгөөр хөрөнгө оруулалт хийж олох ашгаа нэмэгдүүлэхийг хөшүүрэг хэрэглэх гэдэг.<br />
<br />
<br />
<span style="font-size: large;">Хөшүүрэг хэрэглэн ROI-гоо томруулах</span><br />
<br />
Хөшүүрэг хэрэглэн ROI-г томруулах хоёр арга бий.<br />
<br />
Эхнийх нь margin данс хэрэглэх.<br />
Брокер дээр данс нээхэд <b>cash account</b> болон <b>margin account</b> нээх гэсэн сонголт бий.<br />
<b>Cash account</b> нь мэдээж өөрийнхөө мөнгөөр хувьцаа худалдаж авах боломж олгоно.<br />
Харин <b>margin account</b> гэдэг нь дансан дахь $1 бүр тань $2-ийн үнэтэй хувьцааг худалдаж авах чадвартай байдаг аккоунтыг хэлдэг. Зарим брокерууд бүр $5 ч санал болгодог өөрөөр хэлбэл таны худалдан авах чадварыг 5 дахин томруулж өгч байна гэсэн үг.<br />
<br />
<br />
<br />
<br />Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.com0tag:blogger.com,1999:blog-1457877875009527488.post-13854149246315986822017-08-23T18:01:00.002+08:002021-08-19T21:06:56.413+08:00Tensorflow, ConvolutionConvolution гэх ойлголтын талаар.<br />
<br />
<a name='more'></a><br />
Машин сургалтын хувьд convolution гэдэг нь ямар нэгэн матриц дээр тодорхой кернелийг хэрэглэж шүүлтүүр хийх үйл ажиллагааг хэлэх бөгөөд deep learning алгоритмууд болон архитектуруудад нилээд өргөнөөр хэрэглэгддэг.
<br />
<br />
Энэ үйл ажиллагааг ойлгож мэдсэнээр Convolutional Neural Network (ConvNet) гэх нэгэн алдартай Deep Learning алгоритмыг ойлгоход ч тустай.<br />
<br />
Convolution-ийг ойлгохын тулд эхлээд gaussian kernel гээчийг мэдэх шаардлагатай.
<br />
<br />
Gaussian kernel-ийг ямар нэгэн зураг дээгүүр томруулдаг шилээр гүйлгэж харж байгаа мэтээр төсөөлж болох бөгөөд энэ нь томруулдаг шилний төвөөс гадагшаагаа суларч байгаа юм шиг шинж чанарыг агуулдаг.<br />
<br />
Энэ кернелийг хэрэглэсний дараа тухайн зураг blurred буюу бүдгэрсэн мэт болдог байна. Blurred image түлхүүр үгээр гүүгэлдэж үзээрэй.
<br />
<br />
Convolution хийх үйл ажиллагааг дарааллын дагуу хийж үзье.
<br />
<br />
Хамгийн эхлээд зураг ачаалж харая.
<br />
<br />
<pre class="prettyprint"><code class="language-python">from skimage import data
import numpy as np
from matplotlib import pyplot as plt
img = data.camera().astype(np.float32)
plt.imshow(img, cmap='gray')
plt.show()</code></pre>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhw5ZBZXwF9b6hAvBGFYXoGoHlbov-mcIupS8s8L5vlX0vjKDMskjSK46ivf1KXGISStKTLSXvkYq3-TLDY0UtyVBabKnTYT9f18JRvsJC9_y0qXJt6Y_PjpoJCmkMCzlxsDzomb38Vww/s1600/Camera-man.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="480" data-original-width="640" height="480" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhw5ZBZXwF9b6hAvBGFYXoGoHlbov-mcIupS8s8L5vlX0vjKDMskjSK46ivf1KXGISStKTLSXvkYq3-TLDY0UtyVBabKnTYT9f18JRvsJC9_y0qXJt6Y_PjpoJCmkMCzlxsDzomb38Vww/s640/Camera-man.png" width="640" /></a></div>
Зургийн өгөгдөл өндөр, өргөн, суваг гэсэн гурван хэмжээстэй байдаг. Суваг гэдэгт нь RGB-г хэлж байгаа.<br />
<br />
Tensorflow-т 2D convolution хийхийн тулд зурагнуудын өгөгдөл 4-н хэмжэстэй байх ёстой. Хэрэв олон зурагнуудыг нийлүүлээд нэг numpy массив болговол [зурагнуудын тоо, өндөр, өргөн, суваг] гэсэн 4-н хэмжээстэй болно гэсэн үг.<br />
<br />
Нэг ширхэг зураг, өндөр, өргөн тэгээд нэг суваг хэрэглэж үзэх болохоор дээрхи ачаалсан зургийг энэ хэмжээсрүү оруулах ёстой. Үүний тулд numpy-ийн reshape функцийг хэрэглэж болно.<br />
<pre class="prettyprint"><code class="language-python">img_4d = img.reshape([1, img.shape[0], img.shape[1], 1])</code></pre>
<br />
Tensorflow дээр бол
<br />
<pre class="prettyprint"><code class="language-python">>>> img_4d = tf.reshape(img, [1, img.shape[0], img.shape[1], 1])
>>> img_4d.get_shape().as_list()
[1, 512, 512, 1]
</code></pre>
<br />
Кернелийн хувьд хэмжээс нь мөн 4 боловч хэмжээсийн дараалал нь өөр.<br />
[кернелийн өндөр, кернелийн өргөн, зургийн сувгийн тоо, филтерийн тоо]<br />
<br />
Gaussian кернел үүсгэхээс өмнө gaussian муруй үүсгэж харая.<br />
<pre class="prettyprint"><code class="language-python">import numpy as np
import tensorflow as tf
from matplotlib import pyplot as plt
sess = tf.InteractiveSession()
x = tf.linspace(-3.0, 3.0, 100)
# Гауссын муруй
mean = 0.0
sigma = 1.0
z = (tf.exp(tf.negative(tf.pow(x - mean, 2.0) /
(2.0 * tf.pow(sigma, 2.0)))) *
(1.0 / (sigma * tf.sqrt(2.0 * 3.1415))))
plt.plot(z.eval())
plt.show()
</code></pre>
<br />
Нэг иймэрхүү муруй үүснэ, томъёо тал дээр одоогоор санаа зовоод дэмий.
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzufLzVlEj1dV8d9hL54FoG28-rF9enai62ZEbQ1tSOcsRt7X3wcLMeEFF5EuNW9eRlcymaN_09krb_dJrlmymCwkUfj77ihaTgbB4MXfBh19HQGrClSW7o7YbeFLPQ4TZF9k-WxY8vQ/s1600/Gaussian-curve.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="480" data-original-width="640" height="480" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzufLzVlEj1dV8d9hL54FoG28-rF9enai62ZEbQ1tSOcsRt7X3wcLMeEFF5EuNW9eRlcymaN_09krb_dJrlmymCwkUfj77ihaTgbB4MXfBh19HQGrClSW7o7YbeFLPQ4TZF9k-WxY8vQ/s640/Gaussian-curve.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Гауссын муруй</td></tr>
</tbody></table>
<br />
Энэ муруйг ашиглаад хоёр хэмжээст Gaussian kernel үүсгэе.<br />
Энэ муруйны векторыг нэг хэмжээст матриц гэж үзэж болох бөгөөд өөрийг нь өөрийнх нь transpose хувиргалтаар нь гаргаж авсан матрицаар үржүүлбэл 2 хэмжээст Gaussian kernel бий болно.<br />
<pre class="prettyprint"><code class="language-python">import numpy as np
import tensorflow as tf
from matplotlib import pyplot as plt
sess = tf.InteractiveSession()
# -3 аас +3 утгуудын хооронд 100 ширхэг тоо гаргаж авах
x = tf.linspace(-3.0, 3.0, 100)
# Гауссын муруй
mean = 0.0
sigma = 1.0
z = (tf.exp(tf.negative(tf.pow(x - mean, 2.0) /
(2.0 * tf.pow(sigma, 2.0)))) *
(1.0 / (sigma * tf.sqrt(2.0 * 3.1415))))
# Гауссын муруй доторхи элементүүдийн тоо
ksize = z.get_shape().as_list()[0]
# 2 хэмжээст Гауссын кернел гаргаж авах
z_2d = tf.matmul(tf.reshape(z, [ksize, 1]), tf.reshape(z, [1, ksize]))
# Гауссын кернелийг зурж харах
plt.imshow(z_2d.eval())
plt.show()</code></pre>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiF3xV6H5n4yjXkPkQJSfwDPd8SgnMVUF7aI6O-FngjDtPO6pOoq-0VYU3mGO-AXMLCIwGekeeT6nm1ZbuFCFXZDlDEzQK_eNgQ9_aQNXFyx20FFpJf7TJNHud55dHC_lpv8JSxr0eJRQ/s1600/Gaussian-2d-kernel.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="480" data-original-width="640" height="480" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiF3xV6H5n4yjXkPkQJSfwDPd8SgnMVUF7aI6O-FngjDtPO6pOoq-0VYU3mGO-AXMLCIwGekeeT6nm1ZbuFCFXZDlDEzQK_eNgQ9_aQNXFyx20FFpJf7TJNHud55dHC_lpv8JSxr0eJRQ/s640/Gaussian-2d-kernel.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">2 хэмжээст Гауссын кернел</td></tr>
</tbody></table>
<br />
Зиа одоо 2 хэмжээст зураг бэлэн бас 2 хэмжээст Гауссын кернел бэлэн тийм болохоор convolution үйл ажиллагаа буюу шүүлтүүр хийх үйл ажиллагааг явуулахад бэлэн болжээ.<br />
<pre class="prettyprint"><code class="language-python">from skimage import data
from matplotlib import pyplot as plt
import numpy as np
import tensorflow as tf
sess = tf.InteractiveSession()
# зураг ачаалах
img = data.camera().astype(np.float32)
# зургийг 4-н хэмжээст болгох
# [#Images x H x W x #Channels]
img_4d = tf.reshape(img, [1, img.shape[0], img.shape[1], 1])
# 100 ширхэг -3, 3 ийн хооронд тархсан тоо үүсгэх
x = tf.linspace(-3.0, 3.0, 100)
# Гауссын муруй
mean = 0.0
sigma = 1.0
z = (tf.exp(tf.negative(tf.pow(x - mean, 2.0) /
(2.0 * tf.pow(sigma, 2.0)))) *
(1.0 / (sigma * tf.sqrt(2.0 * 3.1415))))
# Гауссын муруй доторхи элементүүдийн тоо
ksize = z.get_shape().as_list()[0]
# 2 хэмжээст Гауссын кернел гаргаж авах
z_2d = tf.matmul(tf.reshape(z, [ksize, 1]), tf.reshape(z, [1, ksize]))
# Кернелийг convolution хийхийн тулд 4-н хэмжээсрүү оруулах
# [H x W x #Input channels x #Output channels]
z_4d = tf.reshape(z_2d, [ksize, ksize, 1, 1])
# Convolution шүүлтүүр хийх
convolved = tf.nn.conv2d(img_4d, z_4d, strides=[1, 1, 1, 1], padding='SAME')
# Тооцооллын графыг Tensorflow дээр ажиллуулах
res = convolved.eval()
# зурж харах
plt.imshow(np.squeeze(res), cmap='gray')
plt.show()
</code></pre>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNDRgdLqyZzP2HoqkTPJTg23wDbs5ASAwkeAaCpvmr4DBJUouNZ9st15QB_O74mXrwrTsjlk-BUoZZTCy1Ccx_TqiNEmBYLRrANV4Yxbk8vAOXIZD6_PBxAZSOqXvvbnfNe0F8AaD5og/s1600/Convolved-by-Gaussian-kernel.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="480" data-original-width="640" height="480" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNDRgdLqyZzP2HoqkTPJTg23wDbs5ASAwkeAaCpvmr4DBJUouNZ9st15QB_O74mXrwrTsjlk-BUoZZTCy1Ccx_TqiNEmBYLRrANV4Yxbk8vAOXIZD6_PBxAZSOqXvvbnfNe0F8AaD5og/s640/Convolved-by-Gaussian-kernel.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Зургийг convolute хийсний дараах үр дүн</td></tr>
</tbody></table>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi00E4AnisepahaBseKlctY3ImWIWLSiHPgrt7lPbVBsaI3NwvHY08GsjBl954ni_Qr0hcsDxNkc4rO1O1ZdwXYf4kBU8WQJKhTe6R3YROxQitS6ZtvZE8YIBl_jfvEHtVKzRjh8X4eqw/s1600/convolving-animation.gif" imageanchor="1"><img border="0" data-original-height="272" data-original-width="480" height="362" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi00E4AnisepahaBseKlctY3ImWIWLSiHPgrt7lPbVBsaI3NwvHY08GsjBl954ni_Qr0hcsDxNkc4rO1O1ZdwXYf4kBU8WQJKhTe6R3YROxQitS6ZtvZE8YIBl_jfvEHtVKzRjh8X4eqw/s1600/convolving-animation.gif" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Convolution үйл ажиллагааг хөдөлгөөнт байдлаар харвал иймэрхүү</td></tr>
</tbody></table>
<br />
Дээрхи анимац шиг үйлдлийг <br />
<pre class="prettyprint"><code class="language-python">convolved = tf.nn.conv2d(img_4d, z_4d, strides=[1, 1, 1, 1], padding='SAME')</code></pre>
кодоор хийлгэж байгаа бөгөөд strides болон padding гэсэн шинэ хувьсагчууд тохируулсан байна.<br />
<br />
<b>Strides</b> нь хэдэн пикселээр filter кернелийг зураг дээгүүр нүүлгэн шилжүүлж ажиллуулах вэ гэдэг тохиргоог агуулна. Хэрэв 1 1 ээр нүүлгэвэл гаралтын хэмжээс нь оролттойгоо бараг ижилхэн гарна. 2-оор нүүлгэвэл гаралтын хэмжээс нь 2 дахин бага болно.<br />
<br />
<b>Padding</b>-д 'VALID' болон 'SAME' гэсэн утга өгч болно. Хэрэв 'VALID' утга оноовол кернел шилжихдээ зургийн хэмжээсийн ирмэгийг нэвтлэхгүй гаралт нь бага зэрэг хэмжээс нь буурна, хэрэв 'SAME' гэж оноовол кернел зургийн хэмжээсийг нэвтлэх бөгөөд гаралтын хэмжээс нь оролтын хэмжээстэй яг таардаг. Дараах бичлэгээс үзвэл бүүр ойлгомжтой.<br />
<br />
<br />
<center>
<iframe allowfullscreen="" frameborder="0" height="315" src="https://www.youtube.com/embed/jajksuQW4mc" width="560"></iframe>
</center>
<br />
Бүүр цааш дэлгэрэнгүй <a href="https://www.kadenze.com/courses/creative-applications-of-deep-learning-with-tensorflow-iv">Kadenze-ийн Deep Learning курс</a>ээс үзээрэй.
Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.com0tag:blogger.com,1999:blog-1457877875009527488.post-33516605911585256692017-08-14T03:03:00.000+08:002021-08-19T21:01:36.738+08:00Tensorflow тэмдэглэлЯг одоо компютерийн шинжлэх ухааны ертөнцөд шинэ давалгаа явагдаж байна, хуучин бол компютерийг хүн програмчлах шаардлагатай байсан харин одоо бол компютерийг програмчлах биш сургадаг болж байна. Компютертэй харилцах бүхэл бүтэн шинэ хандлага бий болж байна гэхэд болно. Үүнийг мэдэрсэн хүн болгон одоо machine learning салбарыг сонирхон судалцгааж байгаа байх.<br />
<a name='more'></a><br />
<span style="font-size: large;">Tensorflow-ийн үндэс</span><br />
<br />
Tensorflow болон бусад numpy мэтийн тооны тооцооллын сангуудын хоорондох хамгийн том ялгаа бол Tensorflow-ийн үйлдлүүд нь симболик байдаг, уламжлалт програмчлалд хувьсагч тодорхойлон түүндээ утга хадгалаад дээр нь ямар нэгэн үйлдэл гүйцэтгээд утгыг нь шууд өөрчилдөг байсан, симболик програмчлалд үйлдлүүдийн хооронд граф барьдаг бөгөөд энэ графийг дараагаар нь ажиллуулах зорилгоор компайлдаж хадгалдаг байна. CPU болон GPU дээр компайлдсан графийг ялгаагүйгээр ажиллуулах боломжтой, симбол бол хаана ажиллуулах вэ гэдэг тал дээр санаа тавих шаардлагагүй байх боломжийг олгосон хийсвэрлэл юм. Мөн симболик програмчлалын ачаар Tensorflow-т <a href="https://en.wikipedia.org/wiki/Automatic_differentiation">автоматаар дифференциал</a> (symbolic differentiation) авахаас өгсүүлээд numpy эдэрт шууд хийлгүүлэх боломжгүй олон янзын үйлдлүүдийг гүйцэтгүүлэх боломжтой байдаг.<br />
<br />
Санамсаргүй утгуудаар дүүргэсэн хоёр матрицуудыг хооронд нь numpy санг хэрэглэн үржүүлж үзье.<br />
<pre class="prettyprint"><code class="language-python">import numpy as np
x = np.random.normal(size=[10, 10])
y = np.random.normal(size=[10, 10])
z = np.dot(x, y)
print(z)
</code></pre>
<br />
Яг ижилхэн үйлдлийг Tensorflow дээр хийе
<br />
<pre class="prettyprint"><code class="language-python">import tensorflow as tf
x = tf.random_normal([10, 10])
y = tf.random_normal([10, 10])
z = tf.matmul(x, y)
sess = tf.Session()
z_val = sess.run(z)
print(z_val)
</code></pre>
<br />
numpy дээрхи шиг тооцооллыг шууд үйлдэж үр дүнг гаралтын хувьсагч z рүү хуулж хадгалахын оронд tensorflow-т үр дүнг төлөөлөх графийн нэг оройг(Тензор) л буцаана, хэрэв z-г шууд хэвлэхийг оролдвол иймэрхүү л юм буцаана.
<br />
<pre>Tensor("MatMul:0", shape=(10, 10), dtype=float32)
</pre>
<br />
Тэхээр тензорын утгыг тооцоолон үйлдэж олохын тулд session объект үүсгээд үүнийхээ тусламжтайгаар Session.run() функцыг хэрэглэн графийг ажиллуулах ёстой.<br />
<br />
Симболик тооцоолол яагаад хүчирхэг вэ гэдгийг ойлгохын тулд өөр жишээ хэрэгтэй.<br />
f(x) = 5x^2 + 3<br />
гэсэн муруй функц байна гэж үзье, тэгвэл f(x)-ийг функцийнх нь параметрүүдийг мэдэхгүйгээр хэрхэн ойролцоолон олох вэ?<br />
g(x, w) = w0 x^2 +w1 x + w2<br />
гэсэн параметерт функц тодорхойлоё. Энэ функцэд оролтын x болон нуугдмал w параметрүүд авна, манай зорилго бол g(x, w) ≈ f(x) байх нуугдмал параметрүүдийг олох явдал юм. Үүнийг loss функцийг минимумчлах замаар олох боломжтой. Loss функцийг тодорхойлвол:<br />
L(w) = (f(x) - g(x, w))^2<br />
санамсаргүй цэгүүд дээр w-тэй хамааруулан L(w) -ийн дундаж градиентийг олоод эсрэг чиглэлд утгуудыг өөрчилөх замаар минимумчлах үйл ажиллагааг хийж болно. Үүнийг Tensorflow-т дээр хэрхэн кодлон олохыг харуулая<br />
<pre class="prettyprint"><code class="language-python">import numpy as np
import tensorflow as tf
# placeholder-ийг python скриптээс Tensorflow графийн
# үйлдлүүдрүү утга дамжуулахад ашигладаг
# оролтын feature x болон гаралтын y нарт зориулан хоёр placeholder үүсгэе
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
# олохыг хүссэн функц маань 2-р зэргийн олон гишүүнт гэж үзье
# коффициентүүдийг нь хадгалах 3-н гишүүнтэй вектор үүсгэн хувиарлая
# энэ хувьсагч нь санамсаргүй утгуудаар автоматаар дүүрсэн байх болно
w = tf.get_variable("w", shape=[3, 1])
# yhat-ийг y-ийн дөхөлтөөр төлөөлүүлэн ашиглахын тулд тодорхойлоё
f = tf.stack([tf.square(x), x, tf.ones_like(x)], 1)
yhat = tf.squeeze(tf.matmul(f, w), 1)
# loss-ийг y-ийн дөхөлт болон өөрийнх нь жинхэнэ утга хоёрын хоорондох l2 зай
# байхаар тодорхойлоё.
loss = tf.nn.l2_loss(yhat - y) + 0.1 * tf.nn.l2_loss(w)
# Adam optimizer-ийг суралцах хэмжээ нь 0.1 тэйгээр loss-ийг
# минимумчилахын тул тохируулан хэрэглэе
train_op = tf.train.AdamOptimizer(0.1).minimize(loss)
def generate_data():
x_val = np.random.uniform(-10.0, 10.0, size=100)
y_val = 5 * np.square(x_val) + 3
return x_val, y_val
sess = tf.Session()
# Олон хувьсагчууд хэрэглэж байгаа болохоор эхлээд тэд нарыг цэнэглэх хэрэгтэй
sess.run(tf.global_variables_initializer())
for _ in range(1000):
x_val, y_val = generate_data()
_, loss_val = sess.run([train_op, loss], {x: x_val, y: y_val})
print(loss_val)
print(sess.run([w]))
</code></pre>
<br />
Энэ кодыг ажиллуулснаар 5x^2 + 3 буюу <b>5</b>x^2 + <b>0</b>x^1 + <b>3</b>x^0 тэй тун дөхсөн векторыг буцаана.
<br />
<pre>[4.9924135, 0.00040895029, 3.4504161]
</pre>
<br />
Энэ бол Tensorflow ийн хийж чадах зүйлсийн нэгээхэн хэсэг нь. Сая сая параметрүүдтэй том хэмжээтэй неорон сүлжээг Tensorflow-т хэдхэн мөр код ашиглаад оптималчилж чадна. Дээрээс нь олон төхөөрөмжүүд, cpu, thread-үүд, платформуудад тооцооллыг тархаан хурдан гүйцэтгэх тал дээр автоматаар санаа тавьж чадна.<br />
<br />
Цааш нь дэлгэрэнгүйг <a href="https://github.com/vahidk/EffectiveTensorflow">https://github.com/vahidk/EffectiveTensorflow</a> холбоосоос үзээрэй.Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.comtag:blogger.com,1999:blog-1457877875009527488.post-76822521276363675882017-05-10T10:15:00.001+08:002021-08-19T21:00:49.001+08:00Naive Bayes ClassifierSupervised Machine Learning-ийн нэг classifier алгоритмын талаар. FB Note дээрээсээ шууд хуулж тавьсан учраас фонт нь өөр болсон байна.<br />
<br />
<a name='more'></a><br />
<h2 class="_2cuy _509y _2vxa" style="box-sizing: border-box; color: #1d2129; direction: ltr; font-family: Georgia, serif; font-size: 36px; font-weight: normal; line-height: 38px; margin: 0px auto 28px; padding: 0px; white-space: pre-wrap; width: 700px; word-wrap: break-word;">
Bayes-ийн онол</h2>
<div class="_2cuy _3dgx _2vxa" style="box-sizing: border-box; color: #1d2129; direction: ltr; font-family: Georgia, serif; font-size: 17px; margin: 0px auto 28px; white-space: pre-wrap; width: 700px; word-wrap: break-word;">
Энэ онолыг анх Thomas Bayes бий болгосон бөгөөд нөхцөлт магадлал дээр суурилдаг. Нөхцөлт магадлал бол ямар нэгэн зүйл аль хэдийн болсон тохиолдолд дараагаар нь дахин өөр ямар нэгэн зүйл болох магадлал юм.
<span class="_4yxo" style="font-family: inherit; font-weight: bold;">Нөхцөлт магадлалын жишээ</span>
- Аль хэдийн iPhone утас хэрэглэдэг хүн macbook худалдаж авах магадлал.
- Кино театр дотор ундаа худалдаж авч уух магадлал.
- Хүйтэн ундаа авчаад дараагаар нь самар худалдаж авах магадлал.
- Буйдан дээр суусны дараа хүүхэлдэйн кино үзэх магадлал.</div>
<div class="_2cuy _3dgx _2vxa" style="box-sizing: border-box; color: #1d2129; direction: ltr; font-family: Georgia, serif; font-size: 17px; margin: 0px auto 28px; white-space: pre-wrap; width: 700px; word-wrap: break-word;">
<span class="_4yxo" style="font-family: inherit; font-weight: bold;">Нөхцөлт магадлалын томъёо</span>
P(A|B) = P(B|A) * P(A) / P(B)
энд
P(A) бол Macbook худалдаж авах магадлал.
P(B) бол iPhone худалдаж авах магадлал.
P(B|A) бол Macbook худалдаж авсны дараа iPhone худалдаж авах магадлал.
P(A|B) бол iPhone худалдаж авсны дараагаар дахин Macbook-тэй болох магадлал.
ерөнхийлсөн тайлбар
P(A) бол A таамаг үнэн байх магадлал. Өмнөх магадлал гэдэг.
P(B) бол баталгааны магадлал (таамгаас үл хамаарна).
P(B|A) бол өгөгдсөн таамаг үнэн байх тохиолдолд гарах баталгааны магадлал.
P(A|B) бол баталгаа нь байж байхад таамгийнх нь гарах магадлал.
<span class="_4yxo" style="font-family: inherit; font-weight: bold;">Жишээ асуудал</span>
Нэг лабортарт “D” нэртэй өвчнийг илрүүлэх тест хийж байна гэж үзье мэдээж хариу нь муу гарсан бол “Positive” харин гайгүй гарсан бол “Negative” үр дүнг үзүүлнэ. Тухайн өвчтөн “D” нэртэй өвчнөөр яг таг өвдсөн байна гэдгийг өөрөөр хэлбэл “Negative” хариуг зөв гаргах магадлал нь 99%, эсрэгээрээ өвдөөгүй байх тохиолдолд “Positive” гэсэн зөв хариуг мөн 99% магадлалтайгаар илрүүлдэг. Нийт тест өгсөн хүмүүсийн 3% нь энэ өвчнөөр өвдсөн буюу “Positive” үр дүнг үзүүлсэн бол таны энэ өвчнөөр өвдсөн байх магадлал хэд вэ?
Энэ асуудлыг нөхцөлт магадлалын томъёололд оруулан шийдэж болно.
- D өвчнөөр өвдсөн хүмүүсийн магадлал буюу P(D) = 0.03 = 3%
- Тухайн тест өгсөн хүн D өвчнөөр үнэхээр өвдсөн байгаа үед
“Positive” хариуг гаргах магадлал P(Pos | D) = 0.99 = 99%
- D өвчнөөр өвдөөгүй хүмүүсийн магадлал P(~D) = 0.97 = 97%
- Тест өгсөн хүн “Positive” хариу авсан бөгөөд энэ тохиолдолд D өвчнөөр
өвдөөгүй байх тохиолдлын магадлал нь P(Pos | ~D) = 0.01 = 1%
Тест өгсөн хүн D өвчнөөр өвдсөн байх магадлалыг буюу <span class="_4yxo" style="font-family: inherit; font-weight: bold;">P(D | Pos)</span> -г Bayes-ийн томъёонд оруулбал
<span class="_4yxo" style="font-family: inherit; font-weight: bold;">P(D | Pos)</span> = P(Pos | D) * P(D) / P(Pos)
болно. Эндээс P(Pos) магадлалын утгыг тооцон олох шаардлагатай.
<span class="_4yxo" style="font-family: inherit; font-weight: bold;">P(Pos)</span> = P(D, pos) + P(~D, pos)
= P(pos | D) * P(D) + P(pos | ~D) * P(~D)
= 0.99 * 0.03 + 0.01 * 0.97 = 0.0297 + 0.0097
= 0.0394
<span class="_4yxo" style="font-family: inherit; font-weight: bold;">P(D | Pos)</span> = P(Pos | D) * P(D) / P(Pos)
= 0.99 * 0.03 / 0.0394
= 0.753807107
Тэхээр ойролцоогоор D өвчнөөр өвдөх магадлал нь 75%-тай гэсэн хариу гарахнээ.
<span class="_4yxo" style="font-family: inherit; font-weight: bold;">Өөр нэг жишээ</span>
Өнөөдөр миний буйдан дээр тухалж амрах дээрээс нь амарсан даруйдаа күүкэлдэйн кино тааруулж үзэх магадлал хэд вэ?
A үзэгдэл - өнөөдөр South Park хүүхэлдэйн кино үзэх
B үзэгдэл - өнөөдөр буйдан дээр ялхайтал амарч суух
P(A|B) = ?
Энэ хоёр үзэгдлүүдийг bayes-ийн томъёоны бүрэлдэхүүн хэсгүүд болгон задлая :
Өнгөрсөн 2 сарын турш буюу 60-н хоногт South Park хүүхэлдэйн киног 10-н удаа үзсэн.
P(A) = P(Өнөөдөр South Park күүкэлдэй үзэх магадлал) = 10 / 60 = ~0.17
Би ихэнхидээ өдөр бүр буйдан дээр амарч суух дуртай бөгөөд өнгөрсөн 2 сарын дотор 14-н удаа гэрээсээ гадуур хоноглож шоудсан. Тэгвэл өнөөдөр буйдан дээр сууж тухлах магадлал
P(B) = P(Өнөөдөр буйдан дээр амрах магадлал) = (60-14) / 60 = ~0.76
South Park хүүхэлдэйн киног 10-н удаа үзсэн хэдий ч 4-ийг нь найзындаа гэрээсээ гадуур үзсэн. Тэхээр би өнгөрсөн 2 сарын дотор 46 өдрийг гэртээ буйдан дээр зурагтын удирдлага сольж өнгөрөөсөн байхнээ, мөн гэртээ South Park-ийг 6/10-н удаа үзсэн болж таарч байна.
P(B|A) = P(өнгөрсөн 2 сард South Park үзэнгээ буйдан дээр амарсан магадлал) = 6/10 = 0.60
Тэхээр миний өнөөдөр буйдан дээрээ суухаар шийдэх тэгээд дээрээс нь South Park күүкэлдэйн киног тохируулж үзэх магадлалыг Bayes-ийн аргаар тооцон бодож олох бүх шаардлагатай бүрэлдэхүүн хэсгүүд бүрджээ.
P(A|B) = P(B|A) * P(A) / P(B) = (0.60 * 0.17) / 0.76 = 0.13
буюу
P(Өнөөдөр буйдан дээр суухаар шийдэх мөн тэгээд South Park үзэх магадлал) = 0.13
= 13% ийн магадлалтай болж таарахнээ.
</div>
<h2 class="_2cuy _509y _2vxa" style="box-sizing: border-box; color: #1d2129; direction: ltr; font-family: Georgia, serif; font-size: 36px; font-weight: normal; line-height: 38px; margin: 52px auto 28px; padding: 0px; white-space: pre-wrap; width: 700px; word-wrap: break-word;">
<span class="_4yxo" style="font-family: inherit; font-weight: bold;">Naive Bayes Classifier</span></h2>
<div class="_2cuy _3dgx _2vxa" style="box-sizing: border-box; color: #1d2129; direction: ltr; font-family: Georgia, serif; font-size: 17px; margin: 0px auto 28px; white-space: pre-wrap; width: 700px; word-wrap: break-word;">
Энэ бол Bayes-ийн онолыг хэрэглэдэг classifier буюу Machine Learning-ийн Supervised Learning төрлийн алгоритм юм. Энэ алгоритмаар тухайн өгөгдөл ямар нэгэн class-д хамаарагдах магадлалыг олдог. Хамгийн өндөр магадлалтай class-д энэ өгөгдөл хамаарагдана гэж үзнэ. Үүнийг өөрөөр <span class="_4yxo" style="font-family: inherit; font-weight: bold;">Maximum A Posteriori (MAP)</span> гэдэг.
Таамганд зориулсан MAP :
MAP(H)
= max(P(H|E))
= max((P(E|H) * P(H)) / P(E))
= max(P(E|H) * P(H))
Энд P(E) бол баталгааны магадлал бөгөөд үр дүнг нормчилоход хэрэглэдэг. Утга нь өөрчлөгдөхгүй тул үүнийг хасахад нээх нөлөө байхгүй.
Naive Bayes classifier алгоритмд бүх feature-үүдийг нэг нэгнээсээ хамааралгүй байна гэж үздэг. Тухайн feature-ийн байгаа болон байхгүй эсэх нь бусад feature-ийн байгаа болон байхгүй эсэхэд нөлөө үзүүлэхгүй гэсэн үг.
Жишээлбэл:
Хэрэв тухайн жимс нь улаан өнгөтэй бөгөөд 4 инчийн диаметртэй бол үүнийг алим гэж тодорхойлж болох юм. Энэ хоёр feature нэг нэгнээсээ хамаардаг аль эсвэл бусад өөр feature-ээс хамааралтай байлаа ч naive bayes classifier алгоритмд эдгээр шинж чанаруудыг энэ жимс бол алим юм гэдгийг тодорхойлох магадлалд тус тусдаа оролцдог гэж үздэг байна.
Бодит өгөгдөлд тухайн таамгийг олон баталгаа буюу feature ашиглан шалгадаг.
H - hypothesis буюу таамаг
Multiple Evidences - Олон баталгаанууд
<span class="_4yxo" style="font-family: inherit; font-weight: bold;">P(H | Multiple Evidences)</span> = P(E1|H) * P(E2|H) .... * P(En|H) * P(H) / P(Multiple Evidences)
<span class="_4yxo" style="font-family: inherit; font-weight: bold;">Naive Bayes Classifier алгоритмын жишээ</span>
1500-н ширхэг бичлэг болон 3-н class бүхий сургалын өгөгдөл байгаа гэж үзье.
Амьтны төрлийг илгэх 3-н класс :
- Тоть
- Нохой
- Загас
Таамаглагч feature-үүд нь 4-н ширхэг байна гэж үзье :
- Сэлдэг
- Нисдэг
- Ногоон өнгөтэй
- Шүд нь аюултай
Эдгээр feature-үүдийг <span class="_4yxo" style="font-family: inherit; font-weight: bold;">(T)True</span> болон <span class="_4yxo" style="font-family: inherit; font-weight: bold;">(F)False</span> гэсэн хоёр ангилалын утгатайгаар ашиглаж болно.
--------------------------------------------------------------------------------------
| Сэлдэг | Нисдэг | Ногоон | Шүдтэй | Амьтны төрөл |
--------------------------------------------------------------------------------------
| 50 | 500/500 | 400/500 | 0 | Тоть |
| 450/500 | 0 | 0 | 500/500 | Нохой |
| 500/500 | 0 | 100/500 | 50/500 | Загас |
--------------------------------------------------------------------------------------
Энэ хүснэгтэнд өгөгдлүүдийн давтамжийг харуулсан байна. Жишээ нь
- Тотьнууд сэлэх тал дээр 50(10%) утгатай. Нийт тотьнуудын 10% нь сэлэх чадвартай, 500-н тотьноос 500 нь буюу 100% далавчтай, 500-н тотьны 400 нь ногоон өнгөтэй, 0(0%) буюу ямар ч тотинд аюултай шүд байхгүй.
- Нохой гэсэн амьтны төрлөөс 500-ны 450 нь (90%) нь сэлж чадна, 0% буюу далавч байхгүй, 0% буюу ногоон өнгөтэй нохой байхгүй, 500-аас 500 буюу 100% аюултай шүдтэй.
- Загас гэсэн амьтны төрлөөс 500-аас 500 буюу 100% бүгдээрээ сэлж чадна, 0% буюу ямар загасанд далавч байхгүй, 100-н ширхэг загас буюу 20% нь ногоон өнгөтэй, 500-аас 50-н ширхэг нь буюу 10% нь аюултай шүдтэй болно.
Одоо эдгээр сургалтын утганууд дээр суурьлан Naive Bayes моделиор class таамаглая. Хоёр ширхэг өгөгдлийг feature-үүдтэй харвал
----------------------------------------------------------------------
| | Сэлдэг | Нисдэг | Ногоон | Шүдтэй |
----------------------------------------------------------------------
| 1. | True | False | True | False |
----------------------------------------------------------------------
| 2. | True | False | True | True |
----------------------------------------------------------------------
Эдгээр feature-үүд болон утгуудыг ашиглан class-ийг нь урьдчилан таамаглах буюу тухайн бичлэг ямар амьтны төрөлд хамаарагдах вэ өөрөөр хэлбэл нохой, тоть, загас гурвын аль нь вэ гэдгийг урьдчилан тааварлая.
Naive Bayes аа санавал
P(H|Multiple Evidences) = P(E1|H) * P(E2|H) * .... * P(En|H) * P(H) / P(Multiple Evidences)
1-р бичлэгийг авч үзье.
Сэлдэг болон ногоон өнгө гэсэн feature-тэй байна, таамаглавал энэ бичлэг нь нохой, тоть, загас гурвын аль нэг нь байх бололцоотой.
Энэ бичлэг нь нохой ангилалд байх таамгийг шалгая :
<span class="_4yxo" style="font-family: inherit; font-weight: bold;">P(Нохой | Сэлдэг, Ногоон)</span>
= P(Сэлдэг | Нохой) * P(Ногоон | Нохой) * P(Нохой) / P(Сэлдэг, Ногоон)
= 0.9 * 0 * 0.333 / P(Сэлдэг, Ногоон)
= 0
Энэ бичлэг тоть ангилалд байх таамгийг шалгая :
<span class="_4yxo" style="font-family: inherit; font-weight: bold;">P(Тоть | Сэлдэг, Ногоон)</span>
= P(Сэлдэг | Тоть) * P(Ногоон | Тоть) * P(Тоть) / P(Сэлдэг, Ногоон)
= 0.1 * 0.80 * 0.333 / P(Сэлдэг, Ногоон)
= 0.0264 / P(Сэлдэг, Ногоон)
Мөн энэ бичлэг загас ангилалд байх таамгийг шалгая :
<span class="_4yxo" style="font-family: inherit; font-weight: bold;">P(Загас | Сэлдэг, Ногоон)</span>
= P(Сэлдэг | Загас) * P(Ногоон | Загас) * P(Загас) / P(Сэлдэг, Ногоон)
= 1 * 0.2 * 0.333 / P(Сэлдэг, Ногоон)
= 0.0666 / P(Сэлдэг, Ногоон)
Дээрхи гурван тооцооллууд бүгд ижилхэн P(Сэлдэг, Ногоон) гэсэн ерөнхий хуваагчтай.
P(Загас | Сэлдэг, Ногоон) -ны утга нь P(Тоть | Сэлдэг, Ногоон) -ны утгаас их байна.
Ингээд Naive Bayes хэрэглэн энэ бичлэг нь загас гэсэн ангилалд хамаарах юм байна гэдгийг мэдэж авлаа.
2-р бичлэгийг авч үзье.
True утгатай feature-үүд гэвэл сэлдэг, ногоон өнгөтэй, аюултай шүдтэй гэдгийг мэдэж болно. Таамаг нь тэхээр нохой, тоть, загас гурвын аль нэг нь байх бололцоотой.
Нохой ангилалд байх таамгийг шалгая :
<span class="_4yxo" style="font-family: inherit; font-weight: bold;">P(Нохой | Сэлдэг, Ногоон, Шүдтэй)</span>
= P(Сэлдэг | Нохой) * P(Ногоон | Нохой) * P(Шүдтэй | Нохой) * P(Нохой) / P(Сэлдэг, Ногоон, Шүдтэй)
= 0.9 * 0 * 1 *0.333 / P(Сэлдэг, Ногоон, Шүдтэй)
= 0
Тоть ангилалд байх таамгийг шалгая :
<span class="_4yxo" style="font-family: inherit; font-weight: bold;">P(Тоть | Сэлдэг, Ногоон, Шүдтэй)</span>
= P(Сэлдэг | Тоть) * P(Ногоон | Тоть) * P(Шүдтэй | Тоть) * P(Тоть) / P(Сэлдэг, Ногоон, Шүдтэй)
= 0.1 * 0.80 * 0 * 0.333 / P(Сэлдэг, Ногоон, Шүдтэй)
= 0
Загас ангилалд байх таамгийг шалгая :
<span class="_4yxo" style="font-family: inherit; font-weight: bold;">P(Загас | Сэлдэг, Ногоон, Шүдтэй)</span>
= P(Сэлдэг | Загас) * P(Ногоон | Загас) * P(Шүдтэй | Загас) * P(Загас) / P(Сэлдэг, Ногоон, Шүдтэй)
= 1 * 0.2 * 0.1 * 0.333 / P(Сэлдэг, Ногоон, Шүдтэй)
= 0.00666 / P(Сэлдэг, Ногоон, Шүдтэй)
Бүгд ижилхэн P(Сэлдэг, Ногоон, Шүдтэй) хуваагчтай. Ганцхан P(Загас | Сэлдэг, Ногоон, Шүдтэй) -ийнх нь утга л 0-ээс их 0.00666 гэсэн утгатай байна. Тиймээс энэ бичлэг class нь Загас гэдгийг мэдэж авлаа гэсэн үг.
Эдгээр бодож олсон магадлалуудын утгууд маш бага тиймээс эдгээр утгуудыг нормчилох хэрэгтэй болдог, үүний тулд хуваагч утга ашигладаг.
</div>
<div class="_2cuy _3dgx _2vxa" style="box-sizing: border-box; color: #1d2129; direction: ltr; font-family: Georgia, serif; font-size: 17px; margin: 0px auto 28px; white-space: pre-wrap; width: 700px; word-wrap: break-word;">
Naive Bayes алгоритмын давуу талууд :
- Хурдтай, тэлэх боломж сайтай.
- Хоёртын болон олон класстай classification-д хэрэглэгдэх боломжтой бөгөөд GaussianNB, MultinomialNB, BernoulliNB гэх мэтийн NB алгоритмууд бий.
- Тоолох үйл ажиллагаа нь олон байдаг, тун энгийн алгоритм.
- Текс болон документ ангилах асуудалд маш сайн тохирдог мөн и-мэйл ангилж спам шүүдэг асуудлуур их хэрэглэгддэг.
- Цөөн хэмжээний сургалтын өгөгдөл дээр хялбархаан сургах боломжтой
Дутагдалтай тал :
- Бүх feature-үүдийг хоорондоо хамааралгүй гэж үздэг, тийм болохоор feature хоорондын харилцан хамаарлын талаар суралцаж чадахгүй. Жишээ нь Батаа үдэшлэгрүү явах гэж байгаа гэж үзь. Үдэшлэгт зориулж хувцсаа сонгохын тулд шүүгээгээ онгичив, тэрээр цагаан өнгийн цамцанд дуртай, жеансны хувьд бор өнгөнд дуртай, гэвч цагаан цамц болон бор жеансийг нэг дор өмсөх дургүй. Naive Bayes feature тус бүрийн хувьд суралцаж чадахч эдгээр feature-үүдийн хоорондох хамаарлыг тодорхойлж чаддаггүй байна.</div>
Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.comtag:blogger.com,1999:blog-1457877875009527488.post-9665504937959449992016-06-30T07:27:00.004+09:002021-09-16T03:15:33.527+08:00Pandora болон Spotify радиог үбүнтү дээр сонсохPandora бол хэрэглэгчийн амталгаан дээр суурилсан хөгжмийн стрийм үйлчилгээ юм. Харамсалтай янз бүрийн хөгжмийн лиценцийн асуудлаас болоод цөөхөн хэдэн оронд л ашиглаж болно. Тэгвэл энд үбүнтү үйлдлийн систем дээрээ монголоос хэрхэн хандан сонсож болох зааврыг орууллаа.<br />
<br />
<a name='more'></a><br />
<br />
$sudo apt install tor<br />
$sudo apt install polipo<br />
$sudo apt install pianobar<br />
<br />
<br />
$sudo vim /etc/polipo/config<br />
нээгээд доорхи агуулгыг нэмнэ<br />
<br />
socksParentProxy = 127.0.0.1:9050<br />
diskCacheRoot = ""<br />
<br />
<br />
$vim ~/.config/pianobar/config<br />
нээгээд tor болон pandora аккоунтынхаа тохиргоог хийнэ<br />
<br />
control_proxy = http://127.0.0.1:8123<br />
user = tanii@email.com<br />
password = taniipandorapassword<br />
<br />
<br />
$sudo service polipo restart<br />
$sudo service tor restart<br />
<br />
<br />
$pianobar<br />
гэж дуудаж ажиллуулахаар pandora радио цуглуулга гарч ирнэ дугаарыг нь сонгоод тоглуулна<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhVqIfMhMHAN6PZxHnjFvBruY1SEaBam7oO9HpHXafjqUVGi8o0GNpytZZ1cFge3vqMtqG4zac2hjnDbqBea9nwTW2Cxy8I9qab0iFut7rb9C3xLnPGoKazmNuAcZ-R6zXiQguPK7CIWQ/s1600/Screenshot+from+2016-06-30+07-24-46.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="359" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhVqIfMhMHAN6PZxHnjFvBruY1SEaBam7oO9HpHXafjqUVGi8o0GNpytZZ1cFge3vqMtqG4zac2hjnDbqBea9nwTW2Cxy8I9qab0iFut7rb9C3xLnPGoKazmNuAcZ-R6zXiQguPK7CIWQ/s640/Screenshot+from+2016-06-30+07-24-46.png" width="640" /></a></div>
<br />
<br />
painobar хэрэглэх нь жаахан төвөгтэй байж болох юм тэгвэл pithos гээд pandora клиэнт програм байгаа<br />
<br />
$sudo apt install pithos<br />
$pithos<br />
<br />
суулгаад ажилуулахдаа Preferences -> Pandora хэсэгт хэрэглэгчийн аккоунтаа тохируулаад<br />
Proxy URL хэсэгт нь http://127.0.0.1:8123 утгыг өгнө.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6Ef6H9kcQNJcVoyTJ0O6EZP0oQMt5Ge6P76nOpsbci72W_BvDkJttLyWn0Ey1_9pTlU4W5H4G6-cQPIKZJ9B_fR9OXJWk04jbwXeuNOg7I9hM8nJxuLuIZkOIbVunP8292nEZreu5Gw/s1600/Screenshot+from+2016-06-30+07-34-24.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="358" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6Ef6H9kcQNJcVoyTJ0O6EZP0oQMt5Ge6P76nOpsbci72W_BvDkJttLyWn0Ey1_9pTlU4W5H4G6-cQPIKZJ9B_fR9OXJWk04jbwXeuNOg7I9hM8nJxuLuIZkOIbVunP8292nEZreu5Gw/s640/Screenshot+from+2016-06-30+07-34-24.png" width="640" /></a></div>
<br />
Хэрэв ажиллахгүй байвал tor сервисээ рестарт хийж байгаад нэг үзээрэй. Tor node-үүдтэй холболтын асуудал үүсэхээр гаднаас урсгал авч чадахаа байчихдаг, тиймээс restart хийж байгаад ашиглаад байхад okey<br />
<br />
$sudo service tor restart<br />
<br />
<br />
Spotify-г ч гэсэн сонсох боломжтой, proxy хэсэгт нь http-г сонгоод утганд нь мөн адил 127.0.0.1 порт хэсэг дээр 8123 гэж тохируулах л хангалттай.<br />
<br />
Линукс клиэнтийг нь https://www.spotify.com/us/download/linux/ эндээс суулгаарай.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVoGAjYcQ5ZrWXw2gSI5uGhDPfiobdkAdMyBhAe0P-zxeeW2l_V4NPX9nlimwkPz5x1_sS0ntkl8jwqMUvfsRvtCU69fcWTbLGlrGHFm1KjC4Rcllx_LHwhRGQhwYoPp-zMq0OYRhkYg/s1600/Screenshot+from+2016-06-30+08-16-40.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="358" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVoGAjYcQ5ZrWXw2gSI5uGhDPfiobdkAdMyBhAe0P-zxeeW2l_V4NPX9nlimwkPz5x1_sS0ntkl8jwqMUvfsRvtCU69fcWTbLGlrGHFm1KjC4Rcllx_LHwhRGQhwYoPp-zMq0OYRhkYg/s640/Screenshot+from+2016-06-30+08-16-40.png" width="640" /></a></div>
<br />
<br />
Зөвхөн америк доторхи прокси node-үүдтэй tor service холбогддог болгоё гэвэл<br />
<br />
$sudo vim /etc/tor/torrc<br />
<br />
Файлын төгсгөлд нь доорхи агуулгыг нэмнэ<br />
<br />
<code>ExcludeNodes {be},{pl},{ca},{za},{vn},{uz},{ua},{tw},{tr},{th}, {sk},{sg},{se},{sd},{sa},{ru},{ro},{pt},{ph},{pa}, {nz},{np},{no},{my},{mx},{md},{lv},{lu},{kr},{jp}, {it},{ir},{il},{ie},{id},{hr},{hk},{gr},{gi},{gb}, {fi},{es},{ee},{dk},{cz},{cy},{cr},{co},{cn},{cl}, {ci},{ch},{by},{br},{bg},{au},{at},{ar},{aq},{ao}, {ae},{nl},{de},{fr}
</code>
<br />
<br />
$sudo service tor restart
<div><br /></div><div><br /></div><div><br /></div><div>2021.09.14-нд нэмэв.</div><div><br /></div><div>Дээрхи тохиргоонууд Ubuntu 21.04 дээр ажиллахгүй. Тиймээс дараах байдлаар хийгээрэй.</div><div><br /></div><div>Шаардлагатай програмуудыг суулгана.</div><div>$sudo apt install tor privoxy</div><div><br /></div><div>$sudo vim /etc/privoxy/config</div><div>Файлд хамгийн доор нь дараах агуулгыг нэмж хадгална</div><div>forward-socks5 / 127.0.0.1:9050 .</div><div><br /></div><div>$sudo vim /etc/tor/torrc</div><div>Tor-ийг зөвхөн америк node-үүдтэй холбогддог болгоё гэвэл </div><div>ExitNodes {US} StrictNodes 1</div><div><br /></div><div>Service-үүдээ restart хийнэ</div><div>$sudo service tor restart</div><div>$sudo service privoxy restart</div><div><br /></div><div>Proxy url нь</div><div>http://127.0.0.1:8118/</div><div><br /></div><div><br /></div><div><br /></div><div><br /></div>Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.comtag:blogger.com,1999:blog-1457877875009527488.post-87188686601611621782016-05-25T18:50:00.001+09:002016-05-25T18:59:50.417+09:00Deploy django project with gunicorn, nginx and ubuntuЭнэ тохиргоог нэтээс хайж олж хийхэд жаахан ядаргаатай байсан тул блогтоо тэмдэглэв.<br />
<br />
<a name='more'></a><br />
<br />
django app-ийн рүүт директор<br />
/home/user/deploy/myproject<br />
<br />
<span style="font-size: large;"><b>Gunicorn</b></span><br />
<br />
Хамгийн эхлээд систем эхлэхэд хамт асдаг gunicorn эхлүүлэгч скрипт бичиж зохих газар нь хадгалах хэрэгтэй.<br />
<br />
/etc/init.d/myproject файлыг үүсгээд дараах байдалтайгаар хадгална.<br />
<br />
<pre class="prettyprint">#!/bin/bash
APPNAME=myproject
USER=user
PATH=/bin:/usr/bin:/sbin:/usr/sbin
ACTIVATE=/home/user/deploy/myproject/venv/bin/activate
APPMODULE=myproject.wsgi:application
SOCKFILE=/home/user/deploy/myproject/myproject.sock
DAEMON=gunicorn
PIDFILE=/var/run/gunicorn.pid
LOGFILE=/var/log/$DAEMON.log
WORKERS=2
. /lib/lsb/init-functions
if [ -e "/etc/default/$APPNAME" ]
then
. /etc/default/$APPNAME
fi
cd /home/user/deploy/myproject
RUNDIR=$(dirname $SOCKFILE)
test -d $RUNDIR || mkdir -p $RUNDIR
case "$1" in
start)
log_daemon_msg "Starting deferred execution scheduler" "$APPNAME"
source $ACTIVATE
$DAEMON --daemon --bind=unix:$SOCKFILE --pid=$PIDFILE --workers=$WORKERS --user=$USER --log-file=$LOGFILE $APPMODULE
log_end_msg $?
;;
stop)
log_daemon_msg "Stopping deferred execution scheduler" "$APPNAME"
killproc -p $PIDFILE $DAEMON
log_end_msg $?
;;
force-reload|restart)
$0 stop
$0 start
;;
status)
status_of_proc -p $PIDFILE $DAEMON && exit 0 || exit $?
;;
*)
echo "Usage: /etc/init.d/$APPNAME {start|stop|restart|force-reload|status}"
exit 1
;;
esac
exit 0
</pre>
<br />
<br />
Хадгалсныхаа дараа<br />
sudo chmod +x /etc/init.d/myproject<br />
sudo chown root:root /etc/init.d/myproject<br />
sudo update-rc.d myproject defaults<br />
sudo update-rc.d myproject enable<br />
<br />
асаахдаа <br />
sudo /etc/init.d/myproject start<br />
<br />
зогсоохдоо<br />
sudo /etc/init.d/myproject stop<br />
<br />
лог файл нь<br />
/var/log/gunicorn.log<br />
<br />
<br />
<b><span style="font-size: large;">NGINX</span></b><br />
<br />
прожект рүүтрүүгээ ороод myproject_nginx.conf файлыг үүсгэнэ.<br />
/home/user/deploy/myproject/myproject_nginx.conf<br />
<br />
<pre class="prettyprint">server {
listen 80;
server_name .my-domain.com;
charset utf-8;
client_max_body_size 75M;
location = /favicon.ico { access_log off; log_not_found off; }
location /media {
alias /home/user/deploy/myproject/design/media;
}
location /static {
alias /home/user/deploy/myproject/design/static_root;
}
location / {
include proxy_params;
proxy_pass http://unix:/home/user/deploy/myproject/myproject.sock;
}
}
</pre>
<br />
media болон static alias-уудыг өөрийнхөө прожектийнхоо тохиргооноос хамаарч өөрчлөөрэй.<br />
<br />
тохиргооны файлаа nginx-д symlink холболт үүсгэн бүртгүүлэх хэрэгтэй.<br />
sudo ln -s /home/user/deploy/myproject/myproject_nginx.conf /etc/nginx/sites-enabled
<br />
<br />
одоо nginx-ээ restart хийгээд тохируулсан домайнлуугаа ороход бүгд хэвийн ажиллах ёстой.<br />
sudo /etc/init.d/myproject start<br />
sudo service nginx restart<br />
<br />
Nginx нь 80-р порт буюу вэбийн default порт дээр чагнан ажиллана, гаднаас хүсэлт ирэнгүүт түүнийг gunicorn-ийн socket-рүү проксидон дамжуулан ажиллана, gunicorn нь хэд хэдэн django процесс үүсгэн тэдгээрийгээ supervise хийн ажиллана, өөрөөр хэлбэл аль нэг процесс нь унавал дахин асааж өгнө, бодвол тэгээд socket-оор ирсэн өгөгдлүүдээ ном журмын дагуу <a href="https://en.wikipedia.org/wiki/Web_Server_Gateway_Interface">wsgi</a>-ээрээ дамжуулан процесстэйгээ харилцан ажиллах байлгүй.Sharavsambuu Gunchinishhttp://www.blogger.com/profile/06950810883056147179noreply@blogger.com