Tags:
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup --> %TOC% %ICON{arrowleft}% Go to [[CMSTier3LogXX][previous page]] / [[CMSTier3LogXX][next page]] of Tier3 site log %M% ---+ !!17. 03. 2010 Pool migration from t3fs05_cms to t3fs07_cms %TOC% ---++ migration commands Actuall, I will carry out a pool copy, because I do not want to endanger our files in this first try. I set the pool to readonly and started the migration with <pre> [t3se01.psi.ch] (local) admin > cd t3fs05_cms [t3se01.psi.ch] (t3fs05_cms) pool disable -rdonly [t3se01.psi.ch] (t3fs05_cms) migration copy t3fs07_cms </pre> Migration ends with the migration process staying in place and checking for new files to transfer <pre> [t3se01.psi.ch] (t3fs05_cms) admin > migration info 1 Command : migration copy t3fs07_cms State : SLEEPING Queued : 31 Attempts : 4869 Targets : t3fs07_cms Concurrency: 2 Running tasks: Most recent errors: 10:21:39 [4854] 000200000000000000935910: Transfer to [t3fs07_cms@local] failed (Not in trash: 000200000000000000935910) 10:22:08 [4855] 0002000000000000009358E0: Transfer to [t3fs07_cms@local] failed (Not in trash: 0002000000000000009358E0) 10:22:12 [4856] 000200000000000000935EE0: Transfer to [t3fs07_cms@local] failed (Not in trash: 000200000000000000935EE0) 10:22:37 [4857] 000200000000000000937460: Transfer to [t3fs07_cms@local] failed (Not in trash: 000200000000000000937460) 10:22:40 [4858] 000200000000000000935E40: Transfer to [t3fs07_cms@local] failed (Not in trash: 000200000000000000935E40) 10:23:07 [4860] 000200000000000000935D30: Transfer to [t3fs07_cms@local] failed (Not in trash: 000200000000000000935D30) 10:23:10 [4859] 000200000000000000936090: Transfer to [t3fs07_cms@local] failed (Not in trash: 000200000000000000936090) 10:23:39 [4862] 000200000000000000936540: Transfer to [t3fs07_cms@local] failed (Not in trash: 000200000000000000936540) 10:23:41 [4861] 000200000000000000935880: Transfer to [t3fs07_cms@local] failed (Not in trash: 000200000000000000935880) 10:24:12 [4863] 0002000000000000009366F0: Transfer to [t3fs07_cms@local] failed (Not in trash: 0002000000000000009366F0) 10:24:13 [4864] 0002000000000000009359D0: Transfer to [t3fs07_cms@local] failed (Not in trash: 0002000000000000009359D0) 10:24:44 [4866] 0002000000000000009372B0: Transfer to [t3fs07_cms@local] failed (Not in trash: 0002000000000000009372B0) 10:24:45 [4865] 000200000000000000936780: Transfer to [t3fs07_cms@local] failed (Not in trash: 000200000000000000936780) 10:25:15 [4867] 000200000000000000936E40: Transfer to [t3fs07_cms@local] failed (Not in trash: 000200000000000000936E40) 10:25:16 [4868] 0002000000000000009376B0: Transfer to [t3fs07_cms@local] failed (Not in trash: 0002000000000000009376B0) </pre> The pool seems to retry failed transfers indeterminately. <pre> [t3se01.psi.ch] (t3fs05_cms) admin > migration cancel 1 [1] CANCELLING migration copy t3fs07_cms </pre> ---++ some monitoring information The throughput from t3fs05 to t3fs07 was roughly 100 MB/s, and probably this was a network bandwidth limitations, because even though we have 4 trunked 1Gb eth on both nodes, only a single line was used due to the hash function being the same for all the connections (based on only IP and MAC). * t3fs05io.png: <br /> <img src="%ATTACHURLPATH%/t3fs05io.png" alt="t3fs05io.png" width='227' height='119' /> * t3fs07io.png: <br /> <img src="%ATTACHURLPATH%/t3fs07io.png" alt="t3fs07io.png" width='227' height='119' /> On t3fs05: <pre> fsstat /data1 new name name attr attr lookup rddir read read write write file remov chng get set ops ops ops bytes ops bytes 0 0 0 0 0 0 0 359 89.7M 0 0 /data1 0 0 0 8 1 18 0 364 91.0M 0 0 /data1 0 0 0 0 0 0 0 339 84.7M 0 0 /data1 0 0 0 0 0 0 0 373 93.2M 0 0 /data1 0 0 0 0 0 0 0 386 96.5M 0 0 /data1 0 0 0 0 0 0 0 393 98.2M 0 0 /data1 0 0 0 0 0 0 0 328 82.0M 0 0 /data1 0 0 0 0 0 0 0 357 89.2M 0 0 /data1 0 0 0 0 0 0 0 385 96.2M 0 0 /data1 0 0 0 0 0 0 0 365 91.2M 0 0 /data1 0 0 0 0 0 0 0 410 102M 0 0 /data1 0 0 0 0 0 0 0 402 100M 0 0 /data1 0 0 0 0 0 0 0 387 96.7M 0 0 /data1 bash-4.0# zpool iostat data1 5 capacity operations bandwidth pool used avail read write read write data1 7.25T 12.7T 705 0 87.5M 0 data1 7.25T 12.7T 553 0 68.7M 0 data1 7.25T 12.7T 669 0 83.0M 0 data1 7.25T 12.7T 617 39 76.7M 75.6K data1 7.25T 12.7T 669 0 83.0M 0 data1 7.25T 12.7T 751 0 93.2M 0 data1 7.25T 12.7T 711 0 88.4M 0 data1 7.25T 12.7T 617 0 76.7M 0 data1 7.25T 12.7T 707 0 87.8M 0 data1 7.25T 12.7T 760 27 94.2M 50.6K data1 7.25T 12.7T 617 0 76.7M 0 data1 7.25T 12.7T 748 0 92.8M 0 data1 7.25T 12.7T 759 0 94.2M 0 data1 7.25T 12.7T 614 0 76.2M 0 data1 7.25T 12.7T 788 0 97.8M 0 data1 7.25T 12.7T 772 39 95.8M 77.4K bash-4.0# /opt/csw/share/dtracetoolkit/Bin/bitesize.d # ca 10 sec Tracing... Hit Ctrl-C to end. ^C PID CMD 8045 /usr/jdk1.6.0_17/bin/java -server -Xmx512m -XX:MaxDirectMemorySize=512m -Dsun.n\0 value ------------- Distribution ------------- count 4096 | 0 8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 35189 16384 | 0 32768 | 0 65536 | 223 131072 | 0 0 sched\0 value ------------- Distribution ------------- count 4096 | 0 8192 |@@@@@@@@@@@@@@@@@@@@@ 6338 16384 |@@@ 771 32768 |@@@@@ 1489 65536 |@@@@@@@@@@@ 3238 131072 | 0 bash-4.0# iopattern %RAN %SEQ COUNT MIN MAX AVG KR KW 85 15 3491 14336 118272 35335 120466 0 85 15 2653 14336 121344 31628 81943 0 84 16 2560 14336 118272 32658 81646 0 86 14 2862 14336 117248 28436 79478 0 84 16 4236 14336 118272 28817 119210 0 85 15 7101 14336 124928 27109 187990 0 85 15 2894 14336 124416 28249 79837 0 85 15 5592 14336 124928 29536 161297 0 84 16 2946 14336 126976 27911 80301 0 86 14 6439 14336 124928 24641 154949 0 81 19 3381 14336 125440 33700 111271 0 78 22 2504 4096 117248 35844 87651 0 83 17 5504 14336 117248 29391 157981 0 88 12 4689 14336 117248 25617 117303 0 83 17 2576 14336 117248 32142 80857 0 86 14 5250 14336 127488 29661 152075 0 </pre> On the target node t3fs07 <pre> fsstat /data1 1 new name name attr attr lookup rddir read read write write file remov chng get set ops ops ops bytes ops bytes 0 0 0 0 0 0 0 0 0 375 93.7M /data1 0 0 0 0 0 0 0 0 0 348 87.0M /data1 0 0 0 0 0 0 0 0 0 390 97.5M /data1 0 0 0 0 0 0 0 0 0 384 96.0M /data1 0 0 0 0 0 0 0 0 0 311 77.7M /data1 0 0 0 0 0 0 0 0 0 374 93.5M /data1 0 0 0 0 0 0 0 0 0 376 94.0M /data1 0 0 0 0 0 0 0 0 0 376 94.0M /data1 0 0 0 0 0 0 0 0 0 399 100M /data1 0 0 0 0 0 0 0 0 0 411 103M /data1 0 0 0 0 0 0 0 0 0 379 94.7M /data1 0 0 0 0 0 0 0 0 0 376 94.0M /data1 0 0 0 0 0 0 0 0 0 375 93.7M /data1 zpool iostat data1 5 ata1 1.86T 38.8T 0 0 0 0 data1 1.86T 38.8T 0 942 0 118M data1 1.86T 38.8T 0 2.81K 102 355M data1 1.87T 38.8T 0 163 0 9.75M data1 1.87T 38.8T 0 0 0 0 data1 1.87T 38.8T 0 0 0 0 data1 1.87T 38.8T 0 0 0 0 data1 1.87T 38.8T 0 1.48K 0 190M data1 1.87T 38.8T 0 1.57K 0 201M data1 1.87T 38.8T 0 1.06K 0 120M data1 1.87T 38.8T 0 0 0 0 data1 1.87T 38.8T 0 0 0 0 data1 1.87T 38.8T 0 0 0 0 data1 1.87T 38.8T 0 0 0 0 data1 1.87T 38.8T 0 0 102 0 data1 1.87T 38.8T 0 4.15K 0 516M data1 1.87T 38.8T 0 479 0 60.0M data1 1.87T 38.8T 0 0 0 102K data1 1.87T 38.8T 0 0 0 0 data1 1.87T 38.8T 0 1.92K 0 245M data1 1.87T 38.8T 0 1.56K 0 200M data1 1.88T 38.7T 0 358 510 29.2M data1 1.88T 38.7T 0 0 0 0 data1 1.88T 38.7T 0 0 0 0 root@t3fs07 $ /opt/csw/share/dtracetoolkit/Bin/bitesize.d # ca 10 sec. Tracing... Hit Ctrl-C to end. ^C PID CMD 0 sched\0 value ------------- Distribution ------------- count 256 | 0 512 |@ 1074 1024 | 415 2048 | 183 4096 | 14 8192 | 1 16384 |@@@@@@@@@ 10214 32768 |@@ 2684 65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 30485 131072 |@ 1427 262144 | 0 root@t3fs07 $ iopattern %RAN %SEQ COUNT MIN MAX AVG KR KW 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 61 39 12027 18432 131072 52277 0 614000 60 40 12603 18432 131072 55695 0 685485 41 59 36 18432 18944 16696 0 587 0 0 0 0 0 0 0 0 63 37 1203 18432 131072 45644 0 53623 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 66 34 7472 18432 131072 49895 0 364050 60 40 20027 18432 131072 55019 0 1076054 43 57 12215 512 131072 55311 0 659792 %RAN %SEQ COUNT MIN MAX AVG KR KW 93 7 167 512 18944 4982 0 812 0 0 0 0 0 0 0 0 100 0 1 512 512 512 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </pre> ---++ Possible Disk problem on t3fs07 There is one disk (c2t4)with a high *Reallocated sector count*. At the end of the migration the count is now 68, while at the start it was 36. Need to watch this disk and potentially remove it soon. <pre> root@t3fs07 $ hd -R 0 c1t0 66283659 0 19 0 3342478 1068 0 19 0 0 0 0 571932703 31 0 20 66283659 0 0 0 1 c1t1 2930967 0 19 0 21478147817 1068 0 19 0 0 0 0 605552674 34 0 20 2930967 0 0 0 2 c1t2 5077966 0 19 0 3324067 1068 0 19 0 0 0 0 622329890 34 0 19 5077966 0 0 0 3 c1t3 242949449 0 19 4 3307947 1068 0 19 0 0 0 0 656080932 36 0 20 242949449 0 0 0 4 c1t4 242528275 0 19 0 3307695 1068 0 19 0 0 0 0 538312733 30 0 19 242528275 0 0 0 5 c1t5 3790302 0 19 0 3287320 1068 0 19 0 0 0 0 605552673 33 0 20 3790302 0 0 0 6 c1t6 5059578 0 19 0 3321877 1068 0 19 0 0 0 0 622329890 34 0 20 5059578 0 0 0 7 c1t7 2727873 0 19 0 3344553 1068 0 19 0 0 0 0 639238179 35 0 20 2727873 0 0 0 8 c2t0 223201082 0 19 0 3325788 1068 0 19 0 0 0 0 538378270 30 0 20 223201082 0 0 0 9 c2t1 232498400 0 18 0 3316399 1088 0 18 0 0 0 0 605552673 33 0 21 232498400 0 0 0 10 c2t2 228783842 0 19 0 3241594 1068 0 19 0 0 0 0 622395426 34 0 20 228783842 0 0 0 11 c2t3 231809539 0 19 1 3327058 1068 0 19 0 0 0 0 622395426 34 0 19 231809539 0 0 0 12 c2t4 232543595 0 19 %BLUE%68%ENDCOLOR% 3159134 1068 0 19 0 0 0 0 538312733 29 0 20 232543595 0 0 0 13 c2t5 228864647 0 18 0 3350283 1087 0 18 0 0 0 0 588775456 32 0 21 228864647 0 0 0 14 c2t6 229779066 0 19 0 3331560 1068 0 19 0 0 0 0 622395426 34 0 21 229779066 0 0 0 15 c2t7 223600250 0 19 0 3285839 1068 0 19 0 0 0 0 622395425 33 0 19 223600250 0 0 0 16 c3t0 6147152 0 19 0 3305037 1068 0 19 0 0 0 0 538247195 27 0 19 6147152 0 0 0 17 c3t1 235186313 0 19 4 3365681 1068 0 19 0 0 0 0 571932701 29 0 20 235186313 0 0 0 18 c3t2 2435812 0 19 0 3341089 1068 0 19 0 0 0 0 605487135 31 0 20 2435812 0 0 0 19 c3t3 1793250 0 19 0 3316889 1068 0 19 0 0 0 0 622395425 32 0 20 1793250 0 0 0 20 c3t4 4979150 0 19 0 3277658 1068 0 19 0 0 0 0 555089947 27 0 20 4979150 0 0 0 21 c3t5 5476969 0 19 0 4298296313 1068 0 19 0 0 0 0 605487134 30 0 20 5476969 0 0 0 22 c3t6 9745391 0 19 0 3296322 1068 0 19 0 0 0 0 605487135 31 0 19 9745391 0 0 0 23 c3t7 229252985 0 19 0 3364089 1068 0 19 0 0 0 0 656015393 33 0 21 229252985 0 0 0 24 c4t0 3458658 0 19 0 3324699 1068 0 19 0 0 0 0 588709915 27 0 21 3458658 0 0 0 25 c4t1 243539118 0 19 0 3286231 1068 0 19 0 0 0 0 605487133 29 0 20 243539118 0 0 0 26 c4t2 3525989 0 19 0 3344955 1068 0 19 0 0 0 0 622329887 31 0 20 3525989 0 0 0 27 c4t3 5895640 0 19 0 3351060 1068 0 19 0 0 0 0 639238176 31 0 20 5895640 0 0 0 28 c4t4 15660084 0 19 0 3313283 1068 0 19 0 0 0 0 588775451 27 0 21 15660084 0 0 0 29 c4t5 4040549 0 19 0 3312089 1068 0 19 0 0 0 0 622329885 29 0 21 4040549 0 0 0 30 c4t6 240433086 0 19 0 3291111 1068 0 19 0 0 0 0 622264350 30 0 20 240433086 0 0 0 31 c4t7 5114925 0 19 0 3308216 1068 0 19 0 0 0 0 656015391 31 0 20 5114925 0 0 0 32 c5t0 225485472 0 19 0 3326328 1068 0 19 0 0 0 0 571932697 25 0 20 225485472 0 0 0 33 c5t1 227343681 0 19 0 3345656 1068 0 19 0 0 0 0 588644379 27 0 18 227343681 0 0 0 34 c5t2 223864868 0 19 0 3295266 1068 0 19 0 0 0 0 639107102 30 0 20 223864868 0 0 0 35 c5t3 234352474 0 19 0 3258373 1068 0 19 0 0 0 0 655949855 31 0 19 234352474 0 0 0 36 c5t4 227198216 0 19 0 3337940 1068 0 19 0 0 0 0 588709914 26 0 20 227198216 0 0 0 37 c5t5 234294848 0 19 0 3312139 1068 0 19 0 0 0 0 622329884 28 0 20 234294848 0 0 0 38 c5t6 228479038 0 18 0 3348027 1087 0 18 0 0 0 0 639107102 30 0 20 228479038 0 0 0 39 c5t7 241004477 0 19 0 3275427 1068 0 19 0 0 0 0 655949855 31 0 19 241004477 0 0 0 40 c6t0 4102149 0 18 0 3283200 1068 0 18 0 0 0 0 605552667 27 0 21 4102149 0 0 0 41 c6t1 8468354 0 18 0 3314118 1068 0 18 0 0 0 0 622329885 29 0 21 8468354 0 0 0 42 c6t2 16718824 0 18 0 3298324 1068 0 18 0 0 0 0 639041565 29 0 19 16718824 0 0 0 43 c6t3 8598303 0 18 0 3323599 1068 0 18 0 0 0 0 655949855 31 0 19 8598303 0 0 0 44 c6t4 3008782 0 18 0 3348042 1068 0 18 0 0 0 0 588709914 26 0 20 3008782 0 0 0 45 c6t5 190674169 0 18 0 2441648 1068 0 18 0 0 0 0 639107101 29 0 21 190674169 0 0 0 46 c6t6 192396928 0 18 0 2410014 1068 0 18 0 0 0 0 655949854 30 0 20 192396928 0 0 0 47 c6t7 189479222 0 18 1 2435767 1068 0 18 0 0 0 0 672792607 31 0 20 189479222 0 0 0 #Details for this disk 12 c2t4 ====== Revision: 10 Offline status 130 Selftest status 0 Seconds to collect 625 Time in minutes to run short selftest 1 Time in minutes to run extended selftest 230 Offline capability 123 SMART capability 3 Error logging capability 1 Checksum 0xfb Identification Status Current Worst Raw data 1 Raw read error rate 0xf 83 63 232543979 3 Spin up time 0x3 99 99 0 4 Start/Stop count 0x32 100 100 19 5 Reallocated sector count 0x33 97 97 68 7 Seek error rate 0xf 65 60 3160217 9 Power on hours count 0x32 99 99 1068 10 Spin retry count 0x13 100 100 0 12 Device power cycle count 0x32 100 37 19 184 IOEDC Error Count 0x32 100 100 0 187 Uncorrectable Errors for Host 0x32 100 100 0 188 Command Timeout Count 0x32 100 100 0 189 High Fly Writes 0x3a 100 100 0 190 Airflow Temperature (WDC) 0x22 71 68 538312733 194 Temperature 0x22 29 40 29/ 0/ 20 (degrees C cur/min/max) 195 Hardware ECC Recovered 0x1a 38 33 232543979 197 Current pending sector count 0x12 100 100 0 198 Scan uncorrected sector count 0x10 100 100 0 199 Ultra DMA CRC error count 0x3e 200 200 0 </pre> ---++ Checking the migration by hand <pre> dc_get_rep_ls.sh t3fs05_cms > t3fs05_cms-20100318.lst dc_get_rep_ls.sh t3fs07_cms > t3fs07_cms-20100318.lst for n in $(cat t3fs05_cms-20100318-sorted.lst); do grep -q $n t3fs07_cms-20100318-sorted.lst; if test $? -ne 0; then echo $n >> missing-on-t3fs07.lst; fi; done # This yields a list with most files missing from pnfs, except for two files dc_get_pnfsname_from_IDlist.sh missing-on-t3fs07.lst ... 0002000000000000007CA2B8 Error:Missing 0002000000000000007EB130 /pnfs/psi.ch/cms/trivcat/store/user/pixel/PixelTree/CollisionsHFAPixel-minBias-mc-mc900gev-0025.csh 0002000000000000007EB820 Error:Missing 0002000000000000007F67E8 /pnfs/psi.ch/cms/trivcat/store/user/andis/TTbar-mcatnlo/TTbar-mcatnlo_Summer09-MC_31X_V3-v1_GEN-SIM-RECO/f6b51da284f946d445f4388d5114091f/output_46.root ... </pre> Most o the files have no pnfs mapping. Part of this is due to files having been deleted while the migration was running. The two files with mappings already had been marked as error files using the *dc_poolconsistency_checker.sh* yesterday. Trying to list the physical files on the source node t3fs05 reveals that they have size 0. The problem already exists in the original files. <pre> bash-4.0# ls -l /data1/t3fs05_cms/data/0002000000000000007EB130 -rw-r--r-- 1 root root 0 Dec 8 05:31 /data1/t3fs05_cms/data/0002000000000000007EB130 bash-4.0# ls -l /data1/t3fs05_cms/data/0002000000000000007F67E8 -rw-r--r-- 1 root root 0 Dec 9 17:57 /data1/t3fs05_cms/data/0002000000000000007F67E8 </pre> Bottom line: The pool copy has been successful. -- Main.DerekFeichtinger - 2010-03-17 ---------------- %ICON{arrowleft}% Go to [[CMSTier3LogXX][previous page]] / [[CMSTier3LogXX][next page]] of Tier3 site log %M%
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
png
t3fs05io.png
r1
manage
8.4 K
2010-03-17 - 14:41
DerekFeichtinger
png
t3fs07io.png
r1
manage
8.1 K
2010-03-17 - 14:42
DerekFeichtinger
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r3
<
r2
<
r1
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r3 - 2010-04-07
-
DerekFeichtinger
CmsTier3
Log In
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
User Pages
Main Page
Policies
Monitoring Storage Space
Monitoring Slurm Usage
Physics Groups
Steering Board Meetings
Admin Pages
AdminArea
Cluster Specs
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback