Tags:
create new tag
view all tags

Arrow left Go to previous page / next page of Tier3 site log MOVED TO...

17. 03. 2010 Pool migration from t3fs05_cms to t3fs07_cms

migration commands

Actuall, I will carry out a pool copy, because I do not want to endanger our files in this first try.

I set the pool to readonly and started the migration with

[t3se01.psi.ch] (local) admin > cd t3fs05_cms
[t3se01.psi.ch] (t3fs05_cms) pool disable -rdonly
[t3se01.psi.ch] (t3fs05_cms) migration copy t3fs07_cms

Migration ends with the migration process staying in place and checking for new files to transfer

[t3se01.psi.ch] (t3fs05_cms) admin > migration info 1                                                                         
Command    : migration copy t3fs07_cms                                                                                        
State      : SLEEPING                                                                                                         
Queued     : 31                                                                                                               
Attempts   : 4869                                                                                                             
Targets    : t3fs07_cms                                                                                                       
Concurrency: 2
Running tasks:
Most recent errors:
10:21:39 [4854] 000200000000000000935910: Transfer to [t3fs07_cms@local] failed (Not in trash: 000200000000000000935910)
10:22:08 [4855] 0002000000000000009358E0: Transfer to [t3fs07_cms@local] failed (Not in trash: 0002000000000000009358E0)
10:22:12 [4856] 000200000000000000935EE0: Transfer to [t3fs07_cms@local] failed (Not in trash: 000200000000000000935EE0)
10:22:37 [4857] 000200000000000000937460: Transfer to [t3fs07_cms@local] failed (Not in trash: 000200000000000000937460)
10:22:40 [4858] 000200000000000000935E40: Transfer to [t3fs07_cms@local] failed (Not in trash: 000200000000000000935E40)
10:23:07 [4860] 000200000000000000935D30: Transfer to [t3fs07_cms@local] failed (Not in trash: 000200000000000000935D30)
10:23:10 [4859] 000200000000000000936090: Transfer to [t3fs07_cms@local] failed (Not in trash: 000200000000000000936090)
10:23:39 [4862] 000200000000000000936540: Transfer to [t3fs07_cms@local] failed (Not in trash: 000200000000000000936540)
10:23:41 [4861] 000200000000000000935880: Transfer to [t3fs07_cms@local] failed (Not in trash: 000200000000000000935880)
10:24:12 [4863] 0002000000000000009366F0: Transfer to [t3fs07_cms@local] failed (Not in trash: 0002000000000000009366F0)
10:24:13 [4864] 0002000000000000009359D0: Transfer to [t3fs07_cms@local] failed (Not in trash: 0002000000000000009359D0)
10:24:44 [4866] 0002000000000000009372B0: Transfer to [t3fs07_cms@local] failed (Not in trash: 0002000000000000009372B0)
10:24:45 [4865] 000200000000000000936780: Transfer to [t3fs07_cms@local] failed (Not in trash: 000200000000000000936780)
10:25:15 [4867] 000200000000000000936E40: Transfer to [t3fs07_cms@local] failed (Not in trash: 000200000000000000936E40)
10:25:16 [4868] 0002000000000000009376B0: Transfer to [t3fs07_cms@local] failed (Not in trash: 0002000000000000009376B0)

The pool seems to retry failed transfers indeterminately.

[t3se01.psi.ch] (t3fs05_cms) admin > migration cancel 1
[1] CANCELLING   migration copy t3fs07_cms

some monitoring information

The throughput from t3fs05 to t3fs07 was roughly 100 MB/s, and probably this was a network bandwidth limitations, because even though we have 4 trunked 1Gb eth on both nodes, only a single line was used due to the hash function being the same for all the connections (based on only IP and MAC).

* t3fs05io.png:
t3fs05io.png

  • t3fs07io.png:
    t3fs07io.png

On t3fs05:

fsstat /data1
 new  name   name  attr  attr lookup rddir  read read  write write
 file remov  chng   get   set    ops   ops   ops bytes   ops bytes
    0     0     0     0     0      0     0   359 89.7M     0     0 /data1
    0     0     0     8     1     18     0   364 91.0M     0     0 /data1
    0     0     0     0     0      0     0   339 84.7M     0     0 /data1
    0     0     0     0     0      0     0   373 93.2M     0     0 /data1
    0     0     0     0     0      0     0   386 96.5M     0     0 /data1
    0     0     0     0     0      0     0   393 98.2M     0     0 /data1
    0     0     0     0     0      0     0   328 82.0M     0     0 /data1
    0     0     0     0     0      0     0   357 89.2M     0     0 /data1
    0     0     0     0     0      0     0   385 96.2M     0     0 /data1
    0     0     0     0     0      0     0   365 91.2M     0     0 /data1
    0     0     0     0     0      0     0   410  102M     0     0 /data1
    0     0     0     0     0      0     0   402  100M     0     0 /data1
    0     0     0     0     0      0     0   387 96.7M     0     0 /data1

bash-4.0# zpool iostat data1 5
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
data1       7.25T  12.7T    705      0  87.5M      0
data1       7.25T  12.7T    553      0  68.7M      0
data1       7.25T  12.7T    669      0  83.0M      0
data1       7.25T  12.7T    617     39  76.7M  75.6K
data1       7.25T  12.7T    669      0  83.0M      0
data1       7.25T  12.7T    751      0  93.2M      0
data1       7.25T  12.7T    711      0  88.4M      0
data1       7.25T  12.7T    617      0  76.7M      0
data1       7.25T  12.7T    707      0  87.8M      0
data1       7.25T  12.7T    760     27  94.2M  50.6K
data1       7.25T  12.7T    617      0  76.7M      0
data1       7.25T  12.7T    748      0  92.8M      0
data1       7.25T  12.7T    759      0  94.2M      0
data1       7.25T  12.7T    614      0  76.2M      0
data1       7.25T  12.7T    788      0  97.8M      0
data1       7.25T  12.7T    772     39  95.8M  77.4K


bash-4.0# /opt/csw/share/dtracetoolkit/Bin/bitesize.d    # ca 10 sec
Tracing... Hit Ctrl-C to end.
^C

     PID  CMD
    8045  /usr/jdk1.6.0_17/bin/java -server -Xmx512m -XX:MaxDirectMemorySize=512m -Dsun.n\0

           value  ------------- Distribution ------------- count
            4096 |                                         0
            8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 35189
           16384 |                                         0
           32768 |                                         0
           65536 |                                         223
          131072 |                                         0

       0  sched\0

           value  ------------- Distribution ------------- count
            4096 |                                         0
            8192 |@@@@@@@@@@@@@@@@@@@@@                    6338
           16384 |@@@                                      771
           32768 |@@@@@                                    1489
           65536 |@@@@@@@@@@@                              3238
          131072 |                                         0


bash-4.0# iopattern
%RAN %SEQ  COUNT    MIN    MAX    AVG     KR     KW
  85   15   3491  14336 118272  35335 120466      0
  85   15   2653  14336 121344  31628  81943      0
  84   16   2560  14336 118272  32658  81646      0
  86   14   2862  14336 117248  28436  79478      0
  84   16   4236  14336 118272  28817 119210      0
  85   15   7101  14336 124928  27109 187990      0
  85   15   2894  14336 124416  28249  79837      0
  85   15   5592  14336 124928  29536 161297      0
  84   16   2946  14336 126976  27911  80301      0
  86   14   6439  14336 124928  24641 154949      0
  81   19   3381  14336 125440  33700 111271      0
  78   22   2504   4096 117248  35844  87651      0
  83   17   5504  14336 117248  29391 157981      0
  88   12   4689  14336 117248  25617 117303      0
  83   17   2576  14336 117248  32142  80857      0
  86   14   5250  14336 127488  29661 152075      0

On the target node t3fs07

fsstat /data1 1
new  name   name  attr  attr lookup rddir  read read  write write
 file remov  chng   get   set    ops   ops   ops bytes   ops bytes
    0     0     0     0     0      0     0     0     0   375 93.7M /data1
    0     0     0     0     0      0     0     0     0   348 87.0M /data1
    0     0     0     0     0      0     0     0     0   390 97.5M /data1
    0     0     0     0     0      0     0     0     0   384 96.0M /data1
    0     0     0     0     0      0     0     0     0   311 77.7M /data1
    0     0     0     0     0      0     0     0     0   374 93.5M /data1
    0     0     0     0     0      0     0     0     0   376 94.0M /data1
    0     0     0     0     0      0     0     0     0   376 94.0M /data1
    0     0     0     0     0      0     0     0     0   399  100M /data1
    0     0     0     0     0      0     0     0     0   411  103M /data1
    0     0     0     0     0      0     0     0     0   379 94.7M /data1
    0     0     0     0     0      0     0     0     0   376 94.0M /data1
    0     0     0     0     0      0     0     0     0   375 93.7M /data1

zpool iostat data1 5
ata1       1.86T  38.8T      0      0      0      0
data1       1.86T  38.8T      0    942      0   118M
data1       1.86T  38.8T      0  2.81K    102   355M
data1       1.87T  38.8T      0    163      0  9.75M
data1       1.87T  38.8T      0      0      0      0
data1       1.87T  38.8T      0      0      0      0
data1       1.87T  38.8T      0      0      0      0
data1       1.87T  38.8T      0  1.48K      0   190M
data1       1.87T  38.8T      0  1.57K      0   201M
data1       1.87T  38.8T      0  1.06K      0   120M
data1       1.87T  38.8T      0      0      0      0
data1       1.87T  38.8T      0      0      0      0
data1       1.87T  38.8T      0      0      0      0
data1       1.87T  38.8T      0      0      0      0
data1       1.87T  38.8T      0      0    102      0
data1       1.87T  38.8T      0  4.15K      0   516M
data1       1.87T  38.8T      0    479      0  60.0M
data1       1.87T  38.8T      0      0      0   102K
data1       1.87T  38.8T      0      0      0      0
data1       1.87T  38.8T      0  1.92K      0   245M
data1       1.87T  38.8T      0  1.56K      0   200M
data1       1.88T  38.7T      0    358    510  29.2M
data1       1.88T  38.7T      0      0      0      0
data1       1.88T  38.7T      0      0      0      0


root@t3fs07 $ /opt/csw/share/dtracetoolkit/Bin/bitesize.d    # ca 10 sec.
Tracing... Hit Ctrl-C to end.
^C

     PID  CMD
       0  sched\0

           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |@                                        1074
            1024 |                                         415
            2048 |                                         183
            4096 |                                         14
            8192 |                                         1
           16384 |@@@@@@@@@                                10214
           32768 |@@                                       2684
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@               30485
          131072 |@                                        1427
          262144 |                                         0

root@t3fs07 $ iopattern
%RAN %SEQ  COUNT    MIN    MAX    AVG     KR     KW
   0    0      0      0      0      0      0      0
   0    0      0      0      0      0      0      0
   0    0      0      0      0      0      0      0
   0    0      0      0      0      0      0      0
  61   39  12027  18432 131072  52277      0 614000
  60   40  12603  18432 131072  55695      0 685485
  41   59     36  18432  18944  16696      0    587
   0    0      0      0      0      0      0      0
  63   37   1203  18432 131072  45644      0  53623
   0    0      0      0      0      0      0      0
   0    0      0      0      0      0      0      0
   0    0      0      0      0      0      0      0
   0    0      0      0      0      0      0      0
   0    0      0      0      0      0      0      0
   0    0      0      0      0      0      0      0
   0    0      0      0      0      0      0      0
   0    0      0      0      0      0      0      0
  66   34   7472  18432 131072  49895      0 364050
  60   40  20027  18432 131072  55019      0 1076054
  43   57  12215    512 131072  55311      0 659792
%RAN %SEQ  COUNT    MIN    MAX    AVG     KR     KW
  93    7    167    512  18944   4982      0    812
   0    0      0      0      0      0      0      0
 100    0      1    512    512    512      0      0
   0    0      0      0      0      0      0      0
   0    0      0      0      0      0      0      0

Possible Disk problem on t3fs07

There is one disk (c2t4)with a high Reallocated sector count. At the end of the migration the count is now 68, while at the start it was 36. Need to watch this disk and potentially remove it soon.

root@t3fs07 $ hd -R
 0 c1t0  66283659            0 19 0 3342478 1068 0 19 0 0 0 0 571932703  31   0  20 66283659 0 0 0
 1 c1t1   2930967            0 19 0 21478147817 1068 0 19 0 0 0 0 605552674  34   0  20 2930967 0 0 0
 2 c1t2   5077966            0 19 0 3324067 1068 0 19 0 0 0 0 622329890  34   0  19 5077966 0 0 0
 3 c1t3  242949449            0 19 4 3307947 1068 0 19 0 0 0 0 656080932  36   0  20 242949449 0 0 0
 4 c1t4  242528275            0 19 0 3307695 1068 0 19 0 0 0 0 538312733  30   0  19 242528275 0 0 0
 5 c1t5   3790302            0 19 0 3287320 1068 0 19 0 0 0 0 605552673  33   0  20 3790302 0 0 0
 6 c1t6   5059578            0 19 0 3321877 1068 0 19 0 0 0 0 622329890  34   0  20 5059578 0 0 0
 7 c1t7   2727873            0 19 0 3344553 1068 0 19 0 0 0 0 639238179  35   0  20 2727873 0 0 0
 8 c2t0  223201082            0 19 0 3325788 1068 0 19 0 0 0 0 538378270  30   0  20 223201082 0 0 0
 9 c2t1  232498400            0 18 0 3316399 1088 0 18 0 0 0 0 605552673  33   0  21 232498400 0 0 0
10 c2t2  228783842            0 19 0 3241594 1068 0 19 0 0 0 0 622395426  34   0  20 228783842 0 0 0
11 c2t3  231809539            0 19 1 3327058 1068 0 19 0 0 0 0 622395426  34   0  19 231809539 0 0 0
12 c2t4  232543595            0 19 68 3159134 1068 0 19 0 0 0 0 538312733  29   0  20 232543595 0 0 0
13 c2t5  228864647            0 18 0 3350283 1087 0 18 0 0 0 0 588775456  32   0  21 228864647 0 0 0
14 c2t6  229779066            0 19 0 3331560 1068 0 19 0 0 0 0 622395426  34   0  21 229779066 0 0 0
15 c2t7  223600250            0 19 0 3285839 1068 0 19 0 0 0 0 622395425  33   0  19 223600250 0 0 0
16 c3t0   6147152            0 19 0 3305037 1068 0 19 0 0 0 0 538247195  27   0  19 6147152 0 0 0
17 c3t1  235186313            0 19 4 3365681 1068 0 19 0 0 0 0 571932701  29   0  20 235186313 0 0 0
18 c3t2   2435812            0 19 0 3341089 1068 0 19 0 0 0 0 605487135  31   0  20 2435812 0 0 0
19 c3t3   1793250            0 19 0 3316889 1068 0 19 0 0 0 0 622395425  32   0  20 1793250 0 0 0
20 c3t4   4979150            0 19 0 3277658 1068 0 19 0 0 0 0 555089947  27   0  20 4979150 0 0 0
21 c3t5   5476969            0 19 0 4298296313 1068 0 19 0 0 0 0 605487134  30   0  20 5476969 0 0 0
22 c3t6   9745391            0 19 0 3296322 1068 0 19 0 0 0 0 605487135  31   0  19 9745391 0 0 0
23 c3t7  229252985            0 19 0 3364089 1068 0 19 0 0 0 0 656015393  33   0  21 229252985 0 0 0
24 c4t0   3458658            0 19 0 3324699 1068 0 19 0 0 0 0 588709915  27   0  21 3458658 0 0 0
25 c4t1  243539118            0 19 0 3286231 1068 0 19 0 0 0 0 605487133  29   0  20 243539118 0 0 0
26 c4t2   3525989            0 19 0 3344955 1068 0 19 0 0 0 0 622329887  31   0  20 3525989 0 0 0
27 c4t3   5895640            0 19 0 3351060 1068 0 19 0 0 0 0 639238176  31   0  20 5895640 0 0 0
28 c4t4  15660084            0 19 0 3313283 1068 0 19 0 0 0 0 588775451  27   0  21 15660084 0 0 0
29 c4t5   4040549            0 19 0 3312089 1068 0 19 0 0 0 0 622329885  29   0  21 4040549 0 0 0
30 c4t6  240433086            0 19 0 3291111 1068 0 19 0 0 0 0 622264350  30   0  20 240433086 0 0 0
31 c4t7   5114925            0 19 0 3308216 1068 0 19 0 0 0 0 656015391  31   0  20 5114925 0 0 0
32 c5t0  225485472            0 19 0 3326328 1068 0 19 0 0 0 0 571932697  25   0  20 225485472 0 0 0
33 c5t1  227343681            0 19 0 3345656 1068 0 19 0 0 0 0 588644379  27   0  18 227343681 0 0 0
34 c5t2  223864868            0 19 0 3295266 1068 0 19 0 0 0 0 639107102  30   0  20 223864868 0 0 0
35 c5t3  234352474            0 19 0 3258373 1068 0 19 0 0 0 0 655949855  31   0  19 234352474 0 0 0
36 c5t4  227198216            0 19 0 3337940 1068 0 19 0 0 0 0 588709914  26   0  20 227198216 0 0 0
37 c5t5  234294848            0 19 0 3312139 1068 0 19 0 0 0 0 622329884  28   0  20 234294848 0 0 0
38 c5t6  228479038            0 18 0 3348027 1087 0 18 0 0 0 0 639107102  30   0  20 228479038 0 0 0
39 c5t7  241004477            0 19 0 3275427 1068 0 19 0 0 0 0 655949855  31   0  19 241004477 0 0 0
40 c6t0   4102149            0 18 0 3283200 1068 0 18 0 0 0 0 605552667  27   0  21 4102149 0 0 0
41 c6t1   8468354            0 18 0 3314118 1068 0 18 0 0 0 0 622329885  29   0  21 8468354 0 0 0
42 c6t2  16718824            0 18 0 3298324 1068 0 18 0 0 0 0 639041565  29   0  19 16718824 0 0 0
43 c6t3   8598303            0 18 0 3323599 1068 0 18 0 0 0 0 655949855  31   0  19 8598303 0 0 0
44 c6t4   3008782            0 18 0 3348042 1068 0 18 0 0 0 0 588709914  26   0  20 3008782 0 0 0
45 c6t5  190674169            0 18 0 2441648 1068 0 18 0 0 0 0 639107101  29   0  21 190674169 0 0 0
46 c6t6  192396928            0 18 0 2410014 1068 0 18 0 0 0 0 655949854  30   0  20 192396928 0 0 0
47 c6t7  189479222            0 18 1 2435767 1068 0 18 0 0 0 0 672792607  31   0  20 189479222 0 0 0


#Details for this disk
12 c2t4
======
Revision: 10
Offline status 130
Selftest status 0
Seconds to collect 625
Time in minutes to run short selftest 1
Time in minutes to run extended selftest 230
Offline capability 123
SMART capability 3
Error logging capability 1
Checksum 0xfb
Identification                     Status Current Worst         Raw data
  1 Raw read error rate            0xf         83    63        232543979
  3 Spin up time                   0x3         99    99                0
  4 Start/Stop count               0x32       100   100               19
  5 Reallocated sector count       0x33        97    97               68
  7 Seek error rate                0xf         65    60          3160217
  9 Power on hours count           0x32        99    99             1068
 10 Spin retry count               0x13       100   100                0
 12 Device power cycle count       0x32       100    37               19
184 IOEDC Error Count              0x32       100   100                0
187 Uncorrectable Errors for Host  0x32       100   100                0
188 Command Timeout Count          0x32       100   100                0
189 High Fly Writes                0x3a       100   100                0
190 Airflow Temperature (WDC)      0x22        71    68        538312733
194 Temperature                    0x22        29    40  29/  0/ 20 (degrees C cur/min/max)
195 Hardware ECC Recovered         0x1a        38    33        232543979
197 Current pending sector count   0x12       100   100                0
198 Scan uncorrected sector count  0x10       100   100                0
199 Ultra DMA CRC error count      0x3e       200   200                0

Checking the migration by hand

dc_get_rep_ls.sh t3fs05_cms > t3fs05_cms-20100318.lst
dc_get_rep_ls.sh t3fs07_cms > t3fs07_cms-20100318.lst

for n in $(cat t3fs05_cms-20100318-sorted.lst); do grep -q $n t3fs07_cms-20100318-sorted.lst; if test $? -ne 0; then echo $n >> missing-on-t3fs07.lst; fi; done

# This yields a list with most files missing from pnfs, except for two files
dc_get_pnfsname_from_IDlist.sh missing-on-t3fs07.lst
...
0002000000000000007CA2B8 Error:Missing                                                                                                                  
0002000000000000007EB130 /pnfs/psi.ch/cms/trivcat/store/user/pixel/PixelTree/CollisionsHFAPixel-minBias-mc-mc900gev-0025.csh                            
0002000000000000007EB820 Error:Missing                                                                                                                  
0002000000000000007F67E8 /pnfs/psi.ch/cms/trivcat/store/user/andis/TTbar-mcatnlo/TTbar-mcatnlo_Summer09-MC_31X_V3-v1_GEN-SIM-RECO/f6b51da284f946d445f4388d5114091f/output_46.root                        
...

Most o the files have no pnfs mapping. Part of this is due to files having been deleted while the migration was running.

The two files with mappings already had been marked as error files using the dc_poolconsistency_checker.sh yesterday. Trying to list the physical files on the source node t3fs05 reveals that they have size 0. The problem already exists in the original files.

bash-4.0# ls -l /data1/t3fs05_cms/data/0002000000000000007EB130
-rw-r--r--   1 root     root           0 Dec  8 05:31 /data1/t3fs05_cms/data/0002000000000000007EB130
bash-4.0# ls -l /data1/t3fs05_cms/data/0002000000000000007F67E8
-rw-r--r--   1 root     root           0 Dec  9 17:57 /data1/t3fs05_cms/data/0002000000000000007F67E8

Bottom line: The pool copy has been successful.

-- DerekFeichtinger - 2010-03-17


Arrow left Go to previous page / next page of Tier3 site log MOVED TO...

Topic attachments
ISorted ascending Attachment History Action Size Date Who Comment
PNGpng t3fs05io.png r1 manage 8.4 K 2010-03-17 - 14:41 DerekFeichtinger  
PNGpng t3fs07io.png r1 manage 8.1 K 2010-03-17 - 14:42 DerekFeichtinger  
Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r3 - 2010-04-07 - DerekFeichtinger
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback