Transferring data#
When working on gadi, often you’ll have to transfer large amounts of data (gigabytes or terabytes) between different disks to deal with changing project quotas.
Sometimes you may also need upload or download data from gadi.
Managing terabytes of data between disks is not a trivial process, for several reasons:
Networking issues during the copying process could cause the task to fail before completion
The task may exceed the
gadilogin node job limits (30 minutes)Excessive disk use can overload the
gadilogin nodes.Disk errors may occur, i.e. the copied file may be incomplete or corrupted.
To guard against these problems, NCI recommend the following steps.
When uploading or downloading data, use a dedicated
gadi‘data-mover’ node (as opposed to standard login node) for smaller, interactive tasks. Using these nodes reduces CPU and I/O loads on the login nodes.
You cannot remotely login to a ‘data-mover’ node. It can only be used for securely copying files using scp
Use the
rsynccommand instead of thecpcopy command. This gives your the ability to resume an interrupted transfer.For larger tasks that may take more than 30 minutes, use the
copyqPBSqueue and submit a batch job.
NCI have provided this link for extra information on the above points.
There is also this dedicated link with a guide on using the copyqjob queue.
Example#
In this example, we will use the copyq queue to transfer about 7 Gigabytes of ERA5 anticyclone data from a su28 data disk to a gb02 scratch disk.
For transferring large amounts of data, we can use the ‘Message Digest Algorithm 5’ or MD5 checksum to check for data integrity.
So firstly, I can use some bash logic to create MD5 checksums for every file. First I will specify the location of the data files, and where I want the MD5 data to be written to.
source_dir=/g/data/su28/weatherfeatures.era5/maxcl/cdf
md5_dir=/g/data/gb02/pag548/md5_keys
project=su28
Then I will loop over all the subdirectories in source_dir (there is one for every year from 1950 to 2022) and generate a checksum for each file in each yearly subdirectory.
cd ${source_dir}
for dir in *
do
cd ${source_dir}/${dir}
echo "$PWD find -type f -print0 | sort -z | xargs -r0 md5sum > ${md5_dir}/${dir}_${project}_md5sum.txt"
find -type f -print0 | sort -z | xargs -r0 md5sum > ${md5_dir}/${dir}_${project}_md5sum.txt
done
The bash logic does the following:
uses
findto grab every filename inside the subdirectorypipes this list of filenames using the
|command, to thesortcommand to sort the list of filenames into alphabetical order,the pipes the sorted lists into
md5sumthe directs the output of
md5suminto a text file${md5_dir}/${dir}_${project}_md5sum.txt
Here is an example output file:
more 2021_su28_md5sum.txt
87d8d3e7f429b308d4d2a44553abfd77 ./A2021_01.nc
cca154c44d474aabda928ae26789ea9e ./A2021_02.nc
a95f322bbe0198265b78582f296166a7 ./A2021_03.nc
4d114b872b56188413c0229ec302d668 ./A2021_04.nc
621e9ddca68930440976279f585f54d5 ./A2021_05.nc
b2f8a29057b36b33e3f2ed4ef6baa884 ./A2021_06.nc
etc.
So the contents of every monthly netCDF file of ERA5 anticyclone data, valid for a given year, is now associated with a unique hash string.
Now, let’s create a bash script called copy_directory.sh to use rsync to transfer the dataset between our two disks.
#!/bin/bash
#PBS -l mem=2GB
#PBS -l ncpus=1
#PBS -q copyq
#PBS -l walltime=2:00:00
#PBS -l storage=gdata/su28+scratch/gb02
#PBS -P gb02
#PBS -o /home/548/pag548/code/software_engineering_demos/file_transfers/md5sum_copyq_rsync/copyq.out
#PBS -e /home/548/pag548/code/software_engineering_demos/file_transfers/md5sum_copyq_rsync/copyq.err
source_dir=/g/data/su28/weatherfeatures.era5/maxcl/cdf
target_dir=/scratch/gb02/pag548/
rsync -vP -r ${source_dir} ${target_dir}
I have to specify the disks I want to transfer the data from using the #PBS storage flag.
I can then submit this task to the copyq job queue using
$ qsub copy_directory.sh
Make sure copy_directory.sh has user executable permissions, i.e. it is coloured green in your terminal, or has the user permissions -rwx when you use a ‘long list’ (i.e. $ ls -l copy_directory.sh). To make a bash script executable by a user, type chmod u+x <filename>
You can check the progress of the job by typing
$ qstat
Job id Name User Time Use S Queue
--------------------- ---------------- ---------------- -------- - -----
137293930.gadi-pbs copy_directory.* <your userid> 0 Q copyq-exec
Qs
In this example the job is ‘queued’ in the copyq job queue and is waiting its turn to be executed.
The job will take just over a minute to finish. If you look at your ‘standard output’ file (defined using the #PBS -o flag) it should look something like
======================================================================================
Resource Usage on 2025-03-20 15:45:45:
Job Id: 137294139.gadi-pbs
Project: gb02
Exit Status: 0
Service Units: 0.12
NCPUs Requested: 1 NCPUs Used: 1
CPU Time Used: 00:00:48
Memory Requested: 10.0GB Memory Used: 6.49GB
Walltime requested: 02:00:00 Walltime Used: 00:01:25
JobFS requested: 100.0MB JobFS used: 0B
If we look at these files in their new location, their permissions will have changed to reflect our own user-id and project memberships, e.g.
$ ls -lh 2020
total 101M
-rw-r--r-- 1 pag548 gb02 8.6M Mar 20 15:45 A2020_01.nc
-rw-r--r-- 1 pag548 gb02 7.6M Mar 20 15:45 A2020_02.nc
-rw-r--r-- 1 pag548 gb02 8.5M Mar 20 15:45 A2020_03.nc
-rw-r--r-- 1 pag548 gb02 9.1M Mar 20 15:45 A2020_04.nc
-rw-r--r-- 1 pag548 gb02 8.9M Mar 20 15:45 A2020_05.nc
etc
That is another advantage of using rsync over cp.
Now, to check the copied files, we can use the same logic above to generate MD5 checksums that we used before. Except this time, we will provide the directory where the copied files are (e.g. /scratch/gb02) and specify the new project (e.g. gb02).
Now we will have two sets of MD5 checksum files for both sets of files, i.e. the originals and their copies. In this example, you will have
2022_gb02_md5sum.txt
2022_su28_md5sum.txt
In the directory where you created your MD5 files for each yearly subdirectory.
Now we can use some bash to compare the contents of each md5sum.txt file and check they are identical.
md5_dir=/g/data/gb02/pag548/md5_keys
cd ${md5_dir}
source_project=su28
target_project=gb02
for year in {1950..2022}
do
target_md5=${year}_${target_project}_md5sum.txt
source_md5=${year}_${source_project}_md5sum.txt
diff ${target_md5} ${source_md5}
done
This is one way to loop over all .txt files in our md5_dir directory and use the bash command diff to check the file contents are identical.
If this command exits with no output, the files have been copied correctly and you can be confident of their data integrity.
The example scripts above are available here
You can alter them to reflect the directory structures of the data you are trying to copy.
You may have to alter the PBS parameters in your copyq job if you run out of memory or exceed walltime.