Using PDSF for ATLAS Work
From Atlas Wiki
if you find that any of these instructions are incomplete or obsolete please send mail to Zach.
We have our own mailing list for atlas specific pdsf issues and general requests for help which you can sign up through eGroups: atlas-lbl-pdsf-users. Please cc this list on support requests you think would be of interest to the general group.
PDSF maintains two emailing lists. The first, pdsf-users (@nersc.gov), automatically adds all users. The second, pdsf-status, is used to notify us about downtimes, meetings, etc. It is advised that you join this list.
See this Tuesday meeting talk for some recent bits of information regarding PDSF.
PDSF has several operating systems installed. One can switch between the systems using CHOS. The suggested OS is sl64 (Scientific Linux 6.4). If you have a need to use sl53 then it's still available, but everyone is encouraged to move to SL6 as soon as they can, and for certain tasks (panda job submission, DQ2, using athena release 19 and higher) SL5 is no longer supported.
To use CHOS in (Ba)sh
To use CHOS in (t)csh
setenv CHOS sl53 chos
If you find that the default shell isn't in your preferred OS, then set the CHOS variable in your ~/.chos file.
A user can configure the login shell on PDSF at nim.nersc.gov. Login with PDSF username and password. ATLAS recommended shell is bash. Csh is known to be broken in some cases by the huge amount of stuff that cmt puts into environment.
Access ATLAS software releases
Last reviewed Feb. 2011 by Yushu Yao
All ATLAS releases are managed by CernVM-FS, a web-based, read-only file system. New releases are installed from the Central CERN location and maintained by the ATLAS Release Management Team. No Local installation or customization needed.
To use ATLAS Software on PDSF, always do the following first:
source /common/atlas/scripts/setupATLAS.sh setupATLAS
The above should be your first line in a batch job as well.
To show the available ATLAS software releases:
showVersions --show athena
To show the available ATLAS DB releases (Note: you should not need DBReleases any more!):
showVersions --show dbrelease
To setup an ATLAS release with a custom testarea
asetup 184.108.40.206 --testarea=$HOME/mytestarea
The ATLAS software are managed by the ATLASLocalRootBase/ManageTier3SW package, please refer to https://twiki.atlas-canada.ca/bin/view/AtlasCanada/ATLASLocalRootBase for more features. The available ATLAS releases are on this website: http://atlas-computing.web.cern.ch/atlas-computing/projects/releases/status/ Please allow 3-5 days after the announcement for the new release to be installed and updated onto CernVM-FS at PDSF.
Running Batch Jobs
PDSF uses the SGE batch system, which differs from the lsf batch system most of you are used to from lxplus at CERN.
Submit your jobs simply using
(there is no need to have $HOME/.sge_request any more!)
To submit to a test queue with a cpu limit of 10min use:
qsub -l debug=1
Below is a sample job script to use CernVM-FS provided ATLAS Releases:
% cat myscript.sh shopt -s expand_aliases source /common/atlas/scripts/setupATLAS.sh setupATLAS asetup 220.127.116.11,noTest,gcc48 athena.py blabla
Note the like with "shopt" has to be there.
When submitting use the "cvmfs=1" option in qsub:
qsub -l cvmfs=1 <normal options>
Interactive Batch Jobs
For interactive jobs that are a bit heavier than just editing files, you can run an interactive session on a batch node directly. The command
Will open a new window with a terminal session on a batch node (it may take a few seconds to get an empty slot). If you're connected remotely, then you can use
instead, which basically just ssh's into a batch slot from your current session. Unfortunately the batch nodes do not have LDAP access, so you'll need to set up SSH keys on your PDSF account to gain access this way:
pdsf2 | ~ : ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/u/mhance/.ssh/id_rsa): # just hit enter Enter passphrase (empty for no passphrase): # enter a passphrase if you want Enter same passphrase again: # same passphrase again Your identification has been saved in /u/mhance/.ssh/id_rsa. Your public key has been saved in /u/mhance/.ssh/id_rsa.pub. The key fingerprint is: <numbers> mhance@pdsf2
pdsf2 | ~ : cat .ssh/id_rsa.pub >> .ssh/authorized_keys
That should do it.
Once you're in the shell, you'll likely have to set up your environment again, including any "chos" commands you'd usually include in a batch submission script. Then you should be able to treat it like an interactive node. The session typically times-out after 24 hours, just like a normal batch job.
Note also that if you normally need, e.g., 2GB of memory, or CVMFS, or anything like that, you should specify that in your login. For example:
qsh -l cvmfs -l h_vmem=2G
Nx Server for PDSF Use
Connecting to pdsf using the NX server is highly recommended, particularly for those working at CERN. To set it up, follow the instructions here.
Setting Up SVN Access
For password-less access to svn create at file called config in your ~/.ssh directory containing:
Host svn.cern.ch svn Protocol 2,1 GSSAPIAuthentication yes GSSAPIDelegateCredentials yes ForwardX11Trusted yes ForwardX11 yes
If your username is different for svn.cern.ch, then add the following line under that host entry:
of course replacing "mhance" with the username that svn.cern.ch expects.
In your ~/.bashrc file add
For each pdsf session you have do
then you should not have to type your password for every svn directory.
|Location||Filesystem||Quota or Size||Comments|
|/project/projectdirs/atlas||GPFS||100 TB||No backup (backup service in development). Can be used for data files or SW.|
|/common/atlas||GPFS||1.5 TB(?)||Nightly backup. This disk is for software, not for data files.|
|/eliza1/atlas||GPFS||28 TB||No backup. Scratch only, disappearing soon!|
|/eliza2/atlas||GPFS||35 TB||No backup. Scratch only, disappearing soon!|
|/eliza11/atlas||GPFS||110 TB||No backup. Free for data or code.|
|/eliza18/atlas||GPFS||350 TB||No backup. Put data files here. Also grid endpoint.|
|/oldeliza/scratch||GPFS||142 TB||No backup. Scratch only, will be around until disks fail.|
- A summary of the data disk space use by ATLAS is here: http://portal.nersc.gov/project/atlas/diskstat/index.py
- To check group disk usage and quotas (-G displays quotas in GB instead of MB):
myquota -G -g atlas
* To check your own disk usage and quota (-G displays quotas in GB instead of MB:
- In addition to a disk space quota, there is also a quota on the number of files (inodes). A kit for a single ATLAS release requires about 80k inodes, so please check the inode quota before installing a new kit.
- Note that when you are near your quota on /home (>90%), your batch jobs will not run. So please keep your home area clear.
Getting/Distributing data locally: using rucio-xxx on PDSF
You are free to use interactive pdsf nodes for small transfers, like getting single files to test jobs on. For any large transfers, please use DaTRI:
To get small files, set up the environment:
source /common/atlas/scripts/setupATLAS.sh setupATLAS localSetupDQ2Client --skipConfirm voms-proxy-init --voms=atlas
Now you can use dq2-ls, dq2-get, etc. Documentation for the tools is here.
If you have problems with "certificate out of date" errors, please do the following and re-try the dq2-get:
If you want to create a new dataset visible to grid users from data that is local on PDSF you can use dq2-put:
dq2-put -L NERSC_SCRATCHDISK -s sourceDir datasetName
please refer to the dq2 twiki page linked above for details and for the format of the dataset name (must be of the form user.[UserName].*). On PDSF the above functionality should work. However you may encounter errors in writing files to the disk, which look like:
>> Transfer of file MC11_7TeV.107499.singlepart_empty.pileup_Pythia8_A2M_noslim_2011BS.mu9.VTXD3PD.root to SE: FAILED
In this case please open a NERSC ticket and ask for the destination directory to be made group-writable. The destination directory in case you are writing to NERSC_SCRATCHDISK will be like:
where [YourUserName] is your nickname on the grid and [XXX] is the part of the dataset name between dots following your nickname, e.g. user.[YourUserName].[XXX].someOtherInfo.v1.0/
Using the PDSF grid server
To list datasets local to PDSF: set up the DQ2 environment, then
dq2-ls -s NERSC_LOCALGROUPDISK
The files are physically located on the /eliza18 disk, so you can also login to PDSF and use "ls":
to find files, and use the files directly in your athena or ROOT jobs.
With the new Rucio system, finding files with "ls" is basically impossible anymore -- you need to get the physical file locations for a given dataset using dq2-ls, e.g.:
dq2-ls -pfL NERSC_LOCALGROUPDISK mc15_13TeV.my_dataset_name | grep "srm\:" | sed s_"srm\://pdsfdtn1.nersc.gov"__g
To request a dataset to PDSF, follow these instructions: http://panda.cern.ch/server/pandamon/query?mode=ddm_req
To get data to NERSC_LOCALGROUPDISK you may need an explicit approval if the dataset is large (TB's). Ian can do that if you see your request is awaiting approval for a long time.
If you copied data in LOCALGROUPDISK and don't need it any more, please use
dq2-delete-replicas -d dataset-name NERSC_LOCALGROUPDISK
to clean up. Everyone should be able to delete datasets they requested to PDSF. Otherwise ask Ian. The "-d" is necessary to actually delete the files, as opposed to just erasing the listing of the dataset at NERSC from the server.
If you get "permission denied" errors when trying to transfer or write data to our grid endpoints, then you may need to either (a) register with DaTRI, or (b) request the "usatlas" role for your grid certificate.
(a) To register with DaTRI, visit the following page and follow the instructions there:
(b) To get the "usatlas" role, follow the instructions on the following page, starting from "In addition, you should request to join the group associated to your country...."
Frontier DB access on pdsf
If you use CernVM-FS based ATLAS releases, the following frontier server is setup automatically.
To get very fast online DB access you should setup Frontier access. This will work for 15.5.X and on. Setup the atlas software then do
then run athena and marvel at how fast it is.
Using Kerberos at PDSF
To be able to check out packages in CVS or use your cern afs space, you need to have kerberos authentication. CERN has now switched to Kerberos v5. To be properly authenticated, you need to define a variable:
to authenticate yourself.
Using pAthena on PDSF: Running Grid Jobs
- Using pathena is an efficient way to do analysis over the grid. The documentation is quite good and should be able to guide you through using it. Here are some quick installation instructions.
You can use pathena through the AtlasLocalRootBase setup:
setup 18.104.22.168 # or your favorite athena release
Now pathena is ready to use. If you're using prun instead, you likely do not need to set up athena:
Useful Atlas Tools on pdsf
- Valgrind is a good way to debug code, find memory leaks and understand seg faults. To setup it up on the pdsf SL4, setup your normal software and then do
to run valgrind do
valgrind --leak-check=yes --trace-children=yes --num-callers=8 --show-reachable=yes `which athena.py` jobOptions.py >! valgrind.log 2>&1
You can decode the output using the Atlas documentation
Other Tips for Easy Usage
Compiling is quite slow with the atlas software on PDSF. If you need to compile something and are not adding any new header or source files, you can do
You can't use this if you've just checked out a package and are compiling for the first time. This is only to be used when you've made relatively minor changes.
There are default soft limits on the address space users are allowed to consume in the interactive machines. This can, for instance, cause problems when opening large ROOT files for browsing or with a MakeClass script. To get around this, try executing this command in your bash shell:
ulimit -v 5242880
Known Problems with using ATLAS specific software on PDSF
- 10% of BS->ESD->AOD->DPD _reco_ jobs fail without error messages. Hypothesis: due to high GPFS load to /common resulting from db-file copying and shared library loading. Iwona and Sven are investigating. Update as of Sep5: After running a few hundred analysis jobs on AOD files (these jobs don't access the db a lot), i'm happy to report that there I don't see this failing-without-error-message problem. This support the GPFS load hypothesis and means that for the most common anlysis use pattern things are ok.
- Initial package compilation is 3x slower than on lxplus. "cmt make -j4" helps a bit, but still twice as slow as "cmt make" on lxplus. Someone should investigate what is limiting factor and try to improve.
- Kerberos against lxplus does not work on 64-bit interactive PDSF nodes (pdsf1 and pdsf2) with CHOS=sl44 --> Sven filed ticket: 080918-000102 --> fixed 9/19/08
- the only 64-bit interactive nodes pdsf1 and pdsf2 freeze roughly once a week, leading to reboots. Memory has been exchanged, and bios upgraded, but the problem persists. NERSC is planning to exchange the processors next (Sven has filed a bug report.) Update 9/19/08 (Sven): No more crashes on pdsf2 after CPU replacement last week, and today pdsf1 CPUs were also replaced. Let's keep our fingers crossed.