Using PDSF for ATLAS Work
From Atlas Wiki
if you find that any of these instructions are incomplete or obsolete please send mail to Mike.
Mailing Lists
We have our own mailing list for atlas specific pdsf issues and general requests for help which you can sign up through eGroups: atlas-lbl-pdsf-users. Please cc this list on support requests you think would be of interest to the general group.
PDSF maintains two emailing lists. The first, pdsf-users (@nersc.gov), automatically adds all users. The second, pdsf-status, is used to notify us about downtimes, meetings, etc. It is advised that you join this list.
Recent Information/News
See this Tuesday meeting talk for some recent bits of information regarding PDSF.
OS environment
PDSF has several operating systems installed. One can switch between the systems using CHOS. However, the only suggested OS is
- sl53 Scientific Linux 5.3 - 64 bit
To use CHOS in (Ba)sh
CHOS=sl53 chos
To use CHOS in (t)csh
setenv CHOS sl53 chos
Login shell
A user can configure the login shell on PDSF at nim.nersc.gov. Login with PDSF username and password. ATLAS recommended shell is bash. Csh is known to be broken in some cases by the huge amount of stuff that cmt puts into environment.
Access ATLAS software releases
Last reviewed Feb. 2011 by Yushu Yao
All ATLAS releases are managed by CernVM-FS, a web-based, read-only file system. New releases are installed from the Central CERN location and maintained by the ATLAS Release Management Team. No Local installation or customization needed.
To use ATLAS Software on PDSF, always do the following first:
source /common/atlas/scripts/setupATLAS.sh setupATLAS
The above should be your first line in a batch job as well.
To show the available ATLAS software releases:
showVersions --show athena
To show the available ATLAS DB releases:
showVersions --show dbrelease
To setup an ATLAS release with a custom testarea
asetup 17.2.0.2 --testarea=$HOME/mytestarea
The ATLAS software are managed by the ATLASLocalRootBase/ManageTier3SW package, please refer to https://twiki.atlas-canada.ca/bin/view/AtlasCanada/ATLASLocalRootBase for more features. The available ATLAS releases are on this website: http://atlas-computing.web.cern.ch/atlas-computing/projects/releases/status/ Please allow 3-5 days after the announcement for the new release to be installed and updated onto CernVM-FS at PDSF.
Running Batch Jobs
PDSF uses the SGE batch system, which differs from the lsf batch system most of you are used to from lxplus at CERN.
Submit your jobs simply using
qsub
(there is no need to have $HOME/.sge_request any more!)
To submit to a test queue with a cpu limit of 10min use:
qsub -l debug=1
Below is a sample job script to use CernVM-FS provided ATLAS Releases:
% cat myscript.sh shopt -s expand_aliases source /common/atlas/scripts/setupATLAS.sh setupATLAS asetup 15.6.3 athena.py blabla
Note the like with "shopt" has to be there.
When submitting use the "cvmfs=1" option in qsub:
qsub -l cvmfs=1 <normal options>
Interactive Batch Jobs
For interactive jobs that are a bit heavier than just editing files, you can run an interactive session on a batch node directly. The command
qsh
Will open a new window with a terminal session on a batch node (it may take a few seconds to get an empty slot). If you're connected remotely, then you can use
qlogin
instead, which basically just ssh's into a batch slot from your current session. Unfortunately the batch nodes do not have LDAP access, so you'll need to set up SSH keys on your PDSF account to gain access this way:
pdsf2 | ~ [8]: ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/u/mhance/.ssh/id_rsa): # just hit enter Enter passphrase (empty for no passphrase): # enter a passphrase if you want Enter same passphrase again: # same passphrase again Your identification has been saved in /u/mhance/.ssh/id_rsa. Your public key has been saved in /u/mhance/.ssh/id_rsa.pub. The key fingerprint is: <numbers> mhance@pdsf2
pdsf2 | ~ [9]: cat .ssh/id_rsa.pub >> .ssh/authorized_keys
That should do it.
Once you're in the shell, you'll likely have to set up your environment again, including any "chos" commands you'd usually include in a batch submission script. Then you should be able to treat it like an interactive node. The session typically times-out after 24 hours, just like a normal batch job.
Note also that if you normally need, e.g., 2GB of memory, or CVMFS, or anything like that, you should specify that in your login. For example:
qsh -l cvmfs -l h_vmem=2G
Nx Server for PDSF Use
Connecting to pdsf using the NX server is highly recommended, particularly for those working at CERN. To set it up, follow the instructions here.
Setting Up SVN Access
For password-less access to svn create at file called config in your ~/.ssh directory containing:
Host svn.cern.ch svn
Protocol 2,1
GSSAPIAuthentication yes
GSSAPIDelegateCredentials yes
ForwardX11Trusted yes
ForwardX11 yes
If your username is different for svn.cern.ch, then add the following line under that host entry:
User mhance
of course replacing "mhance" with the username that svn.cern.ch expects.
In your ~/.bashrc file add
export KRB5_CONFIG="/common/atlas/kits/setup_files/krb5.conf"
For each pdsf session you have do
/usr/kerberos/bin/kinit username@CERN.CH
then you should not have to type your password for every svn directory.
Disk space
| Location | Filesystem | Quota or Size | Comments |
|---|---|---|---|
| /project/projectdirs/atlas | GPFS | 1 TB | No backup (backup service in development). Can be used for data files or SW. |
| /common/atlas | GPFS | 1.5 TB(?) | Nightly backup. This disk is for software, not for data files. |
| /eliza1/atlas | GPFS | 35 TB | No backup. Put data files here. |
| /eliza2/atlas | GPFS | 35 TB | No backup. Grid server storage - don't write to this disk . |
| /eliza4/atlas | GPFS | 12 TB | No backup. Put data files here (being decommissioned). |
| /eliza18/atlas | GPFS | 142 TB | No backup. Put data files here. |
- A summary of the data disk space use by ATLAS is here: http://portal.nersc.gov/project/atlas/diskstat/index.py
- To check group disk usage and quotas (-G displays quotas in GB instead of MB):
myquota -G -g atlas
* To check your own disk usage and quota (-G displays quotas in GB instead of MB:
myquota -G
- In addition to a disk space quota, there is also a quota on the number of files (inodes). A kit for a single ATLAS release requires about 80k inodes, so please check the inode quota before installing a new kit.
- Note that when you are near your quota on /home (>90%), your batch jobs will not run. So please keep your home area clear.
Getting/Distributing data locally: using DQ2-xxx on PDSF
There is a dedicated machine, pdsfdtn1.nersc.gov, for data transfers. This machine should not be used for other purposes (running jobs etc). On the other hand, do not use interactive pdsf nodes for data transfers, it makes makes them slow and affects all users.
To use grid tools:
ssh pdsfdtn1.nersc.gov
check that you have the right OS
cat /etc/redhat-release
Scientific Linux SL release 5.3 (Boron)
If you see something else (e.g. 5.4), run chos by hand
(type "CHOS=sl53 chos").
Set up the environment:
source /common/atlas/scripts/setupATLAS.sh
setupATLAS
localSetupDQ2Client --skipConfirm
voms-proxy-init --voms=atlas
Now you can use dq2-ls, dq2-get, etc. Documentation for the tools is here.
If you have problems with "certificate out of date" errors, please do the following and re-try the dq2-get:
export X509_CERT_DIR=/usr/common/nsg/etc/certificates
If you want to create a new dataset visible to grid users from data that is local on PDSF you can use dq2-put:
dq2-put -L NERSC_SCRATCHDISK -s sourceDir datasetName
please refer to the dq2 twiki page linked above for details and for the format of the dataset name (must be of the form user.[UserName].*). On PDSF the above functionality should work. However you may encounter errors in writing files to the disk, which look like:
>> Transfer of file MC11_7TeV.107499.singlepart_empty.pileup_Pythia8_A2M_noslim_2011BS.mu9.VTXD3PD.root to SE: FAILED
In this case please open a NERSC ticket and ask for the destination directory to be made group-writable. The destination directory in case you are writing to NERSC_SCRATCHDISK will be like:
/eliza2/atlasdata/atlasscratchdisk/user/[YourUserName]/[XXX]/
where [YourUserName] is your nickname on the grid and [XXX] is the part of the dataset name between dots following your nickname, e.g. user.[YourUserName].[XXX].someOtherInfo.v1.0/
Using the PDSF grid server
To list datasets local to PDSF: set up the DQ2 environment, then
dq2-ls -s NERSC_SCRATCHDISK
or
dq2-ls -s NERSC_LOCALGROUPDISK
The files are physically located on the /eliza2 disk, so you can also login to PDSF and use "ls":
ls /eliza2/atlas/atlasdata/atlasscratchdisk/
to find files, and use the files directly in your athena or ROOT jobs.
To request a dataset to PDSF, follow these instructions: http://panda.cern.ch/server/pandamon/query?mode=ddm_req
Transfer requests to NERSC_SCRATCHDISK are auto-approved. To get data to NERSC_LOCALGROUPDISK you may need an explicit approval. Ian can do that.
Datasets from the SCRATCHDISK are removed automatically. If you copied data in LOCALGROUPDISK and don't need it any more, please use
dq2-delete-replicas -d dataset-name NERSC_LOCALGROUPDISK
to clean up. Everyone should be able to delete datasets they requested to PDSF. Otherwise ask Ian. The "-d" is necessary to actually delete the files, as opposed to just erasing the listing of the dataset at NERSC from the server.
If you get "permission denied" errors when trying to transfer or write data to our grid endpoints, then you may need to either (a) register with DaTRI, or (b) request the "usatlas" role for your grid certificate.
(a) To register with DaTRI, visit the following page and follow the instructions there:
http://panda.cern.ch/server/pandamon/query?mode=ddm_user
(b) To get the "usatlas" role, follow the instructions on the following page, starting from "In addition, you should request to join the group associated to your country...."
https://twiki.cern.ch/twiki/bin/viewauth/Atlas/WorkBookStartingGrid
Frontier DB access on pdsf
If you use CernVM-FS based ATLAS releases, the following frontier server is setup automatically.
To get very fast online DB access you should setup Frontier access. This will work for 15.5.X and on. Setup the atlas software then do
export FRONTIER_SERVER='(proxyurl=http://cernvm.lbl.gov:3128)(serverurl=http://frontier.racf.bnl.gov:8000/frontieratbnl)(retrieve-ziplevel=5)'
then run athena and marvel at how fast it is.
Using Kerberos at PDSF
To be able to check out packages in CVS or use your cern afs space, you need to have kerberos authentication. CERN has now switched to Kerberos v5. To be properly authenticated, you need to define a variable:
export KRB5_CONFIG=/common/atlas/kits/setup_files/krb5.conf
and use
/usr/kerberos/bin/kinit username@CERN.CH
to authenticate yourself.
Using pAthena on PDSF: Running Grid Jobs
- Using pathena is an efficient way to do analysis over the grid. The documentation is quite good and should be able to guide you through using it. Here are some quick installation instructions.
You can use pathena through the AtlasLocalRootBase setup:
setupATLAS
asetup 17.2.0.2 # or your favorite athena release
localSetupPandaClient
pathena --help
Now pathena is ready to use. If you're using prun instead, you likely do not need to set up athena:
setupATLAS
localSetupPandaClient
prun --help
Useful Atlas Tools on pdsf
- Valgrind is a good way to debug code, find memory leaks and understand seg faults. To setup it up on the pdsf SL4, setup your normal software and then do
source /afs/cern.ch/sw/lcg/external/valgrind/3.3.0/slc4_amd64_gcc34/_SPI/start.sh
to run valgrind do
valgrind --leak-check=yes --trace-children=yes --num-callers=8 --show-reachable=yes `which athena.py` jobOptions.py >! valgrind.log 2>&1
You can decode the output using the Atlas documentation
Other Tips for Easy Usage
Compiling is quite slow with the atlas software on PDSF. If you need to compile something and are not adding any new header or source files, you can do
make QUICK=1
You can't use this if you've just checked out a package and are compiling for the first time. This is only to be used when you've made relatively minor changes.
There are default soft limits on the address space users are allowed to consume in the interactive machines. This can, for instance, cause problems when opening large ROOT files for browsing or with a MakeClass script. To get around this, try executing this command in your bash shell:
ulimit -v 5242880
Known Problems with using ATLAS specific software on PDSF
- 10% of BS->ESD->AOD->DPD _reco_ jobs fail without error messages. Hypothesis: due to high GPFS load to /common resulting from db-file copying and shared library loading. Iwona and Sven are investigating. Update as of Sep5: After running a few hundred analysis jobs on AOD files (these jobs don't access the db a lot), i'm happy to report that there I don't see this failing-without-error-message problem. This support the GPFS load hypothesis and means that for the most common anlysis use pattern things are ok.
- Initial package compilation is 3x slower than on lxplus. "cmt make -j4" helps a bit, but still twice as slow as "cmt make" on lxplus. Someone should investigate what is limiting factor and try to improve.
- Kerberos against lxplus does not work on 64-bit interactive PDSF nodes (pdsf1 and pdsf2) with CHOS=sl44 --> Sven filed ticket: 080918-000102 --> fixed 9/19/08
- the only 64-bit interactive nodes pdsf1 and pdsf2 freeze roughly once a week, leading to reboots. Memory has been exchanged, and bios upgraded, but the problem persists. NERSC is planning to exchange the processors next (Sven has filed a bug report.) Update 9/19/08 (Sven): No more crashes on pdsf2 after CPU replacement last week, and today pdsf1 CPUs were also replaced. Let's keep our fingers crossed.
