Using PDSF for ATLAS Work

From Atlas Wiki

Jump to: navigation, search

if you find that any of these instructions are incomplete or obsolete please send mail to Mike.

Contents

Mailing Lists

We have our own mailing list for atlas specific pdsf issues and general requests for help which you can sign up through eGroups: atlas-lbl-pdsf-users. Please cc this list on support requests you think would be of interest to the general group.

PDSF maintains two emailing lists. The first, pdsf-users (@nersc.gov), automatically adds all users. The second, pdsf-status, is used to notify us about downtimes, meetings, etc. It is advised that you join this list.

Recent Information/News

See this Tuesday meeting talk for some recent bits of information regarding PDSF.

OS environment

PDSF has several operating systems installed. One can switch between the systems using CHOS. The suggested OS is sl64 (Scientific Linux 6.4). If you have a need to use sl53 then it's still available, but everyone is encouraged to move to SL6 as soon as they can, and for certain tasks (panda job submission, DQ2, using athena release 19 and higher) SL5 is no longer supported.

To use CHOS in (Ba)sh

  CHOS=sl53
  chos

To use CHOS in (t)csh

  setenv CHOS sl53
  chos

If you find that the default shell isn't in your preferred OS, then set the CHOS variable in your ~/.chos file.

Login shell

A user can configure the login shell on PDSF at nim.nersc.gov. Login with PDSF username and password. ATLAS recommended shell is bash. Csh is known to be broken in some cases by the huge amount of stuff that cmt puts into environment.

Access ATLAS software releases

Last reviewed Feb. 2011 by Yushu Yao

All ATLAS releases are managed by CernVM-FS, a web-based, read-only file system. New releases are installed from the Central CERN location and maintained by the ATLAS Release Management Team. No Local installation or customization needed.

To use ATLAS Software on PDSF, always do the following first:

   source /common/atlas/scripts/setupATLAS.sh
   setupATLAS

The above should be your first line in a batch job as well.

To show the available ATLAS software releases:

   showVersions --show athena

To show the available ATLAS DB releases:

   showVersions --show dbrelease

To setup an ATLAS release with a custom testarea

   asetup 17.2.0.2 --testarea=$HOME/mytestarea

The ATLAS software are managed by the ATLASLocalRootBase/ManageTier3SW package, please refer to https://twiki.atlas-canada.ca/bin/view/AtlasCanada/ATLASLocalRootBase for more features. The available ATLAS releases are on this website: http://atlas-computing.web.cern.ch/atlas-computing/projects/releases/status/ Please allow 3-5 days after the announcement for the new release to be installed and updated onto CernVM-FS at PDSF.

Running Batch Jobs

PDSF uses the SGE batch system, which differs from the lsf batch system most of you are used to from lxplus at CERN.

Submit your jobs simply using

 qsub

(there is no need to have $HOME/.sge_request any more!)

To submit to a test queue with a cpu limit of 10min use:

qsub -l debug=1


Below is a sample job script to use CernVM-FS provided ATLAS Releases:

% cat myscript.sh
shopt -s expand_aliases
source /common/atlas/scripts/setupATLAS.sh
setupATLAS
asetup 15.6.3
athena.py blabla

Note the like with "shopt" has to be there.

When submitting use the "cvmfs=1" option in qsub:

qsub -l cvmfs=1 <normal options>

Interactive Batch Jobs

For interactive jobs that are a bit heavier than just editing files, you can run an interactive session on a batch node directly. The command

  qsh

Will open a new window with a terminal session on a batch node (it may take a few seconds to get an empty slot). If you're connected remotely, then you can use

  qlogin

instead, which basically just ssh's into a batch slot from your current session. Unfortunately the batch nodes do not have LDAP access, so you'll need to set up SSH keys on your PDSF account to gain access this way:

  pdsf2 | ~ [8]: ssh-keygen -t rsa
  Generating public/private rsa key pair.
  Enter file in which to save the key (/u/mhance/.ssh/id_rsa):     # just hit enter
  Enter passphrase (empty for no passphrase):                      # enter a passphrase if you want
  Enter same passphrase again:                                     # same passphrase again
  Your identification has been saved in /u/mhance/.ssh/id_rsa.
  Your public key has been saved in /u/mhance/.ssh/id_rsa.pub.
  The key fingerprint is:
  <numbers> mhance@pdsf2
  pdsf2 | ~ [9]: cat .ssh/id_rsa.pub >> .ssh/authorized_keys

That should do it.

Once you're in the shell, you'll likely have to set up your environment again, including any "chos" commands you'd usually include in a batch submission script. Then you should be able to treat it like an interactive node. The session typically times-out after 24 hours, just like a normal batch job.

Note also that if you normally need, e.g., 2GB of memory, or CVMFS, or anything like that, you should specify that in your login. For example:

  qsh -l cvmfs -l h_vmem=2G


Nx Server for PDSF Use

Connecting to pdsf using the NX server is highly recommended, particularly for those working at CERN. To set it up, follow the instructions here.

Setting Up SVN Access

For password-less access to svn create at file called config in your ~/.ssh directory containing:

 Host svn.cern.ch svn
         Protocol 2,1
         GSSAPIAuthentication yes
         GSSAPIDelegateCredentials yes
         ForwardX11Trusted yes
         ForwardX11 yes 

If your username is different for svn.cern.ch, then add the following line under that host entry:

         User mhance

of course replacing "mhance" with the username that svn.cern.ch expects.

In your ~/.bashrc file add

 export KRB5_CONFIG="/common/atlas/kits/setup_files/krb5.conf"

For each pdsf session you have do

 /usr/kerberos/bin/kinit username@CERN.CH

then you should not have to type your password for every svn directory.

Disk space

Location Filesystem Quota or Size Comments
/project/projectdirs/atlas GPFS 1 TB No backup (backup service in development). Can be used for data files or SW.
/common/atlas GPFS 1.5 TB(?) Nightly backup. This disk is for software, not for data files.
/eliza1/atlas GPFS 35 TB No backup. Put data files here.
/eliza2/atlas GPFS 35 TB No backup. Grid server storage - don't write to this disk .
/eliza4/atlas GPFS 12 TB No backup. Put data files here (being decommissioned).
/eliza18/atlas GPFS 142 TB No backup. Put data files here.
  • To check group disk usage and quotas (-G displays quotas in GB instead of MB):
   myquota -G -g atlas
 * To check your own disk usage and quota (-G displays quotas in GB instead of MB:
 myquota -G
  • In addition to a disk space quota, there is also a quota on the number of files (inodes). A kit for a single ATLAS release requires about 80k inodes, so please check the inode quota before installing a new kit.
  • Note that when you are near your quota on /home (>90%), your batch jobs will not run. So please keep your home area clear.


Getting/Distributing data locally: using DQ2-xxx on PDSF

There is a dedicated machine, pdsfdtn1.nersc.gov, for data transfers. This machine should not be used for other purposes (running jobs etc). On the other hand, do not use interactive pdsf nodes for data transfers, it makes makes them slow and affects all users.

To use grid tools:

  ssh pdsfdtn1.nersc.gov

check that you have the right OS

  cat /etc/redhat-release 
  Scientific Linux SL release 5.3 (Boron)

If you see something else (e.g. 5.4), run chos by hand (type "CHOS=sl53 chos").

Set up the environment:

  source /common/atlas/scripts/setupATLAS.sh
  setupATLAS
  localSetupDQ2Client --skipConfirm
  voms-proxy-init --voms=atlas

Now you can use dq2-ls, dq2-get, etc. Documentation for the tools is here.

If you have problems with "certificate out of date" errors, please do the following and re-try the dq2-get:

  export X509_CERT_DIR=/usr/common/nsg/etc/certificates

If you want to create a new dataset visible to grid users from data that is local on PDSF you can use dq2-put:

  dq2-put -L NERSC_SCRATCHDISK -s sourceDir datasetName

please refer to the dq2 twiki page linked above for details and for the format of the dataset name (must be of the form user.[UserName].*). On PDSF the above functionality should work. However you may encounter errors in writing files to the disk, which look like:

>> Transfer of file MC11_7TeV.107499.singlepart_empty.pileup_Pythia8_A2M_noslim_2011BS.mu9.VTXD3PD.root to SE: FAILED

In this case please open a NERSC ticket and ask for the destination directory to be made group-writable. The destination directory in case you are writing to NERSC_SCRATCHDISK will be like:

  /eliza2/atlasdata/atlasscratchdisk/user/[YourUserName]/[XXX]/ 

where [YourUserName] is your nickname on the grid and [XXX] is the part of the dataset name between dots following your nickname, e.g. user.[YourUserName].[XXX].someOtherInfo.v1.0/

Using the PDSF grid server

To list datasets local to PDSF: set up the DQ2 environment, then

  dq2-ls -s NERSC_SCRATCHDISK 

or

  dq2-ls -s NERSC_LOCALGROUPDISK

The files are physically located on the /eliza2 disk, so you can also login to PDSF and use "ls":

  ls /eliza2/atlas/atlasdata/atlasscratchdisk/

to find files, and use the files directly in your athena or ROOT jobs.

To request a dataset to PDSF, follow these instructions: http://panda.cern.ch/server/pandamon/query?mode=ddm_req

Transfer requests to NERSC_SCRATCHDISK are auto-approved. To get data to NERSC_LOCALGROUPDISK you may need an explicit approval. Ian can do that.

Datasets from the SCRATCHDISK are removed automatically. If you copied data in LOCALGROUPDISK and don't need it any more, please use

  dq2-delete-replicas  -d dataset-name NERSC_LOCALGROUPDISK

to clean up. Everyone should be able to delete datasets they requested to PDSF. Otherwise ask Ian. The "-d" is necessary to actually delete the files, as opposed to just erasing the listing of the dataset at NERSC from the server.

If you get "permission denied" errors when trying to transfer or write data to our grid endpoints, then you may need to either (a) register with DaTRI, or (b) request the "usatlas" role for your grid certificate.

(a) To register with DaTRI, visit the following page and follow the instructions there:

http://panda.cern.ch/server/pandamon/query?mode=ddm_user

(b) To get the "usatlas" role, follow the instructions on the following page, starting from "In addition, you should request to join the group associated to your country...."

https://twiki.cern.ch/twiki/bin/viewauth/Atlas/WorkBookStartingGrid


Frontier DB access on pdsf

If you use CernVM-FS based ATLAS releases, the following frontier server is setup automatically.

To get very fast online DB access you should setup Frontier access. This will work for 15.5.X and on. Setup the atlas software then do

   export FRONTIER_SERVER='(proxyurl=http://cernvm.lbl.gov:3128)(serverurl=http://frontier.racf.bnl.gov:8000/frontieratbnl)(retrieve-ziplevel=5)'

then run athena and marvel at how fast it is.

Using Kerberos at PDSF

To be able to check out packages in CVS or use your cern afs space, you need to have kerberos authentication. CERN has now switched to Kerberos v5. To be properly authenticated, you need to define a variable:

 export KRB5_CONFIG=/common/atlas/kits/setup_files/krb5.conf

and use

 /usr/kerberos/bin/kinit username@CERN.CH

to authenticate yourself.

Using pAthena on PDSF: Running Grid Jobs

  • Using pathena is an efficient way to do analysis over the grid. The documentation is quite good and should be able to guide you through using it. Here are some quick installation instructions.

You can use pathena through the AtlasLocalRootBase setup:

setupATLAS asetup 17.2.0.2 # or your favorite athena release localSetupPandaClient pathena --help


Now pathena is ready to use. If you're using prun instead, you likely do not need to set up athena:

setupATLAS localSetupPandaClient prun --help


Useful Atlas Tools on pdsf

  • Valgrind is a good way to debug code, find memory leaks and understand seg faults. To setup it up on the pdsf SL4, setup your normal software and then do
 source /afs/cern.ch/sw/lcg/external/valgrind/3.3.0/slc4_amd64_gcc34/_SPI/start.sh

to run valgrind do

 valgrind --leak-check=yes --trace-children=yes --num-callers=8 --show-reachable=yes 
 `which athena.py` jobOptions.py >! valgrind.log 2>&1

You can decode the output using the Atlas documentation


Other Tips for Easy Usage

Compiling is quite slow with the atlas software on PDSF. If you need to compile something and are not adding any new header or source files, you can do

 make QUICK=1

You can't use this if you've just checked out a package and are compiling for the first time. This is only to be used when you've made relatively minor changes.

There are default soft limits on the address space users are allowed to consume in the interactive machines. This can, for instance, cause problems when opening large ROOT files for browsing or with a MakeClass script. To get around this, try executing this command in your bash shell:

  ulimit -v 5242880

Known Problems with using ATLAS specific software on PDSF

  • 10% of BS->ESD->AOD->DPD _reco_ jobs fail without error messages. Hypothesis: due to high GPFS load to /common resulting from db-file copying and shared library loading. Iwona and Sven are investigating. Update as of Sep5: After running a few hundred analysis jobs on AOD files (these jobs don't access the db a lot), i'm happy to report that there I don't see this failing-without-error-message problem. This support the GPFS load hypothesis and means that for the most common anlysis use pattern things are ok.
  • Initial package compilation is 3x slower than on lxplus. "cmt make -j4" helps a bit, but still twice as slow as "cmt make" on lxplus. Someone should investigate what is limiting factor and try to improve.
  • Kerberos against lxplus does not work on 64-bit interactive PDSF nodes (pdsf1 and pdsf2) with CHOS=sl44 --> Sven filed ticket: 080918-000102 --> fixed 9/19/08
  • the only 64-bit interactive nodes pdsf1 and pdsf2 freeze roughly once a week, leading to reboots. Memory has been exchanged, and bios upgraded, but the problem persists. NERSC is planning to exchange the processors next (Sven has filed a bug report.) Update 9/19/08 (Sven): No more crashes on pdsf2 after CPU replacement last week, and today pdsf1 CPUs were also replaced. Let's keep our fingers crossed.
Personal tools