CernVM File System on Personal Workstations
Summary
This page documents how staff should install and configure CVMFS for reliable use at the scale of computing clusters.
About CernVM File System
The CernVM File System ("CernVM-FS" or "CVMFS") is a tool that allows for efficient global distribution of software and data that does not change frequently. Its name indicates its origins for use by virtual machines in use by the high energy physics community, however, it has wider applicability and usage. It caches files to disk so that, after the initial download, file access for the client is speedy.
Within LIGO, CVMFS is being used to distribute both instrument data ("frame files") and analysis software for use at LIGO computing sites and on the Open Science Grid (OSG).
Installation and Initial Configuration
Please follow the instructions and simple configuration described in
CvmfsUser. This configuration
is only appropriate for the low amount of traffic one might expect from a single client. You will find advanced configuration options appropriate for large-scale computing centers below.
I do not intend these instructions to replace upstream
CVMFS documentation but, instead, to supply LIGO computing administrators with some smart choices for a basic first installation.
Networking considerations
CVMFS clients download software files over HTTP on ports 80 and 8000 while they download instrument data over ports 1094 and 8443 with certificate-based authentication. Your clients must be able to connect make direct HTTPS connections to upstream servers while indirect proxy connections can be used to handle HTTP traffic.
Security considerations
An underlying assumption of CVMFS is that all software files are strictly public: accessible by anyone on the planet. By contrast, instrument data are private and can only be accessed with a certificate like that generated by
ligo-proxy-init
.
All CVMFS software files are downloaded over HTTP, while instrument data (which are private) are downloaded with HTTPS and require certificate-based authentication. This section is only concerned with software files over HTTP and reducing the number of repeated downloads from your site to the public internet. Reasons you should do this:
- To avoid overwhelming the central CMVFS servers that hosts LIGO software
- To avoid
N>>1
clients using your entire bandwidth to the internet
- Your proxy/proxies should have an external connections that, in total, are slower than your data center bandwidth
Specific details regard the CVMFS client:
- CVMFS clients can be configured to balance among several proxies in a variety of ways. If the configured proxies fail to respond, CVMFS clients will failover to direct upstream connections. If you do not want them to failover to your external link, ensure that at least one proxy is active at any time. Otherwise (or in addition), ensure that clients cannot reach the internet over ports 80 or 8000.
- Users should configure their workstations to use the campus proxy and you should configure the proxy to reject connections from off-campus. This will ensure that users use the proxy while on campus and failover to a direct connection when off campus.
- What the CVMFS documents call "load balancing" is more accurately called random selection. If
CVMFS_HTTP_PROXY
is configured with a list of proxy servers in a balanced configuration (e.g. CVMFS_HTTP_PROXY="http://squid0.lan:3128|http://squid1.lan:3128"
, each client will randomly pick one of servers for a given filesystem, every so often, switch. So, any given client will only be using one proxy for each filesystem, though it may be accessing multiple filesystems. Through this approach, a large population of clients will load balance across all N proxies.
Specific details regard the SQUID proxies:
Debugging:
- Run
sudo attr -qg proxy /cvmfs/oasis.opensciencegrid.org
to see which proxy is active for your connection. Remember, these will differ for each filesystem!
- Read and understand the SQUID logs on that proxy!
While the HTTP proxy described above operates as a layer of caching for your data center, CVMFS has two distinct methods for the client itself to cache both software (HTTP) and instrument data (HTTPS).
Local disk cache
The default behavior is to cache up to
CVMFS_QUOTA_LIMIT
in MB in the directory
CVMFS_CACHE_BASE
. This should be an area local to the machine and
must not be set to a shared directory as the file-locking is not handled for shared access.
Shared "alien" cache
The "alien" cache replaces the local disk cache with a location that is able to be shared among N clients. If you expect your access to be low and you wish to ensure that LIGO data are downloaded over your public link very few times, a single NFS server with sufficient storage may suffice. Otherwise, an appropriate choice would be a shared file system (Hadoop, etc), or a set of NFS servers that match the directory structure created by the client.
In CVMFS 2.4, it is hoped that one might use
librados
to cache data directly in a Ceph object store.
This method by-passes certificate-based authentication for LIGO frame files. It is only appropriate for clusters without non-LVC users or where file permissions are set to prevent non-LVC users from reading LIGO data.
Certificate-protected files are found under
/cvmfs/ligo.osgstorage.org/frames
. However, it is best for clients to access them under
/cvmfs/oasis.opensciencegrid.org/ligo/frames
. This is because
/cvmfs/oasis.opensciencegrid.org/ligo/frames
is a configurable symbolic link that can point to an existing local/shared filesystem with instrument data. This filesystem should have filenames that match the convention found in
/cvmfs/ligo.osgstorage.org/frames
. e.g.
/cvmfs/ligo.osgstorage.org/frames/O2/hoft/L1/L-L1_HOFT_C00-11690/L-L1_HOFT_C00-1169022976-4096.gwf
Configure this symbolic link in
/etc/cvmfs/config.d/oasis.opensciencegrid.org.local
. For example:
export LIGO_DATA_FRAMES=/mnt/hadoop/user/ligo/frames