CernVM File System on Personal Workstations
About CernVM File System
The CernVM File System ("CernVM-FS" or "CVMFS") is a tool that allows for efficient global distribution of software and data that does not change frequently. Its name indicates its origins for use by virtual machines in use by the high energy physics community, however, it has wider applicability and usage. It caches files to disk so that, after the initial download, file access for the client is speedy.
Within LIGO, CVMFS is being used to distribute both instrument data ("frame files") and analysis software for use at LIGO computing sites and on the Open Science Grid (OSG).
Installation and Initial Configuration
Please follow the instructions and simple configuration described in CvmfsUser
. This configuration is only appropriate for the low amount of traffic one might expect from a single client
. You will find advanced configuration options appropriate for large-scale computing centers below.
I do not intend these instructions to replace upstream CVMFS documentation
but, instead, to supply LIGO computing administrators with some smart choices for a basic first installation.
CVMFS clients download software files over HTTP on ports 80 and 8000 while they download instrument data over (TPD believes) port 443 with certificate-based authentication. Your clients must be able to connect make direct HTTPS connections to upstream servers while indirect proxy connections can be used to handle HTTP traffic.
An underlying assumption of CVMFS is that all software files are strictly public: accessible by anyone on the planet. By contrast, instrument data are private and can only be accessed with a certificate like that generated by
Configure CVMFS to download software via HTTP proxies
All CVMFS software files are downloaded over HTTP, while instrument data (which are private) are downloaded with HTTPS and require certificate-based authentication. This section is only concerned with software files over HTTP and reducing the number of repeated downloads from your site to the public internet. Reasons you should do this:
- To avoid overwhelming the central CMVFS servers that hosts LIGO software
- To avoid
N>>1 clients using your entire bandwidth to the internet
- Your proxy/proxies should have an external connections that, in total, are slower than your data center bandwidth
Specific details regard the CVMFS client:
- CVMFS clients can be configured to balance among several proxies in a variety of ways. If the configured proxies fail to respond, CVMFS clients will failover to direct upstream connections. If you do not want them to failover to your external link, ensure that at least one proxy is active at any time. Otherwise (or in addition), ensure that clients cannot reach the internet over ports 80 or 8000.
- Users should configure their workstations to use the campus proxy and you should configure the proxy to reject connections from off-campus. This will ensure that users use the proxy while on campus and failover to a direct connection when off campus.
- What the CVMFS documents call "load balancing" is more accurately called random selection. If
CVMFS_HTTP_PROXY is configured with a list of proxy servers in a balanced configuration (e.g.
CVMFS_HTTP_PROXY="http://squid0.lan:3128|http://squid1.lan:3128", each client will randomly pick one of servers for a given filesystem, every so often, switch. So, any given client will only be using one proxy for each filesystem, though it may be accessing multiple filesystems. Through this approach, a large population of clients will load balance across all N proxies.
Specific details regard the SQUID proxies:
sudo attr -qg proxy /cvmfs/oasis.opensciencegrid.org to see which proxy is active for your connection. Remember, these will differ for each filesystem!
- Read and understand the SQUID logs on that proxy!
Configure CVMFS to locally cache software and frame files
While the HTTP proxy described above operates as a layer of caching for your data center, CVMFS has two distinct methods for the client itself to cache both software (HTTP) and instrument data (HTTPS).
Local disk cache
The default behavior is to cache up to
in MB in the directory
. This should be an area local to the machine and must not be set to a shared directory
as the file-locking is not handled for shared access.
Shared "alien" cache
The "alien" cache replaces the local disk cache with a location that is able to be shared among N clients. If you expect your access to be low and you wish to ensure that LIGO data are downloaded over your public link very few times, a single NFS server with sufficient storage may suffice. Otherwise, an appropriate choice would be a shared file system (Hadoop, etc), or a set of NFS servers that match the directory structure created by the client.
In CVMFS 2.4, it is hoped that one might use
to cache data directly in a Ceph object store.
Configure CVMFS to use local frame files
This method by-passes certificate-based authentication for LIGO frame files. It is only appropriate for clusters without non-LVC users or where file permissions are set to prevent non-LVC users from reading LIGO data.
Certificate-protected files are found under
. However, it is best for clients to access them under
. This is because
is a configurable symbolic link that can point to an existing local/shared filesystem with instrument data. This filesystem should have filenames that match the convention found in
Configure this symbolic link in
. For example: