RCommands tutorial

Objective

The objective of this demonstration is to explore how the SRB can be interfaced with some new CCLRC metadata tools to provide a more sophisticated method for data management, archiving and (eventually) retrievel. The plan is to give you the scope to explore the tools in whatever way you like, rather than guide you through a set of pre-determined steps.

You will need some data in the SRB to work with. This is somewhat more tricky to set up than at first sight you might imagine. If we simply provided you with sets of files of numbers, it would be very hard for you to see what the files are. This is why the use of metadata is so important: if the files are not self-described (ie if a file doesn't contain some sort of header telling you exactly what is in the file) and not organised well, you need some information to be provides the basic information about the file. This demonstration is about adding metadata and organisation to files of data. The most obvious (to us) type of file that is sufficiently self-describing to be useful for a demonstration is a publication, and one option for this demonstration is a set of pdf files containing publications on escience on which one or more NIEeS? staff are co-authors. We have organised these into three groups which can be obtained from:

  • Papers on grid computing
  • Papers on data management
  • Papers on colllaborative tools
  • Papers on escience applications

There is no need to download all these files for this demonstration; you just need enough to be able to play with. Download enough to put into the SRB as per yesterday's demonstration.

Thinking about data organisation

Before you start to compute in earnest, you should think about data organisation. The tools used in this demonstation will assume a three-layer hierarchy:

  • The study level. This is the over-arching level under which you will group all files concerned with one particular piece of work. Examples might be a study of sea surface temperatures in the North Atlantic Ocean. If you use the pdf publications files as your data, all together they might represent a single study called "escience".
  • The dataset level. This grouping will consist of a set of files associated with one aspect of the study. For example, in a study of sea surface temperatures, it might be one season or one region. If you use the pdf publications files above, we have already separated these into possible data sets ("grid computing", "data management", "collaborative tools" and "applications").
  • The data object level. This will consist of a single file or a natural collection of files (such as the complete set of files produced by a single computation). If you use the pdf publications file, each file will be a data object.

One important point should be noted: the study and dataset levels are completely abstract. In contrast, the data objects correspond to URIs (see this wikepedia reference) that point to real objects, including (but not exclusively so) files or collections of files in the SRB.

You should not feel constrained by this hierarchy. For example, you may feel that your whole life's work is one study, so that this level has little meaning. On the other hand, you may feel that any one study should only have data objects. This hierarchy has many interpretations and should be used in the way that best suits the investigator.

It is possible to add metadata to each of these levels. Within the framework of the tools you will be using, each level will have and ID number that is used in the scriptable RCommands.

There are two aspects of this work that we won't be too concerned about here but which you will see glimpses of. First, one of the requirements for metadata is so that you can share data, and if a set of data is annotated with an appropriate set of metadata the need for a colleague to keep asking you what something means will be eliminated. Thus the data organisation also includes the the concept of other investigators who may actually be co--owners of data or people who want to share your data with. Second, it is possible to associate data with topics to better enable colleagues to browse for data.

Adding metadata: the RCommands

Introducing the RCommands

You will conclude from this demonstration that adding metadata to files or collections of files can be a very tedious business. That is why metadata continues to be a challenge to the community, in spite of the fact highlighted in the introduction that without metadata it is very difficult to attach meaning to files of data in a useful way.

One approach, which is particularly useful for studies that involve simulations or computer-based analysis of data, is to have scriptable commands to add metadata. This means that creation or metadata can be semi-automated. The RCommands represent one implementation of this approach. The RCommands work in ways that are analogous the Scommands, and will apply to data that are held within the SRB (although they could also apply to files held within a FTP server or on a web page). The RCommands will insert and modify metadata held within a central metadata server.

There are only ten RCommands, with detailed descriptions provided in the links.

  • Rinit: starts an RCommand session, and is needed in order to read information from configuration files.
  • Rpasswd: changes your password that is associated with your access to the metadata server.
  • Rcreate: creates a metadata object, ie any of the study, dataset and data object levels of metadata.
  • Rannotate: adds a decription or a metadata parameter name/value pair to a study of dataset
  • Rls: lists the different entities within the metadata database.
  • Rget: displays the metadata associated with a particular entity.
  • Rrm: removes entities from the metadata database.
  • Rchmod: adds or removes investigators to or from a study.
  • Rsearch: searches the metadata associated with studies and datasets for name/value pairs or keyword descriptions
  • Rexit: ends an RCommand session and has the primary effect of cleaning away hidden files created during the session.

To use the RCommands, use the Putty ssh client tool to log in to one of the NIEeS? linux machines. Details of the IP address and username/password you should use are provided, as per the demonstration yesterday on the Scommands. First you should look at the essential configuration files contained within the .rcommands follder using the commands:

cd .rcommands ls -a cat rcommands.config

Getting started

Now initiate an RCommand session using the Rinit command. You can test that all is well by typing the Rls command: it will return a message telling you (correctly at this point) that you have no studies. To get information about other commands, you can simply type the command name with no arguments, you can use the unix man command, or you can look at the web pages above (which copy from the man pages). If you make any mistakes that you want to remove, this can be done using the metadata edit outlined in section 4 below.

Creating your first study

First use the Rcreate command to create a study level. To use Rcreate you will need to give the study a name, add a description, and assign it to a topic, via:

Rcreate -n -k -t

First you should think about the topic. You can list all topics by the command

Rls -t

Chose a topic and note the number; this will be the topicID label. If you can't decide, just make an arbitrary choice; for the purpose of this exercise it doesn't matter. Run the Rcreate command to create a study. The name and description labels can contain more than one word within quotes. For example:

Rcreate -n "Workshop papers" -k "Papers for workshop" -t 4

Now check that this has worked by running the Rls command. This will return information like


StudyID? : 1026 Name: Workshop papers

where the StudyID? number will differ for different people. Now look at this in more detail using the Rget command:

Rget -s studyID

where you add your StudyID? number. For the example above:

Rget -s 1026

gives


StudyID? : 1026 Name: Workshop papers Description: Papers for workshop Created by: martin dove Status: In Progress Start_date: 07-01-2006

Adding datasets with metadata

Now we want to add some data sets to the study. Following the example of pdf publications, we could create some datasets by

Rcreate -s 1026 -n "Papers on grid computing" Rcreate -s 1026 -n "Papers on data management" Rcreate -s 1026 -n "Papers on collaborative tools" Rcreate -s 1026 -n "Papers on escience applications"

Each invocate will create a DatasetID? , as will be echoed to the screen. Now check on the results of these commands by

Rls -s 1026

This will show you the DatasetID? for each dataset (again, different users will get different numbers). You can look at any one dataset by using the command

Rget -d DatasetID?

where you use the appropriate number of each DatasetID? .

Now we will add some metadata against each data set. For this we use the Rannotate command. The first is to add a brief description to the dataset. In my example, when I run Rls - s 1026 I get


Dataset ID: 26 Dataset Name: Papers on grid computing Parent StudyID? : 1026
Dataset ID: 27 Dataset Name: Papers on data management Parent StudyID? : 1026
Dataset ID: 28 Dataset Name: Papers on collaborative tools Parent StudyID? : 1026
Dataset ID: 29 Dataset Name: Papers on escience applications Parent StudyID? : 1026

We can use the Rannotate command in in two ways. First we can add a description to the dataset. My example is

Rannotate -d 29 -k "Collection of papers on escience applications"

Second we can add some name pairs. My example is

Rannotate -d 29 -p topic=escience Rannotate -d 29 -p topicarea=applications

Running the Rget -d 29 command to view the metadata gives


DatasetID? : 29 Name: Papers on escience applications Parent StudyID? : 1026 Created by: martin dove Creation_date: 07-01-2006 Description: Collection of papers on escience applications

Note that this shows the description but not the name pair values. To see the name pairs I need to use the command Rget -d 29 -p, which yields:


Parameter Name: topic Parameter Value: escience
Parameter Name: topicarea Parameter Value: applications

You can repeat this for other datasets, and you can be add whatever name/value pairs you like.

Adding data objects with metadata

Finally we reach the point where we can add metadata to the data objects. You need to first have data somewhere, and in our case our data are in the SRB. The data object can either be a file or a collection of files within the SRB. The command for adding metadata to a data object is

Rcreate -u -d -n

The specifies where the file is and has the form:

srb:////

In general is composed of

/home/.//.../

An example might be srb://Test/home/nieessrb40.srbdom/test.dat. The gives the dataset that you want to associate the file with, and is the name you want to give the data object.

You then add metadata with the Rannotate command in the same way that you added name/value pair metadata to the datase:

Rannotate -o dataObjectID -p =

where you get the object dataID from the dataset using the command

Rls -d

Hopefully by now you are getting more familiar with the various ID labels: studyID, datasetID and now dataObjectID for the study, dataset and data object respectively.

As before, you can use the Rget command to get the metadata from a data object:

Rget -o -p

Searching on the metadata

The power of metadata comes down to what you do with it! The Rcommands provide for this with the Rsearch command. There are several ways to use this command:

Rsearch -s studyID -p = Rsearch -d datasetID -p = Rsearch -d datasetID -k Rsearch -o dataObjectID -k

Once you have created enough metadata you can experiment with the Rsearch command.

Topic revision: r1 - 19 Jan 2009 - 11:04:05 - RobAllan
 
This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback