MCS Dev

This page is to aid the development of my_condor_submit such that we can continue to add functionality without either introducing new bugs or making the code so complex that new development becomes impossible. This mandates a relatively slow development process with the order of new implementation being discussed in detail prior to inclusion in the code. Such discussions should take place on this wiki page and the mcs mailing list (mcs@niees.ac.uk). Discussions include:

  • MCS Dev/Improved parser
  • MCS Dev/Points of failure
  • MCS Dev/small cleaning tasks
  • MCS Dev/MPI issues

Status

Current release version is 1.4.1. At present there are no plans for a 1.4.2 branch but such a release is a possibility. The 1.5 branch is under development and contains some significant structural changes to the code as well as some significant new features. Release notes for past, present and future versions of MCS can be found at the following links:

  • 1.4.0
  • 1.4.1

Future release

1.4.2

Any further urgent bugfixes could be introduced into a hypothetical 1.4.2 release.

Bugs fixed

(At the tip of the 1.4 branch, would be included in a 1.4.2 release and are deployed on tamar)

  • User specified metadata capture fixed.

1.5.0

Plans for 1.5.0 were devised at a meeting in September 2007. The major drive will be to modularise the MCS code to remove overlap between MCS and RMCS. Andrew will handle modularising the database layer and will introduce database driven machine selection to support MACE (http://amazon.dl.ac.uk/bugzilla/show_bug.cgi?id=69), RMCS retries etc. Rik will modularise the script generation layer and will tackle the parsing layer. The new design of the parsing layer will consist consist of a two stage design. The first to build a hash-of-hashes data structure and the second to validate this input and perform the needed escaping / normalisation. This will simplify handling other input formats.

There will also be some more minor enhancements. Rik will implement Schmod (http://amazon.dl.ac.uk/bugzilla/show_bug.cgi?id=18), add compression of output files (http://amazon.dl.ac.uk/bugzilla/show_bug.cgi?id=5), ensure the clean up of working directories (http://amazon.dl.ac.uk/bugzilla/show_bug.cgi?id=85) on the gatekeepers, introduce the ability to use a non standard AgentX directory structure (http://amazon.dl.ac.uk/bugzilla/show_bug.cgi?id=61) and look at the handling of arguments (http://amazon.dl.ac.uk/bugzilla/show_bug.cgi?id=46). Andrew will improve the logging (http://amazon.dl.ac.uk/bugzilla/show_bug.cgi?id=116) of globus calls and introduce sha1 hashing (http://amazon.dl.ac.uk/bugzilla/show_bug.cgi?id=38) of output files.

Also MCS Dev/small cleaning tasks

Draft release notes

  • Much of the MCS code base has been moved into separate self contained modules to improve maintainability: previous versions of MCS were a single monolithic file.

Draft enhancements

  • All globus calls (instead of just globus failures) are now logged to the database for later analysis. (Bug 116 (http://amazon.dl.ac.uk/bugzilla/show_bug.cgi?id=116))
  • MACE support added.
  • Scheduling of jobs to grid resources now makes use of a plug-in architecture.

Draft bug fixes

1.5.1

A major driver for 1.5.1 will be the ability to support longer running jobs that need restarting. In general this needs "file system gymnastics", support in the metadata layer (http://amazon.dl.ac.uk/bugzilla/show_bug.cgi?id=48) and ultimately the ability to modify input files.

Feature requests not scheduled for inclusion

  • Remove "throughput" and "performance" categories (and merge the DB tables?) and replace with some more useful data we can use for intelligent (possibly user driven) scheduling. Suggest: maximum job time, MPI capable, memory per node, approximate performance (MIPS). This is a second stage of MACE work.

Madder ideas to be binned?

  • debug post ulimit core size and upload if exists (ultradebug mode for AgentX issues) Nice idea but how can we do this?
  • Pre script to add metadata of the form "JobStatus = submitted", and Post script to overwrite with "JobStatus = completed"
  • Remove Sdirect flag and have this info stored within the Seagul database Should probably insist vaults are writable by all resource?
  • Submission to GridSAM controlled grids (if such things exist) or grids which condor does not submit to could be implemented by wrapping the job submission script in a condor job constrained to run on the local machine. Not worth doing unless significant non-globus grid is ever made available to our users. Are there any resources to warrant this?

Brain Dump for Future Discussion

  • Way to set platform specific env vars, e.g. for dynamically linked binaries
  • MACE support Done
  • Remove %userOptions
  • Finally formalise the input file syntax
  • Trying to do a FTP style data storage port
  • Trying to use OTPW for data grid authentication
  • Add option to use single vault for SRB upload (c.f. noHose in Monty)
  • Audit error messages and users (bagsies not my job) Done - Rik's script to automate future audits is committed.
  • Have hash or RE to define tags that are unique and hence encapsulated within Sdir block [and use only 2 error strings for this]
  • Do we need empty string error messages?
  • Taking config out of MCS itself -> dynamics to INI file? ; static (e.g. parser stuff to pm or other?) Done
  • can we catch pre.pl failure by dumping tag to disk and using local condor job in DAG to check for this (Globus code internal)?
  • use connection pooling within RMCS


This topic: GridTech > WebHome > RMCSDocs > RMCSMore > MCSDev
Topic revision: r1 - 19 Jan 2009 - RobAllan
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback