System Testing and Validation

Regular system testing and validation of the CCSM is required to ensure that model quality and integrity is maintained throughout the development process. This section establishes the system testing standards and the procedures that will be used to verify the standards have been met. It is assumed that component model development teams have unit tested their component prior to making it available for system testing. See section [*]) for more information on testing of individual components and unit-testing of individual subroutine and modules within components.

There are two general categories of model evaluations: frequent short test runs and infrequent long validation integrations.

Model testing refers to short (3 to 31 day) model runs designed to verify that the underlying mechanics and performance of the coupled model continues to meet specifications. This includes verifying that the model actually starts up and runs, benchmarking model performance and relative speed/cost of each model component as well as checking that the model restarts exactly. These tests are done on each of the target platforms. Model testing does not address whether the model answer is correct, it merely verifies that it mechanically operates as specified

Model validation involves longer (at least 1 year) integrations to ensure that the model results are in acceptable agreement with both previous model climate statistics and observed characteristics of the real climate system. Model validation occurs with each minor CCSM version (i.e. CCSM2.1, CCSM2.2) or at the request of the CCSM scientists and working groups. Once requested, model validation is only carried out after CCSM scientists have been consulted and the model testing phase is successfully completed. The model validation results are documented on a publicly assessable web page
(http://www.ccsm.ucar.edu/models/ccsm2.0beta/testing/status.html).

Port validation is defined as verification that the differences between two otherwise identical model simulations obtained on different machines or using different environments are caused by machine roundoff errors only.

13.1 Model Testing Procedures for the CCSM

Formal testing of the CCSM is required for each tagged version of the model. The CCSM quality assurance lead is responsible for ensuring that these tests are run, either by personally doing it or having them run by a qualified person. If a model component is identified as having a problem, the liaison for that component is expected to make resolving that problem their highest priority. The results of the testing and benchmarking will be included in the tagged model to document the run characteristics of the model. The actual testing and analysis scripts will be part of the CCSM CVS repository to encourage use by outside users.

13.1.1 Development Testing Steps

1. Successful build CCSM shall compile on each of the target platforms with no changes to the scripts, codes or datasets.
2. Successful startup CCSM will start from an initial state and run for 10 days.
3. Successful restart CCSM will start from an initial state and halt after 5 days, then restart and run from day 6 to day 10.
4. Successful branch CCSM will start from an initial state and halt after 5 days, then carry out a branch case with only a case name change and run from day 6 to day 10.
5. Exact restart A bit-for-bit match must occur between the 10-day initial run and the restart run and branch runs using the same number of processors.
6. Signal trapping A signal trapping test should be conducted with the environment variable DEBUG set to true in the Makefile.
7. Other diagnostics A diagnostic test will be performed with info_dbug set to level 2 in the coupler input.
8. Port diagnostics A port diagnostic test will be performed with info_bcheck set to level 3 in the coupler input and a 10 day run will be carried out.
9. Performance benchmarking The total CPU time, memory usage, output volume, GAU cost, disk space use and wall clock time for the 10 day run will be recorded. The relative cost of each component will also be recorded.
10. Test report The results of all steps above are to be documented in a test report with emphasis on results, comparisons to the previous test and recommendations for improvements. Any faults or defects observed shall be noted and must be brought to the attention of the liaison responsible for that component and the software engineering manager.

13.1.2 Ongoing Test Steps

1. Smoke-test A major criteria used in evaluating the effectiveness of a test procedure is the length of time which has lapsed since the last time the system was tested. To test for system or software changes, an automated six day test run will be made each weekend with the latest CCSM distribution on each of the supported platforms. A restart test will conducted on the first weekend of each month.

2. Test report The results of steps 1 will be automatically documented in a test report.


13.2 Model Validation Procedures for the CCSM

Model Validation occurs with each Minor CCSM version (i.e. CCSM2.1, CCSM2.2) or at the request of the CCSM scientists and working groups. Before starting a validation run, the CCSM Quality Assurance Lead will consult with the CCSM scientists to design the validation experiment.

Pre-Validation Run Steps:

1. Tests successfully The validation will successfully complete the testing steps outlined above.

2. Scientist sign-on The CCSM scientists must agree to make themselves available to informally analyze the results of the run during the run and formally review the results within one week of the completion of the run.

Validation Steps:

1. Comparison with previous model runs Result agrees with previous model runs

2. Comparison with observed climate Result agrees with observed climate

13.3 Port Validation of the CCSM

13.3.1 Background

Port validation is defined as verification that the differences between two otherwise identical model simulations obtained on different machines or using different environments are caused by machine roundoff errors only. Roundoff errors can be caused by using two machines with different internal floating point representation, or by using a different number of processing elements on the same machine which may cause a known re-ordering of some calculations, or by using different compiler versions or options (on a single machine or different machines) which parse internal computations differently.

The following paper offers a primary reference for port validation (hereafter referred to as RW):

Rosinski, J.M. and D.L. Williamson: The Accumulation of Rounding Errors and Port Validation for Global Atmospheric Models. Journal of Scientific Computation, Vol. 18, No. 2, March 1997.

As established in RW, three conditions of model solution behavior must be fulfilled to successfully validate a port of atmospheric general circulation models:

1. during the first few timesteps, differences between the original and ported solutions should be within one to two orders of magnitude of machine rounding;
2. during the first few days, growth of the difference between the original and ported solutions should not exceed the growth of an initial perturbation introduced into the lowest-order bits of the original solution;
3. the statistics of a long simulation must be representative of the climate of the model as produced by the original code.

The extent to which these conditions apply to models other than an atmospheric model has not yet been established. Also, note that the third condition is not the focus of this section (see section 13.2).

13.3.2 Full CCSM Port Validation

Validation of the full CCSM system, defined as the combination of all active model components participating in the full computation, is a two-step process:

1. Validate each model as a standalone system
2. Validate the coupled system

Validation of each component model alone should be performed by the model developers, and it may not be necessary to perform the standalone tests as part of regular, frequent validation testing.

To validate the fully coupled CCSM, the objective is to establish a procedure which will allow one to conclude confidently that the port of the full system (all components active) is valid. However, there are at least two potential problems which should be noted:

* Will the procedure be sufficient to draw conclusions confidently? That is, it must have little potential to conclude a good port when the port is, in fact, bad.
* Upon a conclusion that the port is bad, it is likely that no information will be available pinpointing which component of the full system is suspect.

13.3.3 Recommended Procedure

The general procedure for port validation of the full CCSM is to examine the growth of differences between two solutions over a suitable number of integral timesteps. This error growth can be compared to the growth of differences between two solutions on a single machine, where the differing solution was produced by introducing a random perturbation of the smallest amplitude which can be felt by the model at the precision of the machine.

It is recommended that the procedure examine the growth of differences in a state variable which resides at the primary physical interface (that is, the surface), where the accumulation of errors in all components will act quickly and where the action of the CCSM coupler is also significant (for example, grid mapping).

It is also recommended that the procedure be performed on a coupled system where the exchange of information between active components is frequent. Exchanges of information a model day boundaries may mask the detection of an invalid port because the magnitude of the error differences could reach roundoff saturation levels prior to an exchange of data. See example 5 in section 13.3.4.

The recipe for CCSM validation is as follows:

1. run the CCSM on a selected machine on which confidence in the solution has been established;

2. re-run the CCSM on the same machine, introducing an initial error perturbation in the atmospheric model 3-D temperature using the procedure available in the CCM (see -need a web link-);
3. run the CCSM on the target machine using the same code, same model input namelist files, and same model input data files, and compare the error growth in the perturbed solution versus the error growth in the ported solution.

The errors should satisfy the first two conditions described in RW.

Specific recommendations for a port validation of CCSM:


Item Recommendation
length of test 5-30 days
field to examine 2-D surface temperature on atmospheric grid
frequency of samples every timestep
size of perturbation smallest which can be felt on original machine (1.0E-14)
error statistic RMS difference of field, area-averaged


Note that the field being examined must be processed using the full machine precision. The field must be saved at full machine precision during the model history archival step, and the error statistic must be computed at full machine precision.
Labels: Testing
Model Testing

Component Model Testing, Unit-Testing, Code Reviews

Complex software such as the CCSM requires extensive testing in order to prevent model defects and to provide stable, solid models to work with. Layered testing has shown to be the most effective in catching software defects. Layered testing refers to testing on different levels, both testing individual subroutines as well as more complex systems. There may be several layers of simple to more complex systems tested as well. Testing the individual component models stand-alone is an example of a system less complex than the entire CCSM. Unit-testing is the first layer - testing individual subroutines or modules. Unit-testing by itself will not catch defects that are dependent on relationships between different modules - but testing the entire system sometimes will not catch errors within an individual module. That is why using both extremes is useful in catching model defects. Section [*] covers testing for the entire CCSM modeling system, this section goes over testing of individual model components and unit-testing of subroutines and modules within those components. Another way to help eliminate code errors are periodic code-reviews. Code-reviews can be implemented in many different fashions, but in general it involves having at least one person besides the author go through the written code and examine the implementation both for design and errors. Jones-1986[3] states that ``the average defect-detection rate is only 25 percent for unit testing, 35 percent for function testing, and 45 percent for integration testing. In contrast, the average effectiveness of design and code inspections are 55 percent and 60 percent. McConnel-1993[4] also notes that as well as being more effective in catching errors, code-reviews also catch different types of errors than testing does. In addition when developers realize their code will be reviewed - they tend to be more careful themselves when programming.

Since, the CCSM and the component models take substantial computer resources to run - catching errors early can cut computing costs significantly. In addition to that as pointed out by McConnell-1993[4] development time decreases dramatically when formal quality assurance methods including code-reviews are implemented.

12.1 Component Model Testing

Each component model needs to develop and maintain it's own suite of testing for that given component. It is recommended that analysis of the kinds of testing required for each model by each component models development team be done and written down in a formal testing-plan. Also creating automated tests to run a suite of standard tests can be useful to ensure the models work and continue to work as needed. This is especially useful for making sure models continue to work on multiple platforms. McConnell-1996[5] refers to this as the daily ``build and smoke test'' you daily build and run your code to ensure it continues to work and doesn't just sit there and ``smoke''.

12.1.1 Designing Good Tests

In order to design a comprehensive testing plan we want to take advantage of the following types of tests.

unit-testing
Testing done on a single subroutine or module.
functional-testing
Testing for a given functional group of subroutines or modules, for example, testing model dynamics alone without the model physics.
system-testing
Testing done on the whole system.

12.1.2 unit-tests

Unit-tests are a good way to flush out certain types of defects. Since unit-tests only run on one subroutine they are easier to use, faster to build and run, allow more comprehensive testing on a wider range of input data, help document how to use and check for valid answers, and allows faster testing of individual pieces. By building and maintaining unit-tests the same tests can be run and used by other developers as part of a more comprehensive testing package. Without maintaining unit-tests developers often do less testing than required - since system tests are so much harder to do - or they have to ``hack'' together their own unit-tests for each change. By maintaining unit-tests we allow others to leverage off previous work and provide a format to quickly do extensive checking.

Good unit-tests will do the following:

1. Applicable requirements are checked.
2. Exercise every line of code.
3. Check that the full range of possible input data works. (i.e. if Temperature is input check that values near both the minimum and maximum possible values work)
4. Boundary analysis - logical statements that refer to threshold states are checked to ensure they are correct.
5. Check for bad input data.
6. Test for scientific validity.

By analyzing the code to be tested different test cases can be designed to ensure that all logical statements are exercised in the unit-test. Similarly input can be designed to test logical threshold states (boundary analysis). Testing scientific validity is of course the most difficult. But, sometimes testing states where the answer is known analytically can be useful. And ensuring (or measuring) the degree to which energy, heat, or mass is conserved for conservative processes can also often be done. These types of tests may also be applied for more complex functional and system tests as well.

12.1.3 Functional-tests

Functional tests take a given sub-set of the system and test this set for a particular functionality. Scientific functional tests are common. For example, the Column Radiation Model (CRM) is used to check the radiation part of the atmospheric model. Quite often scientific functional tests are implemented as name-list options to component models, the atmospheric model has a name-list item to test dynamics only by turning the physics off. Functional tests could also be created for infra-structure issues such as parallel decomposition or handling of input or output data. Important functional tests should be maintained in CVS as separate modules that include the directories maintained for the main component model.

12.1.4 System-tests

System tests for a given component model need to ensure that the given model compiles, builds, and runs and that it passes important model requirements. For example, most models require that restarts give results that are bit-for-bit to continuous simulations. This requirement can be tested fairly easily.

12.1.5 CCSM Testing requirements and implementation details

Unit-tests for component models should meet the following minimum requirements.

1. Maintained in CVS either with the rest of the models source code or as a separate module that can be used. Module and/or directory name should be easily identifiable such as `unit_testers'.
2. Have documentation on how to use it.
3. Check error conditions so that an error will print out problems.
4. Prompt for any input in a useful way. (so you don't have to read the code to figure out you have to enter something).
5. Have a Makefile associated with it. It may be useful to leverage off the main Makefile so that the compiler options are the same and so that platform dependencies don't have to be maintained twice.
6. In general unit-tests should be run with as many compiler debug options on as possible (bounds checking, signal trapping etc).

Component model system tests should meet the following minimum requirements.

1. Ensure that the given model will compile, build and run on at least one production platform.
2. Ensure that the given model will work with the CCSM system on at least one production platform.

12.2 Code-Reviews

Formal reviews of the code where the code is gone through line-by-line in groups or in pairs has shown to be one of the most effective way to catch errors McConnell-1993 [4]. As such it is recommended that component model development teams create a strategy for regularly reviewing the code.

12.2.1 Strategies for Implementation of Code-Reviews

Code reviews can be implemented in different ways.

* Code librarian - Before code is checked into CVS it goes through a ``librarian'' who not only is responsible for testing, and validation of the changes - but also reviews it for design and following code standards.
* Peer reviews - Before code is checked into CVS a peer developer reviews the changes.
* Pair programming - All code is developed with two people looking at the same screen (one of the practices of Extreme Programming - [2]).
* Formal configuration management - All code modifications are presented to a configuration management team who extensively reviews and tests changes and incorporates changes as dictated by project management.
* Formal group walk-through - Code is presented and gone through by an entire group.
* Formal individual walk-through - Different individuals are assigned and take responsibility to review different subroutines.

It is recommended that development teams provide both a mechanism to review incremental changes, and also have formal walk-through of important pieces of code in group. This serves two purposes: the design is communicated to a larger group, and the design and implementation is also reviewed by the entire group.

By adopting quality assurance techniques CCSM model codes can both be of greater quality, development time can be lowered, and machine time can be cut by decreasing errors.