Posted on Friday, August 07, 2009 at 6:06 AM
Here's a detab utility.
This should bread and butter code. In fact it is tricky to ensure that a
malicious user cannot cause a buffer overrun. Hence the maxmemory() routine.
getline() is also difficult.
Utility to replaces tabs in text files with spaces.
Can either do a blind replace or replace with user-specifed tab stops
By Malcolm McLean, 2009
- include <stdio.h>
- include <stdlib.h>
- include <string.h>
- include <limits.h>
- include <assert.h>
count the occurences of ch in str
size_t chcount(const char *str, int ch)
size_t answer = 0;
if(*str++ == ch)
is a string an integer?
int integral(char *str)
x = strtol(str, &end, 10);
if(x > INT_MAX || x < INT_MIN)
check that an array of integers is ascending...
Posted on Saturday, January 10, 2009 at 5:48 AM
The big mainframe we use to do protein structure calcualtions has gone down, probably with a virus.
In the meantime I'm thinking about a new use for my Basic Interpreter (this is available from my website, I also publish it as a book). The idea is BattleBugs.
You have a game arena consisting of a terrain on which grows grass. You then have two teams of bugs. The bugs are controlled by little Basic programs. The bugs move to eat grass, when they get energy. After a bug has reached a certain energy level, it can reproduce.When it runs out of energy, it dies.
Bugs are allowed to execute one line of Basic per move. When they move into other bugs they fight.
The skeleton of the game is working, however I need a designer for the fighting rules. The rules have got to have a sort of open-ended quality to them so that programs can become more and more ingenious. Also at present the game does not have enough visual appeal. Fighting rules should change the morphology of the bug...
Posted on Friday, November 28, 2008 at 9:55 AM
All the legacy protein energy potential code is in Fortran 77. However I use C for bread an butter work.
For instance I recently had to load in over 400 data files, a matrix of interactionss for various amino acids (we represent the amino acid residue by virtual atom "balls", see previous posts). 400 paths are too many to hardcode or even put in a configuration file. However my naming system is regular - amino acid 1 letter code, the virtual atom or ball id (2 characters) and the same for its partner, to produce a 6 letter filename.
In C it is dead easy to generate the files. Simply say
sprintf(filename, "%s/%c%s%c%s", directory, aa[i], ball[i], aa[j], ball[j]);
fp = fopen(filename, "r");
In Fortran 77, believe it or not, there's no way to get the length of a string. Strings are just padded with spaces. However you can loop through the string, get the first space, and use that to conactenate
do 100 i = 1, 70
if(plen .eq. 0 .and. path(i:i) .eq. ' ') plen .eq. i-1...
, Fortran 77
Posted on Monday, October 13, 2008 at 8:29 AM
Finally got some decent models out of the computer.
My problem is processing time. Because of the way I build up my protein models, each amino acid residue adds a factor of 18 to the number of conformations. That is, I have 18^N conformations. Somehow these have to be searched to find the minimum energy structure.
However a molecule doesn't exist in only one conformation, except at zero Kelvin or when crystallised. It moves about in the solvent. We hope to capture this. What we do is run it in a simualted temperature bath, until it has found the minimum conformation. Then we run it for as long again, taking a thousand samples from the run. We then super-impose these to gain a view of the conformation.
This leads to the question, how do we know it has found the minimum? What we do is run three independent simulations. if all three find the same minimum value, we assume that the genuine minimum has been found.
Here are the results for a bovine rhdopsin segment (protein data bank code 1eds)...
Posted on Friday, September 19, 2008 at 9:56 AM
We use several software tools in our research. These include wordprocessors, graphics packages, slide show presentation packages, tools for manipulating data, and of course programming languages.
There are basically two ways of designing a tool. You can point and click, or you can use some sort of scripting. Programs on the Beowulf cluster are entirely scripting - it has no graphical screen round which to build a GUI. On the desktop, we've got a choice.
Excel is a predominantly GUI-based statistics package. It is designed as a spreadsheet. You enter numbers into cells, then select manipulations on them from a menu. Finally you create charts from the data, again largely mouse driven. However it is not entirely a GUI - it is possible to use macros that represent a substantial chunk of code. The other statisitics package we use is R. This is a "little language" package. You set up the data as csv files. You then read the data into an R data frame, which you manipulate using the R scripting langauge. It is in fact a full programming language, though it has features that make statistics particularly easy, like dedicated commands for drawing graphs...
Posted on Wednesday, September 10, 2008 at 9:52 AM
As I mentioned in the last post, we've added rotamers to the protein model.
This threw up a problem. Proteins have torsional degrees of freedom along the backbone, however generally the angles are confined to one or two areas of Ramachandran space - given by plotting the phi versus the psi backbone angles against each other. The other torsion angle, the omega angle, is usually planar - it moves but only very slightly.
What we do is reduce the backbone angles to 6 points on the Ramachandran plot.
So now we can describe the protein backbone as a string of integers, like this
The string of 5s is typical and indicates an alpha helix - point 5 is in the helical area of the plot.
However when I add the rotamers we also need a number between 1-3 for each sidechain. The description becomes
Posted on Wednesday, August 27, 2008 at 9:04 AM
Here's the chap who invented the term "soft coding".
I've added my rotamer set to the proteins. Now I've got to run the forcefield over them to see whether adding sidechain degrees of freedom has improved our ability to discriminate folds. It's not the end of the world if it hasn't - some tweaking of the forcefield to accomodate the extra accuracy might well be required.
However each run needs an input file. That input file references a backbone file with the backbone degrees of freedom in it, a sequence file with the sequence data, an output file for the protein atom co-ordinates, a configuration file with the start configuration, a balls file with the original sidechains, a new balls file I've just created with the rotamers. All in all it's a huge edifice. Then the input files need to be submitted to the cluster job scheduling code. So this necessitates a shell script file for each job, and an output file to hold diagnostic prints...
Posted on Thursday, August 21, 2008 at 5:34 AM
Sometimes you just wish you could program visually.
Proteins consist of a long chain of amino acids, or "residues", connected together in what we call the "backbone". All the backbone atoms for each residue are identical. The residues differ in the sidechain. Glycine, the simplest, has a single hydrogen for its sidechain. Tryptophan, the most complex, has 10 heavy atoms arranged in two rings.
To a first approximation, all the bond lengths and bond angles are invariant. The protein has dihedral or torsional degrees of freedom. What I am concentrating on is the sidechain. Each residue has between 0 and 4 'chi' angles, torsion angles that describe the orientation of four connected atoms.
For fundamental reasons, atoms like to go into the spaces between other atoms. So the chi one angle has three major orientations, + 60 degrees, -60 degrees, and trans, or 180 degrees. These slot the sidechin atoms into the spaces between the other atoms in the backbone. So the idea is to reduce the description of the protein sidechain to one number between 1 and 3...