# 1 Overview and Background Many of the assignments in this course will introduce you to topics in ...

###### Question:

Tyrosine TyT Val Each line that starts with the word ATOM represents a unique atom in the protein. So do the lines that start with HETATM, but these are atoms in water molecules surrounding the particular protein when it was crystallized, and we want to ignore them for now. Lines that start with ATOM contain the three-letter code for the amino acid of which that atom is a part. For example, an atom line for an atom in a phenylalanine molecule looks like this: ATOM 3814 N PHE J 24-17.763-7.816-12.014 1.00 0.00 The three-letter code is always in uppercase. The exact form of a PDB file is standardized. The standard is revised every few years. The most recent standard that describes the format can be found at: https://www.wwpdb.org documentation file-format-content format 33/v3.3.htm On that page you can scroll down to the Coordinate Section and find the link to the ATOM record format There you will see that the line for an atom is defined by the following table COLUMNS DATA FIELD DEFINITION 1 -6 Record naneATOM 7 11 Integer 13 16 Aton 17 18 20 Residue name resName serial Aton serial number Aton name Alternate location indicator Residue nane Chain identifier Residue sequence nunber Code for insertion of residues Orthogonal coordinates for I in Angstroms Orthogonal coordinates for Y in Angstroms Orthogonal coordinates for Z in Angstroms altLoc 23 26 Integer resSeq 31 38 Real(8.3) 39 46 Real (8.3) 47 54 Real (8.3) 55 60 Real (6.2) 61 66 Real (6.2) 77 78 LString() lement Element symbol, right-justified. 79 80 LString(2) tempFactor Temperature factor charge Charge on the aton.
2. A DNA string is a sequence of the letters a, c, g, and t in any order, whose length is a multiple of 1 3· For example, aaegtttgtaaccagaactgt is a DNA string of length 21. Each letter is called a base, and a sequece of three consecutive letters is called a codon. For example, in the preceding string, the codons are aac, gtt, tgt, aac, cag, aac, and tgt. A DNA string can be hundreds of thousands of codons long, even millions of codons long, so it is hard to count them by hand. It would be useful to have a simple utility script that could count the number of occurrences of a specific codon in such a string. For instance, in the example string above, aac occurs three times and tgt occurs twice. For simplicity, we always assume that we look for codons at positions that are multiples of three in the file, i.e., starting at positions 0, 3, 6, 9, 12, and so on Write a script named countcodon that is given two arguments on the command line. The first is a lowercase three letter codon string such as aaa or cgt. The second is the name of a file containing a DNA string with no newline characters or white space characters of any kind except at the end after the sequence of bases; it is just a sequence of the letters a, c, g, and t. The script will output a single number, which is the number of occurrences of the given codon in the given fie. It should output nothing but that nmber. If it finds no occurrences, it should output O. For example, if the above string is in a file named dnafile, then it should work like this: $countcodon ttt dnafile countcodon aac dnafile$ countcodon ccc dnafile The script should check that it has two arguments and exit with a usage message if it does not. It should make sure that it can open the file for reading and print a usage statement if it cannot. It does not have to check that the string is actually a codon, but it should check that the file contains nothing but the bases and possible terminating newline character Hint: You will not be able to solve this problem using grep alone. There are a number of commands that might be useful, such as sort, cut, fold, and uniq. One of these makes it very easy. Find the right one.