1 answer

1 Overview and Background Many of the assignments in this course will introduce you to topics in ...

Question:

1 Overview and Background Many of the assignments in this course will introduce you to topics in computational biology. You d
Tyrosine TyT Val Each line that starts with the word ATOM represents a unique atom in the protein. So do the lines that start
2. A DNA string is a sequence of the letters a, c, g, and t in any order, whose length is a multiple of 1 3· For example, aae
1 Overview and Background Many of the assignments in this course will introduce you to topics in computational biology. You do not need to know anything about biology to do these assignments other than what is contained in the description itself. The objective of each assignment is for you to acquire certain particular skills or knowledge, and the choice of topic is independent of that objective. Sometimes the topics will be related to computational problems in biology, chemistry, or physics, and sometimes not This particular assignment is an exercise in extracting information from files that are too big for mere mortals to process manually. The real power of computers is that they can do simple things extremely quickly, mean- ing millions, perhaps billions of times per second, much faster than people can. There is a kind of file called a PDB file that contains structural information about proteins, nucleic acids, and other macromolecules. A macromolecule is just a big molecule. Macro means big. PDB is an acronym for the Protein Data Bank PDB files can be downloaded from the Protein Data Bank at http://www.resb.org/pdb/home/home.do. A PDB file contains information obtained experimentally, usually by ether X-ray crystallography, NM spectroscopy, or cryo-electron microscopy (You do not need to know this to do the assignbut is important for those who intend to pursue a bioinformatics concentration.) These files completely characterize the molecule, providing, for example the three-dimensional positions of every single atom in the file, where the bonds are . which amino acids it contains if it is a protei吧(or nucleotides if DNA or RNA) and much more. The information is not necessarily exact. Associated with some of this information are confidence values that indicate how accurate it is. A PDB file is a plain text file; you can view its contents in any text editor, such as gedit or nedit, or with commands such as cat, more, and less. Each line in a PDB file begins with a word that characterizes what type of line it is. These individual lines are called records For example, some lines start with the word REMARK, which means they are comments about the file itself, or about the experimemt through which the data was collected. Some lines start with SOURCE, and they have information about the source of the data in the file, Some lines start with words such as MODEL, CONECT ATOM, and HETATM. Each has a different meaning in the file. Take a look at some of the PDB files in the directory /data/biocs/b/student.accounts/cs132/data/pdb.files before you read any further, so that you can see what they contain. I suggest picking files that are small, meaning smaller than a megabyte in size Proteins are chains of amino acids. Amino acids are organic compounds that carry out many important bodily functions, such as giving cells their structure. They are also instrumental in the transport and the storage of nutrients, and in the functioning of organs, glands, tendons and arteries. Amino acids have names such as alanine, glycine, tyrosine, and tryptophan. They are also known more succinctly by unique three-letter codes. The table below lists the twenty standard amino acids with their three-letter codes. For a summary of what these methods are, see http://www.pdb.org/pdb/static do?p education discussion/Looking-at- Amino acids are the building blocks of proteins, which may contain many thousands of them tructures/methods.html
Tyrosine TyT Val Each line that starts with the word ATOM represents a unique atom in the protein. So do the lines that start with HETATM, but these are atoms in water molecules surrounding the particular protein when it was crystallized, and we want to ignore them for now. Lines that start with ATOM contain the three-letter code for the amino acid of which that atom is a part. For example, an atom line for an atom in a phenylalanine molecule looks like this: ATOM 3814 N PHE J 24-17.763-7.816-12.014 1.00 0.00 The three-letter code is always in uppercase. The exact form of a PDB file is standardized. The standard is revised every few years. The most recent standard that describes the format can be found at: https://www.wwpdb.org documentation file-format-content format 33/v3.3.htm On that page you can scroll down to the Coordinate Section and find the link to the ATOM record format There you will see that the line for an atom is defined by the following table COLUMNS DATA FIELD DEFINITION 1 -6 Record naneATOM 7 11 Integer 13 16 Aton 17 18 20 Residue name resName serial Aton serial number Aton name Alternate location indicator Residue nane Chain identifier Residue sequence nunber Code for insertion of residues Orthogonal coordinates for I in Angstroms Orthogonal coordinates for Y in Angstroms Orthogonal coordinates for Z in Angstroms altLoc 23 26 Integer resSeq 31 38 Real(8.3) 39 46 Real (8.3) 47 54 Real (8.3) 55 60 Real (6.2) 61 66 Real (6.2) 77 78 LString() lement Element symbol, right-justified. 79 80 LString(2) tempFactor Temperature factor charge Charge on the aton.
2. A DNA string is a sequence of the letters a, c, g, and t in any order, whose length is a multiple of 1 3· For example, aaegtttgtaaccagaactgt is a DNA string of length 21. Each letter is called a base, and a sequece of three consecutive letters is called a codon. For example, in the preceding string, the codons are aac, gtt, tgt, aac, cag, aac, and tgt. A DNA string can be hundreds of thousands of codons long, even millions of codons long, so it is hard to count them by hand. It would be useful to have a simple utility script that could count the number of occurrences of a specific codon in such a string. For instance, in the example string above, aac occurs three times and tgt occurs twice. For simplicity, we always assume that we look for codons at positions that are multiples of three in the file, i.e., starting at positions 0, 3, 6, 9, 12, and so on Write a script named countcodon that is given two arguments on the command line. The first is a lowercase three letter codon string such as aaa or cgt. The second is the name of a file containing a DNA string with no newline characters or white space characters of any kind except at the end after the sequence of bases; it is just a sequence of the letters a, c, g, and t. The script will output a single number, which is the number of occurrences of the given codon in the given fie. It should output nothing but that nmber. If it finds no occurrences, it should output O. For example, if the above string is in a file named dnafile, then it should work like this: $ countcodon ttt dnafile countcodon aac dnafile $ countcodon ccc dnafile The script should check that it has two arguments and exit with a usage message if it does not. It should make sure that it can open the file for reading and print a usage statement if it cannot. It does not have to check that the string is actually a codon, but it should check that the file contains nothing but the bases and possible terminating newline character Hint: You will not be able to solve this problem using grep alone. There are a number of commands that might be useful, such as sort, cut, fold, and uniq. One of these makes it very easy. Find the right one.

Answers

#!/bin/bash #set first and second arguments (dnafile and base respectively)  dir=$1 base=$2  count=$(grep -o ${base} ${dir} | wc -l)  echo "${count}"

OUTPUT :

$ ./countmatches ttt dnafile 1

(Alternative awk script for the same )

AWK SCRIPT :-

grep -v ">"  < input.fa | tr -d '\n' | sed 's/\([ATGCatgcNn]\{3,3\}\)/\1#/g' | tr "#" "\n" | awk '(length($1)==3)' | sort | uniq -c
.

Similar Solved Questions

1 answer
(a) For the same data and null hypothesis, is the P-value of a one-tailed test (right...
(a) For the same data and null hypothesis, is the P-value of a one-tailed test (right or left) larger or smaller than that of a two-tailed test? pick one The P-value for a one-tailed test is smaller because the two-tailed test includes the area in both tails. The P-value for a one-tailed test is lar...
1 answer
Please show work neatly and explain! 10. Assume the following expected free cash flows for Peterson...
please show work neatly and explain! 10. Assume the following expected free cash flows for Peterson Corporation for 2017: Current Forecast Horizon Terminal 2017 2018 2019 2020 2021 Year Free cash flows to the firm (FCFF) $4,651 54,884 $5,128 $5,384 $5,653 $5,766 The company has net nonoperating ...
1 answer
How do you solve #3x^2-4x-5=0# by completing the square?
How do you solve #3x^2-4x-5=0# by completing the square?...
1 answer
Need soon please! please show all work and units so I can follow along. 2. (20...
need soon please! please show all work and units so I can follow along. 2. (20 pts) The current in the long straight wire at the top of the figure below carries a current of 11 = 5 A. The rectangular loop of wire also lies in the plane of the page and carries current 12 = 10 A. Both currents flo...
1 answer
A-C Costs per Equivalent Unit Georgia Products Inc. completed and transferred 194,000 particle board units of...
A-C Costs per Equivalent Unit Georgia Products Inc. completed and transferred 194,000 particle board units of production from the Pressing Department. There was no beginning inventory in process in the department. The ending in-process inventory was 17,000 units, which were 3/5 complete as to con...
1 answer
A circular plate has radius 15cm
a circular plate has radius 15cm. every dimension is multiplied by 4 to create a larger, similar plate. how is the ratio of the circumferences related to the ratio of the corresponding dimesions? what is the ratio to the circumferences?(hint: think of circumference as perimeter if a circle.)...
1 answer
Describe the laws passed in the 1990s that relate to corrections and sentencing. What are the...
Describe the laws passed in the 1990s that relate to corrections and sentencing. What are the three changes that had the effect of keeping offenders incarcerated for longer periods of time, and for more crimes?...
1 answer
[2] Steve has created a "grab the marble" game. If you grab a green marble you...
[2] Steve has created a "grab the marble" game. If you grab a green marble you get a dollar, if you grab a yellow marble you get nothing. Out of a total of 100 marbles, there are 31 green marbles, and 69 yellow ones. You decide to reach into the bag and grab a marble 39 times, replacing the ...
1 answer
I need help with these two! Especially the first one. 13. (5 pts) How many carbon...
i need help with these two! Especially the first one. 13. (5 pts) How many carbon environments are there in trans-1,3-dimethylcyclohexane? 14. (10 points). Give a Newman projection for the view down the C2-C3 bond in the second most stable conformer of pentane. Use the standard alkyl group abbrev...
1 answer
7. Assume that output growth in the U.S. is 2%, money demand (L) is constant and...
7. Assume that output growth in the U.S. is 2%, money demand (L) is constant and money supply growth is 4%. Assume that output growth in China is 58, L is constant and money supply growth is 98. a. Assume that relative PPP and the quantity theory of money holds. What is the growth rate of the dollar...
1 answer
1. Complete the table below for f(x)=3". Use exact values. No work needed. [1.25 points) -...
1. Complete the table below for f(x)=3". Use exact values. No work needed. [1.25 points) - 101 y = f(x) 2. Consider the functions gtx)=(13)" and h(x) =--() +4 12.5,1.25, 1.5 points) a) The points given below (in the first column) belong to g(x)= - Perform two b) Use the point found in part a...
1 answer
I need help determing the multiplicity (singlet, doublet, triplet, etc.) and the expected integration (number of...
I need help determing the multiplicity (singlet, doublet, triplet, etc.) and the expected integration (number of protons- eg. 1H, 2H, 3H) for each spectra. OH Source: www.chemicalbook.com Br. 2,46-tribromophenol H-NMR Spectrum 7 6 2 p-nitroacetanilide H-NMR Spectrum 2,1 pm 8.2ppm lo.uppm ヒ-7....
1 answer
Dave Bowers collects U.S. gold coins. He has a collection of 38 coins. Some are​ $10...
Dave Bowers collects U.S. gold coins. He has a collection of 38 coins. Some are​ $10 coins, and the rest are​ $20 coins. If the face value of the coins is ​$550​, how many of each denomination does he​ have?...
2 answers
How do you solve the following system: #6x+2y=-4, -x+y=5#?
How do you solve the following system: #6x+2y=-4, -x+y=5#?...
1 answer
You receive a $16,000 5-year constant amortization loan (CAL). The loan's annual interest rate is 10%....
You receive a $16,000 5-year constant amortization loan (CAL). The loan's annual interest rate is 10%. What is the total payment in year 4, rounded to the nearest dollar?...
1 answer
Merrill Corp. has the following information available about a potential capital investment: Initial investment Annual net...
Merrill Corp. has the following information available about a potential capital investment: Initial investment Annual net income Expected life Salvage value Merrill's cost of capital $2,100,000 $ 200,000 8 years $ 210,000 10% Assume straight line depreciation method is used. Required: 1. Calcula...
1 answer
+ TOCH DMSO Br Part 1: a X The reaction proceeds by which mechanism? O Sni...
+ TOCH DMSO Br Part 1: a X The reaction proceeds by which mechanism? O Sni 0 Sy2 Part 2 out of 2 What are the products of the reaction? draw structure... < Prey 16 of 16 Next...
1 answer
Question 1 (2 points) Which one of the following statements is true about a discrete random...
Question 1 (2 points) Which one of the following statements is true about a discrete random variable? The values of a discrete random variable can be plotted on a line in an uninterrupted fashion. A random variable takes any value between two given values. The values of a discrete random variable ca...