I399 Bioinformatics and Cyberinfrastructure project - 1000 Genomes protein analysis


Recent improvements in sequencing technology ("next-gen" sequencing platforms) have sharply reduced the cost of sequencing. The 1000 Genomes Project is the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation. The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied. While recent work has been conducted towards sequence alignment and nucleotide matching, there is a large need for protein sequencing and comparison between the 697 currently sequenced datasets. This project will look at the protein synthesis at a low level order to identify differences between members of the population, which can hopefully lead to a better understanding of how proteins differ between individuals.

Intellectual Merit

This project will study the protein biosynthesis processes across 697 currently sequenced members of the 1000 Genomes project [http://www.1000genomes.org/]. This work will identify the computational pipeline needed for a direct comparison between various protein peptide sequences within the currently sequenced data. This pipeline will employ the latest Cloud computing technologies to streamline development and minimize computation time while allowing for a wide variety of work loads to occur.

Broader Impact

This knowledge discovery process and pipeline may be of great interest to the 1000 Genomes project members who are currently looking to employ these processes en masse. Furthermore, undergraduate informatics students will be immersed in the project from the beginning, giving them an in-depth vision of the research process of knowledge discovery.

Use of FutureGrid

We intend to create and deploy a multitude of Virtual Machines to function in the protein peptide biosynthesis analysis, as well as use 1 or more storage devices to place the large datasets in-house, either in a SAN or within S3 or EBS storage devices.

Scale Of Use

Initially the scale of use will be minimal, as the goal of the i1399 project is only to outline this process and perform some elementary analysis. If demand for this system increases, computation may need to ramp up drastically.


The 1000 Genome Project is the first project of its kind to take a significantly large number of participants’ DNA information and broadcast it publicly to the world in order to provide a detailed source for sequencing genomes. It is a collaboration of numerous international research teams, with the end goal of finding the most variations in at least 1% of the population studied. They enlisted people from areas across the world in order to incorporate specific regions and cultural differences, and to ensure that they are not using a homogeneous population. 

We used the Japanese and European data sets for our project because there seemed to be a clear contrast between cultural groups. There seemed to be enough difference in these two groups for the results to seem interesting and understandable in order to see if a larger project could stem from this one (future scientific research of protein discrepancies between other groups etc). There was also an issue of data size in relation to the time we had for this project. In order to sequence one person the time aspect of the project does not seem as drastic, but we used the sequenced genome of ninety European participants and one hundred and five Japanese participants (The more samples we have from each region the more accurate our analysis will be since we are trying to determine our results in reference to lineage in particular). If doing such comparison sequentially, this process would have taken years to compute just for this subset of data. As such, a distributed architecture was needed.

FutureGrid provided an ideal testing platform for building such an environment necessary for this large scale data analysis.  Using the Eucalyptus cloud system available on India, a specialized virtual machine (VM) was constructed with a minimal ubuntu-based image along with the necessary Bioinformatics toolkits. From here, multiple VM instances, complete with this specialized environment, were instantiated en-masse.  Each VM was able to collect the necessary sequence data from the 1000 Genome's data repository at the NCBI and EBI, then each input data was reformatted to fit the BioPerl tool's desired input.  From here, each VM was able to run the tools, compute where each gentic mutation occurs between the two test groups, and send the resulting output files back to a central location. 

Then, we took the list of mutation consequences from the output of the program and applied them towards the data provided by the Thousand Genomes Project. A portion of the data provided a list for every individual of the populations and the presence of those mutations within their genomes. By taking the count of negative mutations overall and dividing by the total number of mutations we could determine the percentage of mutations which were detrimental. In order to determine the significance, we decided as a group to use the statistical student’s T-test. The students T-test is a statistical measure used to compare the means of two samples. In other words, it is used to determine if two sets of data can be considered "different" in a real statistical way. In this case, we compared those of the European participants and the Japanese participants. P value is what you get as a result of a student's T test. After running the student’s t-test we got a result of p=3.08848x10^-0. If the p value is below .05, then the difference between japanese and european functional mutations is statistically significant. As the results show, it is a lot smaller which indicates that the Japanese have a significantly higher likelihood of having more functional protein mutations within their genetic makeup. 

  European Japanese
Average 10.3% 11.4%
Standard Deviation 1.02% 1.31%
Minimum 6.91% 7.98%
Maximum 12.4% 14.8%

There are several possible reasons to explain why the rate of functional mutations is higher in the Japanese population, all dealing with evolution and populations genetics. The first possibility is that there is less migration within the Japanese population. Japan is a relatively secluded island and historically has not subject to a lot of contact or interbreeding from different populations of the world. Breeding within a population over the course of time can lead to the accumulation of deleterious mutations. The second possibility is that without strong selection there would be less reason for those detrimental mutations to be removed from the gene pool. If people can still survive and reproduce even though they have the mutation, then it will persist in the population throughout their future generations. The third possibility is that these mutations have not yet been fixed within the Japanese population. Fixation is a concept within population genetics which refers to the tendency of mutations to be removed from the gene pool over time. There are several properties of a mutation which can lead to its fixation,, like population size and heritability. The process of fixation can take a very long time which can help to explain why these mutations present in the gene pool. 

Analysis of only two populations opens to the door to comparing several different populations at the same time. The data are available for Europeans, East Asians, West Africans, and Americans. Taking one step beyond our plan would be to look at the diseases or protein disorders which commonly affect each sub-population. FutureGrid has provided us with the tools necessary to evaluate these complex problems within the 1000 Genome Project  and paved the way for a new computational environment that would not be possible available. 

The research group have also made an educational video describin the process and work involved in the project, available at http://www.youtube.com/watch?v=nV6UyJw6oZc&feature=youtu.be

Andrew Younge
Indiana University

Project Members

Alex Wu
Blaine Rothrock
Fuxiao Xin
Jonathan Gold
Kym Pagel
Ryan Konz
Tony Gao
Zixuan Li


1 year 31 weeks ago