Sequence Alignment for Phylogenetic Analysis: Difference between revisions

From Bridges Lab Protocols
Jump to navigation Jump to search
Added PhyloBayes information
Added details about BLAST search
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
== Locate Sequences and Generate FASTA File ==
== Locate Sequences and Generate FASTA File ==
* The easiest way to find sequences is to start with a seed sequence then do BLAST searches restricting to RefSeq and the species of interest.
* To find a seed sequence start with NCBI Gene, then find the first Refseq mRNA (should start with NM) then click on that and find the protein (should start with NP)
* Paste that into your FASTA file (see next section) and name accordingly.
* Paste that sequence or its NP id into [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome NCBI Protein Blast].
* Set the parameters to:
** Database: Reference Proteins (refseq_protein)
** Organism: Start with mouse (''Mus musculus'') or human (''Homo sapiens''), depending on your goal consider adding zebrafish (''Danio rerio''), ''Drosophila melanogaster'', chicken (''Gallus gallus'') and ''Caenorhabditis elegans''


=== Generating a FASTA File===
* FASTA format is described [https://zhanglab.ccmb.med.umich.edu/FASTA/ here], and [https://en.wikipedia.org/wiki/FASTA_format here] you need each sequence to start with a >SEQUENCENAME followed by a return and then the sequence, in this case the protein sequence.  An example of a FASTA file would be:
<code>
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
</code>
* Save sequences in notepad, [https://notepad-plus-plus.org/ notepad++] or [https://www.sublimetext.com/ sublime] (not Word) as a <FILENAME>.fasta file.
* Sequence names cannot have spaces.  Generally its better to name it as '''mm_Gdf15-NM_004864.4''' where mm indicates mouse, Gdf15 is the gene name and NM indicates a [https://www.ncbi.nlm.nih.gov/refseq/ RefSeq mRNA].  If there are multiple mRNA's for the gene, name them


== Create Multiple Sequence Alignment using CLUSTAL Omega ==
== Create Multiple Sequence Alignment using CLUSTAL Omega ==
Line 8: Line 38:
* Generate phlogenetic trees with [http://megasun.bch.umontreal.ca/People/lartillot/www/download.html PhyloBayes] or  Mr. Bayes [[Using Mr Bayes to For Phlyogenetic Analysis]].   
* Generate phlogenetic trees with [http://megasun.bch.umontreal.ca/People/lartillot/www/download.html PhyloBayes] or  Mr. Bayes [[Using Mr Bayes to For Phlyogenetic Analysis]].   


=== PhyloBayes Analysis ==
=== PhyloBayes Analysis ===


* Mark in your notes the software version used.
* Mark in your notes the software version used.
* The PhyloBayes manual can be found [http://megasun.bch.umontreal.ca/People/lartillot/www/phylobayes4.1.pdf here].
* The PhyloBayes manual can be found [http://megasun.bch.umontreal.ca/People/lartillot/www/phylobayes4.1.pdf here].

Latest revision as of 13:16, 18 April 2019

Locate Sequences and Generate FASTA File

  • The easiest way to find sequences is to start with a seed sequence then do BLAST searches restricting to RefSeq and the species of interest.
  • To find a seed sequence start with NCBI Gene, then find the first Refseq mRNA (should start with NM) then click on that and find the protein (should start with NP)
  • Paste that into your FASTA file (see next section) and name accordingly.
  • Paste that sequence or its NP id into NCBI Protein Blast.
  • Set the parameters to:
    • Database: Reference Proteins (refseq_protein)
    • Organism: Start with mouse (Mus musculus) or human (Homo sapiens), depending on your goal consider adding zebrafish (Danio rerio), Drosophila melanogaster, chicken (Gallus gallus) and Caenorhabditis elegans

Generating a FASTA File

  • FASTA format is described here, and here you need each sequence to start with a >SEQUENCENAME followed by a return and then the sequence, in this case the protein sequence. An example of a FASTA file would be:

>SEQUENCE_1

MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG

LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK

IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL

MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL

>SEQUENCE_2

SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI

ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

  • Save sequences in notepad, notepad++ or sublime (not Word) as a <FILENAME>.fasta file.
  • Sequence names cannot have spaces. Generally its better to name it as mm_Gdf15-NM_004864.4 where mm indicates mouse, Gdf15 is the gene name and NM indicates a RefSeq mRNA. If there are multiple mRNA's for the gene, name them

Create Multiple Sequence Alignment using CLUSTAL Omega

PhyloBayes Analysis

  • Mark in your notes the software version used.
  • The PhyloBayes manual can be found here.