pipeline home  |  FunGene home  |  RDP home  |  help  ]

Pyrosequencing Pipeline Help

Help Topics:

:: Primer Design :: Alignment Merger :: Rarefaction
:: Pipeline Initial Processor :: Complete Linkage Clustering :: Fasta Sequence Selection
:: NEW Initial Processor stats and charts files :: NEW Clustering stats and charts files :: Spade Input Formatter
:: Aligner :: Representative Sequence :: Cluster To R Formatter
:: NEW Aligner stats and charts files :: Shannon and Chao1 Index :: EstimateS Input Formatter
:: Job Name :: BIOM Format


Sequencing SSU rRNA genes from environmental samples is a standard method for determining bacterial community composition. New sequencing technologies such as pyrosequencing have been successfully used as a rapid and efficient tool to enable in-depth analysis of bacterial composition. This website provides a collection of tools to facilitate the data processing and simplify the computationally intensive analysis of such large sequencing libraries. Analysis tools include initial data process, fast aligner, complete-linkage cluster and ecology metrics. The processed data are available in formats suitable for common ecological and statistical packages such as Spade, EstimateS and R.

:: Primer Design

We designed a set of primers targeting the hypervariable V4 region of the 16S rRNA gene and developed an analysis pipeline that together allow simultaneous sequencing and analysis of up to 80 samples using the Genome Sequencer FLX system. The V4 region has an appropriate length for FLX sequencing. In addition, V4 is one of the variable regions providing the most accurate taxonomic classification, and it has a covered secondary structure that aids alignment. We developed primers targeting highly conserved regions flanking V4 and tested them for coverage against sequences from RDP release 9.53 and from the GOS database of marine bacterial sequences, and found the primers perfectly matched 94.6% and 94.7% of sequences, respectively.

The V4 FLX forward primer is "AYTGGGYDTAAAGNG" (E. coli position 563-577) and the reverse primers are "TACNVGGGTATCTAATCC", "TACCRGGGTHTCTAATCC", "TACCAGAGTATCTAATTC", "CTACDSRGGTMTCTAATC" (E. coli position 785-802).

For Titanium pyrosequencing we recommend using the V4 forward primer "AYTGGGYDTAAAGNG" (E. coli position 563-577) and the reverse primer "CCGTCAATTCMTTTRAGT" (E. coli 907-924).

Some publications that use the above V4 primers:

The analysis pipeline was initially designed for primers targeting V4 region, however it is not limited to V4 region.

In addition to the 16S rRNA priming region, the primers contain sequences (adaptors) required for 454 FLX pyrosequencing, and the forward primer contains an additional short key (also called tag) sequence. Twenty forward primers were synthesized, each with a different key. Twenty samples can then be amplified, each with a different forward primer, mixed together prior to sequencing, and the resulting sequences sorted back into the original samples (Fig. 1).

We provide a list of 6-8 base tag sequences that have minimum distance 2, no bases in flow order and no homopolymers. There are 72 tags of 8-base length (right click the link to download the file) and 20 tags of 6-base length (right click the link to download the file).

Figure 1. Multiplex FLX rRNA gene sequencing protocol.
Multiplex FLX Sequencing Protocol

Pyro Analysis Pipeline
Pyro Analysis Pipeline

:: Pipeline Initial Processor

Initial processing of the sequences is handled by the GL FLX software. Sequences not passing the FLX quality controls are discarded and the 454 specific portions of the primer sequences are trimmed. We developed software to take these raw sequences, sort by tag sequence, trim the 16S primers and filter out additional sequences of low-quality. If the gene is 16S rRNA, the orientation of sequences will be checked and reverse complemented if needed.

From the initial processor page, users simply upload four input files, primer file, tag file, sequence file and quality file obtained from GL FLX sequencing. User can control the quality of the output sequences by manipulating the maximum forward primer edit distance, maximum reverse primer edit distance, number of N's, and minimum sequence length. An output directory with the trimmed sequence file and quality file will be created for each tag. A summary stats file and chart are created for each tag output by the initial processing tool. The stats file contains 5 summary lines:

  1. Tag name
  2. Total sequences with the tag
  3. Number of sequences that passed filtering
  4. Average length of sequences after filtering and trimming
  5. Standard deviation of the length of the sequences after filtering and trimming

Following the summary lines is a list of sequence lengths and the number of sequences that long in the tag.

The stats chart is a histogram of the sequence lengths after filtering and trimming (generated from the data in the stats file).

Finally there is a new <tag_name>_dropped_seqs.txt file in which lists all the dropped sequences, filter that dropped the sequence, and reason for dropping the sequence.

You can upload a SFF file (which contains both the sequence and quality information), or a FASTA sequence file with or without a corresponding FASTA quality file (in plain text).

Tag file must be a tab delimited plain ascii text file and each line must contain the tag sequence and sample name separated by a tab. For example,


After initial processing, the sequences can be assigned to the bacterial taxonomy using the RDP Classifier. Comparison between samples can be made using RDP Library Compare. Sequences can be aligned using the Infernal aligner. See Aligner.

RDP Pyro Initial Processing Length Histogram

:: Aligner

This tool aligns the sequences using the INFERNAL aligner, a SCFG-based, secondary-structure aware aligner (Nawrocki & Eddy, 2007). The INFERNAL aligner provides several significant advantages over RNACAD. The INFERNAL aligner is about 25x faster, it provides a much more intuitive handling of sequencing errors, and solves some known problems with incorrect alignment of short partial sequences. We trained the aligner on a small hand-curated set of high-quality full-length rRNA sequences derived mainly from genome sequencing projects.

Infernal aligner does not accept sequence that contain dots or dashes, so our system chops off all the dots and dashes in your sequence before submitting to the aligner. We do reverse complement your sequence if necessary before submitting to the aligner. And also sequence that contain non IUPAC characters, or sequence that is less than 150 bases will be removed before submitting to the aligner.

When the aligner job is finished, an email will be sent to the user with a link to download the resulting zip file. The zip file contains the aligned fasta file. A summary file and chart are created for each alignment file produced by the aligner tool. The stats file contains 6 summary lines:

  1. Total sequences submitted for alignment
  2. Alignment model used (bacteria or archaea)
  3. Model reference sequence (base positions listed are positions in this sequence)
  4. Number of sequences containing only gaps
  5. Average number of comparable positions
  6. Average sequence length (measured as the distance between the reference beginning and end position)

The rest of the file contains one line per sequence containing the start and end position for that sequence.

The chart contains a histogram of starting and end positions for the sequences in the aligned file.

RDP Pyro Alignment Start and End Model Position Histogram

:: Alignment Merger

This tool merges multiple alignment files, be it in fasta or stockholm format, into one fasta file.

The resulting zip file will contain the merged alignment fasta file.

:: Fasta Sequence Selection

Fasta Sequence Selection tool allows user to select or exclude a subset of sequences from the original sequence file, be it either fasta file or stockholm file. The final result file will be generated in fasta format. It requires two input files:

  1. The original sequence file from where you want to make a sub selection. The original sequence file can be a fasta / stockholm file.
  2. The sequence ID file that contains a list of sequence IDs to be included or excluded. The sequence IDs need to match the IDs in the original sequence file listed above. This file could be either an assignment detail file downloaded from RDP Classifier or a stockholm file, or a text file that contains a list of sequence ids separated by spaces or tabs or new lines.

:: Spade Formatter

This tool allows to make a SPADE (http://chao.stat.nthu.edu.tw/softwareCE.html) compatible input file in 2 formats from the complete linkage cluster file.

  1. simple format (species frequency or abundance data of one community/assemblage) - For any species found in the sample, this type of data includes the number of times (frequency) that the species was discovered, or the number of individuals (abundance) that the species was represented in the sample.
  2. tabular format (species frequency or abundance data of two or more communities/assemblages) - Assume that in each community, a sample of individuals is taken. The species frequencies (or abundances) are arranged in two columns. The first column denotes the frequency (or abundance) of a species discovered in Community I, and the second column denotes the frequency (or abundance) of the same species in Community II. The two frequencies are separated by at least one blank space.

It requires start and end cutoff values. It creates a resulting zip file containing an individual result files specific to each distance, within the distance cutoff range mentioned in your input, that match in your cluster file.

:: Complete Linkage Clustering

Hierarchical cluster analysis (or hierarchical clustering) is a general approach to cluster analysis , in which the object is to group together objects or records that are "close" to one another. A key component of the analysis is repeated calculation of distance measures between objects, and between clusters once objects begin to be grouped into clusters.

The complete linkage clustering (or the farthest neighbor method) is a method of calculating distance between clusters in hierarchical cluster analysis . This tool allows you to make a cluster file based on the aligned sequence file. The sequence file must be aligned fasta or stockholm file.

Other inputs are the maximum cluster distance and step that is an increment between the cluster distances. To represent 30% maximum cluster distance, just enter 30 in the input field.

A summary stat file and a chart are created for each resulting cluster file produced by the clustering tool. The summary file contains the total number of sequences clustered and the number of clusters (OTUs) at each distance cutoff. The chart is a scatter plot of distance cutoffs vs. number of clusters.

RDP Pyro Cluster Scatter Plot

:: Representative Sequence

This tool allows users to select a representative sequence from each cluster. The sequence with the minimum sum of the square of distances between sequences within a cluster is assigned as the representative sequence for that cluster. The result is a tab delimited file which contains the representative sequence ID, cluster id, number of sequences, maximum distance and minimum sum of squares of distances for each cluster.

It requires a complete linkage cluster file, one or many aligned fasta / stockholm files and the maximum distance as input.

:: Cluster To R Formatter

This tool allows to create a community data matrix input file for the R tool. It requires complete linkage cluster file and distance cutoff range you are interested in. The resulting output contains a list of tab delimited files containing the number of sequences for each sample for each OTU to each distance within the distance cutoff range mentioned in your input. The word "aligned_" or "_trimmed" will be removed if present in the sample name. For ex:


:: EstimateS Input Formatter

For regions of less-well-studied bacterial diversity, query classification is often not well supported, even for higher taxonomic ranks. We have found that a high percentage of sequences from some environmental clone libraries are classified with less than 80% confidence, even at the phylum level. Such low confidence classification results may identify sequences where a thorough phylogenetic analysis is warranted.

EstimateS is a free software application for Windows and Macintosh operating systems that computes a variety of biodiversity functions, estimators, and indices based on biotic sampling data. Some features require species relative abundance data, others only species presence/absence data.

:: Shannon and Chao1 Index

The input cluster file(s) must contain single sample file. If you need to calculate Shannon and Chao1 Index for multiple sample files, then first you need to merge them into one sample file by using Alignment Merger and then run Complete Linkage Clustering on that result before using this tool. The input cluster file name must end with ".clust" extension.

The result is a tab delimited file which contains sampleID, distance, N, clusters, chao, LCI95, UCI95, H', varH, E. file.

:: Rarefaction

The Input is a single or multiple cluster file(s), each of which contains single or multiple samples. The name of each input cluster file must end with a ".clust" extension. Recommendation: compress the input files for speedy uploading.

The result is a tab delimited file. The first row has column header corresponding to each distance in your cluster file, and the column headers marked with "U" and "L" indicates the 95% confidence of upper and lower limit for each distance. Open the result file in Excel to make a rarefaction curve.

:: Job Name

Most of the tools allow you name the job for your submission. The job name will be the prefix of the resulting file and appear in the email notice. This will make it easier to keep track of the jobs you submitted to our system. Only word character [a-zA-Z_0-9] is allowed. Any non-word charater will be replaced with "_".

:: BIOM Format

The RDP Clustering tools produce minimal dense BIOM files as part of the results. The Classifier can take an input minimal (or rich) dense BIOM file as input with an optional Metadata file, and produces a rich dense BIOM file.

If an input cluster BIOM file ( version 1.0) is provided, along with a sequence file (output from Representative Sequence tool), the classification result of each sequence will replace the taxonomy of the corresponding cluster. If a metadata file is provided, the information will replace the metadata of the corresponding sample. The resulting rich dense BIOM file can be used by thirdparty tools such as phyloseq or QIIME etc.

The BIOM file format is designed to be a general-use format for representing biological sample by observation contingency tables. BIOM is a recognized standard for the Earth Microbiome Project and is a Genomics Standards Consortium supported project.

Here are some example input files:

Minimal Dense BIOM File produced by RDP Clustering tool

Rich Dense BIOM File with sample metadata

Metadata File, a tab-delimited file, with first row containing attribute name and first column containing the sample name

Questions/comments: rdpstaff@msu.edu