Read pre-processing

ncPRO-seq supports input read sequences in three formats: fastq from Solexa, csfasta from SOLiD and fasta from 454. In this step, ncPRO-seq generates several figures for each sequencing library to describe the basic properties of sequencing reads, such as length distribution of distinct reads, length distribution of abundant reads, and mean quality score at each read position if the read format is fastq, all of which are useful to access the basic quality of sequencing reads. Note that distinct reads are read groups that only count once for reads with the same sequence, i.e. ignoring the read abundance, whereas abundant reads are all sequenced reads.

There is an useful option in ncPRO-seq, called GROUP_READ (see 3.1), which controls the read clustering process. If it is set to 1, reads with identical sequence will be clustered into non-redundant read groups which are then specified with unique group id and read count. For reads in fastq format, the positional quality score of a read group is the mean positional quality score of all reads that are clustered in this group. In the following analyses, read groups, which has a significant decrease of read items comparing to the original read data, will be processed instead as input data. We recommend users to use this option especially if the sequencing libraries are extremely big, which will significantly reduce the CPU time for all analyses and the disk space as well to store intermediate results. Furthermore, another advantage of using this option is that you will get additional read profiles computed based on read groups (i.e. distinct reads) as shown in 5.5.5

Chongjian Chen 2012-01-26