Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large variability in sequence lengths after merging paired reads #2010

Open
alannajmcc16 opened this issue Sep 3, 2024 · 1 comment
Open

Comments

@alannajmcc16
Copy link

Hi,

I'm relatively new to bioinformatics and microbiology, and I've been following the DADA2 tutorial to process my 16S gDNA & eDNA sequencing data. After merging my paired reads and constructing the sequence table, I visualized the sequence lengths and noticed a considerable amount of variability. I'm unsure whether this variability is due to biological reasons or if it might be caused by technical issues, such as incomplete merging or sequencing errors.

Up until this point, everything else in the pipeline has looked good. I'm curious if this variability in sequence lengths is a common observation at this stage when working with the 16S marker. If anyone could offer some advise i would greatly appreciate it :)

Here's some additional information about my data:
illumina MiSeq, 2x300 paired-end sequencing
V3-V4 target region
Primer set: FWD: CCTACGGGNGGCWGCAG, REV: GACTACHVGGGTATCTAATCC
primers have been successfully removed

thanks in advance!

distribution of toatl reads by sequence length

@benjjneb
Copy link
Owner

benjjneb commented Sep 5, 2024

Up until this point, everything else in the pipeline has looked good. I'm curious if this variability in sequence lengths is a common observation at this stage when working with the 16S marker.

Yes. First off, the two modes (peaks) of your sequence length distribution are expected. There is a natural bimodal length distribution of the V3-V4 16S rRNA gene region that differ by about 20 nts.

The various other lengths you observe is not uncommon, and typically comes from a mix of off-target amplification and library artefacts. It is completely valid to "cut a band in silico" and remove the ASVs outside the expected length distribution (this is described in the DADA2 tutorial, "Construct sequence table" section: https://benjjneb.github.io/dada2/tutorial.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants