Quantifying A/T stretches and narrow-minor-groove tetranucleotides in mouse satellite DNA arrays.
Based on the long-read assembly of the Mus musculus genome from
Ton et al., Sci. Data 2020 (PMID: 33203859)
(run SRR11606870), and using the major and minor satellite reads from
Packiaraj & Thakur, Genome Biol. 2024 (PMID: 38378611)
as templates, the scripts compute:
blast (v2.6.0), biopython (v1.83),
mkl-service (v2.4.0), regex (v2024.5.15)(Recommended: terminal / bash.)
majsat.py and minsat.py.(Recommended: terminal / bash.)
SRR11606870 from
NCBI SRA.prefetch SRR11606870fasterq-dump pathtoSRAfile --outdir pathtoSRAfile/fasta --fasta-unsortedSRR11606870.fasta.SRR11606870_2342980.fasta and minor
satellite SRR11606870_111923.fasta) from
the repo
into the same folder as SRR11606870.fasta.(In PyCharm 2023.2.5 Community Edition.)
https://github.com/DDudka9/DNA-shape.gitrequirements.txt and click Install requirements.majsat.py and minsat.py. Use PyCharm's built-in
Python Console to run subsequent parts of the scripts by copy-pasting the code into the console
and pressing Return.SRR11606870_2342980.fasta and
SRR11606870_111923.fasta.
Tetranucleotide column order in all CSV/FASTA outputs:
AAAT · AATA · AATC · AATT · AAAA · AAGT · GAAT · GAAA · TAAT · AAAC.
SRR11606870_Maj_2342980_tetranucleotides.csv —
per-1 kb counts of narrow-major-groove tetranucleotides along a representative major
satellite array (SRR11606870_2342980).SRR11606870_Min_111923_tetranucleotides.csv —
per-1 kb counts of narrow-minor-groove tetranucleotides along a representative minor
satellite array (SRR11606870_111923).SRR11606870_Maj_tetranucleotides_average.fasta —
narrow-major-groove tetranucleotide counts per 234 bp across 500 major satellite arrays
(averages at the end of the file).SRR11606870_Min_tetranucleotides_average.fasta —
same for 500 minor satellite arrays.SRR11606870_Maj_ATstretches_average.fasta —
AT-stretch counts (default minimum length 4) per 234 bp across 500 major satellite arrays.SRR11606870_Min_ATstretches_average.fasta —
same for 500 minor satellite arrays.You can modify the array ID inside the scripts to analyze any other array, or substitute a different dataset entirely.