Dudka Lab — DNA-shape Documentation

Overview

Based on the long-read assembly of the Mus musculus genome from Ton et al., Sci. Data 2020 (PMID: 33203859) (run SRR11606870), and using the major and minor satellite reads from Packiaraj & Thakur, Genome Biol. 2024 (PMID: 38378611) as templates, the scripts compute:

The number of contiguous A/T stretches (default minimum length = 4).
The number of tetranucleotides associated with a narrow minor groove (per Rohs et al., Nature 2009; PMID: 19865164).

Operating system tested: MacOS (M1) Ventura
Language: Python 3.11
Installer: Miniconda
Installation time: ~10 minutes
Code run time: 10–30 minutes
Modules: blast (v2.6.0), biopython (v1.83), mkl-service (v2.4.0), regex (v2024.5.15)

(Recommended: terminal / bash.)

Download Miniconda.
Install BLAST via Miniconda — see anaconda.org/bioconda/blast.
Install sra-tools. Useful guides: fasterq-dump wiki · fastq-dump blog.
(Recommended) PyCharm 2023.2.5 Community Edition for running majsat.py and minsat.py.

(Recommended: terminal / bash.)

Download all reads unsorted in FASTA format:

fasterq-dump pathtoSRAfile --outdir pathtoSRAfile/fasta --fasta-unsorted

Rename the output to SRR11606870.fasta.
Copy the two template FASTA files (major satellite SRR11606870_2342980.fasta and minor satellite SRR11606870_111923.fasta) from the repo into the same folder as SRR11606870.fasta.

(In PyCharm 2023.2.5 Community Edition.)

Clone the GitHub repository: Git → Clone → URL https://github.com/DDudka9/DNA-shape.git
Create a new interpreter: Python interpreter (bottom right) → Add New Interpreter → Add Local Interpreter → Conda Environment → Create New Environment (provide the path to Miniconda; select Python 3.11).
Select requirements.txt and click Install requirements.
Follow instructions inside majsat.py and minsat.py. Use PyCharm's built-in Python Console to run subsequent parts of the scripts by copy-pasting the code into the console and pressing Return.
Output appears in the folder containing SRR11606870_2342980.fasta and SRR11606870_111923.fasta.

Tetranucleotide column order in all CSV/FASTA outputs: AAAT · AATA · AATC · AATT · AAAA · AAGT · GAAT · GAAA · TAAT · AAAC.

SRR11606870_Maj_2342980_tetranucleotides.csv — per-1 kb counts of narrow-major-groove tetranucleotides along a representative major satellite array (SRR11606870_2342980).
SRR11606870_Min_111923_tetranucleotides.csv — per-1 kb counts of narrow-minor-groove tetranucleotides along a representative minor satellite array (SRR11606870_111923).
SRR11606870_Maj_tetranucleotides_average.fasta — narrow-major-groove tetranucleotide counts per 234 bp across 500 major satellite arrays (averages at the end of the file).
SRR11606870_Min_tetranucleotides_average.fasta — same for 500 minor satellite arrays.
SRR11606870_Maj_ATstretches_average.fasta — AT-stretch counts (default minimum length 4) per 234 bp across 500 major satellite arrays.
SRR11606870_Min_ATstretches_average.fasta — same for 500 minor satellite arrays.

You can modify the array ID inside the scripts to analyze any other array, or substitute a different dataset entirely.

Department of Biological Sciences
Iacocca Hall, 111 Research Drive
Bethlehem, PA 18015