← Back to Tools

Overview

Based on the long-read assembly of the Mus musculus genome from Ton et al., Sci. Data 2020 (PMID: 33203859) (run SRR11606870), and using the major and minor satellite reads from Packiaraj & Thakur, Genome Biol. 2024 (PMID: 38378611) as templates, the scripts compute:

  1. The number of contiguous A/T stretches (default minimum length = 4).
  2. The number of tetranucleotides associated with a narrow minor groove (per Rohs et al., Nature 2009; PMID: 19865164).

Links

Requirements

Tools to download

(Recommended: terminal / bash.)

  1. Download Miniconda.
  2. Install BLAST via Miniconda — see anaconda.org/bioconda/blast.
  3. Install sra-tools. Useful guides: fasterq-dump wiki · fastq-dump blog.
  4. (Recommended) PyCharm 2023.2.5 Community Edition for running majsat.py and minsat.py.

Prepare the dataset

(Recommended: terminal / bash.)

  1. Download SRR11606870 from NCBI SRA.
  2. Prefetch the dataset:
    prefetch SRR11606870
  3. Download all reads unsorted in FASTA format:
    fasterq-dump pathtoSRAfile --outdir pathtoSRAfile/fasta --fasta-unsorted
  4. Rename the output to SRR11606870.fasta.
  5. Copy the two template FASTA files (major satellite SRR11606870_2342980.fasta and minor satellite SRR11606870_111923.fasta) from the repo into the same folder as SRR11606870.fasta.

Run the scripts

(In PyCharm 2023.2.5 Community Edition.)

  1. Clone the GitHub repository: Git → Clone → URL https://github.com/DDudka9/DNA-shape.git
  2. Create a new interpreter: Python interpreter (bottom right) → Add New Interpreter → Add Local Interpreter → Conda Environment → Create New Environment (provide the path to Miniconda; select Python 3.11).
  3. Select requirements.txt and click Install requirements.
  4. Follow instructions inside majsat.py and minsat.py. Use PyCharm's built-in Python Console to run subsequent parts of the scripts by copy-pasting the code into the console and pressing Return.
  5. Output appears in the folder containing SRR11606870_2342980.fasta and SRR11606870_111923.fasta.

Expected output files

Tetranucleotide column order in all CSV/FASTA outputs: AAAT · AATA · AATC · AATT · AAAA · AAGT · GAAT · GAAA · TAAT · AAAC.

  1. SRR11606870_Maj_2342980_tetranucleotides.csv — per-1 kb counts of narrow-major-groove tetranucleotides along a representative major satellite array (SRR11606870_2342980).
  2. SRR11606870_Min_111923_tetranucleotides.csv — per-1 kb counts of narrow-minor-groove tetranucleotides along a representative minor satellite array (SRR11606870_111923).
  3. SRR11606870_Maj_tetranucleotides_average.fasta — narrow-major-groove tetranucleotide counts per 234 bp across 500 major satellite arrays (averages at the end of the file).
  4. SRR11606870_Min_tetranucleotides_average.fasta — same for 500 minor satellite arrays.
  5. SRR11606870_Maj_ATstretches_average.fasta — AT-stretch counts (default minimum length 4) per 234 bp across 500 major satellite arrays.
  6. SRR11606870_Min_ATstretches_average.fasta — same for 500 minor satellite arrays.

You can modify the array ID inside the scripts to analyze any other array, or substitute a different dataset entirely.

Map of Lehigh University in Bethlehem, PA with distances to nearby cities
Department of Biological Sciences
Iacocca Hall, 111 Research Drive
Bethlehem, PA 18015
Aerial view of Lehigh University's Mountaintop campus above the Lehigh Valley