Marchantia polymorpha Reference Genome Sequences ver. 7.1
Overview
- Reference genome sequences for the male Tak-1 strain (MpTak1_v7.1) and the female Tak-2 strain (MpTak2_v7.1)
- Standard genome sequences for analytical use (MpTak_v7.1) consisting of autosomal sequences from Tak-1 and sex chromosomal sequences (chrU from Tak-2 and chrV from Tak-1).
- Nearly gap-free chromosome scale assembly, with telomere motifs or rDNA clusters identified at the ends of all chromosomes. Only 1 gap in chr3 of Tak-1 and 1 gap in chr4 of Tak-2.
- Gene annotations lifted over from the previous reference genomes. Some of the gene models are improved through manual curation.
Statistics
MpTak1_v7.1
- Total length: 242,451,485 bp
- Number of sequences: 9 (chr1-8, chrV)
- Number of gene loci: 18,200 (proptein coding: 17,944, miRNA: 256)
- Number of coding sequences: 22,172 (including alternative transcripts)
- BUSCO score: 98.8% (genome), 99.6% (proteome)
- Kmer completeness: 99.6%
- Consensus quality value (QV): 82.9
MpTak2_v7.1
- Total length: 241,484,302 bp
- Number of sequences: 9 (chr1-8, chrU)
- Number of gene loci: 18,391 (proptein coding: 18,136, miRNA: 255)
- Number of coding sequences: 22,321 (including alternative transcripts)
- BUSCO score: 98.8% (genome), 99.6% (proteome)
- Kmer completeness: 99.6%
- Consensus quality value (QV): 82.9
MpTak_v7.1 (standard genome)
- Total length: 248,042,680 bp
- Number of sequences: 10 (chr1-8, chrU, chrV)
- Number of gene loci: 18,263 (mRNA: 18,007, miRNA: 256)
- Number of coding sequences: 22,248 (including alternative transcripts)
Gene Identifier (MpGene ID)
Starting with the v5.1 genome (the first chromosome-scale genome), the MpGene ID system has been introduced.
- Format: Mp#g????? (gene ID), Mp#g?????.$ (transcript ID)
#: chromosome number (1-9, U, V), ?????: serial number, $: transcript number - Example: Mp1g01230, Mp5g00270.2, MpUg00370, pVg00350.1
The sequencial number corresponds roughly to the order of appearance within a chromosome, but locally in reverse order. This is due to the local inversion between the v5.1 and v7.1 genomes (most likely misassembly in the v5.1 genome).
To ensure continuity, these gene IDs will be carried over to the subsequent versions as much as possible in the future updates of the genome sequences/gene annotations (permanent ID).
Provisional gene ID
Example: Mp3g09300_L1, Mp3g09295_R2, Mp2g90010_P
The gene IDs containing the underscore ‘_’ are provisional IDs, which are not lifted over in future updates.
In the Tak-1 genome, tandemly repeated genes are encoded in the region around the assembly gap in chr3. Genes to the left of the gap are assigned gene IDs with the suffix L1, L2… and genes to the right of the gap are assigned gene IDs with the suffix R1, R2… The gene IDs sharing the same sedquential number (e.g. Mp3g09300_L1 and Mp3g09300_R1) are paralogues of each other.
The gene annotations for the Tak-2 genome are mainly semi-automatically transferred from the Tak-1 genome, and a small number of genes are newly annotated by mapping known transcript sequences or as an extra copies of the known genes. These newly annotated genes are assigned with a provisional ID suffixed with ‘_P’.
Note
Up till the v6.1 genome, genes located on unplaced scaffolds were assigned with gene IDs beginning with Mpzg (e.g. Mpzg00100). As of the v7.1 genome, all of them are placed in one of the chromosomal sequences and have been reassigned gene IDs. See the gene ID correspondence table.
For example, Mpzg01410 has been reassigned as Mp2g20695.
INSDC
The genome sequences are also available from INSDC (DDBJ/ENA/GenBank). Due to techinical limitations, identical CDSs derived from alternative transcripts are annotated as misc_feature, so the numbers of CDSs and mRNAs do not match.
Each gene locus is assigned with an identifier (locus_tag) such as 'MPTK1_1g00200' in the Tak-1 genome and 'MPTK2_Ug00200' in the Tak-2 genome, which correspond to MpGene IDs Mp1g00200 and MpUg00200, respectively.
Tak-1
- INSDC accession numbers: AP031342-AP031350
- BioProject: PRJDB16711
- BioSample: SAMD00647143
- SRA raw read sequences: DRR504758
- Assembly: GCA_037833805.1
Tak-2
- INSDC accession numbers: AP031351-AP031359
- BioProject: PRJDB16711
- BioSample: SAMD00647144
- SRA raw read sequences: DRR504759
- Assembly: GCA_037833965.1
Tak-1 + chrU (Tak-2) + MT + CP
Method
Genome sequencing
- PacBio HiFi technology using PacBio Sequel-II
- Genome coverage: 160x
Genome assembly
- De novo assembly using hifiasm ver. 0.19.5
Error correction using nextpolish2 ver. 0.1.1
Reads mapped to chloroplast and mitochondrion sequences were removed before conducting de novo assembly.
Short contigs possibly derived from chloroplast, mitochondrion, and rDNA clusters were removed.
Gene annotation
- MpTak1_v7.1
- Gene annotations were lifted over from the v6.1 genome (MpTak_v6.1r2) using liftoff ver. 1.6.3
- Full-length transcript sequences (PacBio Iso-seq) were mapped using GMAP ver. 2023.10.10
- De novo gene prediction using GINGER ver. 1.0.1
- Newly identified genes from 2 and 3 were merged with 1 after manual curation.
- MpTak2_v7.1
- Gene annotations were liftef over from MpTak1_v7.1 using liftoff ver. 1.6.3.
- Full-length transcript sequences (PacBio Iso-seq) were mapped using GMAP ver. 2023.10.10
- Extra copies of the existing genes from 1 and newly identified genes were assigned with provisional gene IDs and incorporated into the final annotation after manual curation.
Quality assessment
- Genome/annotation completeness: BUSCO v. 5.6.0
- Kmer-based completenes and quality value: Merqury v. 2020-01-29