Task 4 - Ensembl Plants Query

This task will guide the user on how to retrieve gene names, descriptions and orthologs from the Ensembl Plants database using R within the command line. The provided Ensembl_Query.R script also allows the user to retrieve information with predefined queries. Alternatively, this task can be manually performed in the Ensembl Plants website using the BioMart Query tool.

The following steps require R (version 4.1+) to be installed. If not already installed, please download and install R here.

Step-by-Step Single Gene Query

#Install and load the required Ensembl Plants - BioMart R package
if (!require("biomaRt")) BiocManager::install("biomaRt")
library(biomaRt)

#Connect to the Ensembl Plants Database
plant_ensembl <- useEnsemblGenomes(biomart="plants_mart")

#Check the available list of Plant Datasets available
All_Datasets <- listDatasets(plant_ensembl)
View(All_Datasets)

#Connect to a specific species dataset using the following syntax: "Genus 1º letter + Species name + _eg_gene" (e.g. "qsuber_eg_gene")
#In this case we are connecting to the cork oak dataset
plant_mart <- useEnsemblGenomes(biomart="plants_mart", dataset= "qsuber_eg_gene")

#Check the available list of Dataset attributes/features
listAttributes(plant_mart)

Note

In cork oak’s case, all gene ID’s must have a “gene-” prefix for a sucessfull query

#Obtaining gene descriptions for a single gene
single_description <- getBM(attributes=c('ensembl_gene_id', 'ensembl_peptide_id','description'),filters = 'ensembl_gene_id', values = "gene-CFP56_45155", mart = plant_mart)
single_description

ensembl_gene_id	ensembl_peptide_id	description
gene-CFP56_23532	cds-POE85947.1	NA

ensembl_gene_id - Gene stable ID
ensembl_peptide_id - Protein stable ID
description - Gene description

Step-by-Step Multiple Genes Query

#Install and load the required Ensembl Plants API R package
if (!require("biomaRt")) BiocManager::install("biomaRt")
library(biomaRt)

#Connect with the Ensembl Plants Database
plant_ensembl <- useEnsemblGenomes(biomart="plants_mart")

#Connect to the cork oak dataset
plant_mart <- useEnsemblGenomes(biomart="plants_mart", dataset= "qsuber_eg_gene")

#Read Node Table (contains cork oak Gene IDs)
genes <- readLines("corkoak_node.csv")
#OR
Define a cork oak gene list
genes <- c("CFP56_57155", "CFP56_18234", "CFP56_55251")

#Modifying all Gene IDs to have the "gene-" prefix
new_genes <- paste0("gene-", genes)

#Obtaining multiple gene descriptions at once
#In this case, instead of specifying a single gene ID in the *values field*, we provide a gene list (new_genes)
multiple_descriptions <- getBM(attributes=c('ensembl_gene_id', 'ensembl_peptide_id', 'description'),filters = 'ensembl_gene_id', values = new_genes, mart = plant_mart)
multiple_descriptions

ensembl_gene_id	ensembl_peptide_id	description
gene-CFP56_57155	cds-POE85887.1	NA
gene-CFP56_18234	cds-POE60447.1	NA
gene-CFP56_55251	cds-POF01545.1	NA
…	…	…

Note

Available descriptions were not found within the Ensembl Plants database due to a lack of available cork oak annotations.

More columns can be retrieved by specifying more atributes within the attributes field.

Some additional atributes of interest include:

ensembl_transcript_id - Transcript stable ID

ensembl_exon_id - Exon stable ID

chromosome_name - Chromosome/scaffold name

start_position - Gene start (bp)

end_position - Gene end (bp)

strand - Strand

band - Karyotype band

transcript_start - Transcript start (bp)

transcript_end - Transcript end (bp)

transcription_start_site - Transcription start site (TSS)

transcript_length - Transcript length (including UTRs and CDS)

transcript_is_canonical - Ensembl Canonical

transcript_count - Transcript count

percentage_gene_gc_content - Gene % GC content

gene_biotype - Gene type

transcript_biotype - Transcript type

source - Source (gene)

transcript_source - Source (transcript)

Note

For a complete attribute list, run the following:

All_attributes <- listAttributes(plant_mart)
View(All_attributes)

Step-by-Step Multiple Genes Query - Annotations

#Following the same syntax, by changing attributes within the attribute field, the retrieved information will be different
#In this case, to obtain gene annotations (GO:Terms) and correspondent descriptions for a list of genes, run:
gene_annotations <- getBM(attributes=c('ensembl_gene_id','go_id','name_1006'),filters = 'ensembl_gene_id', values = new_genes, mart = plant_mart)
gene_annotations

ensembl_gene_id	go_id	name_1006
gene-CFP56_18234	GO:0009834	plant-type secondary cell wall biogenesis
gene-CFP56_18234	GO:0010417	glucuronoxylan biosynthetic process
…	…	…
gene-CFP56_55251	GO:0009834	cellulose microfibril organization
gene-CFP56_55251	GO:0009834	anchored component of membrane
…	…	…
gene-CFP56_57155

ensembl_gene_id - Gene stable ID
go_id - GO term accession
name_1006 - GO term name

According to the retrieved annotations, we observe that the queried cork oak genes are putatively related with plant growth, apparent by their activity on glucuronoxylan biosynthesis, the most common hemicellulose found on hardwood trees, and their role in secondary wall organization, essential for the tree secondary growth development.

Some available atributes regarding Annotation, in addition to the previous, include:

definition_1006 - GO term definition

go_linkage_type - GO term evidence code

namespace_1003 - GO domain

goslim_goa_accession - GOSlim GOA Accession(s)

goslim_goa_description - GOSlim GOA Description

embl - European Nucleotide Archive ID

uniparc - UniParc ID

uniprotswissprot - UniProtKB/Swiss-Prot ID

pfam - Pfam ID

scanprosite - PROSITE patterns ID

superfamily - Superfamily ID

tigrfam - TIGRFAM ID

interpro - Interpro ID

interpro_short_description - Interpro Short Description

interpro_description - Interpro Description

Step-by-Step Multiple Genes Query - Homologs

Gene homologs can be retrieved for most plant species using the following attribute syntax:

[“Genus 1º letter + Species name + _eg_homolog_ensembl_gene”] (e.g. “athaliana_eg_homolog_ensembl_gene”)

Note

For a complete list of all plant species available for homolog query, run the following:

All_attributes <- listAttributes(plant_mart)
View(All_attributes)
#Scroll down to the *Homologs* section. Every line containing _eg_homolog_ensembl_gene is an available species for query

Gathering Arabidopsis thaliana Homologs:

#Gathering Arabidopsis thaliana homologs
gene_athaliana_homologs <- getBM(attributes=c('ensembl_gene_id','athaliana_eg_homolog_ensembl_gene','athaliana_eg_homolog_associated_gene_name'),filters = 'ensembl_gene_id', values = new_genes, mart = plant_mart)
View(gene_athaliana_homologs)

ensembl_gene_id	athaliana_eg_homolog_ensembl_gene	athaliana_eg_homolog_associated_gene_name
gene-CFP56_18234	AT1G27440	GUT2
gene-CFP56_55251	AT5G15630	IRX6
gene-CFP56_57155	AT5G60490	FLA12
…	…	…

ensembl_gene_id - Gene stable ID
athaliana_eg_homolog_ensembl_gene - Arabidopsis thaliana gene stable ID
athaliana_eg_homolog_associated_gene_name - Arabidopsis thaliana gene name

A species list (possibly outdated) which allows homologs retrieval are, in addition to the previous:

achinensis_eg_homolog_ensembl_gene - Actinidia chinensis gene stable ID

atauschii_eg_homolog_ensembl_gene - Aegilops tauschii gene stable ID

atrichopoda_eg_homolog_ensembl_gene - Amborella trichopoda gene stable ID

acomosus_eg_homolog_ensembl_gene - Ananas comosus gene stable ID

ahalleri_eg_homolog_ensembl_gene - Arabidopsis halleri gene stable ID

alyrata_eg_homolog_ensembl_gene - Arabidopsis lyrata gene stable ID

aalpina_eg_homolog_ensembl_gene - Arabis alpina gene stable ID

aofficinalis_eg_homolog_ensembl_gene - Asparagus officinalis gene stable ID

asot3098_eg_homolog_ensembl_gene - Avena sativa OT3098 gene stable ID

assang_eg_homolog_ensembl_gene - Avena sativa Sang gene stable ID

bvulgaris_eg_homolog_ensembl_gene - Beta vulgaris gene stable ID

bdistachyon_eg_homolog_ensembl_gene - Brachypodium distachyon gene stable ID

bjuncea_eg_homolog_ensembl_gene - Brassica juncea gene stable ID

bnapus_eg_homolog_ensembl_gene - Brassica napus gene stable ID

boleracea_eg_homolog_ensembl_gene - Brassica oleracea gene stable ID

brro18_eg_homolog_ensembl_gene - Brassica rapa R-o-18 gene stable ID

ccajan_eg_homolog_ensembl_gene - Cajanus cajan (pigeon pea) - GCA_000340665.1 gene stable ID

csativa_eg_homolog_ensembl_gene - Camelina sativa gene stable ID

csfemale_eg_homolog_ensembl_gene - Cannabis sativa female gene stable ID

cannuum_eg_homolog_ensembl_gene - Capsicum annuum gene stable ID

cbraunii_eg_homolog_ensembl_gene - Chara braunii gene stable ID

cquinoa_eg_homolog_ensembl_gene - Chenopodium quinoa gene stable ID

…

Predefined Queries with Ensembl_Plants_Query.R

This script allows the user to specify few input arguments in order to obtain a output table with the following format:

geneID	ensembl_peptide_id	description	go_id_description	athaliana_eg_homolog_gene	athaliana_eg_homolog_associated_gene_name	athaliana_eg_homolog_perc_identity
CFP56_18234	POE85887.1	NA	GO:0016757_glycosyltransferase activity \| GO:0006486_protein glycosylation \| GO:0009834_plant-type secondary cell wall biogenesis \| GO:0010417_glucuronoxylan biosynthetic process \| GO:0047517_1,4-beta-D-xylan synthase activity \| GO:0080116_glucuronoxylan glucuronosyltransferase activity	AT1G27440	GUT2	87.3786
CFP56_55251	POE60447.1	NA	GO:0010215_cellulose microfibril organization \| GO:0031225_anchored component of membrane \| GO:0005886_plasma membrane \| GO:0009834_plant-type secondary cell wall biogenesis	AT5G15630	IRX6	75.6381
CFP56_57155	POF01545.1	NA	_	AT5G60490	FLA12	61.0442
…	…	…	…	…	…	…

geneID - input gene ID
ensembl_peptide_id - Protein stable ID
description - Gene description
go_id_description - list of GO:Terms and respective descriptions associated with a given gene
athaliana_eg_homolog_gene - Arabidopsis thaliana homolog ID (filtered for the highest %identity between query and Arabidopsis thaliana gene)
athaliana_eg_homolog_associated_gene_name - Arabidopsis thaliana gene name
athaliana_eg_homolog_perc_identity - % identity of the query gene with the target Arabidopsis thaliana gene

This script receives the following mandatory arguments:

Gene ID list (.csv or .txt format, one per line)
Species name (e.g. qsuber)

And optional arguments:

Output name and format (e.g. qsuber_annotated.csv)

Example of use:

#Within the command line:
Rscript Ensembl_Plants_Query.R -g corkoak_nodes.csv -s qsuber -o qsuber_annotated

Congratulations, this task concludes the present use-case. Further questions or recomendations can be submitted to: hugo.miguelr99@gmail.com.