Task 4 - Ensembl Plants Query
This task will guide the user on how to retrieve gene names, descriptions and orthologs from the Ensembl Plants database using R within the command line. The provided Ensembl_Query.R script also allows the user to retrieve information with predefined queries. Alternatively, this task can be manually performed in the Ensembl Plants website using the BioMart Query tool.
The following steps require R (version 4.1+) to be installed. If not already installed, please download and install R here.
Step-by-Step Single Gene Query
#Install and load the required Ensembl Plants - BioMart R package
if (!require("biomaRt")) BiocManager::install("biomaRt")
library(biomaRt)
#Connect to the Ensembl Plants Database
plant_ensembl <- useEnsemblGenomes(biomart="plants_mart")
#Check the available list of Plant Datasets available
All_Datasets <- listDatasets(plant_ensembl)
View(All_Datasets)
#Connect to a specific species dataset using the following syntax: "Genus 1º letter + Species name + _eg_gene" (e.g. "qsuber_eg_gene")
#In this case we are connecting to the cork oak dataset
plant_mart <- useEnsemblGenomes(biomart="plants_mart", dataset= "qsuber_eg_gene")
#Check the available list of Dataset attributes/features
listAttributes(plant_mart)
Note
In cork oak’s case, all gene ID’s must have a “gene-” prefix for a sucessfull query
#Obtaining gene descriptions for a single gene
single_description <- getBM(attributes=c('ensembl_gene_id', 'ensembl_peptide_id','description'),filters = 'ensembl_gene_id', values = "gene-CFP56_45155", mart = plant_mart)
single_description
ensembl_gene_id |
ensembl_peptide_id |
description |
|---|---|---|
gene-CFP56_23532 |
cds-POE85947.1 |
NA |
ensembl_gene_id- Gene stable IDensembl_peptide_id- Protein stable IDdescription- Gene description
Step-by-Step Multiple Genes Query
#Install and load the required Ensembl Plants API R package
if (!require("biomaRt")) BiocManager::install("biomaRt")
library(biomaRt)
#Connect with the Ensembl Plants Database
plant_ensembl <- useEnsemblGenomes(biomart="plants_mart")
#Connect to the cork oak dataset
plant_mart <- useEnsemblGenomes(biomart="plants_mart", dataset= "qsuber_eg_gene")
#Read Node Table (contains cork oak Gene IDs)
genes <- readLines("corkoak_node.csv")
#OR
Define a cork oak gene list
genes <- c("CFP56_57155", "CFP56_18234", "CFP56_55251")
#Modifying all Gene IDs to have the "gene-" prefix
new_genes <- paste0("gene-", genes)
#Obtaining multiple gene descriptions at once
#In this case, instead of specifying a single gene ID in the *values field*, we provide a gene list (new_genes)
multiple_descriptions <- getBM(attributes=c('ensembl_gene_id', 'ensembl_peptide_id', 'description'),filters = 'ensembl_gene_id', values = new_genes, mart = plant_mart)
multiple_descriptions
ensembl_gene_id |
ensembl_peptide_id |
description |
|---|---|---|
gene-CFP56_57155 |
cds-POE85887.1 |
NA |
gene-CFP56_18234 |
cds-POE60447.1 |
NA |
gene-CFP56_55251 |
cds-POF01545.1 |
NA |
… |
… |
… |
Note
Available descriptions were not found within the Ensembl Plants database due to a lack of available cork oak annotations.
ensembl_transcript_id- Transcript stable ID
ensembl_exon_id- Exon stable ID
chromosome_name- Chromosome/scaffold name
start_position- Gene start (bp)
end_position- Gene end (bp)
strand- Strand
band- Karyotype band
transcript_start- Transcript start (bp)
transcript_end- Transcript end (bp)
transcription_start_site- Transcription start site (TSS)
transcript_length- Transcript length (including UTRs and CDS)
transcript_is_canonical- Ensembl Canonical
transcript_count- Transcript count
percentage_gene_gc_content- Gene % GC content
gene_biotype- Gene type
transcript_biotype- Transcript type
source- Source (gene)
transcript_source- Source (transcript)
Note
For a complete attribute list, run the following:
All_attributes <- listAttributes(plant_mart)
View(All_attributes)
Step-by-Step Multiple Genes Query - Annotations
#Following the same syntax, by changing attributes within the attribute field, the retrieved information will be different
#In this case, to obtain gene annotations (GO:Terms) and correspondent descriptions for a list of genes, run:
gene_annotations <- getBM(attributes=c('ensembl_gene_id','go_id','name_1006'),filters = 'ensembl_gene_id', values = new_genes, mart = plant_mart)
gene_annotations
ensembl_gene_id |
go_id |
name_1006 |
|---|---|---|
gene-CFP56_18234 |
GO:0009834 |
plant-type secondary cell wall biogenesis |
gene-CFP56_18234 |
GO:0010417 |
glucuronoxylan biosynthetic process |
… |
… |
… |
gene-CFP56_55251 |
GO:0009834 |
cellulose microfibril organization |
gene-CFP56_55251 |
GO:0009834 |
anchored component of membrane |
… |
… |
… |
gene-CFP56_57155 |
ensembl_gene_id- Gene stable IDgo_id- GO term accessionname_1006- GO term name
According to the retrieved annotations, we observe that the queried cork oak genes are putatively related with plant growth, apparent by their activity on glucuronoxylan biosynthesis, the most common hemicellulose found on hardwood trees, and their role in secondary wall organization, essential for the tree secondary growth development.
Some available atributes regarding Annotation, in addition to the previous, include:
definition_1006- GO term definition
go_linkage_type- GO term evidence code
namespace_1003- GO domain
goslim_goa_accession- GOSlim GOA Accession(s)
goslim_goa_description- GOSlim GOA Description
embl- European Nucleotide Archive ID
uniparc- UniParc ID
uniprotswissprot- UniProtKB/Swiss-Prot ID
pfam- Pfam ID
scanprosite- PROSITE patterns ID
superfamily- Superfamily ID
tigrfam- TIGRFAM ID
interpro- Interpro ID
interpro_short_description- Interpro Short Description
interpro_description- Interpro Description
Step-by-Step Multiple Genes Query - Homologs
Note
For a complete list of all plant species available for homolog query, run the following:
All_attributes <- listAttributes(plant_mart)
View(All_attributes)
#Scroll down to the *Homologs* section. Every line containing _eg_homolog_ensembl_gene is an available species for query
Gathering Arabidopsis thaliana Homologs:
#Gathering Arabidopsis thaliana homologs
gene_athaliana_homologs <- getBM(attributes=c('ensembl_gene_id','athaliana_eg_homolog_ensembl_gene','athaliana_eg_homolog_associated_gene_name'),filters = 'ensembl_gene_id', values = new_genes, mart = plant_mart)
View(gene_athaliana_homologs)
ensembl_gene_id |
athaliana_eg_homolog_ensembl_gene |
athaliana_eg_homolog_associated_gene_name |
|---|---|---|
gene-CFP56_18234 |
AT1G27440 |
GUT2 |
gene-CFP56_55251 |
AT5G15630 |
IRX6 |
gene-CFP56_57155 |
AT5G60490 |
FLA12 |
… |
… |
… |
ensembl_gene_id- Gene stable IDathaliana_eg_homolog_ensembl_gene- Arabidopsis thaliana gene stable IDathaliana_eg_homolog_associated_gene_name- Arabidopsis thaliana gene name
A species list (possibly outdated) which allows homologs retrieval are, in addition to the previous:
achinensis_eg_homolog_ensembl_gene- Actinidia chinensis gene stable ID
atauschii_eg_homolog_ensembl_gene- Aegilops tauschii gene stable ID
atrichopoda_eg_homolog_ensembl_gene- Amborella trichopoda gene stable ID
acomosus_eg_homolog_ensembl_gene- Ananas comosus gene stable ID
ahalleri_eg_homolog_ensembl_gene- Arabidopsis halleri gene stable ID
alyrata_eg_homolog_ensembl_gene- Arabidopsis lyrata gene stable ID
aalpina_eg_homolog_ensembl_gene- Arabis alpina gene stable ID
aofficinalis_eg_homolog_ensembl_gene- Asparagus officinalis gene stable ID
asot3098_eg_homolog_ensembl_gene- Avena sativa OT3098 gene stable ID
assang_eg_homolog_ensembl_gene- Avena sativa Sang gene stable ID
bvulgaris_eg_homolog_ensembl_gene- Beta vulgaris gene stable ID
bdistachyon_eg_homolog_ensembl_gene- Brachypodium distachyon gene stable ID
bjuncea_eg_homolog_ensembl_gene- Brassica juncea gene stable ID
bnapus_eg_homolog_ensembl_gene- Brassica napus gene stable ID
boleracea_eg_homolog_ensembl_gene- Brassica oleracea gene stable ID
brro18_eg_homolog_ensembl_gene- Brassica rapa R-o-18 gene stable ID
ccajan_eg_homolog_ensembl_gene- Cajanus cajan (pigeon pea) - GCA_000340665.1 gene stable ID
csativa_eg_homolog_ensembl_gene- Camelina sativa gene stable ID
csfemale_eg_homolog_ensembl_gene- Cannabis sativa female gene stable ID
cannuum_eg_homolog_ensembl_gene- Capsicum annuum gene stable ID
cbraunii_eg_homolog_ensembl_gene- Chara braunii gene stable ID
cquinoa_eg_homolog_ensembl_gene- Chenopodium quinoa gene stable ID…
Predefined Queries with Ensembl_Plants_Query.R
This script allows the user to specify few input arguments in order to obtain a output table with the following format:
geneID |
ensembl_peptide_id |
description |
go_id_description |
athaliana_eg_homolog_gene |
athaliana_eg_homolog_associated_gene_name |
athaliana_eg_homolog_perc_identity |
|---|---|---|---|---|---|---|
CFP56_18234 |
POE85887.1 |
NA |
GO:0016757_glycosyltransferase activity | GO:0006486_protein glycosylation | GO:0009834_plant-type secondary cell wall biogenesis | GO:0010417_glucuronoxylan biosynthetic process | GO:0047517_1,4-beta-D-xylan synthase activity | GO:0080116_glucuronoxylan glucuronosyltransferase activity |
AT1G27440 |
GUT2 |
87.3786 |
CFP56_55251 |
POE60447.1 |
NA |
GO:0010215_cellulose microfibril organization | GO:0031225_anchored component of membrane | GO:0005886_plasma membrane | GO:0009834_plant-type secondary cell wall biogenesis |
AT5G15630 |
IRX6 |
75.6381 |
CFP56_57155 |
POF01545.1 |
NA |
_ |
AT5G60490 |
FLA12 |
61.0442 |
… |
… |
… |
… |
… |
… |
… |
geneID- input gene IDensembl_peptide_id- Protein stable IDdescription- Gene descriptiongo_id_description- list of GO:Terms and respective descriptions associated with a given geneathaliana_eg_homolog_gene- Arabidopsis thaliana homolog ID (filtered for the highest %identity between query and Arabidopsis thaliana gene)athaliana_eg_homolog_associated_gene_name- Arabidopsis thaliana gene nameathaliana_eg_homolog_perc_identity- % identity of the query gene with the target Arabidopsis thaliana gene
This script receives the following mandatory arguments:
Gene ID list (.csv or .txt format, one per line)
Species name (e.g. qsuber)
And optional arguments:
Output name and format (e.g. qsuber_annotated.csv)
Example of use:
#Within the command line:
Rscript Ensembl_Plants_Query.R -g corkoak_nodes.csv -s qsuber -o qsuber_annotated
Congratulations, this task concludes the present use-case. Further questions or recomendations can be submitted to: hugo.miguelr99@gmail.com.