airway datasetThe airway dataset contains more than 64k features. How many of these features overlaps with transcripts on the autosomes (chromosomes 1-22) as represented by the TxDb.Hsapiens.UCSC.hg19.knownGene package? A feature has to overlap the actual transcript, not the intron of a transcript. So you will need to make sure that the transcript representation does not contain introns.
library(dplyr); library(BiocInstaller); library(biomaRt); library(GenomicFeatures); library(ALL); library(AnnotationHub); library(TxDb.Hsapiens.UCSC.hg19.knownGene); library(airway); data(airway)
setAnnotationHubOption("CACHE", "D:/RStudio/AnnotationHub")
seqlevels(TxDb.Hsapiens.UCSC.hg19.knownGene) <- # use only chromosomes 1-22
seqlevels(TxDb.Hsapiens.UCSC.hg19.knownGene)[1:22]
TxDb.Hsapiens.UCSC.hg19.knownGene %>%
transcripts() %>%
GenomicRanges::setdiff(., intronicParts(TxDb.Hsapiens.UCSC.hg19.knownGene)) %>% # take only transcripts from which introns are cut
{. ->> overlaps} %>%
reduce() %>%
seqlevels() %>%
mapSeqlevels("NCBI") %>%
renameSeqlevels(overlaps, .) %>% # change seqnames
subsetByOverlaps(airway, .) %>% # intersect SummarizedExperiment with transcripts
dim() %>%
.[1]
## [1] 24415
The expression measures of the airway dataset are the number of reads mapping to each feature. In the previous question we have established that many of these features do not overlap autosomal transcripts from the TxDb.Hsapiens.UCSC.hg19.knownGene. But how many reads map to features which overlaps these transcripts? For sample SRR1039508, how big a percentage (expressed as a number between 0 and 1) of the total reads in the airway dataset for that sample, are part of a feature which overlaps an autosomal TxDb.Hsapiens.UCSC.hg19.knownGene transcript?
TxDb.Hsapiens.UCSC.hg19.knownGene %>%
transcripts() %>%
GenomicRanges::setdiff(., intronicParts(TxDb.Hsapiens.UCSC.hg19.knownGene)) %>% # take only transcripts from which introns are cut
{. ->> overlaps} %>%
seqlevels() %>%
mapSeqlevels("NCBI") %>%
renameSeqlevels(overlaps, .) %>% # change seqnames
subsetByOverlaps(airway, .) %>%
assays() %>%
.$counts %>%
.[, "SRR1039508"] %>%
sum() / (airway %>% # count reads mapped to SRR1039508
assays() %>%
.$counts %>%
.[, "SRR1039508"] %>%
sum()) # count total reads in SRR1039508
## [1] 0.8926014
Consider sample SRR1039508 and only consider features which overlap autosomal transcripts from TxDb.Hsapiens.UCSC.hg19.knownGene. We should be able to very roughly divide these transcripts into expressed and non expressed transcript. Expressed transcripts should be marked by H3K4me3 at their promoter. The airway dataset have assayed “airway smooth muscle cells”. In the Roadmap Epigenomics data set, the E096 is supposed to be “lung”. Obtain the H3K4me3 narrowPeaks from the E096 sample using the AnnotationHub package. What is the median number of counts per feature (for sample SRR1039508) containing a H3K4me narrowPeak in their promoter (only features which overlap autosomal transcripts from TxDb.Hsapiens.UCSC.hg19.knownGene are considered)? We are using the standard 2.2kb default Bioconductor promotor setting.
E096_H3K4me3 <- AnnotationHub() %>%
subset(species == "Homo sapiens") %>%
query(c("H3K4me3", "E096", "EpigenomeRoadMap")) %>%
.[["AH30596"]] %>% # AH30596 | E096-H3K4me3.narrowPeak.gz
keepSeqlevels(value = seqlevels(TxDb.Hsapiens.UCSC.hg19.knownGene)[1:22],
pruning.mode = "coarse") # use only chromosomes 1-22
# features (for sample SRR1039508) containing a H3K4me3 narrowPeak in their promoter
TxDb.Hsapiens.UCSC.hg19.knownGene %>%
transcripts() %>% # extract transcripts
GenomicRanges::setdiff(., intronicParts(TxDb.Hsapiens.UCSC.hg19.knownGene)) %>% # remove introns
{. ->> trans} %>%
promoters() %>% # extract promoters
findOverlaps(., E096_H3K4me3) %>% # take promoters which overlap H3K4me3 regions
queryHits() %>%
trans[.] %>% # take intronless transcripts which contain H3K4me3 in their promoter
unique() %>%
{. ->> trans} %>%
seqlevels() %>% # rename seqlevels
mapSeqlevels("NCBI") %>%
renameSeqlevels(trans, .) %>%
subsetByOverlaps(airway, .) %>% # take only regions that overlap intronless transcripts
assays() %>%
.$counts %>%
.[, "SRR1039508"] %>%
median() # calculate median
## [1] 200
Compare this to the median number of counts for features without a H3K4me3 peak. Note that this short analysis has not taken transcript lengths into account and it compares different genomic regions to each other; this is highly susceptible to bias such as sequence bias.
TxDb.Hsapiens.UCSC.hg19.knownGene %>%
transcripts() %>%
GenomicRanges::setdiff(., intronicParts(TxDb.Hsapiens.UCSC.hg19.knownGene)) %>%
{. ->> trans} %>%
promoters() %>%
findOverlaps(., E096_H3K4me3) %>%
queryHits() %>%
trans[.] %>%
setdiff(trans, .) %>%
unique() %>%
{. ->> trans} %>%
seqlevels() %>%
mapSeqlevels("NCBI") %>%
renameSeqlevels(trans, .) %>%
subsetByOverlaps(airway, .) %>%
assays() %>%
.$counts %>%
.[, "SRR1039508"] %>%
median()
## [1] 26