1 Analysing the `airway` dataset

1.1 Features overlapping autosomal transcripts

The airway dataset contains more than 64k features. How many of these features overlaps with transcripts on the autosomes (chromosomes 1-22) as represented by the TxDb.Hsapiens.UCSC.hg19.knownGene package? A feature has to overlap the actual transcript, not the intron of a transcript. So you will need to make sure that the transcript representation does not contain introns.

library(dplyr); library(BiocInstaller); library(biomaRt); library(GenomicFeatures); library(ALL); library(AnnotationHub); library(TxDb.Hsapiens.UCSC.hg19.knownGene); library(airway); data(airway)

setAnnotationHubOption("CACHE", "D:/RStudio/AnnotationHub")

seqlevels(TxDb.Hsapiens.UCSC.hg19.knownGene) <- # use only chromosomes 1-22
    seqlevels(TxDb.Hsapiens.UCSC.hg19.knownGene)[1:22]

TxDb.Hsapiens.UCSC.hg19.knownGene %>%
    transcripts() %>% 
    GenomicRanges::setdiff(., intronicParts(TxDb.Hsapiens.UCSC.hg19.knownGene)) %>% # take only transcripts from which introns are cut
    {. ->> overlaps} %>%
    reduce() %>%
    seqlevels() %>% 
    mapSeqlevels("NCBI") %>% 
    renameSeqlevels(overlaps, .) %>% # change seqnames
    subsetByOverlaps(airway, .) %>% # intersect SummarizedExperiment with transcripts
    dim() %>%
    .[1]

## [1] 24415

1.2 Reads overlapping autosomal transcripts

The expression measures of the airway dataset are the number of reads mapping to each feature. In the previous question we have established that many of these features do not overlap autosomal transcripts from the TxDb.Hsapiens.UCSC.hg19.knownGene. But how many reads map to features which overlaps these transcripts? For sample SRR1039508, how big a percentage (expressed as a number between 0 and 1) of the total reads in the airway dataset for that sample, are part of a feature which overlaps an autosomal TxDb.Hsapiens.UCSC.hg19.knownGene transcript?

TxDb.Hsapiens.UCSC.hg19.knownGene %>%
    transcripts() %>% 
    GenomicRanges::setdiff(., intronicParts(TxDb.Hsapiens.UCSC.hg19.knownGene)) %>% # take only transcripts from which introns are cut
    {. ->> overlaps} %>%
    seqlevels() %>% 
    mapSeqlevels("NCBI") %>% 
    renameSeqlevels(overlaps, .) %>% # change seqnames
    subsetByOverlaps(airway, .) %>% 
    assays() %>% 
    .$counts %>% 
    .[, "SRR1039508"] %>% 
    sum() / (airway %>% # count reads mapped to SRR1039508
                 assays() %>% 
                 .$counts %>% 
                 .[, "SRR1039508"] %>% 
                 sum()) # count total reads in SRR1039508

## [1] 0.8926014

1.3 Expressed and non-expressed autosomal transcripts

Consider sample SRR1039508 and only consider features which overlap autosomal transcripts from TxDb.Hsapiens.UCSC.hg19.knownGene. We should be able to very roughly divide these transcripts into expressed and non expressed transcript. Expressed transcripts should be marked by H3K4me3 at their promoter. The airway dataset have assayed “airway smooth muscle cells”. In the Roadmap Epigenomics data set, the E096 is supposed to be “lung”. Obtain the H3K4me3 narrowPeaks from the E096 sample using the AnnotationHub package. What is the median number of counts per feature (for sample SRR1039508) containing a H3K4me narrowPeak in their promoter (only features which overlap autosomal transcripts from TxDb.Hsapiens.UCSC.hg19.knownGene are considered)? We are using the standard 2.2kb default Bioconductor promotor setting.

E096_H3K4me3 <- AnnotationHub() %>% 
    subset(species == "Homo sapiens") %>%
    query(c("H3K4me3", "E096", "EpigenomeRoadMap")) %>% 
    .[["AH30596"]] %>%     # AH30596 | E096-H3K4me3.narrowPeak.gz
    keepSeqlevels(value = seqlevels(TxDb.Hsapiens.UCSC.hg19.knownGene)[1:22], 
                  pruning.mode = "coarse")                 # use only chromosomes 1-22

# features (for sample SRR1039508) containing a H3K4me3 narrowPeak in their promoter
TxDb.Hsapiens.UCSC.hg19.knownGene %>%
    transcripts() %>%  # extract transcripts
    GenomicRanges::setdiff(., intronicParts(TxDb.Hsapiens.UCSC.hg19.knownGene)) %>% # remove introns
    {. ->> trans} %>%
    promoters() %>% # extract promoters
    findOverlaps(., E096_H3K4me3) %>% # take promoters which overlap H3K4me3 regions
    queryHits() %>%
    trans[.] %>% # take intronless transcripts which contain H3K4me3 in their promoter
    unique() %>%
    {. ->> trans} %>%
    seqlevels() %>% # rename seqlevels
    mapSeqlevels("NCBI") %>% 
    renameSeqlevels(trans, .) %>% 
    subsetByOverlaps(airway, .) %>%  # take only regions that overlap intronless transcripts
    assays() %>% 
    .$counts %>% 
    .[, "SRR1039508"] %>%
    median() # calculate median

## [1] 200

Compare this to the median number of counts for features without a H3K4me3 peak. Note that this short analysis has not taken transcript lengths into account and it compares different genomic regions to each other; this is highly susceptible to bias such as sequence bias.

TxDb.Hsapiens.UCSC.hg19.knownGene %>%
    transcripts() %>% 
    GenomicRanges::setdiff(., intronicParts(TxDb.Hsapiens.UCSC.hg19.knownGene)) %>%
    {. ->> trans} %>%
    promoters() %>%
    findOverlaps(., E096_H3K4me3) %>%
    queryHits() %>%
    trans[.] %>%
    setdiff(trans, .) %>%
    unique() %>%
    {. ->> trans} %>%
    seqlevels() %>% 
    mapSeqlevels("NCBI") %>% 
    renameSeqlevels(trans, .) %>% 
    subsetByOverlaps(airway, .) %>% 
    assays() %>% 
    .$counts %>% 
    .[, "SRR1039508"] %>%
    median()

## [1] 26

Bioconductor Assignment

Bojan Zunar

15 August 2018

1 Analysing the `airway` dataset

1.1 Features overlapping autosomal transcripts

1.2 Reads overlapping autosomal transcripts

1.3 Expressed and non-expressed autosomal transcripts

Bioconductor Assignment

Bojan Zunar

15 August 2018

1 Analysing the airway dataset

1.1 Features overlapping autosomal transcripts

1.2 Reads overlapping autosomal transcripts

1.3 Expressed and non-expressed autosomal transcripts

1 Analysing the `airway` dataset