This RMarkdown notebook analyzes single cell RNA-seq data to find genes that are expressed exclusively in human lung ciliated cells. The data is published in GEO with accession number GSE167295.

Seurat package is used for this analysis.

library(Seurat)

There are three samples taken from three chronic obstructive pulmonary disease (COPD) patients. Since COPD do not have any known effect on cilia function, we assume that transcription profile of ciliary genes do not alter in these patients; hence we can use this dataset to determine ciliary genes.

First read data which includes matrix with number of reads, barcodes and genes.

sc167<-ReadMtx(mtx = "files/COPD/GSM5100998_SC167matrix.mtx.gz", features = "files/COPD/GSM5100998_SC167genes.tsv.gz",
              cells = "files/COPD/GSM5100998_SC167barcodes.tsv.gz")

sc170<-ReadMtx(mtx = "files/COPD/GSM5100999_SC170matrix.mtx.gz", features = "files/COPD/GSM5100999_SC170genes.tsv.gz",
               cells = "files/COPD/GSM5100999_SC170barcodes.tsv.gz")

sc200<-ReadMtx(mtx = "files/COPD/GSM5101000_SC200matrix.mtx.gz", features = "files/COPD/GSM5101000_SC200genes.tsv.gz",
               cells = "files/COPD/GSM5101000_SC200barcodes.tsv.gz")

Then we can create Seurat objects that are needed for downstream analysis with Seurat package.

sc167<-CreateSeuratObject(counts = sc167, project = "sc167")
sc170<-CreateSeuratObject(counts = sc170, project = "sc170")
sc200<-CreateSeuratObject(counts = sc200, project = "sc200")

We can add new column to meta data, indicating sample numbers. This might be useful if we want to analyse differences between samples.

sc167@meta.data$type<-"sc167"
sc170@meta.data$type<-"sc170"
sc200@meta.data$type<-"sc200"

Before integrating these samples, we need to clean and normalize data. To apply the same code for all three samples, we put them in a list and apply various functions to the list. First, we need to determine the percentage of reads that map to the mitochondrial genome. Low quality cells may show high mitochondrial contamination. Then we filter out the cells expressing too much or too few unique genes, which is stored in the column nFeature_RNA. Next, we normalize data and identify genes having highly variable expression between different cells. These genes will be a better indicator of cell types and different conditions.

All these newly generated data is added to the seurat objects.

object<-list(sc167, sc170, sc200)

object<- lapply(X = object, FUN = function(x) {
  x[["percent.mt"]]<-PercentageFeatureSet(x, pattern = "^MT-")
  x@meta.data<-x@meta.data[!is.nan(x@meta.data$percent.mt),]
  x<-subset(x, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
  x<-NormalizeData(x)
  x<-FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})

Next, we select genes having highly variable expression between dataset and integrate all three datasets.

features<-SelectIntegrationFeatures(object.list = object)
anchors<-FindIntegrationAnchors(object.list = object, anchor.features = features)
combined<-IntegrateData(anchorset = anchors)

Then we scale the data and perform dimensional reduction (PCA) on integrated data.

combined<-ScaleData(combined)
combined<-RunPCA(combined)

In order to determine the number of PC to use in clustering, we draw elbow plot to see the variance in each PC. We can see that 24 PC can represent the majority of data. Therefore, we will use first 24 PC in the following functions.

ElbowPlot(combined, ndims = 50)

Next, we will cluster the cells and draw UMAP plot to see cell groups.

combined<-FindNeighbors(combined, reduction = "pca", dims = 1:24)
combined<-FindClusters(combined, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 8973
## Number of edges: 389421
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9161
## Number of communities: 18
## Elapsed time: 1 seconds
combined<-RunUMAP(combined, reduction = "pca", dims = 1:24)

Draw the UMAP plot.

DimPlot(combined, reduction = "umap", label = TRUE, repel = TRUE)

DefaultAssay(combined)<-"RNA"
markers1<-FindAllMarkers(combined, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
markers1$cluster[markers1$gene %in% "FOXJ1"]
## [1] 6
## Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
ciliary_genes<-markers1[markers1$cluster == 6,]
ciliary_genes
ABCDEFGHIJ0123456789
 
 
p_val
<dbl>
avg_log2FC
<dbl>
pct.1
<dbl>
pct.2
<dbl>
p_val_adj
<dbl>
cluster
<fct>
CAPS0.000000e+005.41766080.9480.0730.000000e+006
TPPP30.000000e+005.11136310.9150.0710.000000e+006
C20orf850.000000e+004.87523440.8500.0390.000000e+006
GSTA10.000000e+004.68349030.7610.0470.000000e+006
C9orf240.000000e+004.58740760.8070.0490.000000e+006
RSPH10.000000e+004.57586690.8500.0300.000000e+006
GSTA20.000000e+004.51441250.5720.0230.000000e+006
TSPAN10.000000e+004.40343640.7650.0430.000000e+006
AGR30.000000e+004.38157890.7320.0410.000000e+006
C1orf1940.000000e+004.29178890.7650.0300.000000e+006