Data preparation

Output from kallisto-bustools (kp-python)

In this example we use raw sequencing data stored in .fastq format, from 1000 PBMC, the data can be accessed at https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_1k_v3

Note

This is by no means a tutorial for processing scRNA-seq data. We only demonstrate the workflow of connecting upstream analysis software and DCS.

Download the data and unpack the .tar file (~5.17 GB):

wget https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_fastqs.tar
tar -xvf pbmc_1k_v3_fastqs.tar

To process sequencing data one could use kallisto bus tool to generate BUS file following by bustools count to generate count matrices from a BUS file. However, we prefer to use kb-python, a package that wraps the kallisto and bustools single-cell RNA-seq workflow (Bray, N. L., Pimentel, H., Melsted, P., & Pachter, L. (2016). Near-optimal probabilistic RNA-seq quantification. Nature biotechnology, 34(5), 525) kb-python can be installed with pip.

kb count -i kallisto_index/homo_sapiens/transcriptome.idx \
         -g kallisto_index/homo_sapiens/transcripts_to_genes.txt \
         -x 10xv3 \
         --filter \
         -t 4 \
         pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz \
         pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz \
         pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz \
         pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz

Output from kb count command above [2020-11-20 14:32:51,136] INFO Using index kallisto_index/homo_sapiens/transcriptome.idx to generate BUS file to . from

[2020-11-20 14:32:51,136] INFO pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz

[2020-11-20 14:32:51,136] INFO pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz

[2020-11-20 14:32:51,136] INFO pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz

[2020-11-20 14:32:51,136] INFO pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz

[2020-11-20 14:36:33,477] INFO Sorting BUS file ./output.bus to ./tmp/output.s.bus

[2020-11-20 14:36:57,118] INFO Whitelist not provided

[2020-11-20 14:36:57,118] INFO Copying pre-packaged 10XV3 whitelist to .

[2020-11-20 14:36:57,675] INFO Inspecting BUS file ./tmp/output.s.bus

[2020-11-20 14:37:07,641] INFO Correcting BUS records in ./tmp/output.s.bus to ./tmp/output.s.c.bus with whitelist ./10xv3_whitelist.txt

[2020-11-20 14:37:29,264] INFO Sorting BUS file ./tmp/output.s.c.bus to ./output.unfiltered.bus

[2020-11-20 14:37:47,478] INFO Generating count matrix ./counts_unfiltered/cells_x_genes from BUS file ./output.unfiltered.bus

[2020-11-20 14:37:59,662] INFO Filtering with bustools

[2020-11-20 14:37:59,662] INFO Generating whitelist ./filter_barcodes.txt from BUS file ./output.unfiltered.bus

[2020-11-20 14:37:59,790] INFO Correcting BUS records in ./output.unfiltered.bus to ./tmp/output.unfiltered.c.bus with whitelist ./filter_barcodes.txt

[2020-11-20 14:38:14,344] INFO Sorting BUS file ./tmp/output.unfiltered.c.bus to ./output.filtered.bus

[2020-11-20 14:38:30,918] INFO Generating count matrix ./counts_filtered/cells_x_genes from BUS file ./output.filtered.bus


The output directory that we are interested in is counts_filtered/. Rename it:

mv counts_filtered/ kb_1k_PBMC_output/

This will generate counts data in the directory kb_1k_PBMC_output/.

Output from CellRanger

Here we use CellRanger-processed data stored in .mtx format, from 1000 PBMC, the data can be accessed at https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_1k_v3

Download the data and unpack the .tar.gz file (~9 MB):

wget https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_filtered_feature_bc_matrix.tar.gz
tar -xzf pbmc_1k_v3_filtered_feature_bc_matrix.tar.gz && mv filtered_feature_bc_matrix/ cellranger_1k_PBMC_output/

These two commands will prepare the processed counts data in the directory cellranger_1k_PBMC_output/.

Import from kallisto-bustools (kp-python)

import DigitalCellSorter
from DigitalCellSorter.core import readMTXdata

# Read the MTX data
df = readMTXdata(dataDir='kb_1k_PBMC_output/', origin='kb-python')

# (Optional) Convert gene names to HUGO
DCS = DigitalCellSorter.DigitalCellSorter()
DCS.prepare(df)
DCS.convert('ensembl', 'hugo')

# Check the DCS data
print(DCS.df_expr)

Import from CellRanger

import DigitalCellSorter
from DigitalCellSorter.core import readMTXdata

# Read the MTX data
df = readMTXdata(dataDir='cellranger_1k_PBMC_output/', origin='cellranger')

# (Optional) Convert gene names to HUGO
DCS = DigitalCellSorter.DigitalCellSorter()
DCS.prepare(df)
DCS.convert('ensembl', 'hugo')

# Check the DCS data
print(DCS.df_expr)

Function readMTXdata

Function to read data in MTX format (see usage examples above).

readMTXdata(dataDir, origin, fileMatrix=None, fileBarcodes=None, fileGenes=None, headerRows=None, sampleName=None, stripGeneVersions=True, saveData=True, dropGeneDuplicates=True, dropCellDuplicates=True)[source]

Read MTX format into pandas DataFrame compatible with DCS input format

Parameters
dataDir: str

Path to gene expression counts data

origin: str
Name of the software where the data was generated. Supported options are:

‘kb-python’ for kallisto-bustools ‘cellranger’ for cellRanger

fileMatrix: str, Default None

Name of the matrix file

fileBarcodes: str, Default None

Name of the cell barcodes file

fileGenes: str, Default None

Name of the genes file

headerRows: list, Default None

List of rows in matrix file to skip

sampleName: str, Default None

Name of the data sample to include in the batch level

stripGeneVersions: boolean, Default True

Remove ensembl gene version. E.g. “ENSG00000236246.1” –> “ENSG00000236246”

saveData: boolean, Default True

Whether to save data in hdf format. If True then the data is saved to a compressed hdf at the same location as matrix data

dropGeneDuplicates: boolean, Default True

Whether to remove gene duplicates (keep first)

dropGeneDuplicates: boolean, Default True

Whether to remove barcode duplicates (keep first)

Returns:
pandas.DataFrame

Table that has genes in rows and cells in columns

Usage:

df = readMTX(dataDir=’filtered_feature_bc_matrix/’, origin=’cellranger’) #df = readMTX(dataDir=’counts_filtered/’, origin=’kb-python’)

DCS = DigitalCellSorter.DigitalCellSorter() DCS.prepare(df) DCS.convert(‘ensembl’, ‘hugo’) print(DCS.df_expr)

Human Cell Atlas tools

Set of generic tools for retrieving, loading, and preparation of Human Cell Atlas (HCA) datasets is contained in this module.

Example:

import os
import DigitalCellSorter.ReadPrepareDataHCA as prep

# Example URL of a relatvely small dataset of scRNA-seq of human pancreas
url = "https://data.humancellatlas.org/project-assets/project-matrices/cddab57b-6868-4be4-806f-395ed9dd635a.homo_sapiens.mtx.zip"

# Path of directories where the data will be placed
extractPath = os.path.join(os.path.join(os.path.dirname(__file__), ''), 'data', os.path.splitext(os.path.basename(url))[0])

# Download data and unpack it to a specified directory
prep.getHCAdataByURL(url, extractPath)

# Record *.h5 files of individual donor IDs
IDs = prep.recordFilesOfIndividualDonors(extractPath, organName='islet of Langerhans')

# Load ready-to-use dataset of the first donor ID
df = prep.getDataframeByDonorID(extractPath, IDs[0])

# Print the shape of just loaded dataset
print(df.shape)

Submodule ReadPrepareDataHCA

Functions:

PrepareDataOnePatient_PREVIEW_DATASET(…[, …])

Prepare data from Human Cell Atlas (HCA) preview dataset h5 data file.

getDataframeByDonorID(extractPath, donorID)

Get pandas.DataFrame by Donor ID

getHCAdataByURL(url, extractPath[, extractData])

Download and extract data from Human Cell Atlas Portal

prepareDemo5kData(dir)

recordFilesOfIndividualDonors(extractPath[, …])

Record h5 files of HCA individual donors in a dataset

getHCAdataByURL(url, extractPath, extractData=True)[source]

Download and extract data from Human Cell Atlas Portal

Parameters:
url: str

URL of the data of interest

extractPath: str

Path where to save and extract data to

extractData: boolean, Default True

Whether to extract downloaded data

Returns:

None

Usage:

getHCAdataByURL(url, extractPath)

recordFilesOfIndividualDonors(extractPath, organName=None, donorIDcolumn='donor_organism.provenance.document_id', organColumn='derived_organ_parts_label', useHogoGeneNames=True)[source]

Record h5 files of HCA individual donors in a dataset

Parameters:
extractPath: str

Path of directories where HCA matrix files were downloaded and extracted. See function getHCAdataByURL() for detail.

organName: str, Default None

Name of the organ name. E.g. ‘pancreas’, ‘bone marrow’, etc.

donorIDcolumn: str, Default donor_organism.provenance.document_id

Column with unique IDs of donors in the file. Another option is ‘specimen_from_organism.provenance.document_id’ IDs at samples level is needed.

organColumn: str, Default ‘derived_organ_parts_label’

‘derived_organ_label’

‘derived_organ_parts_label’ This option is ignored when organName parameter is None.

useHogoGeneNames: boolean, Default True

Whether to use HUGO gene names.

Returns:
list

List of donor IDs

Usage:

recordFilesOfIndividualDonors(extractPath, organName=’retina’)

getDataframeByDonorID(extractPath, donorID)[source]

Get pandas.DataFrame by Donor ID

Parameters:
extractPath: str

Path of directories where HCA matrix files were downloaded and extracted. See function getHCAdataByURL() for detail.

donorID: str

Donor ID.

Returns:
pandas.DataFrame

Matrix corresponding to the Donor ID

Usage:

getDataframeByDonorID(extractPath, donorID)

PrepareDataOnePatient_PREVIEW_DATASET(filename, patient, saveFolderName, useAllData=True, cellsLimitToUse=1000, randomlySample=True, randomSeed=0)[source]

Prepare data from Human Cell Atlas (HCA) preview dataset h5 data file. The user can download the file ica_bone_marrow_h5.h5 from https://preview.data.humancellatlas.org/ (Raw Counts Matrix - Bone Marrow) and place in folder data. The file is ~485Mb and contains all 378000 cells from 8 bone marrow donors (BM1-BM8). Note: this data file is no longer available at HCA data server, however, some users may have a copy of it and need to extract data from it.

Parameters:
filename: str

Path and name of the file to store binary data in

patient: str

Identifier of the patient: ‘BM1’, ‘BM2’, ‘BM3’, ‘BM4’, ‘BM5’, ‘BM6’, ‘BM7’ or ‘BM8’

saveFolderName: str

Path where to save prepared data file

useAllData: boolean, Default True

Whether to use all data or a subset

cellsLimitToUse: int, Default 1000

Number of cells to use if useAllData=False

randomlySample: boolean, Default True

Whether to sample cell randomly of pick top number

randomSeed: int, Default 0

Random seed

Returns:

None

Usage:

PrepareDataOnePatient(os.path.join(‘data’, ‘ica_bone_marrow_h5.h5’), ‘BM1’, os.path.join(‘data’, ‘’), useAllData=False, cellsLimitToUse=5000)

prepareDemo5kData(dir)[source]