Core class

Submodule core

Description of the package functionality

The main class of DigitalCellSorter. The class includes tools for:

  1. Pre-preprocessing of single cell RNA sequencing data

  2. Quality control

  3. Batch effects correction

  4. Cells anomaly score evaluation

  5. Dimensionality reduction

  6. Clustering

  7. Annotation of cell types

  8. Vizualization

  9. Post-processing



class DigitalCellSorter(df_expr=None, dataName='dataName', species='Human', geneNamesType='alias', geneListFileName=None, mitochondrialGenes=None, sigmaOverMeanSigma=0.01, nClusters=10, nFineClusters=3, doFineClustering=True, splitFineClusters=False, subSplitSize=100, medianScaleFactor=10000, minSizeForFineClustering=50, clusteringFunction=<class 'sklearn.cluster._agglomerative.AgglomerativeClustering'>, nComponentsPCA=200, nSamples_pDCS=3000, nSamples_Hopfield=200, saveDir='', makeMarkerSubplots=False, availableCPUsCount=2, zScoreCutoff=0.3, subclusteringName=None, doQualityControl=True, doBatchCorrection=False, makePlots=True, useUnderlyingNetwork=True, minimumNumberOfMarkersPerCelltype=10, nameForUnknown='Unassigned', nameForLowQC='Failed QC', matplotlibMode='Agg', countDepthCutoffQC=0.5, numberOfGenesCutoffQC=0.5, mitochondrialGenesCutoffQC=1.5, excludedFromQC=None, countDepthPrecutQC=500, numberOfGenesPrecutQC=250, precutQC=False, minSubclusterSize=25, thresholdForUnknown_pDCS=0.0, thresholdForUnknown_ratio=0.0, thresholdForUnknown_Hopfield=0.0, thresholdForUnknown=0.2, layout='TSNE', safePlotting=True, HopfieldTemperature=0.1, annotationMethod='ratio-pDCS-Hopfield', useNegativeMarkers=True, removeLowQualityScores=True, updateConversionDictFile=True, verbose=1, random_state=None)[source]

Bases: DigitalCellSorter.VisualizationFunctions.VisualizationFunctions

Class of Digital Cell Sorter with methods for processing single cell RNA-seq data. Includes analyses and visualization tools.

Parameters:
df_expr: pandas.DataFrame, Defauld None

Gene expression in a form of a table, where genes are rows, and cells/batches are columns

dataName: str, Default ‘dataName’

Name used in output files

geneNamesType: str, Default ‘alias’

Input gene name convention

geneListFileName: str, Default None

Name of the marker genes file

mitochondrialGenes: list, Default None

List of mitochondrial genes to use in quality control

sigmaOverMeanSigma: float, Default 0.1

Threshold to consider a gene constant

nClusters: int, Default 10

Number of clusters

nFineClusters: int, Default 3

Number of fine clusters to determine with Spectral Co-clustering routine. This option is ignored is doFineClustering is False.

doFineClustering: boolean, Default True

Whether to do fine clustering or not

minSizeForFineClustering: int, Default 50

Minimum number of cells required to do fine clustering of a cluster. This option is ignored is doFineClustering is False.

clusteringFunction: function, Default AgglomerativeClustering

Clustering function to use. Other options: KMeans, {k_neighbors:40}, etc. Note: the function should have .fit method and same input and output. For Network-based clustering pass a dictionary {‘k_neighbors’:40, metric:’euclidean’, ‘clusterExpression’:True}, this way the best number of clusters will be determined automatically

nComponentsPCA: int, Default 200

Number of pca components

nSamples_pDCS: int, Default 3000

Number of random samples in distribution for pDCS annotation method

nSamples_Hopfield: int, Default 500

Number of repetitions for Hopfield annotation method

saveDir: str, Default os.path.join(‘’)

Directory for output files

makeMarkerSubplots: boolean, Default False

Whether to make subplots on markers

makePlots: boolean, Default True

Whether to make all major plots

availableCPUsCount: int, Default min(12, os.cpu_count())

Number of CPUs used in pDCS method

zScoreCutoff: float, Default 0.3

Z-Score cutoff when setting expression of a cluster as significant

thresholdForUnknown: float, Default 0.3

Threshold when assigning label “Unknown”. This option is used only with a combination of 2 or more annotation methods

thresholdForUnknown_pDCS: float, Default 0.1

Threshold when assigning label “Unknown” in pDCS method

thresholdForUnknown_ratio: float, Default 0.1

Threshold when assigning label “Unknown” in ratio method

thresholdForUnknown_Hopfield: float, Default 0.1

Threshold when assigning label “Unknown” in Hopfield method

annotationMethod: str, Default ‘ratio-pDCS-Hopfield’
Metod to use for annotation of cell types to clusters. Options are:

‘pDCS’: main DCS voting scheme with null testing

‘ratio’: simple voting score

‘Hopfield’: Hopfield Network classifier

‘pDCS-ratio’: ‘pDCS’ adjusted with ‘ratio’

‘pDCS-Hopfield’: ‘pDCS’ adjusted with ‘Hopfield’

‘ratio-Hopfield’: ‘ratio’ adjusted with ‘Hopfield’

‘pDCS-ratio-Hopfield’: ‘pDCS’ adjusted with ‘ratio’ and ‘Hopfield’

subclusteringName: str, Default None

Parameter used in for certain labels on plots

doQualityControl: boolean, Default True

Whether to remove low quality cells

doBatchCorrection: boolean, Default False

Whether to correct data for batches

minimumNumberOfMarkersPerCelltype: int, Default 10

Minimum number of markers per cell type to keep that cell type in annotation options

nameForUnknown: str, Default ‘Unassigned’

Name to use for clusters where label assignment yielded uncertain results

nameForLowQC: str, Default ‘Failed QC’

Name to use for cell that do not pass quality control

layout: str, Default ‘TSNE’
Projection layout used in visualization. Options are:

‘TSNE’: t-SNE layout L.J.P. van der Maaten. Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research 15(Oct):3221-3245, 2014.

‘PCA’: use two largest principal components

‘UMAP’: use uniform manifold approximation, McInnes, L., Healy, J., UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018

‘PHATE’: use potential of heat diffusion for affinity-based transition embedding, Moon, K.R., van Dijk, D., Wang, Z. et al. Visualizing structure and transitions in high-dimensional biological data. Nat Biotechnol 37, 1482–1492 (2019).

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

df_data = DCS.Clean(df_data)

Methods:

KeyInFile(key, file)

Check is a key exists in a HDF file.

alignSeries(se1, se2, tagForMissing)

Align two pandas.Series

annotate([mapNonexpressedCelltypes])

Produce cluster voting results, annotate cell types, and update marker expression with cell type labels

annotateWith_Hopfield_Scheme(…)

Produce cluster annotation results

annotateWith_pDCS_Scheme(df_markers_expr, …)

Produce cluster annotation results

annotateWith_ratio_Scheme(df_markers_expr, …)

Produce cluster annotation results

batchEffectCorrection([method])

Batch effect correction.

calculateQCmeasures()

Calculate Quality Control (QC) measures

calculateV(args)

Calculate the voting scores (celltypes by clusters)

clean()

Clean pandas.DataFrame: validate index, remove index duplicates, replace missing with zeros, remove all-zero rows and columns

cluster()

Cluster PCA-reduced data into a desired number of clusters

convert([nameFrom, nameTo])

Convert index to hugo names, if any names in the index are duplicated, remove duplicates

convertColormap(colormap)

Convert colormap from the form (1.,1.,1.,1.) to ‘rgba(255,255,255,1.)’

createReverseDictionary(inputDictionary)

Efficient way to create a reverse dictionary from a dictionary.

getAnomalyScores(trainingSet, testingSet[, …])

Function to get anomaly score of cells based on some reference set

getCells([celltype, clusterIndex, clusterName])

Get cell annotations in a form of pandas.Series

getCountsDataframe(se1, se2[, tagForMissing])

Get a pandas.DataFrame with cross-counts (overlaps) between two pandas.Series

getExprOfCells(cells)

Get expression of a set of cells.

getExprOfGene(gene[, analyzeBy])

Get expression of a gene.

getHugoName(gene[, printAliases])

Get gene hugo name(s).

getIndexOfGoodQualityCells([QCplotsSubDir])

Get index of sells that satisfy the QC criteria

getNewMarkerGenes([cluster, top, …])

Extract new marker genes based on the cluster annotations

getQualityControlCutoff(se, cutoff[, …])

Function to calculate QC quality cutoff

getSubnetworkOfPCN(subnetworkGenes[, …])

Extract subnetwork of PCN network

loadAnnotatedLabels([detailed, …])

Load cell annotations resulted from function ‘annotate’

loadExpressionData()

Load processed expression data from the internal HDF storage.

makeAnomalyScoresPlot([cells, suffix, noPlot])

Make anomaly scores plot

makeHopfieldLandscapePlot([…])

Make and plot Hopfield landscape

makeIndividualGeneExpressionPlot(genes, **kwargs)

Produce individual gene expression plot on a 2D layout

makeIndividualGeneTtestPlot(gene[, analyzeBy])

Produce individual gene t-test plot of the two-tailed p-value.

makeMarkerSubplots(**kwargs)

Produce subplots on each marker and its expression on all clusters

makeProjectionPlotAnnotated(**kwargs)

Produce projection plot colored by cell types

makeProjectionPlotByBatches(**kwargs)

Produce projection plot colored by batches

makeProjectionPlotByClusters(**kwargs)

Produce projection plot colored by clusters

makeProjectionPlotsQualityControl(**kwargs)

Produce Quality Control projection plots

mergeIndexDuplicates(df_expr[, method, …])

Merge index duplicates

normalize([median])

Normalize pandas.DataFrame: rescale all cells, log-transform data, remove constant genes, sort index

prepare(obj)

Prepare pandas.DataFrame for input to function process() If input is pd.DataFrame validate the input whether it has correct structure.

prepareMarkers([expressedGenes, …])

Get dictionary of markers for each cell types.

process([dataIsNormalized, cleanData])

Process data before using any annotation of visualization functions

project([PCAonly, do_fast_tsne])

Project pandas.DataFrame to lower dimensions

propagateHopfield([sigma, xi, T, tmax, …])

Function is used internally to propagate Hopfield network over a set number of time steps

qualityControl(**kwargs)

Remove low quality cells

readMarkerFile([mergeFunction, mergeCutoff])

Read markers file, prepare markers

recordAnnotationResults(df_marker_cell_type, …)

Record cell type annotation results to spreadsheets.

recordExpressionData()

Record expression data from the internal HDF storage.

visualize()

Aggregate of visualization tools of this class.

zScoreOfSeries(se)

Calculate z-score of pandas.Series and modify the Series in place

Attributes:

df_expr

fileHDFpath

geneListFileName

saveDir

property saveDir
property fileHDFpath
property df_expr
property geneListFileName
prepare(obj)[source]

Prepare pandas.DataFrame for input to function process() If input is pd.DataFrame validate the input whether it has correct structure.

Parameters:
obj: str, pandas.DataFrame, pandas.Series

Expression data in a form of pandas.DataFrame, pandas.Series, or name and path to a csv file with data

Returns:

None

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

dDCS.preapre(‘data.csv’)

convert(nameFrom=None, nameTo=None, **kwargs)[source]

Convert index to hugo names, if any names in the index are duplicated, remove duplicates

Parameters:
nameFrom: str, Default ‘alias’

Gene name type to convert from

nameTo: str, Default ‘hugo’

Gene name type to convert to

Any parameters that function ‘mergeIndexDuplicates’ can accept

Returns:

None

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.convertIndex()

clean()[source]

Clean pandas.DataFrame: validate index, remove index duplicates, replace missing with zeros, remove all-zero rows and columns

Parameters:

None

Returns:

None

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.clean()

normalize(median=None)[source]

Normalize pandas.DataFrame: rescale all cells, log-transform data, remove constant genes, sort index

Parameters:
median: float, Default None

Scale factor, if not provided will be computed as median across all cells in data

Returns:

None

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.normalize()

project(PCAonly=False, do_fast_tsne=True)[source]

Project pandas.DataFrame to lower dimensions

Parameters:
PCAonly: boolean, Default False

Perform Principal component analysis only

do_fast_tsne: boolean, Default True

Do FI-tSNE instead of “exact” tSNE This option is ignored if layout is not ‘TSNE’

Returns:
tuple

Processed data

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

xPCA, PCs, tSNE = DCS.project()

cluster()[source]

Cluster PCA-reduced data into a desired number of clusters

Parameters:

None

Returns:

None

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.cluster()

annotate(mapNonexpressedCelltypes=True)[source]

Produce cluster voting results, annotate cell types, and update marker expression with cell type labels

Parameters:
mapNonexpressedCelltypes: boolean, Default True

If True then cell types coloring will be consistent across all datasets, regardless what cell types are annotated in all datasets for a given input marker list file.

Returns:
dictionary

Voting results, a dictionary in form of: {cluster label: assigned cell type}

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

results = DCS.annotate(df_markers_expr, df_marker_cell_type)

process(dataIsNormalized=False, cleanData=True)[source]

Process data before using any annotation of visualization functions

Parameters:
dataIsNormalized: boolean, Default False

Whether DCS.df_expr is normalized or not

Returns:

None

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.process()

visualize()[source]

Aggregate of visualization tools of this class.

Parameters:

None

Returns:

None

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.process()

DCS.visualize()

makeProjectionPlotAnnotated(**kwargs)[source]

Produce projection plot colored by cell types

Parameters:

Any parameters that function ‘makeProjectionPlot’ can accept

Returns:

None

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.process()

DCS.makeProjectionPlotAnnotated()

makeProjectionPlotByBatches(**kwargs)[source]

Produce projection plot colored by batches

Parameters:

Any parameters that function ‘makeProjectionPlot’ can accept

Returns:

None

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.process()

DCS.makeProjectionPlotByBatches()

makeProjectionPlotByClusters(**kwargs)[source]

Produce projection plot colored by clusters

Parameters:

Any parameters that function ‘makeProjectionPlot’ can accept

Returns:

None

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.process()

DCS.makeProjectionPlotByClusters()

makeProjectionPlotsQualityControl(**kwargs)[source]

Produce Quality Control projection plots

Parameters:

Any parameters that function ‘makeProjectionPlot’ can accept

Returns:

None

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.process()

DCS.makeProjectionPlotsQualityControl()

makeMarkerSubplots(**kwargs)[source]

Produce subplots on each marker and its expression on all clusters

Parameters:

Any parameters that function ‘internalMakeMarkerSubplots’ can accept

Returns:

None

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.process()

DCS.makeMarkerSubplots()

makeAnomalyScoresPlot(cells='All', suffix='', noPlot=False, **kwargs)[source]

Make anomaly scores plot

Parameters:
cells: pandas.MultiIndex, Default ‘All’

Index of cells of interest

Any parameters that function ‘makeProjectionPlot’ can accept

Returns:

None

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.process()

cells = DCS.getCells(celltype=’T cell’)

DCS.makeAnomalyScoresPlot(cells)

makeIndividualGeneTtestPlot(gene, analyzeBy='label', **kwargs)[source]

Produce individual gene t-test plot of the two-tailed p-value.

Parameters:
gene: str

Name of gene of interest

analyzeBy: str, Default ‘label’

What level of lablels to include. Other possible options are ‘label’ and ‘celltype’

Any parameters that function ‘makeTtestPlot’ can accept

Returns:

None

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.makeIndividualGeneTtestPlot(‘SDC1’)

makeIndividualGeneExpressionPlot(genes, **kwargs)[source]

Produce individual gene expression plot on a 2D layout

Parameters:
gene: str, or list-like

Name of gene of interest. E.g. ‘CD4, CD33’, ‘PECAM1’, [‘CD4’, ‘CD33’]

hideClusterLabels: boolean, Default False

Whether to hide the clusters labels

outlineClusters: boolean, Default True

Whether to outline the clusters with circles

Any parameters that function ‘internalMakeMarkerSubplots’ can accept

Returns:

None

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.makeIndividualGeneExpressionPlot(‘CD4’)

makeHopfieldLandscapePlot(meshSamplingRate=1000, plot3D=True, reuseData=False, **kwargs)[source]

Make and plot Hopfield landscape

Parameters:
meshSamplingRate: int, Default 1000

Defines quality of sampling around attractor states

plot3D: boolean, Default False

Whether to plot 2D or 3D figure

reuseData: boolean, Default False

Whether to attempt using precalculated data.

Any parameters that function ‘HopfieldLandscapePlot’ or ‘HopfieldLandscapePlot3D’ can accept

Returns:

None

Usage:

DCS = DigitalCellSorter.DigitalCellSorter() DCS.makeHopfieldLandscapePlot()

getAnomalyScores(trainingSet, testingSet, printResults=False)[source]

Function to get anomaly score of cells based on some reference set

Parameters:
trainingSet: pandas.DataFrame

With cells to trail isolation forest on

testingSet: pandas.DataFrame

With cells to score

printResults: boolean, Default False

Whether to print results

Returns:
1d numpy.array

Anomaly score(s) of tested cell(s)

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

cutoff = DCS.getAnomalyScores(df_expr.iloc[:, 5:], df_expr.iloc[:, :5])

getHugoName(gene, printAliases=False)[source]

Get gene hugo name(s).

Parameters:
gene: str

‘hugo’ or ‘alias’ name of a gene

Returns:
str

Hugo name if found, otherwise input name

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.getHugoName(‘CD138’)

getExprOfGene(gene, analyzeBy='cluster')[source]

Get expression of a gene. Run this function only after function process()

Parameters:
cells: pandas.MultiIndex

Index of cells of interest

analyzeBy: str, Default ‘cluster’

What level of lablels to include. Other possible options are ‘label’ and ‘celltype’

Returns:
pandas.DataFrame

With expression of the cells of interest

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.process()

DCS.getExprOfGene(‘SDC1’)

getExprOfCells(cells)[source]

Get expression of a set of cells. Run this function only after function process()

Parameters:
cells: pandas.MultiIndex

2-level Index of cells of interest, must include levels ‘batch’ and ‘cell’

Returns:
pandas.DataFrame

With expression of the cells of interest

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.process()

DCS.getExprOfCells(cells)

getCells(celltype=None, clusterIndex=None, clusterName=None)[source]

Get cell annotations in a form of pandas.Series

Parameters:
celltype: str, Default None

Cell type to extract

clusterIndex: int, Default None

Cell type to extract

clusterName: str, Default None

Cell type to extract

Returns:
pandas.MultiIndex

Index of labelled cells

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.process()

labels = DCS.getCells()

getIndexOfGoodQualityCells(QCplotsSubDir='QC_plots', **kwargs)[source]

Get index of sells that satisfy the QC criteria

Parameters:
count_depth_cutoff: float, Default 0.5

Fraction of median to take as count depth cutoff

number_of_genes_cutoff: float, Default 0.5

Fraction of median to take as number of genes cutoff

mitochondrial_genes_cutoff: float, Default 3.0

The cutoff is median + standard_deviation * this_parameter

Any parameters that function ‘makeQualityControlHistogramPlot’ can accept

Returns:
pandas.Index

Index of cells

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

index = DCS.getIndexOfGoodQualityCells()

getQualityControlCutoff(se, cutoff, precut=1.0, mito=False, MakeHistogramPlot=True, **kwargs)[source]

Function to calculate QC quality cutoff

Parameters:
se: pandas.Series

With data to analyze

cutoff: float

Parameter for calculating the quality control cutoff

mito: boolean, Default False

Whether the analysis of mitochondrial genes fraction

plotPathAndName: str, Default None

Text to include in the figure title and file name

MakeHistogramPlot: boolean, Default True

Whether to make a histogram plot

Any parameters that function ‘makeQualityControlHistogramPlot’ can accept

Returns:
float

Cutoff value

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

cutoff = DCS.getQualityControlCutoff(se)

getCountsDataframe(se1, se2, tagForMissing='N/A')[source]

Get a pandas.DataFrame with cross-counts (overlaps) between two pandas.Series

Parameters:
se1: pandas.Series

Series with the first set of items

se2: pandas.Series

Series with the second set of items

tagForMissing: str, Default ‘N/A’

Label to assign to non-overlapping items

Returns:
pandas.DataFrame

Contains counts

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

df = DCS.getCountsDataframe(se1, se2)

getNewMarkerGenes(cluster=None, top=100, zScoreCutoff=None, removeUnknown=False, **kwargs)[source]

Extract new marker genes based on the cluster annotations

Parameters:
cluster: int, Default None

Cluster #, if provided genes of only this culster will be returned

top: int, Default 100

Upper bound for number of new markers per cell type

zScoreCutoff: float, Default 0.3

Lower bound for a marker z-score to be significant

removeUnknown: boolean, Default False

Whether to remove type “Unknown”

Any parameters that function ‘makePlotOfNewMarkers’ can accept

Returns:

None

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.extractNewMarkerGenes()

classmethod calculateV(args)[source]

Calculate the voting scores (celltypes by clusters)

Parameters:
args: tuple

Tuple of sub-arguments

df_M: pandas.DataFrame

Marker cell type DataFrame

df_X: pandas.DataFrame

Markers expression DataFrame

cluster_index: 1d numpy.array

Clustering index

cutoff: float

Significance cutoff, i.e. a threshold for a given marker to be significant

giveSignificant: boolean

Whether to return the significance matrix along with the scores

removeLowQCscores: boolean

Whether to remove low quality scores, i.e. those with less than 10% of markers that a re supporting

Returns:
pandas.DataFrame

Contains voting scores per celltype per cluster

Usage:

Function is used internally.

df = calculateV((df_M, df_X, cluster_index, 0.3, False, True))

annotateWith_pDCS_Scheme(df_markers_expr, df_marker_cell_type)[source]

Produce cluster annotation results

Parameters:
df_markers_expr: pandas.DataFrame

Data with marker genes by cells expression

df_marker_cell_type: pandas.DataFrame

Data with marker genes by cell types

Returns:

tuple

Usage:

Function should be called internally only

annotateWith_ratio_Scheme(df_markers_expr, df_marker_cell_type)[source]

Produce cluster annotation results

Parameters:
df_markers_expr: pandas.DataFrame

Data with marker genes by cells expression

df_marker_cell_type: pandas.DataFrame

Data with marker genes by cell types

Returns:

tuple

Usage:

Function should be called internally only

annotateWith_Hopfield_Scheme(df_markers_expr, df_marker_cell_type)[source]

Produce cluster annotation results

Parameters:
df_markers_expr: pandas.DataFrame

Markers expression DataFrame

df_marker_cell_type: pandas.DataFrame

Marker cell type DataFrame

Returns:

tuple

Usage:

Function should be called internally only

recordAnnotationResults(df_marker_cell_type, df_markers_expr, df_L, df_V, dict_expressed_markers, df_null_distributions=None)[source]

Record cell type annotation results to spreadsheets.

Parameters:
df_marker_cell_type: pandas.DataFrame

Markers to cell types table

df_markers_expr: pandas.DataFrame

Markers expression in each cluster

df_L: pandas.DataFrame

Annotation scores along with other information

df_V: pandas.DataFrame

Annotation scores along with other information

dict_expressed_markers: dictionary

Dictionary of markers signigicantly expressed in each cluster

df_null_distributions: pandas.DataFrame, Default None

Table with null distributions

Returns:

None

Usage:

This function is intended to be used internally only

propagateHopfield(sigma=None, xi=None, T=0.2, tmax=200, fractionToUpdate=0.5, mode=4, meshSamplingRate=200, underlyingNetwork=None, typesNames=None, clustersNames=None, printInfo=False, recordTrajectories=True, id=None, printSwitchingFraction=False, path=None, verbose=0)[source]

Function is used internally to propagate Hopfield network over a set number of time steps

Parameters:
sigma: pandas.DataFrame, Default None

Markers expression

xi: pandas.DataFrame, Default None

Marker cell type DataFrame

T: float, Default 0.2

Noise (Temperature) parameter

tmax: int, Default 200

Number of step to iterate through

fractionToUpdate: float, Default 0.5

Fraction of nodes to randomly update at each iteration

mode: int, Default 4
Options are:

1: non-onthogonalized, non-weighted attractors 2: onthogonalized, non-weighted attractors 3: onthogonalized, weighted attractors 4: onthogonalized, weighted attractors, asymetric and diluted dynamics

meshSamplingRate: int, Default 100

Visualization parameter to control the quality of the color mesh near the attractors

underlyingNetwork: 2d numpy.array, Default None

Network of underlying connections between genes

typesNames: list-like, Default None

Names of cell types

clustersNames: list-like, Default None

Names or identifiers of the clusters

printInfo: boolean, Default False

Whether to print detailes

recordTrajectories: boolean, Default True

Whether to record trajectories data to files

id: int, Default None

Identifier of this function call

printSwitchingFraction: boolean, Default False

Whether to print fraction of clusters that switch theie maximum overlapping attractor

path: str, Default None

Path for saving trajectories data

Returns:
2d numpy.array

Overlaps

Usage:

result = propagateHopfield(sigma=sigma, xi=df_attrs)

classmethod convertColormap(colormap)[source]

Convert colormap from the form (1.,1.,1.,1.) to ‘rgba(255,255,255,1.)’

Parameters:
colormap: dictionary

Colormap to convert

Returns:
dictionary

Converted colomap

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

colormap = DCS.convertColormap(colormap)

classmethod zScoreOfSeries(se)[source]

Calculate z-score of pandas.Series and modify the Series in place

Parameters:
se: pandas.Series

Series to process

Returns:
pandas.Series

Processed series

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

se = DCS.zScoreOfSeries(se)

classmethod KeyInFile(key, file)[source]

Check is a key exists in a HDF file.

Parameters:
key: str

Key name to check

file: str

HDF file name to check

Returns:
boolean

True if the key is found False otherwise

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.KeyInFile(‘df_expr’, ‘data/file.h5’)

getSubnetworkOfPCN(subnetworkGenes, min_shared_first_targets=30)[source]

Extract subnetwork of PCN network

Parameters:
subnetworkGenes: list-like

Set of genes that the subnetwork should contain

min_shared_first_targets: int, Default 30

Number of minimum first shared targets to connect two nodes

Returns:
pandas.DataFrame

Adjacency matrix

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

df_subnetwork = DCS.getSubnetworkOfPCN(genes)

alignSeries(se1, se2, tagForMissing)[source]

Align two pandas.Series

Parameters:
se1: pandas.Series

Series with the first set of items

se2: pandas.Series

Series with the second set of items

tagForMissing: str, Default ‘Missing’

Label to assign to non-overlapping items

Returns:
pandas.DataFrame

Contains two aligned pandas.Series

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

df = DCS.alignSeries(pd.Index([‘A’, ‘B’, ‘C’, ‘D’]).to_series(), pd.Index([‘B’, ‘C’, ‘D’, ‘E’, ‘F’]).to_series())

createReverseDictionary(inputDictionary)[source]

Efficient way to create a reverse dictionary from a dictionary. Utilizes Pandas.Dataframe.groupby and Numpy arrays indexing.

Parameters:
inputDictionary: dictionary

Dictionary to reverse

Returns:
dictionary

Reversed dictionary

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

revDict = DCS.createReverseDictionary(Dict)

readMarkerFile(mergeFunction='mean', mergeCutoff=0.25)[source]

Read markers file, prepare markers

Parameters:
mergeCutoff: str, Default ‘mean’
Function used for grouping of the cell sub-types. Options are:

‘mean’: average of the values ‘max’: maxium of the values, effectively a logiacal OR function

mergeCutoff: float, Default 0.25

Values below cutoff are set to zero. This option is used if mergeCutoff is ‘mean’

Returns:
pandas.DataFrame

Celltype/markers matrix

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

df_marker_cell_type = DCS.readMarkerFile()

mergeIndexDuplicates(df_expr, method='average', printDuplicates=False, verbose=1)[source]

Merge index duplicates

Parameters:
df_expr: pandas.DataFrame

Gene expression table

method: str, Default None
How to deal with index duplicates. Option are:

‘average’: average values of duplicates

‘first’: keep only first of duplicates, discard rest

Returns:
pandas.DataFrame

Gene expression table

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

df_expr = DCS.mergeIndexDuplicates(df_expr)

recordExpressionData()[source]

Record expression data from the internal HDF storage.

Parameters:

None

Returns:

None

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.recordExpressionData()

loadAnnotatedLabels(detailed=False, includeLowQC=True, infoType='label')[source]

Load cell annotations resulted from function ‘annotate’

Parameters:
detailed: boolean, Default False

Whether to give cluster- or celltype- resolution data

includeLowQC: boolean, Default False

Whether to include low quality cells in the output

Returns:

pandas.Series

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.loadAnnotatedLabels()

loadExpressionData()[source]

Load processed expression data from the internal HDF storage.

Parameters:

None

Returns:

None

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.loadExpressionData()

prepareMarkers(expressedGenes=None, createColormapForCelltypes=True)[source]

Get dictionary of markers for each cell types.

Parameters:
expressedGenes: pandas.Index, Default None

If not None then the marker DataFrame will be intersected with this index, i.e. all non-expressed genes will be filtered from the marker file

createColormapForCelltypes: boolean, Default True

Create (or update) a colormap for cell types based on a marker-celltype matrix. This will make coloring of cell clusters consistent across all plots.

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.prepareMarkers()

calculateQCmeasures()[source]

Calculate Quality Control (QC) measures

Parameters:

None

Returns:

None

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.calculateQCmeasures()

qualityControl(**kwargs)[source]

Remove low quality cells

Parameters:

None

Returns:

Any parameters that function ‘getIndexOfGoodQualityCells’ can accept

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.qualityControl()

batchEffectCorrection(method='COMBAT')[source]

Batch effect correction.

Parameters:
method: str, Default ‘COMBAT’

Stein, C.K., Qu, P., Epstein, J. et al. Removing batch effects from purified plasma cell gene expression microarrays with modified ComBat. BMC Bioinformatics 16, 63 (2015)

Returns:

None

Usage:

DCS = DigitalCellSorter.DigitalCellSorter()

DCS.batchEffectCorrection()