Input Data Format

Gene Expression Data Format

The input gene expression data is expected in one of the following formats:

1. Spreadsheet of comma-separated values csv containing condensed matrix in a form ('cell', 'gene', 'expr'). If there are batches in the data the matrix has to be of the form ('batch', 'cell', 'gene', 'expr'). Columns order can be arbitrary.

cell

gene

expr

C1

G1

3

C1

G2

2

C1

G3

1

C2

G1

1

C2

G4

5

or:

batch

cell

gene

expr

batch0

C1

G1

3

batch0

C1

G2

2

batch0

C1

G3

1

batch1

C2

G1

1

batch1

C2

G4

5

2. Spreadsheet of comma-separated values csv where rows are genes, columns are cells with gene expression counts. If there are batches in the data the spreadsheet the first row should be 'batch' and the second 'cell'.

cell

C1

C2

C3

C4

G1

3

1

7

G2

2

2

2

G3

3

1

5

G4

10

5

4

or:

batch

batch0

batch0

batch1

batch1

cell

C1

C2

C3

C4

G1

3

1

7

G2

2

2

2

G3

3

1

5

G4

10

5

4

3. Pandas DataFrame where axis 0 is genes and axis 1 are cells. If the are batched in the data then the index of axis 1 should have two levels, e.g. ('batch', 'cell'), with the first level indicating patient, batch or expreriment where that cell was sequenced, and the second level containing cell barcodes for identification.

df = pd.DataFrame(data=[[2,np.nan],[3,8],[3,5],[np.nan,1]],
                  index=['G1','G2','G3','G4'],
                  columns=pd.MultiIndex.from_arrays([['batch0','batch1'],['C1','C2']], names=['batch', 'cell']))

4. Pandas Series where index should have two levels, e.g. ('cell', 'gene'). If there are batched in the data the first level should be indicating patient, batch or expreriment where that cell was sequenced, the second level cell barcodes for identification and the third level gene names.

se = pd.Series(data=[1,8,3,5,5],
               index=pd.MultiIndex.from_arrays([['batch0','batch0','batch1','batch1','batch1'],
                                                ['C1','C1','C1','C2','C2'],
                                                ['G1','G2','G3','G1','G4']], names=['batch', 'cell', 'gene']))

Any of the data types outlined above need to be prepared/validated with a function prepare().