Flow cytometry simulation.

Using NBNode as basis enables the simulation of (flow) cytometry data. Cytometry data consists of an unordered matrix of n cells and p features. The ordering of cells in the matrix does not hold any significant information, except in cases where data corruption has occurred during the experiment.

As an example, we previously utilized a cell matrix originating from the Beckman Coulter DURAClone IM T Cell Subsets Antibody Panel. This particular panel measures 10 markers in fluorochrome combinations that provide robust population identification, including CD3, CD4, CD8, CD27, CD28, CD45, CD45RA, CD57, CD197 (CCR7), CD279 (PD-1). Additionally, forward scatter and side scatter are usually measured, increasing the number of features per cell to 13.

According to their feature values, cells can be classified (“gated”) into cell types, e.g. CD4+ central memory T cells or CD27+ CD28+ CD4+ T cells. Within one cell type, features do still vary. In this package, we use cell population for any defined set of cells.

Our class FlowSimulationTree uses an existing gating applied to multiple samples and simulates two major parts:

  1. How many cells are of which cell population

  2. How are cell features distributed within one cell population

In this example, we will use data from 48 healthy human peripheral blood samples stained with the DURAClone IM T Cell Subsets Tube (Beckman Coulter GmbH). The data is available under zenodo 7883353.

Synthetic data based on previous samples

To enable the analysis also with data supplied with package, I supply the final result flowsim_tree.pickle which has done downloading gating and initiating the simulation.

First, download the data.

[1]:
%%script false --no-raise-error

# This downloading takes about 1.5 minutes

# From https://github.com/zenodo/zenodo/issues/1888
import os

import requests

# params = {"access_token": "ZENODO_ACCESS_TOKEN"}  # only necessary until public
params = {}

record_id = "7883353"

r = requests.get(
    f"https://zenodo.org/api/records/{record_id}", params=params
)
print(r.json())
download_urls = [f["links"]["self"] for f in r.json()["files"]]
filenames = [f["key"] for f in r.json()["files"]]

print(r.status_code)
print(download_urls)

outdir = "example_data/asinh.align_manual.CD3_Gate"
os.makedirs(outdir, exist_ok=True)

for filename, url in zip(filenames, download_urls):
    print("Downloading:", filename)
    r = requests.get(url, params=params)
    with open(os.path.join(outdir, filename), "wb") as f:
        f.write(r.content)

The (pre-defined) tree for this dataset is nbtree.tree_complete_aligned_v2()

[2]:
%%script false --no-raise-error

import nbnode.nbnode_trees as nbtree
complete_tree = nbtree.tree_complete_aligned_v2()
complete_tree.pretty_print()

nbtree.tree_complete_aligned_v2 has specific ideas about the naming of the markers, so we need to use the same names here.

new_colnames replaces the column names in all CSV files, regardless of how they were named before, so use with caution!

gate_csv() now places all cells from all samples in the NBNode.

[3]:
%%script false --no-raise-error

# This gate_csv takes about a minute
from nbnode.apply.gate_csv import gate_csv
all_files = [
        os.path.join(outdir, file_x)
        for file_x in os.listdir(outdir)
    ]
new_colnames = [
        "FS",
        "FS.0",
        "SS",
        "CD45RA",
        "CCR7",
        "CD28",
        "PD1",
        "CD27",
        "CD4",
        "CD8",
        "CD3",
        "CD57",
        "CD45",
    ]
celltree_gated = gate_csv(
        csv=all_files,
        celltree=complete_tree,
        new_colnames=new_colnames,
    )
[4]:
%%script false --no-raise-error

celltree_gated.pretty_print()

After we have the number of cells per node, we can generate a Dirichlet-based simulation. In particular, the number of cells per node are described by a single Dirichlet distribution containing a concentration parameter for every leaf node.

[5]:
%%script false --no-raise-error

from nbnode.simulation.FlowSimulationTree import FlowSimulationTreeDirichlet

flowsim_tree = FlowSimulationTreeDirichlet(
    rootnode=celltree_gated,
    include_features=new_colnames,
)

from nbnode.io.pickle_open_dump import pickle_open_dump

pickle_open_dump(flowsim_tree, "flowsim_tree.pickle")

[6]:
import os
import pickle
with open(os.path.join(os.pardir, os.pardir, "tests", "testdata", "flowcytometry", "flowsim_tree.pickle"), "rb") as f:
    flowsim_tree = pickle.load(f)

That’s it. We can now simulate synthetic flow cytometry samples.

[7]:
flowsim_tree.sample(n_cells=10)
[7]:
FS FS.0 SS CD45RA CCR7 CD28 PD1 CD27 CD4 CD8 CD3 CD57 CD45
0 0.100577 5.251694 -0.287347 0.065961 0.659882 0.955657 1.205756 0.564757 0.998757 0.028991 0.049317 0.044155 -0.168126
1 -0.092387 5.286552 -0.061012 0.007936 0.567971 1.060524 0.914231 0.944080 1.109905 -0.007508 -0.070442 0.025814 0.349142
2 -0.159748 5.295267 -0.008783 -0.022918 0.773974 1.353215 0.590003 0.732128 1.040836 0.065331 -0.086695 -0.058003 0.352973
3 0.036379 5.286429 -0.355162 0.022301 0.555761 1.093256 1.016961 0.729056 0.925360 0.085008 0.100495 -0.051198 -0.029892
4 -0.211879 5.305992 -0.224904 -0.010840 0.883872 1.357975 1.194074 1.329266 0.886225 0.041453 -0.734879 0.003398 0.009937
5 0.037094 5.235038 -0.212276 0.514944 0.924698 0.973519 0.950437 0.984918 1.038688 0.063404 -0.441071 -0.025791 -0.398935
6 -0.236324 5.226253 -0.254201 0.772118 1.087614 0.953043 0.559132 1.006041 0.903030 0.079715 0.196283 -0.033585 -0.073690
7 -0.044698 5.232621 -0.174454 0.569029 1.018878 0.909895 0.714083 1.076313 1.044681 0.078261 0.222197 -0.030222 0.158465
8 0.274186 5.271703 0.091757 0.426830 -0.053408 -0.130320 0.889216 -0.187032 0.013959 1.186677 -0.058428 0.947983 -0.061106
9 0.368384 5.290500 -0.148017 -0.016317 -0.096373 0.959208 0.616525 0.484474 0.039499 -0.003198 0.573195 0.695084 -0.030929
[8]:
synthetic_sample, cell_numbers = flowsim_tree.sample(n_cells=1000, return_sampled_cell_numbers=True)
print("Number of cells per population in the synthetic sample\n")
print(cell_numbers)

print("\n\n\nSynthetic sample")
synthetic_sample
Number of cells per population in the synthetic sample

/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57+/PD1+        0.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57+/PD1-        0.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57-           298.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28-/CD57+/PD1+        0.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28-/CD57-             0.0
                                                    ...
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57+/PD1+      0.0
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57+/PD1-      0.0
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57-           0.0
/AllCells/DN                                         84.0
/AllCells/DP                                          0.0
Name: 0, Length: 96, dtype: float64



Synthetic sample
[8]:
FS FS.0 SS CD45RA CCR7 CD28 PD1 CD27 CD4 CD8 CD3 CD57 CD45
0 -0.199053 5.234598 -0.088241 -0.034998 0.924329 1.131150 1.082737 0.809666 1.125293 0.004234 0.046163 -0.045944 0.179582
1 -0.008096 5.277367 0.184243 0.056715 0.681012 0.935260 0.782681 1.024696 1.124938 -0.051580 -0.774571 -0.036563 0.035795
2 0.031465 5.147636 0.214739 0.009059 0.498950 1.033221 0.657734 0.879537 0.999712 0.055606 -0.289094 -0.055068 0.119610
3 0.205881 5.244811 -0.124324 -0.077907 0.594776 1.187281 0.837427 0.892707 0.987820 0.037409 0.041813 -0.041016 -0.032496
4 0.009315 5.210594 -0.185055 0.002598 0.657654 1.181162 0.950851 1.259253 1.112355 0.025556 -0.008730 -0.000847 -0.078750
... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 -0.159198 5.304853 0.272190 -0.230301 0.271408 0.809975 0.973990 0.149426 -0.036235 -0.036501 0.595494 0.777754 -0.411414
996 -0.209859 5.266154 0.135413 0.382660 0.265313 1.075741 1.255181 0.844357 -0.004307 -0.002984 -0.164177 -0.372914 -0.011075
997 -0.077003 5.277108 0.339912 0.050631 0.568426 0.513111 0.259577 0.814484 0.058262 -0.021971 0.919959 0.512811 -0.023790
998 0.015126 5.278076 0.214904 0.111095 0.130813 0.388185 0.078628 0.316861 -0.033330 0.009760 0.600114 0.737948 0.521447
999 0.144214 5.218433 0.261650 0.407119 0.179927 0.320780 0.522724 0.805193 0.035672 0.024522 0.070480 -0.048929 -0.143095

1000 rows × 13 columns

If you do not need the actual cell features but only the number of cells per cell population, you can do that with sample_populations()

[9]:
cell_numbers = flowsim_tree.sample_populations(n_cells=1000)
cell_numbers
[9]:
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57+/PD1+       11.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57+/PD1-        3.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57-           305.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28-/CD57+/PD1+        3.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28-/CD57-             0.0
                                                    ...
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57+/PD1+      0.0
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57+/PD1-      0.0
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57-           0.0
/AllCells/DN                                         68.0
/AllCells/DP                                         12.0
Name: 0, Length: 96, dtype: float64

Internally:

  1. sample_populations() is called to generate the number of cells per leaf

  2. multivariate_normal() generates the cell features for each cell

You can also set a seed for reproducibility.

[10]:
flowsim_tree.set_seed(42)
synthetic_sample = flowsim_tree.sample(n_cells=3)
print(synthetic_sample)

flowsim_tree.set_seed(42)
synthetic_sample = flowsim_tree.sample(n_cells=3)
print(synthetic_sample)
         FS      FS.0        SS    CD45RA      CCR7      CD28       PD1
0  0.033016  5.212589  0.027994 -0.042323  1.068687  1.230491  0.727421  \
1 -0.034233  5.219806  0.041332  0.009111  0.016646  1.218523  0.445606
2  0.395116  5.366544  0.085398  0.737223  0.762656  0.641986  1.200716

       CD27       CD4       CD8       CD3      CD57      CD45
0  0.667649  1.017606 -0.012887 -0.232470 -0.015077  0.049724
1  0.971163  1.167465  0.033722  0.081410  0.019712 -0.191248
2  1.028306 -0.014907  0.892293 -0.686076 -0.011778  0.000069
         FS      FS.0        SS    CD45RA      CCR7      CD28       PD1
0  0.033016  5.212589  0.027994 -0.042323  1.068687  1.230491  0.727421  \
1 -0.034233  5.219806  0.041332  0.009111  0.016646  1.218523  0.445606
2  0.395116  5.366544  0.085398  0.737223  0.762656  0.641986  1.200716

       CD27       CD4       CD8       CD3      CD57      CD45
0  0.667649  1.017606 -0.012887 -0.232470 -0.015077  0.049724
1  0.971163  1.167465  0.033722  0.081410  0.019712 -0.191248
2  1.028306 -0.014907  0.892293 -0.686076 -0.011778  0.000069

Per default, the normal distribution is based on a diagonal covariance matrix estimated on the cells per leaf node. However, you can enable the full covariance matrix!

[11]:
flowsim_tree.set_seed(42)
# default
synthetic_sample = flowsim_tree.sample(n_cells=3, use_only_diagonal_covmat=True)
print(synthetic_sample)

flowsim_tree.set_seed(42)
synthetic_sample = flowsim_tree.sample(n_cells=3, use_only_diagonal_covmat=False)
print(synthetic_sample)
         FS      FS.0        SS    CD45RA      CCR7      CD28       PD1
0  0.033016  5.212589  0.027994 -0.042323  1.068687  1.230491  0.727421  \
1 -0.034233  5.219806  0.041332  0.009111  0.016646  1.218523  0.445606
2  0.395116  5.366544  0.085398  0.737223  0.762656  0.641986  1.200716

       CD27       CD4       CD8       CD3      CD57      CD45
0  0.667649  1.017606 -0.012887 -0.232470 -0.015077  0.049724
1  0.971163  1.167465  0.033722  0.081410  0.019712 -0.191248
2  1.028306 -0.014907  0.892293 -0.686076 -0.011778  0.000069
         FS      FS.0        SS    CD45RA      CCR7      CD28       PD1
0 -0.014040  5.287691  0.113469 -0.056042  0.963664  1.316469  0.788263  \
1  0.077410  5.221344  0.198341 -0.028692  0.049742  1.251280  0.402444
2 -0.093948  5.399214  0.514198  1.252213  1.269626  0.644032  0.538064

       CD27       CD4       CD8       CD3      CD57      CD45
0  1.189001  1.073660  0.097426 -0.041899  0.009045 -0.067505
1  0.653476  0.899301  0.023174 -0.265431  0.006454 -0.508050
2  1.183130  0.016804  1.130555 -0.224201  0.028751  0.131174

Synthetic data with effects

Previously, our ability was limited to mimicing the original dataset. However, we can also introduce effects in any cell population! This means we can modify the proportion of cells derived from a particular cell population in a synthesized sample. Specifically, we defined the proportion of cells in a population using a Dirichlet distribution and its corresponding concentration parameters. By altering these parameters, we can adjust the number of cells in the population.

Note that we are not changing how the cells are samples within a leaf node!

We show two ways to simulate different effects:

  1. TreeMeanRelative(): Change any cell population by a relative amount.

  2. TreeMeanDistributionSampler(): Force any distribution on a single cell population

TreeMeanRelative changes the concentration of a cell population by a certain factor (e.g. old = 173.2, new = old*2) and all other concentration parameters decrease (by a factor) such that the overall sum of concentration parameters remains equal.

TreeMeanDistributionSampler() first samples a target proportion from a given distribution and then sets the concentration parameter such that on average the cell population contains this proportion. Again, the sum of concentration parameters remains equal.

TreeMeanRelative

The following sets up a TreeMeanRelative where the cell population /AllCells/DN is changed by the factor 1 (therefore remains equal). n_samples is set to 2 and n_cells to 100.

Therefore, calling tmr.sample() generates two samples with the original distribution, 100 cells each. tmr.sample() returns the number of cells per leaf population for both samples, the parameters after changing (the original tree is not changed) and the actually sampled samples.

[12]:
from nbnode.simulation.TreeMeanRelative import TreeMeanRelative
tmr = TreeMeanRelative(
    flowsim_tree,
    change_pop_mean_proportional={"/AllCells/DN": 1},
    n_samples=2,
    n_cells=100,
    verbose=False
)
print(tmr)
true_popcounts, changed_parameters, sampled_samples = tmr.sample()

true_popcounts
<nbnode.simulation.TreeMeanRelative.TreeMeanRelative object at 0x7f937d9c8b50>
[12]:
sample_0 sample_1
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57+/PD1+ 0.0 0.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57+/PD1- 0.0 0.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57- 31.0 29.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28-/CD57+/PD1+ 0.0 0.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28-/CD57- 1.0 0.0
... ... ...
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57+/PD1+ 0.0 0.0
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57+/PD1- 0.0 1.0
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57- 0.0 0.0
/AllCells/DN 6.0 6.0
/AllCells/DP 0.0 0.0

96 rows × 2 columns

[13]:
changed_parameters["alpha"]  # these are the concentration parameters of the Dirichlet distribution
[13]:
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57+/PD1+       0.901580
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57+/PD1-       0.295210
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57-           48.203633
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28-/CD57+/PD1+       0.091623
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28-/CD57-            0.096534
                                                      ...
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57+/PD1+     0.159252
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57+/PD1-     0.153660
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57-          0.125163
/AllCells/DN                                        15.218566
/AllCells/DP                                         1.755718
Name: 0, Length: 96, dtype: float64
[14]:
print(f"There are {len(sampled_samples)} synthetic samples, showing only the first\n")
sampled_samples[0]
There are 2 synthetic samples, showing only the first

[14]:
FS FS.0 SS CD45RA CCR7 CD28 PD1 CD27 CD4 CD8 CD3 CD57 CD45
0 0.062235 5.241828 0.167840 -0.084488 0.462969 1.327377 1.020327 0.735205 0.919547 -0.024936 -0.459670 -0.024190 -0.155744
1 -0.137097 5.211038 0.006071 -0.013674 0.631770 1.354644 0.924415 1.071354 0.909430 -0.011694 -0.594609 -0.015247 -0.368281
2 0.049543 5.226160 -0.092377 0.032866 0.720488 1.218761 0.788295 1.015296 1.006380 0.065456 -0.138280 0.082154 0.100491
3 -0.161587 5.303274 -0.203429 -0.063234 0.272455 1.249179 0.712787 1.112724 1.045490 0.031186 -0.482921 0.040894 -0.401152
4 0.153456 5.171440 0.188800 0.084201 0.742456 1.389247 0.397038 0.904283 1.021047 0.004681 -0.283981 -0.004825 -0.402051
... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 0.315077 5.217177 0.494209 -0.770656 -0.407820 0.650297 1.125016 0.144094 -0.011654 0.032542 0.723826 1.004760 0.156519
96 0.534101 5.270793 0.460711 0.347915 0.157224 0.784490 1.131765 0.869825 0.019530 -0.018556 0.217103 0.367984 0.458230
97 -0.226989 5.252685 -0.189814 0.154086 -0.156006 0.536826 1.165235 0.703251 0.030829 -0.052641 0.138474 0.634439 0.154382
98 -0.001238 5.322612 0.424035 0.193686 0.277735 1.124228 0.598610 0.624192 0.010864 -0.030898 0.231010 -0.259944 0.003429
99 0.108700 5.239704 -0.103004 0.554299 0.218544 0.567261 0.524959 0.422115 -0.107181 -0.006251 0.420655 0.294118 -0.431249

100 rows × 13 columns

Increasing a specific proportion on average increases the respective cell population’s number of cells:

[15]:
tmr_base = TreeMeanRelative(
    flowsim_tree,
    change_pop_mean_proportional={"/AllCells/DN": 1},
    n_samples=10,
    n_cells=100,
    verbose=False
)
tmr_ridiculously_changed = TreeMeanRelative(
    flowsim_tree,
    change_pop_mean_proportional={"/AllCells/DN": 10},
    n_samples=10,
    n_cells=100,
    verbose=False
)

You can check the changed parameters when sampling from the distribution:

[16]:
true_popcounts, changed_parameters, sampled_samples = tmr_base.sample()
r_true_popcounts, r_changed_parameters, r_sampled_samples = tmr_ridiculously_changed.sample()

changed_parameters["alpha"]
[16]:
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57+/PD1+       0.901580
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57+/PD1-       0.295210
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57-           48.203633
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28-/CD57+/PD1+       0.091623
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28-/CD57-            0.096534
                                                      ...
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57+/PD1+     0.159252
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57+/PD1-     0.153660
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57-          0.125163
/AllCells/DN                                        15.218566
/AllCells/DP                                         1.755718
Name: 0, Length: 96, dtype: float64
[17]:
r_changed_parameters["alpha"]
[17]:
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57+/PD1+        0.169073
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57+/PD1-        0.055361
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57-             9.039610
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28-/CD57+/PD1+        0.017182
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28-/CD57-             0.018103
                                                       ...
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57+/PD1+      0.029865
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57+/PD1-      0.028816
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57-           0.023472
/AllCells/DN                                        152.185662
/AllCells/DP                                          0.329249
Name: 0, Length: 96, dtype: float64

We observe that the concentration parameter of /AllCells/DN strongly increased! Therefore also the number of cells from that population changed

[18]:
true_popcounts
[18]:
sample_0 sample_1 sample_2 sample_3 sample_4 sample_5 sample_6 sample_7 sample_8 sample_9
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57+/PD1+ 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57+/PD1- 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57- 31.0 29.0 36.0 31.0 22.0 30.0 29.0 24.0 29.0 31.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28-/CD57+/PD1+ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28-/CD57- 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ...
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57+/PD1+ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57+/PD1- 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57- 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
/AllCells/DN 6.0 6.0 7.0 7.0 17.0 9.0 8.0 13.0 12.0 5.0
/AllCells/DP 0.0 0.0 1.0 2.0 0.0 0.0 0.0 0.0 2.0 0.0

96 rows × 10 columns

[19]:
r_true_popcounts
[19]:
sample_0 sample_1 sample_2 sample_3 sample_4 sample_5 sample_6 sample_7 sample_8 sample_9
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57+/PD1+ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57+/PD1- 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57- 4.0 4.0 6.0 4.0 3.0 4.0 5.0 3.0 6.0 5.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28-/CD57+/PD1+ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28-/CD57- 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ...
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57+/PD1+ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57+/PD1- 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57- 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
/AllCells/DN 86.0 90.0 90.0 90.0 83.0 85.0 82.0 89.0 85.0 85.0
/AllCells/DP 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

96 rows × 10 columns

TreeMeanDistributionSampler

The following sets up a TreeMeanDistributionSampler where the cell population /AllCells/DN is sampled from a normal distribution with mean of the original data and a standard deviation of 1 (default).
n_samples is set to 2 and n_cells to 100.

Therefore, calling tmr.sample() generates two samples with the values from a normal distribution with a mean of 8.28% being in that population, a standard deviation of 1. The value from that normal distribution is then set as target for the Dirichlet distribution, and 100 cells are synthesized.

tmr.sample() returns the number of cells per leaf population for both samples, the parameters after changing (the original tree is not changed) and the actually sampled samples.

[20]:
from nbnode.simulation.TreeMeanDistributionSampler import TreeMeanDistributionSampler

tmds = TreeMeanDistributionSampler(
    flowsim_tree,
    population_name_to_change="/AllCells/DN",
    n_samples=2,
    n_cells=100,
    verbose=False,
)
(
    all_true_popcounts,
    all_changed_parameters,
    all_sampled_samples,
    all_targets,
) = tmds.sample()



print(flowsim_tree.pop_mean("/AllCells/DN"))
all_true_popcounts
0.08279977774316377
/home/gugl/.conda_envs/nbnode_pyscaffold/lib/python3.8/site-packages/nbnode/simulation/sim_target.py:90: UserWarning: Changing more than 1 population is NOT setting the means to the specifiedvalues - only the LAST value will be exactly the set percentage
  warnings.warn(
[20]:
sample_0 sample_1
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57+/PD1+ 0.0 0.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57+/PD1- 0.0 0.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57- 32.0 29.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28-/CD57+/PD1+ 0.0 0.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28-/CD57- 0.0 0.0
... ... ...
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57+/PD1+ 0.0 0.0
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57+/PD1- 0.0 1.0
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57- 0.0 0.0
/AllCells/DN 6.0 8.0
/AllCells/DP 0.0 0.0

96 rows × 2 columns

Alternatively, you can insert any population distribution with a function mean_distribution taking one argument (original_mean) and returning a class with a .sample() method which can be casted to float.

Internally, the target_mean_distribution is set to mean_distribution(original_mean) of the original population. Then for every of the n_samples, a single value is sampled by float(target_mean_distribution.sample()) and this value is set as mean of the Dirichlet distribution for the population_name_to_change.

We included a PseudoTorchDistributionNormal with the relevant method, but you can use any distribution from pytorch-distributions.

In the following example, we introduce a much higher variance on one population by basing it on a normal distribution with scale 10.

[21]:
from nbnode.simulation.TreeMeanDistributionSampler import PseudoTorchDistributionNormal

tmds_custom = TreeMeanDistributionSampler(
    flowsim_tree,
    population_name_to_change="/AllCells/DN",
    mean_distribution=lambda original_mean: PseudoTorchDistributionNormal(
        loc=original_mean, scale=10
    ),
    n_samples=10,
    n_cells=10000,
    verbose=False,
)
(
    all_true_popcounts,
    all_changed_parameters,
    all_sampled_samples,
    all_targets,
) = tmds_custom.sample()

all_true_popcounts

/home/gugl/.conda_envs/nbnode_pyscaffold/lib/python3.8/site-packages/nbnode/simulation/sim_target.py:90: UserWarning: Changing more than 1 population is NOT setting the means to the specifiedvalues - only the LAST value will be exactly the set percentage
  warnings.warn(
/home/gugl/.conda_envs/nbnode_pyscaffold/lib/python3.8/site-packages/nbnode/simulation/sim_target.py:90: UserWarning: Changing more than 1 population is NOT setting the means to the specifiedvalues - only the LAST value will be exactly the set percentage
  warnings.warn(
/home/gugl/.conda_envs/nbnode_pyscaffold/lib/python3.8/site-packages/nbnode/simulation/sim_target.py:90: UserWarning: Changing more than 1 population is NOT setting the means to the specifiedvalues - only the LAST value will be exactly the set percentage
  warnings.warn(
/home/gugl/.conda_envs/nbnode_pyscaffold/lib/python3.8/site-packages/nbnode/simulation/sim_target.py:90: UserWarning: Changing more than 1 population is NOT setting the means to the specifiedvalues - only the LAST value will be exactly the set percentage
  warnings.warn(
/home/gugl/.conda_envs/nbnode_pyscaffold/lib/python3.8/site-packages/nbnode/simulation/sim_target.py:90: UserWarning: Changing more than 1 population is NOT setting the means to the specifiedvalues - only the LAST value will be exactly the set percentage
  warnings.warn(
/home/gugl/.conda_envs/nbnode_pyscaffold/lib/python3.8/site-packages/nbnode/simulation/sim_target.py:90: UserWarning: Changing more than 1 population is NOT setting the means to the specifiedvalues - only the LAST value will be exactly the set percentage
  warnings.warn(
/home/gugl/.conda_envs/nbnode_pyscaffold/lib/python3.8/site-packages/nbnode/simulation/sim_target.py:90: UserWarning: Changing more than 1 population is NOT setting the means to the specifiedvalues - only the LAST value will be exactly the set percentage
  warnings.warn(
/home/gugl/.conda_envs/nbnode_pyscaffold/lib/python3.8/site-packages/nbnode/simulation/sim_target.py:90: UserWarning: Changing more than 1 population is NOT setting the means to the specifiedvalues - only the LAST value will be exactly the set percentage
  warnings.warn(
/home/gugl/.conda_envs/nbnode_pyscaffold/lib/python3.8/site-packages/nbnode/simulation/sim_target.py:90: UserWarning: Changing more than 1 population is NOT setting the means to the specifiedvalues - only the LAST value will be exactly the set percentage
  warnings.warn(
[21]:
sample_0 sample_1 sample_2 sample_3 sample_4 sample_5 sample_6 sample_7 sample_8 sample_9
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57+/PD1+ 23.0 40.0 50.0 125.0 23.0 81.0 32.0 50.0 29.0 81.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57+/PD1- 0.0 0.0 0.0 0.0 22.0 0.0 0.0 23.0 1.0 4.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57- 2671.0 2698.0 2865.0 2693.0 2262.0 2362.0 3006.0 2262.0 2621.0 2837.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28-/CD57+/PD1+ 0.0 32.0 32.0 2.0 0.0 0.0 0.0 0.0 0.0 3.0
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28-/CD57- 27.0 0.0 0.0 0.0 0.0 4.0 14.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ...
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57+/PD1+ 0.0 13.0 0.0 0.0 3.0 14.0 0.0 9.0 0.0 0.0
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57+/PD1- 0.0 165.0 0.0 4.0 24.0 6.0 1.0 0.0 0.0 0.0
/AllCells/CD4-/CD8+/naive/CD27-/CD28-/CD57- 44.0 4.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
/AllCells/DN 529.0 111.0 2215.0 1092.0 524.0 1056.0 0.0 1677.0 954.0 497.0
/AllCells/DP 96.0 86.0 28.0 102.0 3.0 119.0 59.0 70.0 242.0 19.0

96 rows × 10 columns

FlowSimulationTree methods and attributes

FlowSimulationTree has a number of interesting methods and attributes.

flowsim_tree.estimate_population_distribution() Given node percentages per sample for all lefa nodes estimate their corresponding dirichlet parameters

flowsim_tree.remove_population() You can remove a population from the Dirichlet distribution. No cells are going to be sampled from that one and the corresponding concentration parameter is removed (not redistributed).

[22]:
import os
import pickle
with open(os.path.join(os.pardir, os.pardir, "tests", "testdata", "flowcytometry", "flowsim_tree.pickle"), "rb") as f:
    flowsim_tree = pickle.load(f)
[23]:
# Precision is equal to the total sum of concentration parameters
flowsim_tree.precision
[23]:
183.79960241689196
[24]:
# concentration parameters for all populations, also intermediate populations
flowsim_tree.alpha_all
[24]:
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57+/PD1+      0.901580
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57+/PD1-      0.295210
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28+/CD57-          48.203633
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28-/CD57+/PD1+      0.091623
/AllCells/CD4+/CD8-/Tcm/CD27+/CD28-/CD57-           0.096534
                                                     ...
/AllCells/CD4+/CD8-/Tem/CD27-/CD28+                 3.228699
/AllCells/CD4+/CD8-/Tem/CD27-/CD28-                 0.530434
/AllCells/CD4+/CD8-/Tem                            11.384744
/AllCells/CD4+/CD8-                               120.537038
/AllCells                                         183.799602
Name: 0, Length: 139, dtype: float64
[25]:
# All leaf node names (i.e. populations) corresponding to the given population
print(flowsim_tree.pop_leafnode_names("/AllCells/CD4+/CD8-/naive"))

# Also works directly with NBNodes
flowsim_tree.pop_leafnode_names(flowsim_tree.rootnode_structure["/AllCells/CD4+/CD8-/naive"])
['/AllCells/CD4+/CD8-/naive/CD27+/CD28+/CD57+/PD1+', '/AllCells/CD4+/CD8-/naive/CD27+/CD28+/CD57+/PD1-', '/AllCells/CD4+/CD8-/naive/CD27+/CD28+/CD57-', '/AllCells/CD4+/CD8-/naive/CD27+/CD28-/CD57+/PD1+', '/AllCells/CD4+/CD8-/naive/CD27+/CD28-/CD57+/PD1-', '/AllCells/CD4+/CD8-/naive/CD27+/CD28-/CD57-', '/AllCells/CD4+/CD8-/naive/CD27-/CD28+/CD57+/PD1+', '/AllCells/CD4+/CD8-/naive/CD27-/CD28+/CD57+/PD1-', '/AllCells/CD4+/CD8-/naive/CD27-/CD28+/CD57-', '/AllCells/CD4+/CD8-/naive/CD27-/CD28-/CD57+/PD1+', '/AllCells/CD4+/CD8-/naive/CD27-/CD28-/CD57+/PD1-', '/AllCells/CD4+/CD8-/naive/CD27-/CD28-/CD57-']
[25]:
['/AllCells/CD4+/CD8-/naive/CD27+/CD28+/CD57+/PD1+',
 '/AllCells/CD4+/CD8-/naive/CD27+/CD28+/CD57+/PD1-',
 '/AllCells/CD4+/CD8-/naive/CD27+/CD28+/CD57-',
 '/AllCells/CD4+/CD8-/naive/CD27+/CD28-/CD57+/PD1+',
 '/AllCells/CD4+/CD8-/naive/CD27+/CD28-/CD57+/PD1-',
 '/AllCells/CD4+/CD8-/naive/CD27+/CD28-/CD57-',
 '/AllCells/CD4+/CD8-/naive/CD27-/CD28+/CD57+/PD1+',
 '/AllCells/CD4+/CD8-/naive/CD27-/CD28+/CD57+/PD1-',
 '/AllCells/CD4+/CD8-/naive/CD27-/CD28+/CD57-',
 '/AllCells/CD4+/CD8-/naive/CD27-/CD28-/CD57+/PD1+',
 '/AllCells/CD4+/CD8-/naive/CD27-/CD28-/CD57+/PD1-',
 '/AllCells/CD4+/CD8-/naive/CD27-/CD28-/CD57-']
[26]:
flowsim_tree.rootnode_structure.pretty_print()
AllCells (counter:0)
├── DN (counter:0)
├── DP (counter:0)
├── CD4-/CD8+ (counter:0)
│   ├── naive (counter:0)
│   │   ├── CD27+/CD28+ (counter:0)
│   │   │   ├── CD57+/PD1+ (counter:0)
│   │   │   ├── CD57+/PD1- (counter:0)
│   │   │   └── CD57- (counter:0)
│   │   ├── CD27+/CD28- (counter:0)
│   │   │   ├── CD57+/PD1+ (counter:0)
│   │   │   ├── CD57+/PD1- (counter:0)
│   │   │   └── CD57- (counter:0)
│   │   ├── CD27-/CD28+ (counter:0)
│   │   │   ├── CD57+/PD1+ (counter:0)
│   │   │   ├── CD57+/PD1- (counter:0)
│   │   │   └── CD57- (counter:0)
│   │   └── CD27-/CD28- (counter:0)
│   │       ├── CD57+/PD1+ (counter:0)
│   │       ├── CD57+/PD1- (counter:0)
│   │       └── CD57- (counter:0)
│   ├── Tcm (counter:0)
│   │   ├── CD27+/CD28+ (counter:0)
│   │   │   ├── CD57+/PD1+ (counter:0)
│   │   │   ├── CD57+/PD1- (counter:0)
│   │   │   └── CD57- (counter:0)
│   │   ├── CD27+/CD28- (counter:0)
│   │   │   ├── CD57+/PD1+ (counter:0)
│   │   │   ├── CD57+/PD1- (counter:0)
│   │   │   └── CD57- (counter:0)
│   │   ├── CD27-/CD28+ (counter:0)
│   │   │   ├── CD57+/PD1+ (counter:0)
│   │   │   ├── CD57+/PD1- (counter:0)
│   │   │   └── CD57- (counter:0)
│   │   └── CD27-/CD28- (counter:0)
│   │       ├── CD57+/PD1+ (counter:0)
│   │       ├── CD57+/PD1- (counter:0)
│   │       └── CD57- (counter:0)
│   ├── Temra (counter:0)
│   │   ├── CD27+/CD28+ (counter:0)
│   │   │   ├── CD57+/PD1+ (counter:0)
│   │   │   ├── CD57+/PD1- (counter:0)
│   │   │   └── CD57- (counter:0)
│   │   ├── CD27+/CD28- (counter:0)
│   │   │   ├── CD57+/PD1+ (counter:0)
│   │   │   ├── CD57+/PD1- (counter:0)
│   │   │   └── CD57- (counter:0)
│   │   ├── CD27-/CD28+ (counter:0)
│   │   │   ├── CD57+/PD1+ (counter:0)
│   │   │   ├── CD57+/PD1- (counter:0)
│   │   │   └── CD57- (counter:0)
│   │   └── CD27-/CD28- (counter:0)
│   │       ├── CD57+/PD1+ (counter:0)
│   │       ├── CD57+/PD1- (counter:0)
│   │       └── CD57- (counter:0)
│   └── Tem (counter:0)
│       ├── CD27+/CD28+ (counter:0)
│       │   ├── CD57+/PD1+ (counter:0)
│       │   ├── CD57+/PD1- (counter:0)
│       │   └── CD57- (counter:0)
│       ├── CD27+/CD28- (counter:0)
│       │   ├── CD57+/PD1+ (counter:0)
│       │   ├── CD57+/PD1- (counter:0)
│       │   └── CD57- (counter:0)
│       ├── CD27-/CD28+ (counter:0)
│       │   ├── CD57+/PD1+ (counter:0)
│       │   ├── CD57+/PD1- (counter:0)
│       │   └── CD57- (counter:0)
│       └── CD27-/CD28- (counter:0)
│           ├── CD57+/PD1+ (counter:0)
│           ├── CD57+/PD1- (counter:0)
│           └── CD57- (counter:0)
└── CD4+/CD8- (counter:0)
    ├── naive (counter:0)
    │   ├── CD27+/CD28+ (counter:0)
    │   │   ├── CD57+/PD1+ (counter:0)
    │   │   ├── CD57+/PD1- (counter:0)
    │   │   └── CD57- (counter:0)
    │   ├── CD27+/CD28- (counter:0)
    │   │   ├── CD57+/PD1+ (counter:0)
    │   │   ├── CD57+/PD1- (counter:0)
    │   │   └── CD57- (counter:0)
    │   ├── CD27-/CD28+ (counter:0)
    │   │   ├── CD57+/PD1+ (counter:0)
    │   │   ├── CD57+/PD1- (counter:0)
    │   │   └── CD57- (counter:0)
    │   └── CD27-/CD28- (counter:0)
    │       ├── CD57+/PD1+ (counter:0)
    │       ├── CD57+/PD1- (counter:0)
    │       └── CD57- (counter:0)
    ├── Tcm (counter:0)
    │   ├── CD27+/CD28+ (counter:0)
    │   │   ├── CD57+/PD1+ (counter:0)
    │   │   ├── CD57+/PD1- (counter:0)
    │   │   └── CD57- (counter:0)
    │   ├── CD27+/CD28- (counter:0)
    │   │   ├── CD57+/PD1+ (counter:0)
    │   │   ├── CD57+/PD1- (counter:0)
    │   │   └── CD57- (counter:0)
    │   ├── CD27-/CD28+ (counter:0)
    │   │   ├── CD57+/PD1+ (counter:0)
    │   │   ├── CD57+/PD1- (counter:0)
    │   │   └── CD57- (counter:0)
    │   └── CD27-/CD28- (counter:0)
    │       ├── CD57+/PD1+ (counter:0)
    │       ├── CD57+/PD1- (counter:0)
    │       └── CD57- (counter:0)
    ├── Temra (counter:0)
    │   ├── CD27+/CD28+ (counter:0)
    │   │   ├── CD57+/PD1+ (counter:0)
    │   │   ├── CD57+/PD1- (counter:0)
    │   │   └── CD57- (counter:0)
    │   ├── CD27+/CD28- (counter:0)
    │   │   ├── CD57+/PD1+ (counter:0)
    │   │   ├── CD57+/PD1- (counter:0)
    │   │   └── CD57- (counter:0)
    │   ├── CD27-/CD28+ (counter:0)
    │   │   ├── CD57+/PD1+ (counter:0)
    │   │   ├── CD57+/PD1- (counter:0)
    │   │   └── CD57- (counter:0)
    │   └── CD27-/CD28- (counter:0)
    │       ├── CD57+/PD1+ (counter:0)
    │       ├── CD57+/PD1- (counter:0)
    │       └── CD57- (counter:0)
    └── Tem (counter:0)
        ├── CD27+/CD28+ (counter:0)
        │   ├── CD57+/PD1+ (counter:0)
        │   ├── CD57+/PD1- (counter:0)
        │   └── CD57- (counter:0)
        ├── CD27+/CD28- (counter:0)
        │   ├── CD57+/PD1+ (counter:0)
        │   ├── CD57+/PD1- (counter:0)
        │   └── CD57- (counter:0)
        ├── CD27-/CD28+ (counter:0)
        │   ├── CD57+/PD1+ (counter:0)
        │   ├── CD57+/PD1- (counter:0)
        │   └── CD57- (counter:0)
        └── CD27-/CD28- (counter:0)
            ├── CD57+/PD1+ (counter:0)
            ├── CD57+/PD1- (counter:0)
            └── CD57- (counter:0)
[27]:
# Concentration parameter for a single population
flowsim_tree.pop_alpha("/AllCells/CD4+/CD8-/naive")
[27]:
54.5458005721906
[28]:
# Mean for a single population
flowsim_tree.pop_mean("/AllCells/CD4+/CD8-/naive")
[28]:
0.2967677832537989
[29]:
# Set the new mean of the Dirichlet distribution for a single population
flowsim_tree.new_pop_mean("/AllCells/CD4+/CD8-/naive", 0.5)
print(flowsim_tree.pop_alpha("/AllCells/CD4+/CD8-/naive"))
print(flowsim_tree.pop_mean("/AllCells/CD4+/CD8-/naive"))

91.89980120844596
0.5000000000000001

The previous command new_pop_mean actively changes the whole dirichlet distribution. It is usually important to reset the changed parameters to the original values. You can do this easily with reset_populations

[30]:
flowsim_tree.reset_populations()
print(flowsim_tree.pop_alpha("/AllCells/CD4+/CD8-/naive"))
print(flowsim_tree.pop_mean("/AllCells/CD4+/CD8-/naive"))
54.5458005721906
0.2967677832537989