nbnode.simulation package

Submodules

nbnode.simulation.FlowSimulationTree module

class nbnode.simulation.FlowSimulationTree.BaseFlowSimulationTree(rootnode: NBNode, data_cellgroup_col: str = 'sample_name', node_percentages: DataFrame | None = None, seed: int = 12987, include_features: List[str] = 'dataset_melanoma', verbose: bool = False)[source]

Bases: object

Base class for flow simulation.

estimate_cell_distributions(nodes: List[NBNode]) Dict[str, Dict[Literal['mu', 'cov'], DataFrame]][source]

Estimate the distribution of cells in each node.

If no distribution can be estimated (less than 2 cells), the node is removed from the simulation.

Parameters:

nodes (List[NBNode]) – A list of nodes whose distribution should be estimated.

Returns:

A dictionary of the form:

{
    "node_name": {
        "mu": pd.DataFrame,  # mean of the distribution
        "cov": pd.DataFrame,  # covariance of the distribution
    }
}

The mean and covariance matrix are calculated for the features given in self.include_features.

Return type:

Dict[str, Dict[Literal[“mu”, “cov”], pd.DataFrame]]

abstract static estimate_population_distribution(node_percentages) Dict[Literal['__name'] | str, Any][source]

Estimate the distribution of populations.

The distribution is estimated from the given node percentages. The distribution parameters are usually pd.DataFrames.

Parameters:

node_percentages (_type_) – A DataFrame with the samples as columns and the populations as rows. The values are the percentage of cells in the population.

Returns:

Dict[str, Any]:

{
    # __name must be given
    "__name": list(population_means.index),
    "mean": population_means,   # distribution parameter
    "cov": population_cov,      # distribution parameter
}

Example:

# Calculate mean and covariance for each of the populations
# (rows of node_percentages)

population_means = node_percentages.mean(axis=1)
if node_percentages.shape[1] > 1:
    population_cov = node_percentages.T.cov()
else:
    population_cov = np.identity(len(population_means))
    population_cov = pd.DataFrame(
        population_cov,
        columns=population_means.index,
        index=population_means.index,
    )
return {
    "__name": list(population_means.index),
    "mean": population_means,
    "cov": population_cov,
}
abstract generate_populations(population_parameters: Dict[str, Any], n_cells: int, *args, **kwargs) List[float][source]

Generate a list of percentages per cell population.

Parameters:
  • population_parameters (Dict[str, Any]) – Parameters for the distribution to draw from.

  • n_cells (int) – How many cells should be drawn per sample

Returns:

List of percentages per cell population

Return type:

List[int]

Example:

# 1. Generate random percentages for each population
random_mean = self._rng.multivariate_normal(
    **population_parameters
)
# HACK: therefore all values are positive, this is a kindoff hack and
# should be replaced with a better distribution
random_mean -= min(random_mean)
# add the smallest value such that the "most negative population" has
# atleast _some_ chance of occuring
random_mean += sorted(set(random_mean))[1] / 1e3
# normalize to 1
random_mean = random_mean / sum(random_mean)

# 2. "Sample" the number of cells according to the random percentages
onesample_ncells_perpop = random_mean * n_cells
onesample_ncells_perpop = np.floor(onesample_ncells_perpop)

if sum(onesample_ncells_perpop) < n_cells:
    # because of floor there are too little cells sampled
    remaining_cells = self._rng.choice(
        [population_i for population_i in range(len(random_mean))],
        size=int(n_cells - sum(onesample_ncells_perpop)),
        replace=True,
        p=random_mean,
    )
    for cell_from_pop_i in remaining_cells:
        onesample_ncells_perpop[cell_from_pop_i] += 1
return onesample_ncells_perpop
ncells_from_percentages(percentages: DataFrame, n_cells: int) List[int][source]

‘Sample’ the number of cells according to the random percentages

Parameters:
  • percentages (pd.DataFrame) – A DataFrame with the sample percentages as columns and the populations as rows.

  • n_cells (int) – The total number of cells to be sampled.

Returns:

A list with the number of cells per population. The sum of the list is equal to n_cells.

Return type:

List[int]

abstract remove_population(population_name: str)[source]

Remove a certain population from the simulation. Necessary if any population had no cells and therefore the cell-parameters for the population cannot be estimated. ALWAYS call self.population_parameters[“__name”].remove(population_name)

Parameters:

population_name (str) – The name of the population which should be removed from the population_parameters.

Example:

self.population_parameters["__name"].remove(population_name)
self.population_parameters["mean"].drop(
    population_name, inplace=True
)
self.population_parameters["cov"].drop(
    population_name, inplace=True, axis=0
)
self.population_parameters["cov"].drop(
    population_name, inplace=True, axis=1
)
reset_populations()[source]

Reset the population parameters to the initially estimated values.

sample(n_cells: int = 10000, return_sampled_cell_numbers: bool = False, use_only_diagonal_covmat: bool = True, **population_parameters) Tuple[DataFrame, Series] | DataFrame[source]

Sample cells from the tree.

Parameters:
  • n_cells (int, optional) – Number of cells to sample from the tree. Defaults to 10000.

  • return_sampled_cell_numbers (bool, optional) – Whether to return the number of cells sampled per population as well as the sampled cells themselves. Defaults to False.

  • use_only_diagonal_covmat (bool, optional) – Whether to use only the diagonal of the covariance matrix when sampling cells. Defaults to True.

Returns:

If return_sampled_cell_numbers is True, a tuple with the sampled cells and the number of cells sampled per population is returned. Otherwise, only the sampled cells are returned.

Return type:

Union[Tuple[pd.DataFrame, pd.Series], pd.DataFrame]

sample_populations(n_cells: int = 10000, **population_parameters) Series[source]

Generate number of cells according to leaf node population distributions.

Parameters:

n_cells (int, optional) – Number of cells to sample. Defaults to 10000.

Returns:

A pandas Series with the number of cells per population.

Return type:

pd.Series

set_seed(seed: int)[source]

Set the seed for the random number generator.

Parameters:

seed (int) – The seed for the random number generator.

class nbnode.simulation.FlowSimulationTree.FlowSimulationTreeDirichlet(rootnode: NBNode, data_cellgroup_col: str = 'sample_name', node_percentages: DataFrame | None = None, seed: int = 12987, include_features='dataset_melanoma', verbose: bool = False)[source]

Bases: BaseFlowSimulationTree

Simulate a tree of cell populations using the Dirichlet distribution.

property alpha_all: Series

The alpha parameter of the Dirichlet distribution for all populations

Concentration parameters “alpha” of the Dirichlet distribution. The alpha parameter is a vector of positive values, where each value corresponds to a population. The larger the value, the more cells will be generated for that population.

alpha_all are the concentration parameters for all, including the intermediate populations.

Returns:

A series (named) of alpha parameters per cell population (including intermediate populations).

Return type:

pd.Series

static estimate_population_distribution(node_percentages)[source]

Estimate the population distribution using the Dirichlet distribution.

generate_populations(population_parameters, n_cells: int, *args, **kwargs) DataFrame[source]

Generate a population of cells using the Dirichlet distribution.

Parameters:
  • population_parameters (_type_) –

    Given as a dictionary with keys:

    • alpha: The alpha parameter of the Dirichlet distribution

    • __name: The name of the population

  • n_cells (int) – The number of cells to generate

Returns:

A dataframe with the generated cells per population

Return type:

pd.DataFrame

property mean_leafs: Series

Mean from dirichlet distribution

Estimating a Dirichlet distribution Thomas P. Minka 2000 (revised 2003, 2009, 2012)

Returns:

A series (named) of means per cell population

Return type:

pd.Series

new_pop_mean(population_node_full_name: str, percentage: float)[source]

Set the new mean of the Dirichlet distribution for a given population

Parameters:
  • population_node_full_name (str) – The get_name_full() of a population or the node itself.

  • percentage (float) – The new percentage of cells that should be generated for the given population. Must be between 0 and 1.

pop_alpha(population_node_full_name: str) float[source]

Get the alpha parameter of the Dirichlet distribution for a given population

Parameters:

population_node_full_name (str) – The get_name_full() of a population or the node itself.

Returns:

The alpha parameter of the Dirichlet distribution for the given population.

Return type:

float

pop_leafnode_names(population_node_full_name: str | NBNode) List[str][source]

Get the names of the leaf nodes of any intermediate population

Parameters:

population_node_full_name (Union[str, NBNode]) – The get_name_full() of a population, or the node itself.

Returns:

A list of the get_name_full() of the leaf nodes below the given population.

Return type:

List[str]

pop_mean(population_node_full_name: str)[source]

Get the mean of the Dirichlet distribution for a given population

property precision: float

Mean from dirichlet distribution

Estimating a Dirichlet distribution Thomas P. Minka 2000 (revised 2003, 2009, 2012)

Returns:

Total precision of all populations

Return type:

float

remove_population(population_name: str)[source]

Remove a population from the tree

Parameters:

population_name (str) – The get_name_full() of a population

nbnode.simulation.TreeMeanDistributionSampler module

class nbnode.simulation.TreeMeanDistributionSampler.PseudoTorchDistributionNormal(loc: float, scale: float)[source]

Bases: object

A class that mimics the torch.distributions.Distribution class.

This class is used as a fallback if torch is not installed. It is used within TreeMeanDistributionSampler to sample a new mean for a population that is to be changed.

So the calls are:

mean_distribution = PseudoTorchDistributionNormal(loc=new_mean, scale=1)
new_value_from_distribution = mean_distribution.sample()
sample() float[source]

Sample a value from a normal distribution with the given parameters.

Returns:

A value from a normal distribution with the given parameters.

Return type:

float

class nbnode.simulation.TreeMeanDistributionSampler.TreeMeanDistributionSampler(flowsim_tree: str | ~nbnode.simulation.FlowSimulationTree.FlowSimulationTreeDirichlet, population_name_to_change: str, mean_distribution=<function mean_dist_fun>, n_samples=100, n_cells=10000, use_only_diagonal_covmat=False, verbose=False, seed_sample_0=129873, save_dir='sim/sim00_m0.sd1', save_type: str = 'csv', only_return_sampled_cell_numbers=False, save_changed_parameters=False, minimum_target_mean_proportion=1e-09)[source]

Bases: object

A class synthesizing cytometry samples with a distribution for the mean of a population.

sample()[source]

Synthesize cytometry samples with a distribution for the mean of a population.

See the __init__ method for the description of the arguments.

Returns:

  • A dataframe with the sampled cell numbers.

  • A dictionary with the parameters of the dirichlet distribution.

  • A list of dataframes with the sampled cell matrices (n_cells X features) for each sample.

Return type:

Tuple[pd.DataFrame, Dict[str, Any], List[pd.DataFrame]]

nbnode.simulation.TreeMeanDistributionSampler.mean_dist_fun(original_mean: float) PseudoTorchDistributionNormal[source]

A function that returns a distribution for the new mean.

This is a fallback function that is used if torch is not installed. Within TreeMeanDistributionSampler, this function is used to sample a distribution for the new mean. The distribution is then used to sample a new mean for the population that is to be changed.

So the calls are:

mean_distribution = mean_dist_fun(new_mean)
new_value_from_distribution = mean_distribution.sample()
Parameters:

original_mean (float) – The mean of the normal distribution

Returns:

Mimics the torch.distributions.Distribution class in the sense that it has a sample() method that returns a new value from the distribution.

Return type:

Pseudo-D.Distribution

nbnode.simulation.TreeMeanRelative module

class nbnode.simulation.TreeMeanRelative.TreeMeanRelative(flowsim_tree: str | FlowSimulationTreeDirichlet, change_pop_mean_proportional: Dict[str, float], n_samples=100, n_cells=10000, use_only_diagonal_covmat=False, verbose=False, seed_sample_0=129873, save_dir='sim/sim00_pure_estimate', only_return_sampled_cell_numbers=False, save_changed_parameters=True)[source]

Bases: object

Sample from a tree with a relative change in a population.

sample() Tuple[DataFrame, Dict[str, Any], List[DataFrame]][source]

A method to sample with a relative change in a population mean.

See the __init__ method for the description of the arguments.

Returns:

  • A dataframe with the sampled cell numbers.

  • A dictionary with the parameters of the dirichlet distribution.

  • A list of dataframes with the sampled cell matrices (n_cells X features) for each sample.

Return type:

Tuple[pd.DataFrame, Dict[str, Any], List[pd.DataFrame]]

sample_customize(n_samples=None, n_cells=None, change_pop_mean_proportional=None, use_only_diagonal_covmat=None, verbose=None, seed_sample_0=None, save_dir=None, _only_return_sampled_cell_numbers=None, save_changed_parameters=False) Tuple[DataFrame, Dict[str, Any], List[DataFrame]][source]

A customizable method to sample with a relative change in a population mean.

See the __init__ method for the description of the arguments. In contrast to .sample(), this method allows to change each parameter individually. If any argument is not given, the default value set in __init__ method will be used.

Returns:

  • A dataframe with the sampled cell numbers.

  • A dictionary with the parameters of the dirichlet distribution.

  • A list of dataframes with the sampled cell matrices (n_cells X features) for each sample.

Return type:

Tuple[pd.DataFrame, Dict[str, Any], List[pd.DataFrame]]

nbnode.simulation.save_sample module

nbnode.simulation.save_sample.save_sample(df, save_dir, sample_name, save_type, verbose)[source]

nbnode.simulation.sim_proportional module

nbnode.simulation.sim_proportional.sim_proportional(flowsim: FlowSimulationTreeDirichlet, n_samples=100, n_cells=25000, use_only_diagonal_covmat=True, change_pop_mean_proportional={'/AllCells/CD4+/CD8-/Tem': 1}, save_dir='sim/intraassay/sim00_baseline', save_type: str = 'csv', seed_sample_0=129873, verbose=False, only_return_sampled_cell_numbers=False) Tuple[DataFrame, Dict[str, Any], List[DataFrame]][source]

This function simulates new cells (n_cells) for n_samples samples according to the given flow simulation flowsim.

1. The population mean of the keys from change_pop_mean_proportional are multiplied with their respective value and changed by flowsim.new_pop_mean(old_mean * change_prop) 2. Generate n_samples with n_cells are sampled from the changed FlowSimulation. 3. (optional) The generated samples are saved to save_dir 3. The actual number of cells and the changed parameters are returned

Parameters:
  • n_samples (int, optional) –

    The number of simulated samples of n_cells.

    Defaults to 100.

  • n_cells (int, optional) –

    The number of cells per sample.

    Defaults to 25000.

  • use_only_diagonal_covmat (bool, optional) –

    If False, the complete covariance matrix per cell population is used to draw new cells If True, all off-diagonal elements of the covariance matrix are set to 0.

    Defaults to True.

  • change_pop_mean_proportional (dict, optional) –

    A dictionary of which cell population(s) should be changed by which fraction. A value of 1 does not change the mean proportion of the cell population. The changes to flowsim are not persistent as they are undone after the simulation.

    Defaults to {“/AllCells/CD4+/CD8-/Tem”: 1}.

  • save_dir (str, optional) –

    If given, the created samples (n cells X p markers) are saved into that directory as f”sample_{sample_i}.csv”.

    Defaults to “sim/intraassay/sim00_baseline”.

  • seed_sample_0 (int) – flowsim.set_seed(seed_sample_0 + sample_i)

  • verbose (bool, optional) – Verboseness. Defaults to True.

  • only_return_sampled_cell_numbers (bool) – If true, only the number of cells per population are returned, not the actual samples.

Returns:

pd.DataFrame:

Returns the true number of generated cells per leaf-population.

Dict:

A deep copy of flowsim.population_parameters

Return type:

Tuple

nbnode.simulation.sim_target module

nbnode.simulation.sim_target.sim_target(flowsim: FlowSimulationTreeDirichlet, change_pop_mean_target: List[Dict[str, float]] = [{'/AllCells/CD4+/CD8-/Tem': 0.05}], n_cells=25000, use_only_diagonal_covmat=True, save_dir='sim/intraassay/sim00_target', save_type='csv', sample_name=None, seed_sample_0=129873, verbose=False, only_return_sampled_cell_numbers=False, save_changed_parameters=False) Tuple[DataFrame, Dict[str, Any], List[DataFrame]][source]

This function simulates new cells (n_cells) for n_samples samples according to the given flow simulation flowsim.

  1. flowsim.reset_populations() (for consistency)

  2. For every list element in change_pop_mean_target, change all contained populations (keys) to their respective mean proportions (values). Values outside (0, 1) are not allowed.

  3. For every sample: flowsim.simulate_cells(n_cells).

  4. Return the true number of generated cells per leaf-population, a deep copy of flowsim.population_parameters and the generated samples. If save_dir is given, the samples are also saved.

Parameters:
  • n_samples (int, optional) –

    The number of simulated samples of n_cells.

    Defaults to 100.

  • n_cells (int, optional) –

    The number of cells per sample.

    Defaults to 25000.

  • use_only_diagonal_covmat (bool, optional) –

    If False, the complete covariance matrix per cell population is used to draw new cells If True, all off-diagonal elements of the covariance matrix are set to 0.

    Defaults to True.

  • change_pop_mean_target (List[Dict[str, float]]) – A dictionary of which cell population(s) should be changed to which mean proportion of all cells. floats should be between (0, 1)

  • save_dir (str, optional) –

    If given, the created samples (n cells X p markers) are saved into that directory as f”sample_{sample_i}.csv”.

    Defaults to “sim/intraassay/sim00_target”.

  • seed_sample_0 (int) – flowsim.set_seed(seed_sample_0 + sample_i)

  • verbose (bool, optional) – Verboseness. Defaults to True.

  • only_return_sampled_cell_numbers (bool) – If true, only the number of cells per population are returned, not the actual samples.

Returns:

  • pd.DataFrame:

    The true number of generated cells per leaf-population.

  • Dict:

    A deep copy of flowsim.population_parameters.

  • List[pd.DataFrame]:

    The generated samples (n cells X p markers), potentially also saved into the given save_dir as f”sample_{sample_i}.csv”.

Return type:

Tuple

Usage:

simulated_cell_populations, changed_parameters, simulated_samples = sim_target(
    flowsim
)

Module contents