nbnode.simulation package¶
Submodules¶
nbnode.simulation.FlowSimulationTree module¶
- class nbnode.simulation.FlowSimulationTree.BaseFlowSimulationTree(rootnode: NBNode, data_cellgroup_col: str = 'sample_name', node_percentages: DataFrame | None = None, seed: int = 12987, include_features: List[str] = 'dataset_melanoma', verbose: bool = False)[source]¶
Bases:
objectBase class for flow simulation.
- estimate_cell_distributions(nodes: List[NBNode]) Dict[str, Dict[Literal['mu', 'cov'], DataFrame]][source]¶
Estimate the distribution of cells in each node.
If no distribution can be estimated (less than 2 cells), the node is removed from the simulation.
- Parameters:
nodes (List[NBNode]) – A list of nodes whose distribution should be estimated.
- Returns:
A dictionary of the form:
{ "node_name": { "mu": pd.DataFrame, # mean of the distribution "cov": pd.DataFrame, # covariance of the distribution } }
The mean and covariance matrix are calculated for the features given in self.include_features.
- Return type:
Dict[str, Dict[Literal[“mu”, “cov”], pd.DataFrame]]
- abstract static estimate_population_distribution(node_percentages) Dict[Literal['__name'] | str, Any][source]¶
Estimate the distribution of populations.
The distribution is estimated from the given node percentages. The distribution parameters are usually pd.DataFrames.
- Parameters:
node_percentages (_type_) – A DataFrame with the samples as columns and the populations as rows. The values are the percentage of cells in the population.
- Returns:
Dict[str, Any]:
{ # __name must be given "__name": list(population_means.index), "mean": population_means, # distribution parameter "cov": population_cov, # distribution parameter }
Example:
# Calculate mean and covariance for each of the populations # (rows of node_percentages) population_means = node_percentages.mean(axis=1) if node_percentages.shape[1] > 1: population_cov = node_percentages.T.cov() else: population_cov = np.identity(len(population_means)) population_cov = pd.DataFrame( population_cov, columns=population_means.index, index=population_means.index, ) return { "__name": list(population_means.index), "mean": population_means, "cov": population_cov, }
- abstract generate_populations(population_parameters: Dict[str, Any], n_cells: int, *args, **kwargs) List[float][source]¶
Generate a list of percentages per cell population.
- Parameters:
- Returns:
List of percentages per cell population
- Return type:
List[int]
Example:
# 1. Generate random percentages for each population random_mean = self._rng.multivariate_normal( **population_parameters ) # HACK: therefore all values are positive, this is a kindoff hack and # should be replaced with a better distribution random_mean -= min(random_mean) # add the smallest value such that the "most negative population" has # atleast _some_ chance of occuring random_mean += sorted(set(random_mean))[1] / 1e3 # normalize to 1 random_mean = random_mean / sum(random_mean) # 2. "Sample" the number of cells according to the random percentages onesample_ncells_perpop = random_mean * n_cells onesample_ncells_perpop = np.floor(onesample_ncells_perpop) if sum(onesample_ncells_perpop) < n_cells: # because of floor there are too little cells sampled remaining_cells = self._rng.choice( [population_i for population_i in range(len(random_mean))], size=int(n_cells - sum(onesample_ncells_perpop)), replace=True, p=random_mean, ) for cell_from_pop_i in remaining_cells: onesample_ncells_perpop[cell_from_pop_i] += 1 return onesample_ncells_perpop
- ncells_from_percentages(percentages: DataFrame, n_cells: int) List[int][source]¶
‘Sample’ the number of cells according to the random percentages
- abstract remove_population(population_name: str)[source]¶
Remove a certain population from the simulation. Necessary if any population had no cells and therefore the cell-parameters for the population cannot be estimated. ALWAYS call self.population_parameters[“__name”].remove(population_name)
- Parameters:
population_name (str) – The name of the population which should be removed from the population_parameters.
Example:
self.population_parameters["__name"].remove(population_name) self.population_parameters["mean"].drop( population_name, inplace=True ) self.population_parameters["cov"].drop( population_name, inplace=True, axis=0 ) self.population_parameters["cov"].drop( population_name, inplace=True, axis=1 )
- sample(n_cells: int = 10000, return_sampled_cell_numbers: bool = False, use_only_diagonal_covmat: bool = True, **population_parameters) Tuple[DataFrame, Series] | DataFrame[source]¶
Sample cells from the tree.
- Parameters:
n_cells (int, optional) – Number of cells to sample from the tree. Defaults to 10000.
return_sampled_cell_numbers (bool, optional) – Whether to return the number of cells sampled per population as well as the sampled cells themselves. Defaults to False.
use_only_diagonal_covmat (bool, optional) – Whether to use only the diagonal of the covariance matrix when sampling cells. Defaults to True.
- Returns:
If return_sampled_cell_numbers is True, a tuple with the sampled cells and the number of cells sampled per population is returned. Otherwise, only the sampled cells are returned.
- Return type:
Union[Tuple[pd.DataFrame, pd.Series], pd.DataFrame]
- sample_populations(n_cells: int = 10000, **population_parameters) Series[source]¶
Generate number of cells according to leaf node population distributions.
- Parameters:
n_cells (int, optional) – Number of cells to sample. Defaults to 10000.
- Returns:
A pandas Series with the number of cells per population.
- Return type:
pd.Series
- class nbnode.simulation.FlowSimulationTree.FlowSimulationTreeDirichlet(rootnode: NBNode, data_cellgroup_col: str = 'sample_name', node_percentages: DataFrame | None = None, seed: int = 12987, include_features='dataset_melanoma', verbose: bool = False)[source]¶
Bases:
BaseFlowSimulationTreeSimulate a tree of cell populations using the Dirichlet distribution.
- property alpha_all: Series¶
The alpha parameter of the Dirichlet distribution for all populations
Concentration parameters “alpha” of the Dirichlet distribution. The alpha parameter is a vector of positive values, where each value corresponds to a population. The larger the value, the more cells will be generated for that population.
alpha_allare the concentration parameters for all, including the intermediate populations.- Returns:
A series (named) of alpha parameters per cell population (including intermediate populations).
- Return type:
pd.Series
- static estimate_population_distribution(node_percentages)[source]¶
Estimate the population distribution using the Dirichlet distribution.
- generate_populations(population_parameters, n_cells: int, *args, **kwargs) DataFrame[source]¶
Generate a population of cells using the Dirichlet distribution.
- Parameters:
population_parameters (_type_) –
Given as a dictionary with keys:
alpha: The alpha parameter of the Dirichlet distribution
__name: The name of the population
n_cells (int) – The number of cells to generate
- Returns:
A dataframe with the generated cells per population
- Return type:
pd.DataFrame
- property mean_leafs: Series¶
Mean from dirichlet distribution
Estimating a Dirichlet distribution Thomas P. Minka 2000 (revised 2003, 2009, 2012)
- Returns:
A series (named) of means per cell population
- Return type:
pd.Series
- new_pop_mean(population_node_full_name: str, percentage: float)[source]¶
Set the new mean of the Dirichlet distribution for a given population
- pop_alpha(population_node_full_name: str) float[source]¶
Get the alpha parameter of the Dirichlet distribution for a given population
- pop_leafnode_names(population_node_full_name: str | NBNode) List[str][source]¶
Get the names of the leaf nodes of any intermediate population
- pop_mean(population_node_full_name: str)[source]¶
Get the mean of the Dirichlet distribution for a given population
nbnode.simulation.TreeMeanDistributionSampler module¶
- class nbnode.simulation.TreeMeanDistributionSampler.PseudoTorchDistributionNormal(loc: float, scale: float)[source]¶
Bases:
objectA class that mimics the torch.distributions.Distribution class.
This class is used as a fallback if torch is not installed. It is used within TreeMeanDistributionSampler to sample a new mean for a population that is to be changed.
So the calls are:
mean_distribution = PseudoTorchDistributionNormal(loc=new_mean, scale=1) new_value_from_distribution = mean_distribution.sample()
- class nbnode.simulation.TreeMeanDistributionSampler.TreeMeanDistributionSampler(flowsim_tree: str | ~nbnode.simulation.FlowSimulationTree.FlowSimulationTreeDirichlet, population_name_to_change: str, mean_distribution=<function mean_dist_fun>, n_samples=100, n_cells=10000, use_only_diagonal_covmat=False, verbose=False, seed_sample_0=129873, save_dir='sim/sim00_m0.sd1', save_type: str = 'csv', only_return_sampled_cell_numbers=False, save_changed_parameters=False, minimum_target_mean_proportion=1e-09)[source]¶
Bases:
objectA class synthesizing cytometry samples with a distribution for the mean of a population.
- sample()[source]¶
Synthesize cytometry samples with a distribution for the mean of a population.
See the __init__ method for the description of the arguments.
- Returns:
A dataframe with the sampled cell numbers.
A dictionary with the parameters of the dirichlet distribution.
A list of dataframes with the sampled cell matrices (n_cells X features) for each sample.
- Return type:
Tuple[pd.DataFrame, Dict[str, Any], List[pd.DataFrame]]
- nbnode.simulation.TreeMeanDistributionSampler.mean_dist_fun(original_mean: float) PseudoTorchDistributionNormal[source]¶
A function that returns a distribution for the new mean.
This is a fallback function that is used if torch is not installed. Within TreeMeanDistributionSampler, this function is used to sample a distribution for the new mean. The distribution is then used to sample a new mean for the population that is to be changed.
So the calls are:
mean_distribution = mean_dist_fun(new_mean) new_value_from_distribution = mean_distribution.sample()
- Parameters:
original_mean (float) – The mean of the normal distribution
- Returns:
Mimics the torch.distributions.Distribution class in the sense that it has a sample() method that returns a new value from the distribution.
- Return type:
Pseudo-D.Distribution
nbnode.simulation.TreeMeanRelative module¶
- class nbnode.simulation.TreeMeanRelative.TreeMeanRelative(flowsim_tree: str | FlowSimulationTreeDirichlet, change_pop_mean_proportional: Dict[str, float], n_samples=100, n_cells=10000, use_only_diagonal_covmat=False, verbose=False, seed_sample_0=129873, save_dir='sim/sim00_pure_estimate', only_return_sampled_cell_numbers=False, save_changed_parameters=True)[source]¶
Bases:
objectSample from a tree with a relative change in a population.
- sample() Tuple[DataFrame, Dict[str, Any], List[DataFrame]][source]¶
A method to sample with a relative change in a population mean.
See the __init__ method for the description of the arguments.
- Returns:
A dataframe with the sampled cell numbers.
A dictionary with the parameters of the dirichlet distribution.
A list of dataframes with the sampled cell matrices (n_cells X features) for each sample.
- Return type:
Tuple[pd.DataFrame, Dict[str, Any], List[pd.DataFrame]]
- sample_customize(n_samples=None, n_cells=None, change_pop_mean_proportional=None, use_only_diagonal_covmat=None, verbose=None, seed_sample_0=None, save_dir=None, _only_return_sampled_cell_numbers=None, save_changed_parameters=False) Tuple[DataFrame, Dict[str, Any], List[DataFrame]][source]¶
A customizable method to sample with a relative change in a population mean.
See the __init__ method for the description of the arguments. In contrast to
.sample(), this method allows to change each parameter individually. If any argument is not given, the default value set in __init__ method will be used.- Returns:
A dataframe with the sampled cell numbers.
A dictionary with the parameters of the dirichlet distribution.
A list of dataframes with the sampled cell matrices (n_cells X features) for each sample.
- Return type:
Tuple[pd.DataFrame, Dict[str, Any], List[pd.DataFrame]]
nbnode.simulation.save_sample module¶
nbnode.simulation.sim_proportional module¶
- nbnode.simulation.sim_proportional.sim_proportional(flowsim: FlowSimulationTreeDirichlet, n_samples=100, n_cells=25000, use_only_diagonal_covmat=True, change_pop_mean_proportional={'/AllCells/CD4+/CD8-/Tem': 1}, save_dir='sim/intraassay/sim00_baseline', save_type: str = 'csv', seed_sample_0=129873, verbose=False, only_return_sampled_cell_numbers=False) Tuple[DataFrame, Dict[str, Any], List[DataFrame]][source]¶
This function simulates new cells (n_cells) for n_samples samples according to the given flow simulation flowsim.
1. The population mean of the keys from change_pop_mean_proportional are multiplied with their respective value and changed by flowsim.new_pop_mean(old_mean * change_prop) 2. Generate n_samples with n_cells are sampled from the changed FlowSimulation. 3. (optional) The generated samples are saved to save_dir 3. The actual number of cells and the changed parameters are returned
- Parameters:
n_samples (int, optional) –
The number of simulated samples of n_cells.
Defaults to 100.
n_cells (int, optional) –
The number of cells per sample.
Defaults to 25000.
use_only_diagonal_covmat (bool, optional) –
If False, the complete covariance matrix per cell population is used to draw new cells If True, all off-diagonal elements of the covariance matrix are set to 0.
Defaults to True.
change_pop_mean_proportional (dict, optional) –
A dictionary of which cell population(s) should be changed by which fraction. A value of 1 does not change the mean proportion of the cell population. The changes to flowsim are not persistent as they are undone after the simulation.
Defaults to {“/AllCells/CD4+/CD8-/Tem”: 1}.
save_dir (str, optional) –
If given, the created samples (n cells X p markers) are saved into that directory as f”sample_{sample_i}.csv”.
Defaults to “sim/intraassay/sim00_baseline”.
seed_sample_0 (int) – flowsim.set_seed(seed_sample_0 + sample_i)
verbose (bool, optional) – Verboseness. Defaults to True.
only_return_sampled_cell_numbers (bool) – If true, only the number of cells per population are returned, not the actual samples.
- Returns:
- pd.DataFrame:
Returns the true number of generated cells per leaf-population.
- Dict:
A deep copy of flowsim.population_parameters
- Return type:
Tuple
nbnode.simulation.sim_target module¶
- nbnode.simulation.sim_target.sim_target(flowsim: FlowSimulationTreeDirichlet, change_pop_mean_target: List[Dict[str, float]] = [{'/AllCells/CD4+/CD8-/Tem': 0.05}], n_cells=25000, use_only_diagonal_covmat=True, save_dir='sim/intraassay/sim00_target', save_type='csv', sample_name=None, seed_sample_0=129873, verbose=False, only_return_sampled_cell_numbers=False, save_changed_parameters=False) Tuple[DataFrame, Dict[str, Any], List[DataFrame]][source]¶
This function simulates new cells (n_cells) for n_samples samples according to the given flow simulation flowsim.
flowsim.reset_populations() (for consistency)
For every list element in change_pop_mean_target, change all contained populations (keys) to their respective mean proportions (values). Values outside (0, 1) are not allowed.
For every sample: flowsim.simulate_cells(n_cells).
Return the true number of generated cells per leaf-population, a deep copy of flowsim.population_parameters and the generated samples. If save_dir is given, the samples are also saved.
- Parameters:
n_samples (int, optional) –
The number of simulated samples of n_cells.
Defaults to 100.
n_cells (int, optional) –
The number of cells per sample.
Defaults to 25000.
use_only_diagonal_covmat (bool, optional) –
If False, the complete covariance matrix per cell population is used to draw new cells If True, all off-diagonal elements of the covariance matrix are set to 0.
Defaults to True.
change_pop_mean_target (List[Dict[str, float]]) – A dictionary of which cell population(s) should be changed to which mean proportion of all cells. floats should be between (0, 1)
save_dir (str, optional) –
If given, the created samples (n cells X p markers) are saved into that directory as f”sample_{sample_i}.csv”.
Defaults to “sim/intraassay/sim00_target”.
seed_sample_0 (int) – flowsim.set_seed(seed_sample_0 + sample_i)
verbose (bool, optional) – Verboseness. Defaults to True.
only_return_sampled_cell_numbers (bool) – If true, only the number of cells per population are returned, not the actual samples.
- Returns:
- pd.DataFrame:
The true number of generated cells per leaf-population.
- Dict:
A deep copy of flowsim.population_parameters.
- List[pd.DataFrame]:
The generated samples (n cells X p markers), potentially also saved into the given save_dir as f”sample_{sample_i}.csv”.
- Return type:
Tuple
Usage:
simulated_cell_populations, changed_parameters, simulated_samples = sim_target( flowsim )