Counting and math

Apart from introducing non-binary trees, the power of NBNode comes from its included counting and math mechanisms. Each NBNode has a math_node_attribute which is used to calculate math on. This is usually set to counter.

In this example, we will use a small test dataset coming with the package. It comes from a flow cytometry experiment with 13 features (columns) of 999 cells (rows). Each cell can further be classified into cell types which we defined with prior biological knowledge as tree given in nbtree.

Counting

I start by introducing how to count.

[1]:
import os
import re
import pandas as pd

print(os.getcwd())
cellmat = pd.read_csv(
    os.path.join(
        os.pardir,
        os.pardir,
        "tests",
        "testdata",
        "flowcytometry",
        "gated_cells",
        "cellmat.csv",
    )
)
# FS TOF (against FS INT which is "FS")
cellmat.rename(columns={"FS_TOF": "FS.0"}, inplace=True)
cellmat.columns = [re.sub("_.*", "", x) for x in cellmat.columns]
print(cellmat)

/home/gugl/clonedgit/ccc_verse/nbnode/docs/notebooks
         FS  FS.0      SS  CD45RA  CCR7  CD28   PD1  CD27   CD4   CD8   CD3
0    197657    94  186372    3.90  6.34  4.97 -1.98  7.51  5.87  3.55  5.83  \
1    180716    92  135447    6.48  6.63  5.17  3.07  7.38  5.49  2.64  5.83
2    134129    90  168268    5.92  6.53  5.39  2.60  7.57  5.70  2.54  5.74
3    239241    94   79262    5.47  6.57  4.68  3.30  7.36  5.75  2.76  6.06
4    246527    89   97635    6.12  6.26  5.22  3.05  7.40  5.70  2.66  6.29
..      ...   ...     ...     ...   ...   ...   ...   ...   ...   ...   ...
994  176236    90  149982    6.48 -1.11  2.85 -1.55  2.28  0.59  1.70  0.39
995  191863    99  115406    6.30  5.19  3.01  2.07 -1.58  0.62  1.02  0.73
996  217752    93  124675    6.35  4.75  0.42  1.89  2.02  0.52  1.48  0.53
997  334174    97  210458    1.90  1.36  1.22  2.52 -0.72  0.59  1.03  0.75
998  308089   103  219747    6.48 -0.42  1.23  2.64  7.07  0.57  1.82  1.72

     CD57  CD45
0    2.62  6.78
1    2.39  6.76
2    1.02  6.46
3    1.14  6.59
4    2.22  6.33
..    ...   ...
994  4.22  6.49
995  2.69  6.22
996  2.92  6.50
997  2.98  5.38
998  2.87  6.03

[999 rows x 13 columns]
[2]:
import nbnode.nbnode_trees as nbtree
cell_tree = nbtree.tree_complete_aligned_trunk()
cell_tree.pretty_print("__long__")
AllCells (counter:0, decision_name:None, decision_value:None)
├── DN (counter:0, decision_name:['CD4', 'CD8'], decision_value:[-1, -1])
├── DP (counter:0, decision_name:['CD4', 'CD8'], decision_value:[1, 1])
├── CD4-/CD8+ (counter:0, decision_name:['CD4', 'CD8'], decision_value:[-1, 1])
│   ├── naive (counter:0, decision_name:['CCR7', 'CD45RA'], decision_value:[1, 1])
│   ├── Tcm (counter:0, decision_name:['CCR7', 'CD45RA'], decision_value:[1, -1])
│   ├── Temra (counter:0, decision_name:['CCR7', 'CD45RA'], decision_value:[-1, 1])
│   └── Tem (counter:0, decision_name:['CCR7', 'CD45RA'], decision_value:[-1, -1])
└── CD4+/CD8- (counter:0, decision_name:['CD4', 'CD8'], decision_value:[1, -1])
    ├── naive (counter:0, decision_name:['CCR7', 'CD45RA'], decision_value:[1, 1])
    ├── Tcm (counter:0, decision_name:['CCR7', 'CD45RA'], decision_value:[1, -1])
    ├── Temra (counter:0, decision_name:['CCR7', 'CD45RA'], decision_value:[-1, 1])
    └── Tem (counter:0, decision_name:['CCR7', 'CD45RA'], decision_value:[-1, -1])

Let’s predict the cell type of all cells which returns a list of 999 predicted nodes!

[3]:
cell_preds = cell_tree.predict(cellmat)
print(cell_preds)
0      (((NBNode('/AllCells/DP', counter=0, decision_...
1      (((NBNode('/AllCells/DP', counter=0, decision_...
2      (((NBNode('/AllCells/DP', counter=0, decision_...
3      (((NBNode('/AllCells/DP', counter=0, decision_...
4      (((NBNode('/AllCells/DP', counter=0, decision_...
                             ...
994    (((NBNode('/AllCells/DP', counter=0, decision_...
995    (((NBNode('/AllCells/DP', counter=0, decision_...
996    (((NBNode('/AllCells/DP', counter=0, decision_...
997    (((NBNode('/AllCells/DP', counter=0, decision_...
998    (((NBNode('/AllCells/DP', counter=0, decision_...
Length: 999, dtype: object

This by itself did not change anything in the tree.

I will introduce another NBNode attribute: NBNode.ids. This is a list of numerical indices indicating which predicted nodes are “contained” in a specific node. Naturally, root.ids should contain ALL ids, and every other node only the list of ids which are (or passed) the node until reaching a endnode.

Even after predicting, no ids are set, so this is still an empty list.

[4]:
print(cell_tree.ids)
print(cell_tree["/AllCells/DP"].ids)
[]
[]

To set the ids, you have to actively use the predicted nodes and identify their ids. celltree.id_preds takes a list of nodes and sorts them within the tree. The numerical index refers to the order in which the predicted nodes occurred!

[5]:
cell_tree.id_preds(cell_preds)
print(cell_tree.ids[0:10])
print(len(cell_tree.ids))

# With this here we see that nodes [69, 74, 443, 972, 973] are all in /AllCells/CD4-/CD8+
# or a node below!
print(cell_tree["/AllCells/CD4-/CD8+"].ids[0:10])
print(cell_tree["/AllCells/CD4-/CD8+"].ids[0:10])
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
999
[69, 74, 443, 972, 973]
[69, 74, 443, 972, 973]

However, it would be interesting to know how many cells are in each node. For this, we can use cell_preds.count(cell_preds)

[6]:
cell_tree.count(cell_preds)

If we already set NBNode.ids, we could also not recount but directly use the len(every_node.ids) which saves us a lot of computation.

[7]:
cell_tree.count(cell_preds, use_ids=True)

Internally, this iterates over every predicted node and iterates the tree until reaching the node. Any passed node’s node.ids gets appended by the (numerical) index of the predicted node.

[8]:
cell_tree.pretty_print()
AllCells (counter:999)
├── DN (counter:0)
├── DP (counter:973)
├── CD4-/CD8+ (counter:5)
│   ├── naive (counter:5)
│   ├── Tcm (counter:0)
│   ├── Temra (counter:0)
│   └── Tem (counter:0)
└── CD4+/CD8- (counter:21)
    ├── naive (counter:20)
    ├── Tcm (counter:0)
    ├── Temra (counter:1)
    └── Tem (counter:0)

We see that now the printed counter became filled, and the majority of cells are /AllCells/DP (which are double positive T-cells, but that does not matter for our examples).

Finally, we can export the counts per node, but we should set .data for it, see another jupyter notebook for further explanation.

[9]:
cell_tree.data = cellmat
print("\nCounts for every sample, only leaf (=end) nodes:")
print(cell_tree.export_counts(only_leafnodes=True).transpose())
print("\n\nCounts for every sample, leaf AND intermediate nodes:")
print(cell_tree.export_counts(only_leafnodes=False).transpose())

Counts for every sample, only leaf (=end) nodes:
Sample                       0
/AllCells/DN                 0
/AllCells/DP               973
/AllCells/CD4-/CD8+/naive    5
/AllCells/CD4-/CD8+/Tcm      0
/AllCells/CD4-/CD8+/Temra    0
/AllCells/CD4-/CD8+/Tem      0
/AllCells/CD4+/CD8-/naive   20
/AllCells/CD4+/CD8-/Tcm      0
/AllCells/CD4+/CD8-/Temra    1
/AllCells/CD4+/CD8-/Tem      0


Counts for every sample, leaf AND intermediate nodes:
Sample                       0
/AllCells/DN                 0
/AllCells/DP               973
/AllCells/CD4-/CD8+/naive    5
/AllCells/CD4-/CD8+/Tcm      0
/AllCells/CD4-/CD8+/Temra    0
/AllCells/CD4-/CD8+/Tem      0
/AllCells/CD4-/CD8+          5
/AllCells/CD4+/CD8-/naive   20
/AllCells/CD4+/CD8-/Tcm      0
/AllCells/CD4+/CD8-/Temra    1
/AllCells/CD4+/CD8-/Tem      0
/AllCells/CD4+/CD8-         21
/AllCells                  999
/home/gugl/.conda_envs/nbnode_pyscaffold/lib/python3.8/site-packages/nbnode/nbnode.py:353: UserWarning: self.ids was an empty list, subset an empty dataframe. Did you call celltree.id_preds(predicted_nodes)? Can also be a node with no cells.
  warnings.warn(
/home/gugl/.conda_envs/nbnode_pyscaffold/lib/python3.8/site-packages/nbnode/nbnode.py:353: UserWarning: self.ids was an empty list, subset an empty dataframe. Did you call celltree.id_preds(predicted_nodes)? Can also be a node with no cells.
  warnings.warn(
/home/gugl/.conda_envs/nbnode_pyscaffold/lib/python3.8/site-packages/nbnode/nbnode.py:353: UserWarning: self.ids was an empty list, subset an empty dataframe. Did you call celltree.id_preds(predicted_nodes)? Can also be a node with no cells.
  warnings.warn(
/home/gugl/.conda_envs/nbnode_pyscaffold/lib/python3.8/site-packages/nbnode/nbnode.py:353: UserWarning: self.ids was an empty list, subset an empty dataframe. Did you call celltree.id_preds(predicted_nodes)? Can also be a node with no cells.
  warnings.warn(
/home/gugl/.conda_envs/nbnode_pyscaffold/lib/python3.8/site-packages/nbnode/nbnode.py:353: UserWarning: self.ids was an empty list, subset an empty dataframe. Did you call celltree.id_preds(predicted_nodes)? Can also be a node with no cells.
  warnings.warn(
/home/gugl/.conda_envs/nbnode_pyscaffold/lib/python3.8/site-packages/nbnode/nbnode.py:353: UserWarning: self.ids was an empty list, subset an empty dataframe. Did you call celltree.id_preds(predicted_nodes)? Can also be a node with no cells.
  warnings.warn(
/home/gugl/.conda_envs/nbnode_pyscaffold/lib/python3.8/site-packages/nbnode/nbnode.py:353: UserWarning: self.ids was an empty list, subset an empty dataframe. Did you call celltree.id_preds(predicted_nodes)? Can also be a node with no cells.
  warnings.warn(
/home/gugl/.conda_envs/nbnode_pyscaffold/lib/python3.8/site-packages/nbnode/nbnode.py:353: UserWarning: self.ids was an empty list, subset an empty dataframe. Did you call celltree.id_preds(predicted_nodes)? Can also be a node with no cells.
  warnings.warn(
/home/gugl/.conda_envs/nbnode_pyscaffold/lib/python3.8/site-packages/nbnode/nbnode.py:353: UserWarning: self.ids was an empty list, subset an empty dataframe. Did you call celltree.id_preds(predicted_nodes)? Can also be a node with no cells.
  warnings.warn(
/home/gugl/.conda_envs/nbnode_pyscaffold/lib/python3.8/site-packages/nbnode/nbnode.py:353: UserWarning: self.ids was an empty list, subset an empty dataframe. Did you call celltree.id_preds(predicted_nodes)? Can also be a node with no cells.
  warnings.warn(
/home/gugl/.conda_envs/nbnode_pyscaffold/lib/python3.8/site-packages/nbnode/nbnode.py:353: UserWarning: self.ids was an empty list, subset an empty dataframe. Did you call celltree.id_preds(predicted_nodes)? Can also be a node with no cells.
  warnings.warn(
/home/gugl/.conda_envs/nbnode_pyscaffold/lib/python3.8/site-packages/nbnode/nbnode.py:353: UserWarning: self.ids was an empty list, subset an empty dataframe. Did you call celltree.id_preds(predicted_nodes)? Can also be a node with no cells.
  warnings.warn(

Math on one NBNode

After we now have a usefull number assigned to each node, we can do quite a bit of math. Each NBNode has a math_node_attribute which is used to calculate math on. This is usually set to counter.

We can then use usual math to add, subtract, multiply, etc. nodes with numerics.

Note that this is then not backed up by NBNode.ids anymore!

[10]:
# Math operations do not happen inplace
added_tree = cell_tree + 100
print(added_tree.pretty_print())

print(cell_tree.pretty_print())
AllCells (counter:1099)
├── DN (counter:100)
├── DP (counter:1073)
├── CD4-/CD8+ (counter:105)
│   ├── naive (counter:105)
│   ├── Tcm (counter:100)
│   ├── Temra (counter:100)
│   └── Tem (counter:100)
└── CD4+/CD8- (counter:121)
    ├── naive (counter:120)
    ├── Tcm (counter:100)
    ├── Temra (counter:101)
    └── Tem (counter:100)
None
AllCells (counter:999)
├── DN (counter:0)
├── DP (counter:973)
├── CD4-/CD8+ (counter:5)
│   ├── naive (counter:5)
│   ├── Tcm (counter:0)
│   ├── Temra (counter:0)
│   └── Tem (counter:0)
└── CD4+/CD8- (counter:21)
    ├── naive (counter:20)
    ├── Tcm (counter:0)
    ├── Temra (counter:1)
    └── Tem (counter:0)
None

Re-counting by using the ids RESETS all math operations and overwrites the counter with the length of the ids!

[11]:
added_tree.count(use_ids=True)
added_tree.pretty_print()
AllCells (counter:999)
├── DN (counter:0)
├── DP (counter:973)
├── CD4-/CD8+ (counter:5)
│   ├── naive (counter:5)
│   ├── Tcm (counter:0)
│   ├── Temra (counter:0)
│   └── Tem (counter:0)
└── CD4+/CD8- (counter:21)
    ├── naive (counter:20)
    ├── Tcm (counter:0)
    ├── Temra (counter:1)
    └── Tem (counter:0)

To focus on the important math, we will only print the root node from now on, but could use pretty_print() everytime to show that the operations happen on every node.

[12]:
new_tree = cell_tree + 100
print(new_tree)

new_tree = new_tree - 10
print(new_tree)

new_tree = new_tree *2
print(new_tree)

NBNode('/AllCells', counter=1099, decision_name=None, decision_value=None)
NBNode('/AllCells', counter=1089, decision_name=None, decision_value=None)
NBNode('/AllCells', counter=2178, decision_name=None, decision_value=None)

Sometimes it is important which type is used. There are two options to do that:

  1. Change the math operation such that it is appropriate

  2. Modify the type of the tree

[13]:
try:
    new_tree = new_tree /3
except TypeError as e:
    print("TypeError: descriptor '__truediv__' requires a 'float' object but received a 'int'")

new_tree = new_tree /3.0
# Note that the error did NOT happen in the rootnode, so it might be that some math
# operations have already been done!
print(new_tree)

# With astype_math_node_attribute we can change the type of the math node attribute
# from all nodes in the tree
print(new_tree.astype_math_node_attribute(float))
print(new_tree.astype_math_node_attribute(int))
NBNode('/AllCells', counter=242.0, decision_name=None, decision_value=None)
NBNode('/AllCells', counter=242.0, decision_name=None, decision_value=None)
NBNode('/AllCells', counter=242, decision_name=None, decision_value=None)
[14]:
print(new_tree / 15)
print(new_tree // 15)
print(new_tree % 15)
print(new_tree << 2)
print(new_tree >> 2)

NBNode('/AllCells', counter=16.133333333333333, decision_name=None, decision_value=None)
NBNode('/AllCells', counter=16, decision_name=None, decision_value=None)
NBNode('/AllCells', counter=2, decision_name=None, decision_value=None)
NBNode('/AllCells', counter=968, decision_name=None, decision_value=None)
NBNode('/AllCells', counter=60, decision_name=None, decision_value=None)

Equalities

When introducing counters, suddenly the same “trees” are not identical anymore:

[15]:
print(cell_tree == cell_tree)
print(cell_tree == cell_tree + 100)
True
False

Therefore we introduce the difference between “structural” and “complete” identity. Two trees are structurally equal if their node.name, node.decision_name and node.decision_value are equal, everything else can be different.

[16]:
from nbnode.nbnode import NBNode
import nbnode.nbnode_trees as nbtree

original_tree = nbtree.tree_simple()
new_tree = nbtree.tree_simple()
print(new_tree == new_tree  + 100)
print(new_tree.eq_structure(new_tree + 100))

NBNode("ADDITIONAL_NODE", parent=new_tree)

new_tree.pretty_print()
original_tree.pretty_print()

print(original_tree == original_tree + 100)
print(original_tree == new_tree)

# You can generate a new tree by only copying the structure, then counts and data are not copied:
new_tree = original_tree.copy_structure()
False
True
a (counter:0)
├── a0 (counter:0)
├── a1 (counter:0)
│   └── a1a (counter:0)
├── a2 (counter:0)
└── ADDITIONAL_NODE (counter:0)
a (counter:0)
├── a0 (counter:0)
├── a1 (counter:0)
│   └── a1a (counter:0)
└── a2 (counter:0)
False
False
/home/gugl/.conda_envs/nbnode_pyscaffold/lib/python3.8/site-packages/nbnode/nbnode.py:364: UserWarning: data is no pandas.DataFrame, converting it via pd.DataFrame(data).
  warnings.warn(

Math with multiple NBNodes

We can then use usual math to add, subtract, multiply, etc. nodes with each other. Explicitely, this traverses all nodes in both trees simultaneously and does the mathematical operation using both math_node_attribute. The result is then saved in the math_node_attribute, but no tree is changed inplace.

Note that this is then not backed up by NBNode.ids anymore!

[17]:
import copy
import nbnode.nbnode_trees as nbtree
cell_tree = nbtree.tree_complete_aligned_trunk()
cell_tree.id_preds(cell_tree.predict(cellmat))
cell_tree.count(use_ids=True)
cell_tree.pretty_print()

cell_tree_2 = copy.deepcopy(cell_tree)
# Reset the counts of the nodes
cell_tree_2.reset_counts()
cell_tree_2 = cell_tree_2 + 1

# You can set the counter values manually.
# Keep in mind that setting an intermediate node (like this one)
#  might not make any sense biologically as every cell must reach a leaf node
cell_tree_2["/AllCells/CD4-/CD8+"].counter = 1000
cell_tree_2.pretty_print()

AllCells (counter:999)
├── DN (counter:0)
├── DP (counter:973)
├── CD4-/CD8+ (counter:5)
│   ├── naive (counter:5)
│   ├── Tcm (counter:0)
│   ├── Temra (counter:0)
│   └── Tem (counter:0)
└── CD4+/CD8- (counter:21)
    ├── naive (counter:20)
    ├── Tcm (counter:0)
    ├── Temra (counter:1)
    └── Tem (counter:0)
AllCells (counter:1)
├── DN (counter:1)
├── DP (counter:1)
├── CD4-/CD8+ (counter:1000)
│   ├── naive (counter:1)
│   ├── Tcm (counter:1)
│   ├── Temra (counter:1)
│   └── Tem (counter:1)
└── CD4+/CD8- (counter:1)
    ├── naive (counter:1)
    ├── Tcm (counter:1)
    ├── Temra (counter:1)
    └── Tem (counter:1)
[18]:
# Add the two trees
(cell_tree + cell_tree_2).pretty_print()
print(cell_tree)
print(cell_tree_2)
AllCells (counter:1000)
├── DN (counter:1)
├── DP (counter:974)
├── CD4-/CD8+ (counter:1005)
│   ├── naive (counter:6)
│   ├── Tcm (counter:1)
│   ├── Temra (counter:1)
│   └── Tem (counter:1)
└── CD4+/CD8- (counter:22)
    ├── naive (counter:21)
    ├── Tcm (counter:1)
    ├── Temra (counter:2)
    └── Tem (counter:1)
NBNode('/AllCells', counter=999, decision_name=None, decision_value=None)
NBNode('/AllCells', counter=1, decision_name=None, decision_value=None)
[19]:
(cell_tree - cell_tree_2).pretty_print()
AllCells (counter:998)
├── DN (counter:-1)
├── DP (counter:972)
├── CD4-/CD8+ (counter:-995)
│   ├── naive (counter:4)
│   ├── Tcm (counter:-1)
│   ├── Temra (counter:-1)
│   └── Tem (counter:-1)
└── CD4+/CD8- (counter:20)
    ├── naive (counter:19)
    ├── Tcm (counter:-1)
    ├── Temra (counter:0)
    └── Tem (counter:-1)
[20]:
(cell_tree * cell_tree_2).pretty_print()
AllCells (counter:999)
├── DN (counter:0)
├── DP (counter:973)
├── CD4-/CD8+ (counter:5000)
│   ├── naive (counter:5)
│   ├── Tcm (counter:0)
│   ├── Temra (counter:0)
│   └── Tem (counter:0)
└── CD4+/CD8- (counter:21)
    ├── naive (counter:20)
    ├── Tcm (counter:0)
    ├── Temra (counter:1)
    └── Tem (counter:0)
[21]:

(cell_tree / cell_tree_2).pretty_print()
AllCells (counter:999.0)
├── DN (counter:0.0)
├── DP (counter:973.0)
├── CD4-/CD8+ (counter:0.005)
│   ├── naive (counter:5.0)
│   ├── Tcm (counter:0.0)
│   ├── Temra (counter:0.0)
│   └── Tem (counter:0.0)
└── CD4+/CD8- (counter:21.0)
    ├── naive (counter:20.0)
    ├── Tcm (counter:0.0)
    ├── Temra (counter:1.0)
    └── Tem (counter:0.0)
[22]:

(cell_tree % cell_tree_2).pretty_print()
AllCells (counter:0)
├── DN (counter:0)
├── DP (counter:0)
├── CD4-/CD8+ (counter:5)
│   ├── naive (counter:0)
│   ├── Tcm (counter:0)
│   ├── Temra (counter:0)
│   └── Tem (counter:0)
└── CD4+/CD8- (counter:0)
    ├── naive (counter:0)
    ├── Tcm (counter:0)
    ├── Temra (counter:0)
    └── Tem (counter:0)