PC constraint-based BN learning algorithm

Posted on Wed 17 August 2016 in score_based

The past while I have been working on basic constraint-based BN learning. This required a method to perform conditional independence tests on the data set. Surprisingly, such tests for conditional independence are not part of scipy.stats or other statistics libraries.

To test if X _|_ Y | Zs, one has to manually construct the frequencies one would expect if the variables were conditionally independent, namely \(P(X,Y,Zs)=P(X|Zs)\cdot P(Y|Zs)\cdot P(Zs)\) and compare it with the observed frequencies, using e.g. a \(\chi^2\) deviance statistic (provided by scipy.stats). Expected frequencies can be computed as \(\frac{P(X, Zs)\cdot P(Y, Zs)}{P(Zs)}\), So one can start with a joint state_count/frequency table and marginalize out \(X\), \(Y\), and both and compute the expected distribution from the margins.

Once such a testing method is in place, the PC algorithm can be used to infer a partially directed acyclic graph (PDAG) structure to capture the dependencies in the data, in polynomial time. Finally, the PDAG can be fully oriented and completed to a Bayesian network. The implementation looks like this:

Methods of the `ConstraintBasedEstimator` class:

`test_conditional_independence(self, X, Y, Zs=[])`

Chi-square conditional independence test (PGM book, 18.2.2.3, page 789)

`build_skeleton(nodes, independencies)`

Build undirected graph from independencies/1st part of PC algorithm (PGM book, 3.4.2.1, page 85, like Algorithm 3.3)

`skeleton_to_pdag(skel, seperating_sets)`

Orients compelled edges of skeleton/2nd part of PC (Neapolitan, Learning Bayseian Networks, Section 10.1.2, page 550, Algorithm 10.2)

`pdag_to_dag(pdag)`

Method to (faithfully) orient the remaining edges to obtain BayesianModel (Implemented as described here on page 454 last paragraph (in text)).

Finally three methods that combine the above parts for convenient access:

`estimate(self, p_value=0.01)`

-> returns BayesianModel estimate for data

`estimate_skeleton(self, p_value=0.01)`

-> returns UndirectedGraph estimate for data

`estimate_from_independencies(nodes, independencies)`

-> static, takes set of independencies and estimates BayesianModel.

Examples:

import pandas as pd
import numpy as np
from pgmpy.base import DirectedGraph
from pgmpy.estimators import ConstraintBasedEstimator
from pgmpy.independencies import Independencies

data = pd.DataFrame(np.random.randint(0, 5, size=(2500, 3)), columns=list('XYZ'))
data['sum'] = data.sum(axis=1)
print(data)

# estimate BN structue:
c = ConstraintBasedEstimator(data)
model = c.estimate()
print("Resulting network: ", model.edges())

Output:

      X  Y  Z  sum
0     1  3  4    8
1     3  3  0    6
2     4  4  1    9
...  .. .. ..  ...
2497  0  4  2    6
2498  0  3  1    4
2499  2  1  3    6

[2500 rows x 4 columns]

Resulting network: [('Z', 'sum'), ('X', 'sum'), ('Y', 'sum')]

Using parts of the algorithm manually:

# some (in)dependence tests:
data = pd.DataFrame(np.random.randint(0, 2, size=(50000, 4)), columns=list('ABCD'))
data['E'] = data['A'] + data['B'] + data['C']
c = ConstraintBasedEstimator(data)

print("\n P-value for hypothesis test that A, C are dependent: ",
      c.test_conditional_independence('A', 'C'))
print("P-value for hypothesis test that A, B are dependent, given D: ",
      c.test_conditional_independence('A', 'B', 'D'))
print("P-value for hypothesis test that A, B are dependent, given D and E: ",
      c.test_conditional_independence('A', 'B', ['D', 'E']))

# build skeleton from list of independencies:
ind = Independencies(['B', 'C'], ['A', ['B', 'C'], 'D'])
ind = ind.closure()
skel, sep_sets = ConstraintBasedEstimator.build_skeleton("ABCD", ind)
print("Some skeleton: ", skel.edges())

# build PDAG from skeleton (+ sep_sets):
data = pd.DataFrame(np.random.randint(0, 4, size=(5000, 3)), columns=list('ABD'))
data['C'] = data['A'] - data['B']
data['D'] += data['A']
c = ConstraintBasedEstimator(data)
pdag = c.skeleton_to_pdag(*c.estimate_skeleton())
print("Some PDAG: ", pdag.edges())  # edges: A->C, B->C, A--D (not directed)

# complete PDAG to DAG:
pdag1 = DirectedGraph([('A', 'B'), ('C', 'B'), ('C', 'D'), ('D', 'C'), ('D', 'A'), ('A', 'D')])
print("PDAG: ", pdag1.edges())
dag1 = ConstraintBasedEstimator.pdag_to_dag(pdag1)
print("DAG:  ", dag1.edges())

Output:

P-value for hypothesis test that A, C are dependent:  0.995509460079
P-value for hypothesis test that A, B are dependent, given D:  0.998918522413
P-value for hypothesis test that A, B are dependent, given D and E:  0.0
Some skeleton:  [('A', 'D'), ('C', 'D'), ('B', 'D')]
Some PDAG:  [('A', 'C'), ('A', 'D'), ('D', 'A'), ('B', 'C')]
PDAG:  [('A', 'D'), ('A', 'B'), ('C', 'D'), ('C', 'B'), ('D', 'A'), ('D', 'C')]
DAG:   [('A', 'B'), ('C', 'B'), ('D', 'A'), ('D', 'C')]

Methods of the ConstraintBasedEstimator class:

test_conditional_independence(self, X, Y, Zs=[])

build_skeleton(nodes, independencies)

skeleton_to_pdag(skel, seperating_sets)

pdag_to_dag(pdag)

estimate(self, p_value=0.01)

estimate_skeleton(self, p_value=0.01)

estimate_from_independencies(nodes, independencies)