PC constraint-based BN learning algorithm
Posted on Wed 17 August 2016 in score_based
The past while I have been working on basic constraint-based BN learning. This required a
method to perform conditional independence tests on the data set. Surprisingly, such
tests for conditional independence are not part of scipy.stats
or other statistics libraries.
To test if X _|_ Y | Zs
, one has to manually construct the frequencies one would expect if the variables were conditionally independent, namely \(P(X,Y,Zs)=P(X|Zs)\cdot P(Y|Zs)\cdot P(Zs)\) and compare it with the observed frequencies, using e.g. a \(\chi^2\) deviance statistic (provided by scipy.stats
). Expected frequencies can be computed as \(\frac{P(X, Zs)\cdot P(Y, Zs)}{P(Zs)}\),
So one can start with a joint state_count/frequency table and marginalize out \(X\), \(Y\), and both and compute the expected distribution from the margins.
Once such a testing method is in place, the PC algorithm can be used to infer a partially directed acyclic graph (PDAG) structure to capture the dependencies in the data, in polynomial time. Finally, the PDAG can be fully oriented and completed to a Bayesian network. The implementation looks like this:
Methods of the ConstraintBasedEstimator
class:
test_conditional_independence(self, X, Y, Zs=[])
Chi-square conditional independence test (PGM book, 18.2.2.3, page 789)
build_skeleton(nodes, independencies)
Build undirected graph from independencies/1st part of PC algorithm (PGM book, 3.4.2.1, page 85, like Algorithm 3.3)
skeleton_to_pdag(skel, seperating_sets)
Orients compelled edges of skeleton/2nd part of PC (Neapolitan, Learning Bayseian Networks, Section 10.1.2, page 550, Algorithm 10.2)
pdag_to_dag(pdag)
Method to (faithfully) orient the remaining edges to obtain BayesianModel
(Implemented as described here on page 454 last paragraph (in text)).
Finally three methods that combine the above parts for convenient access:
estimate(self, p_value=0.01)
-> returns BayesianModel
estimate for data
estimate_skeleton(self, p_value=0.01)
-> returns UndirectedGraph
estimate for data
estimate_from_independencies(nodes, independencies)
-> static, takes set of independencies and estimates BayesianModel
.
Examples:
import pandas as pd import numpy as np from pgmpy.base import DirectedGraph from pgmpy.estimators import ConstraintBasedEstimator from pgmpy.independencies import Independencies data = pd.DataFrame(np.random.randint(0, 5, size=(2500, 3)), columns=list('XYZ')) data['sum'] = data.sum(axis=1) print(data) # estimate BN structue: c = ConstraintBasedEstimator(data) model = c.estimate() print("Resulting network: ", model.edges())
Output:
X Y Z sum 0 1 3 4 8 1 3 3 0 6 2 4 4 1 9 ... .. .. .. ... 2497 0 4 2 6 2498 0 3 1 4 2499 2 1 3 6 [2500 rows x 4 columns] Resulting network: [('Z', 'sum'), ('X', 'sum'), ('Y', 'sum')]
Using parts of the algorithm manually:
# some (in)dependence tests: data = pd.DataFrame(np.random.randint(0, 2, size=(50000, 4)), columns=list('ABCD')) data['E'] = data['A'] + data['B'] + data['C'] c = ConstraintBasedEstimator(data) print("\n P-value for hypothesis test that A, C are dependent: ", c.test_conditional_independence('A', 'C')) print("P-value for hypothesis test that A, B are dependent, given D: ", c.test_conditional_independence('A', 'B', 'D')) print("P-value for hypothesis test that A, B are dependent, given D and E: ", c.test_conditional_independence('A', 'B', ['D', 'E'])) # build skeleton from list of independencies: ind = Independencies(['B', 'C'], ['A', ['B', 'C'], 'D']) ind = ind.closure() skel, sep_sets = ConstraintBasedEstimator.build_skeleton("ABCD", ind) print("Some skeleton: ", skel.edges()) # build PDAG from skeleton (+ sep_sets): data = pd.DataFrame(np.random.randint(0, 4, size=(5000, 3)), columns=list('ABD')) data['C'] = data['A'] - data['B'] data['D'] += data['A'] c = ConstraintBasedEstimator(data) pdag = c.skeleton_to_pdag(*c.estimate_skeleton()) print("Some PDAG: ", pdag.edges()) # edges: A->C, B->C, A--D (not directed) # complete PDAG to DAG: pdag1 = DirectedGraph([('A', 'B'), ('C', 'B'), ('C', 'D'), ('D', 'C'), ('D', 'A'), ('A', 'D')]) print("PDAG: ", pdag1.edges()) dag1 = ConstraintBasedEstimator.pdag_to_dag(pdag1) print("DAG: ", dag1.edges())
Output:
P-value for hypothesis test that A, C are dependent: 0.995509460079 P-value for hypothesis test that A, B are dependent, given D: 0.998918522413 P-value for hypothesis test that A, B are dependent, given D and E: 0.0 Some skeleton: [('A', 'D'), ('C', 'D'), ('B', 'D')] Some PDAG: [('A', 'C'), ('A', 'D'), ('D', 'A'), ('B', 'C')] PDAG: [('A', 'D'), ('A', 'B'), ('C', 'D'), ('C', 'B'), ('D', 'A'), ('D', 'C')] DAG: [('A', 'B'), ('C', 'B'), ('D', 'A'), ('D', 'C')]