# PC constraint-based BN learning algorithm

Posted on Wed 17 August 2016 in score_based

The past while I have been working on basic constraint-based BN learning. This required a
method to perform conditional independence tests on the data set. Surprisingly, such
tests for conditional independence are not part of `scipy.stats`

or other statistics libraries.

To test if `X _|_ Y | Zs`

, one has to manually construct the frequencies one would expect if the variables were conditionally independent, namely \(P(X,Y,Zs)=P(X|Zs)\cdot P(Y|Zs)\cdot P(Zs)\) and compare it with the observed frequencies, using e.g. a \(\chi^2\) deviance statistic (provided by `scipy.stats`

). Expected frequencies can be computed as \(\frac{P(X, Zs)\cdot P(Y, Zs)}{P(Zs)}\),
So one can start with a joint state_count/frequency table and marginalize out \(X\), \(Y\), and both and compute the expected distribution from the margins.

Once such a testing method is in place, the PC algorithm can be used to infer a partially directed acyclic graph (PDAG) structure to capture the dependencies in the data, in polynomial time. Finally, the PDAG can be fully oriented and completed to a Bayesian network. The implementation looks like this:

## Methods of the `ConstraintBasedEstimator`

class:

`test_conditional_independence(self, X, Y, Zs=[])`

Chi-square conditional independence test (PGM book, 18.2.2.3, page 789)

`build_skeleton(nodes, independencies)`

Build undirected graph from independencies/1st part of PC algorithm (PGM book, 3.4.2.1, page 85, like Algorithm 3.3)

`skeleton_to_pdag(skel, seperating_sets)`

Orients compelled edges of skeleton/2nd part of PC (Neapolitan, Learning Bayseian Networks, Section 10.1.2, page 550, Algorithm 10.2)

`pdag_to_dag(pdag)`

Method to (faithfully) orient the remaining edges to obtain `BayesianModel`

(Implemented as described here on page 454 last paragraph (in text)).

Finally three methods that combine the above parts for convenient access:

`estimate(self, p_value=0.01)`

-> returns `BayesianModel`

estimate for data

`estimate_skeleton(self, p_value=0.01)`

-> returns `UndirectedGraph`

estimate for data

`estimate_from_independencies(nodes, independencies)`

-> static, takes set of independencies and estimates `BayesianModel`

.

## Examples:

import pandas as pd import numpy as np from pgmpy.base import DirectedGraph from pgmpy.estimators import ConstraintBasedEstimator from pgmpy.independencies import Independencies data = pd.DataFrame(np.random.randint(0, 5, size=(2500, 3)), columns=list('XYZ')) data['sum'] = data.sum(axis=1) print(data) # estimate BN structue: c = ConstraintBasedEstimator(data) model = c.estimate() print("Resulting network: ", model.edges())

Output:

X Y Z sum 0 1 3 4 8 1 3 3 0 6 2 4 4 1 9 ... .. .. .. ... 2497 0 4 2 6 2498 0 3 1 4 2499 2 1 3 6 [2500 rows x 4 columns] Resulting network: [('Z', 'sum'), ('X', 'sum'), ('Y', 'sum')]

Using parts of the algorithm manually:

# some (in)dependence tests: data = pd.DataFrame(np.random.randint(0, 2, size=(50000, 4)), columns=list('ABCD')) data['E'] = data['A'] + data['B'] + data['C'] c = ConstraintBasedEstimator(data) print("\n P-value for hypothesis test that A, C are dependent: ", c.test_conditional_independence('A', 'C')) print("P-value for hypothesis test that A, B are dependent, given D: ", c.test_conditional_independence('A', 'B', 'D')) print("P-value for hypothesis test that A, B are dependent, given D and E: ", c.test_conditional_independence('A', 'B', ['D', 'E'])) # build skeleton from list of independencies: ind = Independencies(['B', 'C'], ['A', ['B', 'C'], 'D']) ind = ind.closure() skel, sep_sets = ConstraintBasedEstimator.build_skeleton("ABCD", ind) print("Some skeleton: ", skel.edges()) # build PDAG from skeleton (+ sep_sets): data = pd.DataFrame(np.random.randint(0, 4, size=(5000, 3)), columns=list('ABD')) data['C'] = data['A'] - data['B'] data['D'] += data['A'] c = ConstraintBasedEstimator(data) pdag = c.skeleton_to_pdag(*c.estimate_skeleton()) print("Some PDAG: ", pdag.edges()) # edges: A->C, B->C, A--D (not directed) # complete PDAG to DAG: pdag1 = DirectedGraph([('A', 'B'), ('C', 'B'), ('C', 'D'), ('D', 'C'), ('D', 'A'), ('A', 'D')]) print("PDAG: ", pdag1.edges()) dag1 = ConstraintBasedEstimator.pdag_to_dag(pdag1) print("DAG: ", dag1.edges())

Output:

P-value for hypothesis test that A, C are dependent: 0.995509460079 P-value for hypothesis test that A, B are dependent, given D: 0.998918522413 P-value for hypothesis test that A, B are dependent, given D and E: 0.0 Some skeleton: [('A', 'D'), ('C', 'D'), ('B', 'D')] Some PDAG: [('A', 'C'), ('A', 'D'), ('D', 'A'), ('B', 'C')] PDAG: [('A', 'D'), ('A', 'B'), ('C', 'D'), ('C', 'B'), ('D', 'A'), ('D', 'C')] DAG: [('A', 'B'), ('C', 'B'), ('D', 'A'), ('D', 'C')]