Examples for basic BN learning,

Posted on Thu 07 July 2016 in score_based

I’ll soon finish basic score-based structure estimation for BayesianModels. Below is the current state of my PR, with two examples.

Changes in pgmpy/estimators/

I rearranged the estimator classes to inherit from each other like this:

.                                    MaximumLikelihoodEstimator
                                   /
                ParameterEstimator -- BayesianEstimator
              /
BaseEstimator                        ExhaustiveSearch
            | \                    /
            |   StructureEstimator -- HillClimbSearch
            |                      \
            |                        ConstraintBasedEstimator
            |
            |
            |                BayesianScore
            |              /
            StructureScore -- BicScore

BaseEstimator takes a data set and optionally state_names and a flag for how to handle missing values. ParameterEstimator and its subclasses additionally take a model. All *Search-classes are initialized with a StructureScore-instance (or by default BayesianScore) in addition to the data set.

Example

Given a data sets with 5 or less variables, we can search through all BayesianModels and find the best-scoring one, using ExhaustiveSearch (currently 5 vars already takes a few minutes, but can be made faster):

import pandas as pd
import numpy as np
from pgmpy.estimators import ExhaustiveSearch

# create random data sample with 3 variables, where B and C are identical:
data = pd.DataFrame(np.random.randint(0, 5, size=(5000, 2)), columns=list('AB'))
data['C'] = data['B']

est = ExhaustiveSearch(data)

best_model = est.estimate()
print(best_model.nodes())
print(best_model.edges())

print('\nall scores:')
for score, model in est.all_scores():
    print(score, model.edges())

The example first prints nodes and edges of the best-fitting model and then the scores for all possible BayesianModels for this data set:

['A','B','C']
[('B','C')]

all scores:
-24243.15030635083 [('A', 'C'), ('A', 'B')]
-24243.149854387288 [('A', 'B'), ('C', 'A')]
-24243.149854387288 [('A', 'C'), ('B', 'A')]
-24211.96205284525 [('A', 'B')]
-24211.96205284525 [('A', 'C')]
-24211.961600881707 [('B', 'A')]
-24211.961600881707 [('C', 'A')]
-24211.961600881707 [('C', 'A'), ('B', 'A')]
-24180.77379933967 []
-16603.134367431743 [('A', 'C'), ('A', 'B'), ('B', 'C')]
-16603.13436743174 [('A', 'C'), ('A', 'B'), ('C', 'B')]
-16603.133915468195 [('A', 'B'), ('C', 'A'), ('C', 'B')]
-16603.133915468195 [('A', 'C'), ('B', 'A'), ('B', 'C')]
-16571.946113926162 [('A', 'C'), ('B', 'C')]
-16571.94611392616 [('A', 'B'), ('C', 'B')]
-16274.052597732147 [('A', 'B'), ('B', 'C')]
-16274.052597732145 [('A', 'C'), ('C', 'B')]
-16274.0521457686 [('B', 'A'), ('B', 'C')]
-16274.0521457686 [('C', 'A'), ('B', 'C')]
-16274.0521457686 [('C', 'B'), ('B', 'A')]
-16274.0521457686 [('C', 'A'), ('C', 'B')]
-16274.0521457686 [('C', 'A'), ('B', 'A'), ('B', 'C')]
-16274.0521457686 [('C', 'A'), ('C', 'B'), ('B', 'A')]
-16242.864344226566 [('B', 'C')]
-16242.864344226564 [('C', 'B')]

There is a big jump in score between those models where B and C influence each other (~-16274) and the rest (~-24211), as expected since they are correlated.

Example 2

I tried the same with the Kaggle titanic data set:

import pandas as pd
import numpy as np
from pgmpy.estimators import ExhaustiveSearch

# data_link - "https://www.kaggle.com/c/titanic/download/train.csv"
titanic_data = pd.read_csv('testdata/titanic_train.csv')
titanic_data = titanic_data[["Survived", "Sex", "Pclass"]]

est = ExhaustiveSearch(titanic_data)
for score, model in est.all_scores():
    print(score, model.edges())

Output:

-2072.9132364404695 []
-2069.071694164769 [('Pclass', 'Sex')]
-2069.0144197068785 [('Sex', 'Pclass')]
-2025.869489762676 [('Survived', 'Pclass')]
-2025.8559302273054 [('Pclass', 'Survived')]
-2022.0279474869753 [('Survived', 'Pclass'), ('Pclass', 'Sex')]
-2022.0143879516047 [('Pclass', 'Survived'), ('Pclass', 'Sex')]
-2021.9571134937144 [('Sex', 'Pclass'), ('Pclass', 'Survived')]
-2017.5258065853768 [('Survived', 'Pclass'), ('Sex', 'Pclass')]
-1941.3075053892835 [('Survived', 'Sex')]
-1941.2720031713893 [('Sex', 'Survived')]
-1937.4304608956886 [('Sex', 'Survived'), ('Pclass', 'Sex')]
-1937.4086886556925 [('Survived', 'Sex'), ('Sex', 'Pclass')]
-1937.3731864377983 [('Sex', 'Survived'), ('Sex', 'Pclass')]
-1934.134485060888 [('Survived', 'Sex'), ('Pclass', 'Sex')]
-1894.2637587114903 [('Survived', 'Sex'), ('Survived', 'Pclass')]
-1894.2501991761196 [('Survived', 'Sex'), ('Pclass', 'Survived')]
-1894.228256493596 [('Survived', 'Pclass'), ('Sex', 'Survived')]
-1891.0630673606006 [('Sex', 'Survived'), ('Pclass', 'Survived')]
-1887.2215250849 [('Sex', 'Survived'), ('Pclass', 'Survived'), ('Pclass', 'Sex')]
-1887.1642506270096 [('Sex', 'Survived'), ('Sex', 'Pclass'), ('Pclass', 'Survived')]
-1887.0907383830947 [('Survived', 'Sex'), ('Survived', 'Pclass'), ('Pclass', 'Sex')]
-1887.077178847724 [('Survived', 'Sex'), ('Pclass', 'Survived'), ('Pclass', 'Sex')]
-1885.9200755341908 [('Survived', 'Sex'), ('Survived', 'Pclass'), ('Sex', 'Pclass')]
-1885.8845733162966 [('Survived', 'Pclass'), ('Sex', 'Survived'), ('Sex', 'Pclass')]

Here it didn’t work as I hoped. [('Sex', 'Survived'), ('Pclass', 'Survived')] has the best score among models with 2 or less edges, but every model with 3 edges scores better. I didn’t have a closer look at the dataset yet, but a weak dependency between Sex and PClass would explain this.