Examples for basic BN learning,
Posted on Thu 07 July 2016 in score_based
I’ll soon finish basic score-based structure estimation for BayesianModel
s. Below is the current state of my PR, with two examples.
Changes in pgmpy/estimators/
I rearranged the estimator classes to inherit from each other like this:
. MaximumLikelihoodEstimator / ParameterEstimator -- BayesianEstimator / BaseEstimator ExhaustiveSearch | \ / | StructureEstimator -- HillClimbSearch | \ | ConstraintBasedEstimator | | | BayesianScore | / StructureScore -- BicScore
BaseEstimator
takes a data set and optionally state_names
and a flag for how to handle missing values. ParameterEstimator
and its subclasses additionally take a model. All *Search
-classes are initialized with a StructureScore
-instance (or by default BayesianScore
) in addition to the data set.
Example
Given a data sets with 5
or less variables, we can search through all BayesianModels
and find the best-scoring one, using ExhaustiveSearch
(currently 5 vars already takes a few minutes, but can be made faster):
import pandas as pd import numpy as np from pgmpy.estimators import ExhaustiveSearch # create random data sample with 3 variables, where B and C are identical: data = pd.DataFrame(np.random.randint(0, 5, size=(5000, 2)), columns=list('AB')) data['C'] = data['B'] est = ExhaustiveSearch(data) best_model = est.estimate() print(best_model.nodes()) print(best_model.edges()) print('\nall scores:') for score, model in est.all_scores(): print(score, model.edges())
The example first prints nodes and edges of the best-fitting model and then the scores for all possible BayesianModel
s for this data set:
['A','B','C'] [('B','C')] all scores: -24243.15030635083 [('A', 'C'), ('A', 'B')] -24243.149854387288 [('A', 'B'), ('C', 'A')] -24243.149854387288 [('A', 'C'), ('B', 'A')] -24211.96205284525 [('A', 'B')] -24211.96205284525 [('A', 'C')] -24211.961600881707 [('B', 'A')] -24211.961600881707 [('C', 'A')] -24211.961600881707 [('C', 'A'), ('B', 'A')] -24180.77379933967 [] -16603.134367431743 [('A', 'C'), ('A', 'B'), ('B', 'C')] -16603.13436743174 [('A', 'C'), ('A', 'B'), ('C', 'B')] -16603.133915468195 [('A', 'B'), ('C', 'A'), ('C', 'B')] -16603.133915468195 [('A', 'C'), ('B', 'A'), ('B', 'C')] -16571.946113926162 [('A', 'C'), ('B', 'C')] -16571.94611392616 [('A', 'B'), ('C', 'B')] -16274.052597732147 [('A', 'B'), ('B', 'C')] -16274.052597732145 [('A', 'C'), ('C', 'B')] -16274.0521457686 [('B', 'A'), ('B', 'C')] -16274.0521457686 [('C', 'A'), ('B', 'C')] -16274.0521457686 [('C', 'B'), ('B', 'A')] -16274.0521457686 [('C', 'A'), ('C', 'B')] -16274.0521457686 [('C', 'A'), ('B', 'A'), ('B', 'C')] -16274.0521457686 [('C', 'A'), ('C', 'B'), ('B', 'A')] -16242.864344226566 [('B', 'C')] -16242.864344226564 [('C', 'B')]
There is a big jump in score between those models where B
and C
influence each other (~-16274
) and the rest (~-24211
), as expected since they are correlated.
Example 2
I tried the same with the Kaggle titanic data set:
import pandas as pd import numpy as np from pgmpy.estimators import ExhaustiveSearch # data_link - "https://www.kaggle.com/c/titanic/download/train.csv" titanic_data = pd.read_csv('testdata/titanic_train.csv') titanic_data = titanic_data[["Survived", "Sex", "Pclass"]] est = ExhaustiveSearch(titanic_data) for score, model in est.all_scores(): print(score, model.edges())
Output:
-2072.9132364404695 [] -2069.071694164769 [('Pclass', 'Sex')] -2069.0144197068785 [('Sex', 'Pclass')] -2025.869489762676 [('Survived', 'Pclass')] -2025.8559302273054 [('Pclass', 'Survived')] -2022.0279474869753 [('Survived', 'Pclass'), ('Pclass', 'Sex')] -2022.0143879516047 [('Pclass', 'Survived'), ('Pclass', 'Sex')] -2021.9571134937144 [('Sex', 'Pclass'), ('Pclass', 'Survived')] -2017.5258065853768 [('Survived', 'Pclass'), ('Sex', 'Pclass')] -1941.3075053892835 [('Survived', 'Sex')] -1941.2720031713893 [('Sex', 'Survived')] -1937.4304608956886 [('Sex', 'Survived'), ('Pclass', 'Sex')] -1937.4086886556925 [('Survived', 'Sex'), ('Sex', 'Pclass')] -1937.3731864377983 [('Sex', 'Survived'), ('Sex', 'Pclass')] -1934.134485060888 [('Survived', 'Sex'), ('Pclass', 'Sex')] -1894.2637587114903 [('Survived', 'Sex'), ('Survived', 'Pclass')] -1894.2501991761196 [('Survived', 'Sex'), ('Pclass', 'Survived')] -1894.228256493596 [('Survived', 'Pclass'), ('Sex', 'Survived')] -1891.0630673606006 [('Sex', 'Survived'), ('Pclass', 'Survived')] -1887.2215250849 [('Sex', 'Survived'), ('Pclass', 'Survived'), ('Pclass', 'Sex')] -1887.1642506270096 [('Sex', 'Survived'), ('Sex', 'Pclass'), ('Pclass', 'Survived')] -1887.0907383830947 [('Survived', 'Sex'), ('Survived', 'Pclass'), ('Pclass', 'Sex')] -1887.077178847724 [('Survived', 'Sex'), ('Pclass', 'Survived'), ('Pclass', 'Sex')] -1885.9200755341908 [('Survived', 'Sex'), ('Survived', 'Pclass'), ('Sex', 'Pclass')] -1885.8845733162966 [('Survived', 'Pclass'), ('Sex', 'Survived'), ('Sex', 'Pclass')]
Here it didn’t work as I hoped. [('Sex', 'Survived'), ('Pclass', 'Survived')]
has the best score among models with 2 or less edges, but every model with 3 edges scores better. I didn’t have a closer look at the dataset yet, but a weak dependency between Sex
and PClass
would explain this.